SAMZA-2591: Async Commit [2/3]: Task Commit api changes and async commit#1490

Merged

mynameborat merged 9 commits intoapache:state-backend-async-commitfrom

dxichen:task-commit-lifecycle

May 7, 2021

Member

dxichen commented Apr 16, 2021

Introduce new state backend APIs for blobstore and kafka changelog
Change the task commit lifecycle to separate snapshot, upload and cleanup phases
Make the TaskInstance commit upload and cleanup phases nonblocking

dxichen added 2 commits

April 15, 2021 22:09


          Checkpoint v2 migration

6ffb1cf


          checkpoints serdes

9fcc116

Member Author

dxichen commented Apr 16, 2021

Please disregard the Checkpoint v2 migration commit since it is part of #1489

dxichen force-pushed the task-commit-lifecycle branch from 23fdb14 to b03306b Compare

April 16, 2021 21:51

mynameborat reviewed

View reviewed changes

samza-api/src/main/java/org/apache/samza/storage/StateBackendFactory.java Outdated

Comment on lines 39 to 41

Contributor

mynameborat Apr 23, 2021

[P1] Can you elaborate why job and container model are passed down here? Seems like unused parameters in this PR at the least.

Member Author

dxichen May 4, 2021

jobmodel is used for potential forwards compatibility, and container model is used for restores, so I wanted to keep it symmetrical.

samza-api/src/main/java/org/apache/samza/storage/StateBackendFactory.java Show resolved Hide resolved

samza-api/src/main/java/org/apache/samza/storage/StateBackendFactory.java

Comment on lines +47 to +51

+                TaskRestoreManager getRestoreManager(JobContext jobContext,
+                    ContainerContext containerContext,
+                    TaskModel taskModel,

Contributor

mynameborat Apr 23, 2021

[P1] Seems inconsistent with the above. Maybe consider passing the same context as part of the signature of getBackupManager too so that access to JobModel and ContainerModel are all through the context variables which makes access pattern consistent in the code base and helps with evolution.

That said, is the TaskContext available before instantiation so that TaskModel is also accessed through the context?

Member Author

dxichen May 4, 2021

will change the backup to context passing

samza-api/src/main/java/org/apache/samza/storage/TaskStorageAdmin.java Show resolved Hide resolved

samza-core/src/main/java/org/apache/samza/storage/KafkaChangelogStateBackendFactory.java Show resolved Hide resolved

samza-core/src/main/java/org/apache/samza/storage/TaskStorageCommitManager.java

Comment on lines +97 to +101

		stateBackendToBackupManager.values()
		.forEach(storageBackupManager -> storageBackupManager.init(null));

Contributor

mynameborat Apr 23, 2021

[P2] Can we use a sentinel checkpoint instead of null?

Member Author

dxichen May 4, 2021

will address this as a follow up

samza-core/src/main/java/org/apache/samza/storage/TaskStorageCommitManager.java

+                    CheckpointManager checkpointManager, Config config, ExecutorService backupExecutor,
+                    StorageManagerUtil storageManagerUtil, File durableStoreBaseDir) {
+                  this.taskName = taskName;
+                  this.containerStorageManager = containerStorageManager;

Contributor

mynameborat Apr 23, 2021

[P1] Can we persist the StorageEngines here instead of getting a handle of ContainerStorageManager?

Member Author

dxichen May 4, 2021

Unfortunately storageEngines are not created at this point, it is created after init is called, that is the reason we are handling containerStorageManager

samza-core/src/main/java/org/apache/samza/storage/TaskStorageCommitManager.java

+                }
+                public void init() {
+                  // Assuming that container storage manager has already started and created to stores

Contributor

mynameborat Apr 23, 2021

Is there a way to enforce or validate this assumption? Refer to the above comment on moving this to constructor. Is it here because the data may not be available during construction?

Member Author

dxichen May 4, 2021

Yes the data is not avail during construction

samza-core/src/main/java/org/apache/samza/storage/TaskStorageCommitManager.java

Comment on lines +115 to +126

+                  storageEngines.forEach((storeName, storageEngine) -> {
+                    if (storageEngine.getStoreProperties().isPersistedToDisk() &&
+                        storageEngine.getStoreProperties().isDurableStore()) {
+                      storageEngine.checkpoint(checkpointId);
+                    }
+                  });

Contributor

mynameborat Apr 23, 2021

Is isDurableStore() equivalent to isLoggedStore()? We used to checkpoint for persisted and logged store. Making sure this is just a rename?

Member Author

dxichen May 4, 2021

isDurable is a superset of isLogged, isLogged is specifically for kafka durability but isDurable means it is durable either blob store or kafka changelog

samza-core/src/main/java/org/apache/samza/storage/TaskStorageCommitManager.java

Comment on lines +125 to +141

+                  // for each configured state backend factory, backup the state for all stores in this task.
+                  stateBackendToBackupManager.forEach((stateBackendFactoryName, backupManager) -> {
+                    Map<String, String> snapshotSCMs = backupManager.snapshot(checkpointId);
+                    LOG.debug("Created snapshot for taskName: {}, checkpoint id: {}, state backend: {}. Snapshot SCMs: {}",
+                        taskName, checkpointId, stateBackendFactoryName, snapshotSCMs);
+                    stateBackendToStoreSCMs.put(stateBackendFactoryName, snapshotSCMs);
+                  });

Contributor

mynameborat Apr 23, 2021

Do we need to backup the state for all stores or the stores within each of the backup factories satisfy the above requirement for checkpointing (persisted & durable)?

Looking at the Kafka implementations, storeChangelogs are the ones that are iterated for fetching the snapshot but the above criteria needs both durable & persisted.

Checking if changelog enabled stores are persisted to disk by default and what the behavior is otherwise?

Member Author

dxichen May 4, 2021

We need to backup all the stores that are durable (or logged) and write as a checkpoint (to file) all the state that are persisted. Similarly, changelog enabled stores are persisted to disk only if they are persisted (ie non inmem), which should perserve the existing behavior.


          Serdes changes and minor pr comments

513e798

dxichen force-pushed the task-commit-lifecycle branch from b03306b to 0f7be25 Compare

April 29, 2021 04:10

dxichen added 3 commits

April 29, 2021 16:27


          Checkpoint id serde changes

26d12eb


          task commit api changes and async commit

581046d


          PR comments from 1/3

cef0f1e

dxichen force-pushed the task-commit-lifecycle branch from 3008a98 to cef0f1e Compare

April 29, 2021 23:33

dxichen added 3 commits

April 29, 2021 16:34


          Metrics and thread count configs

292f991


          PR comments for 2/3

2aab0ca


          Removed ignores

79af196

mynameborat merged commit c117a68 into apache:state-backend-async-commit

dxichen mentioned this pull request

SAMZA-2591: API updates for TaskStorageBackupManager #1429

Closed

shekhars-li pushed a commit to shekhars-li/samza that referenced this pull request


          SAMZA-2591: Async Commit [2/3]: Task Commit api changes and async com…

e548c62

…mit (apache#1490)

Introduce new state backend APIs for blobstore and kafka changelog
Change the task commit lifecycle to separate snapshot, upload and cleanup phases
Make the TaskInstance commit upload and cleanup phases nonblocking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet