[CDAP-2013] Refactor Program States #9158

sameetandpotatoes · 2017-06-28T23:48:58Z

Work Items:

Adds AbstractStateChangeProgramController as a parent class to all program type's controller so that the listener is automatically added to the controller when the controller used by the program runner is created.
Add ProgramRunStatus.STARTING and update Store to record this new state, and the RunRecord class to hold the times
The RunRecord now holds startTs, runTs, and stopTs to represent when the program is in the STARTING, RUNNING, and terminated states, respectively.
Separate recordType for ProgramRunStatus.STARTING and ProgramRunStatus.RUNNING into different columns in the metadata store
Add ProgramStateWriter interface with DirectStoreProgramStateWriter that persists to the runtime store
Inject ProgramStateWriter for programs, and inject NoOpProgramStateWriter for Workers and Service program runners (for in memory mode only) since they run multiple instances.
Build passes
ITN passes

Once https://issues.apache.org/jira/browse/TWILL-240 is resolved, the listener for all distributed mode will be added, and the individual program runners will no longer have the ProgramStateWriter. Rather, the InMemory and Distributed program runners will have the program state writers.

This needs to happen for services/flows/workers to record program states in distributed mode correctly.

JIRA: https://issues.cask.co/browse/CDAP-2013
Build: https://builds.cask.co/browse/CDAP-DUT5896

chtyim

Looks like the changes only affect local mode. How about in distributed mode?

chtyim · 2017-07-05T18:00:01Z

cdap-app-fabric/src/main/java/co/cask/cdap/internal/app/runtime/flow/FlowProgramRunner.java

+      final ProgramId programId = program.getId();
+      final Arguments systemArgs = options.getArguments();
+      final Arguments userArgs = options.getUserArguments();
+      final String twillRunId = systemArgs.getOption(ProgramOptionConstants.TWILL_RUN_ID);


This program runner only used in local mode, there won't be any twill run id.

chtyim · 2017-07-05T18:08:19Z

...app-fabric/src/main/java/co/cask/cdap/internal/app/runtime/service/ServiceProgramRunner.java

+    final ProgramId programId = program.getId();
+    final Arguments systemArgs = options.getArguments();
+    final Arguments userArgs = options.getUserArguments();
+    final String twillRunId = systemArgs.getOption(ProgramOptionConstants.TWILL_RUN_ID);


Same as above.

chtyim · 2017-07-05T18:09:44Z

...app-fabric/src/main/java/co/cask/cdap/internal/app/runtime/service/ServiceProgramRunner.java

@@ -124,19 +143,65 @@ public ProgramController run(Program program, ProgramOptions options) {
      // Add a service listener to make sure the plugin instantiator is closed when the http server is finished.
      component.addListener(new ServiceListenerAdapter() {


Looks like the listener should refactor into a class such that the logic can be shared between FlowProgramRunner, ServiceProgramRunner and WorkerProgramRunner.

sameetandpotatoes · 2017-07-21T20:49:57Z

cdap-app-fabric/src/main/java/co/cask/cdap/internal/app/store/AppMetadataStore.java

@@ -348,9 +391,17 @@ private void recordProgramSuspendResume(ProgramId programId, String pid, String
      record = get(key, RunRecordMeta.class);
    }

+    // We can also suspend a workflow that is in the starting state


Since we introduced a starting state, it is possible for a program to be in the starting state and transition to other states besides RUNNING. There are several cases that I've had to handle now:

STARTING->SUSPENDED

STARTED->FAILED, STARTING->KILLED

resuming->RUNNING

I've tried to make these as efficient as possible by only querying the table if the original case did not work.

Can a Workflow SUSPEND before getting in RUNNING state?

Yes, it can.

sameetandpotatoes · 2017-07-21T22:37:40Z

ITN passes: https://builds.cask.co/browse/CDAP-ITM-126
Build passes: https://builds.cask.co/browse/CDAP-DUT5896-52

chtyim · 2017-07-23T06:13:37Z

...t/java/co/cask/cdap/internal/app/services/http/handlers/WorkflowStatsSLAHttpHandlerTest.java

@@ -81,23 +81,28 @@ public void testStatistics() throws Exception {
    ProgramId mapreduceProgram = WORKFLOW_APP.mr(mapreduceName);
    ProgramId sparkProgram = WORKFLOW_APP.spark(sparkName);

+    // Time from program starting to program running
+    int turnoverTime = 1;


what is this for?

Since we have added a distinction between a program starting and a program running, when we use setStart in the tests to persist the state, I thought I would have a variable to mark the time between a program starting and a program running.

This variable represents the time in seconds after a program starts for the program to be marked running.

chtyim · 2017-07-23T06:14:29Z

cdap-app-fabric/src/main/java/co/cask/cdap/app/guice/DistributedProgramRunnableModule.java

@@ -172,6 +175,11 @@ protected void configure() {
          // For binding DataSet transaction stuff
          install(new DataFabricFacadeModule());

+          bind(ProgramStateWriter.class).to(DirectStoreProgramStateWriter.class);
+          bind(ProgramStateWriter.class)


please add TODO (with a JIRA number) to have this removed.

chtyim · 2017-07-23T06:18:51Z

cdap-app-fabric/src/main/java/co/cask/cdap/app/runtime/AbstractProgramRuntimeService.java

      monitorProgram(runtimeInfo, cleanUpTask);
      return runtimeInfo;
    } catch (Exception e) {
+      // Set the program state to an error when an exception is thrown
+      programStateWriter.error(programId.run(runId), new Throwable(e.getMessage()));


Why new a Throwable here? That would lost the stacktrace. Why not just use e directly?

Good point - I wasn't aware that Exception was a subclass of Throwable. I will use e directly.

chtyim · 2017-07-23T06:22:51Z

...ic/src/main/java/co/cask/cdap/internal/app/program/AbstractStateChangeProgramController.java

+                                              @Nullable String componentName) {
+    super(programRunId.getParent(), RunIds.fromString(programRunId.getRun()), componentName);
+
+    service.addListener(


I don't think we need this constructor at all, isn't it?

Yes, the ProgramControllerServiceAdapter can simply call the other constructor now.

Actually, removing this will break a few tests. In the next PR, when we add the InMemory program runners, it should be able to be removed as the listeners are attached differently. But unfortunately for now, this constructor will be needed

why do tests fail after removing this?

~~I'm not entirely sure. DynamicPartitioningTestRun fails for example, with issues seemingly unrelated with dataset tables not being found. I've narrowed it down to what I believe is this change.~~

~~Removing this constructor would "break" / change behavior from what's currently happening. For Spark / MR, we were adding the listener on the service, and not on the controller.~~

The issue is just on the test level, where sometimes it would see ProgramRunStatus.STARTING and think that the program is running. It has nothing to do with the constructor, so I will remove it.

chtyim · 2017-07-23T06:24:53Z

...ic/src/main/java/co/cask/cdap/internal/app/program/AbstractStateChangeProgramController.java

+    addListener(
+      new AbstractListener() {
+        @Override
+        public void init(ProgramController.State state, @Nullable Throwable cause) {


Well, since you add the listener to itself in the constructor, there won't be any state change yet. So, yeah, you can remove this init method.

ecfm · 2017-07-22T02:21:14Z

...t/java/co/cask/cdap/internal/app/services/http/handlers/WorkflowStatsSLAHttpHandlerTest.java

@@ -81,23 +81,28 @@ public void testStatistics() throws Exception {
    ProgramId mapreduceProgram = WORKFLOW_APP.mr(mapreduceName);
    ProgramId sparkProgram = WORKFLOW_APP.spark(sparkName);

+    // Time from program starting to program running
+    int turnoverTime = 1;


Weird name, maybe startDelaySecs?

ecfm · 2017-07-22T02:23:15Z

...t/java/co/cask/cdap/internal/app/services/http/handlers/WorkflowStatsSLAHttpHandlerTest.java

    long startTime = System.currentTimeMillis();
    long currentTimeMillis = startTime;
    String outlierRunId = null;
    for (int i = 0; i < 10; i++) {
      // workflow runs every 5 minutes
      currentTimeMillis = startTime + (i * TimeUnit.MINUTES.toMillis(5));
      RunId workflowRunId = RunIds.generate(currentTimeMillis);
-      store.setStart(workflowProgram, workflowRunId.getId(), RunIds.getTime(workflowRunId, TimeUnit.SECONDS));
+      long startTimeProgram = RunIds.getTime(workflowRunId, TimeUnit.SECONDS);


workflowStartTimeSecs might be a better name

ecfm · 2017-07-22T02:27:03Z

...t/java/co/cask/cdap/internal/app/services/http/handlers/WorkflowStatsSLAHttpHandlerTest.java

      RunId workflowRunId = RunIds.generate(currentTimeMillis);
      runIdList.add(workflowRunId);
-      store.setStart(workflowProgram, workflowRunId.getId(), RunIds.getTime(workflowRunId, TimeUnit.SECONDS));
+      long startTimeProgram = RunIds.getTime(workflowRunId, TimeUnit.SECONDS);


shouldn't startTimeProgram just be currentTimeMillis to seconds? try verify that

ecfm · 2017-07-22T02:28:39Z

...t/java/co/cask/cdap/internal/app/services/http/handlers/WorkflowStatsSLAHttpHandlerTest.java

    for (int i = 0; i < count; i++) {
      // work-flow runs every 5 minutes
-      currentTimeMillis = startTime + (i * TimeUnit.MINUTES.toMillis(5));
+      currentTimeMillis = startTime + (i * TimeUnit.MINUTES.toMillis(5)) - TimeUnit.SECONDS.toMillis(turnoverTime);


why - TimeUnit.SECONDS.toMillis(turnoverTime)?

Not needed anymore, will remove.

ecfm · 2017-07-22T02:31:45Z

cdap-app-fabric-tests/src/test/java/co/cask/cdap/runtime/WorkflowTest.java

+        long nowSecs = TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis());
+        injector.getInstance(Store.class).setStartAndRun(controller.getProgramRunId().getParent(),
+                                                         controller.getProgramRunId().getRun(),
+                                                         nowSecs, nowSecs + 1);


better to comment + 1 or add a private static final int for it

ecfm · 2017-07-24T19:24:56Z

cdap-proto/src/main/java/co/cask/cdap/proto/RunRecord.java

  private final long startTs;

+  @SerializedName("start")
+  @Nullable


this is not nullable since it exists since previous versions

ecfm · 2017-07-24T19:45:12Z

...st/java/co/cask/cdap/internal/app/runtime/schedule/constraint/ConcurrencyConstraintTest.java

    assertSatisfied(false, concurrencyConstraint.check(schedule, constraintContext));

    // add a run for the program that wasn't from a schedule
    // there are now three concurrent runs, so the constraint will not be met
-    store.setStart(WORKFLOW_ID, pid3, System.currentTimeMillis(), null, EMPTY_MAP, EMPTY_MAP);
+    store.setStartAndRun(WORKFLOW_ID, pid3, System.currentTimeMillis(), 1);


System.currentTimeMillis() + 1? or you can just initialize local variables startTs and runTs at the beginning and let them be shared by all the setStartAndRun since the time here doesn't matter

ecfm · 2017-07-24T19:50:08Z

cdap-app-fabric/src/test/java/co/cask/cdap/internal/app/store/AppMetadataStoreTest.java

@@ -87,8 +87,10 @@ public void testOldRunRecordFormat() throws Exception {
    txnl.execute(new TransactionExecutor.Subroutine() {
      @Override
      public void apply() throws Exception {
-        metadataStoreDataset.recordProgramStartOldFormat(program, runId.getId(),
-                                                         RunIds.getTime(runId, TimeUnit.SECONDS), null, null, null);
+        metadataStoreDataset.recordProgramStart(program, runId.getId(),


no need of adding this? This test is for pre-3.6 version

Yes, but it uses the same recordProgramRunning method. The recordProgramRunning method expects there to be a record with ProgramRunStatus.STARTING.

The only difference in this oldFormat method is that the key does not include an application version.

ecfm · 2017-07-24T19:53:40Z

cdap-app-fabric/src/test/java/co/cask/cdap/internal/app/store/DefaultStoreTest.java

@@ -130,6 +130,14 @@ public void testStopBeforeStart() throws RuntimeException {
    store.setStop(programId, "runx", now, ProgramController.State.ERROR.getRunStatus());
  }

+  @Test(expected = UnsupportedOperationException.class)
+  public void testCompleteAfterStart() throws UnsupportedOperationException {


remove this test if the check in AppMetadataStore is removed

ecfm · 2017-07-24T19:55:34Z

...p-etl/cdap-data-streams/src/test/java/co/cask/cdap/datastreams/DataStreamsSparkSinkTest.java

@@ -134,5 +135,6 @@ public Boolean call() throws Exception {

    sparkManager.stop();
    sparkManager.waitForStatus(false, 10, 1);
+    sparkManager.waitForRun(ProgramRunStatus.KILLED, 10, TimeUnit.SECONDS);


why adding this?

So, if we don't wait for the run record to be persisted in the store, it will try to do this in the next test that it runs, and in the next test, it will persist KILLED, which would throw an exception because there is no starting/running record.

In my branch that switches to TMS, I've had to do this to many tests as the tests do not correctly wait for the record to be persisted.

Shouldn't the second test have a different runId? How will they have conflict?

They don't conflict. The issue is that the next test that runs will try to write KILLED to the store because it didn't get to finish at the end of this test, and throw an error.

This issue occurs sometimes, so its better to ensure that the program is truly killed by making sure that the run record has been written.

ecfm · 2017-07-24T22:28:45Z

.../src/test/java/co/cask/cdap/internal/app/services/http/handlers/WorkflowHttpHandlerTest.java

+    HttpResponse response = deploy(SleepingWorkflowApp.class, Constants.Gateway.API_VERSION_3_TOKEN, TEST_NAMESPACE2);
+    Assert.assertEquals(200, response.getStatusLine().getStatusCode());
+
+    WorkflowId workflow = Ids.namespace(TEST_NAMESPACE2).app("SleepWorkflowApp").workflow("SleepWorkflow");


new WorkflowId(...)

ecfm · 2017-07-24T22:29:54Z

.../src/test/java/co/cask/cdap/internal/app/services/http/handlers/WorkflowHttpHandlerTest.java

+    WorkflowId workflow = Ids.namespace(TEST_NAMESPACE2).app("SleepWorkflowApp").workflow("SleepWorkflow");
+
+    // Start the workflow
+    startProgram(workflow.toId(), 200);


Remove all calls of .toId() . This method converts WorkflowId to a deprecated class

ecfm · 2017-07-24T22:33:13Z

cdap-app-fabric/src/main/java/co/cask/cdap/app/runtime/ProgramStateWriter.java

@@ -29,7 +30,9 @@
 public interface ProgramStateWriter {

  /**
-   * Updates the program run's status to be {@link ProgramRunStatus#STARTING} at the given start time
+   * Updates the program run's status to be {@link ProgramRunStatus#STARTING} at the given start time when


start time from twillRunId, otherwise it's confusing where the start time comes from

ecfm · 2017-07-24T22:42:13Z

...p-etl/cdap-data-streams/src/test/java/co/cask/cdap/datastreams/DataStreamsSparkSinkTest.java

@@ -134,5 +135,6 @@ public Boolean call() throws Exception {

    sparkManager.stop();
    sparkManager.waitForStatus(false, 10, 1);
+    sparkManager.waitForRun(ProgramRunStatus.KILLED, 10, TimeUnit.SECONDS);


Shouldn't the second test have a different runId? How will they have conflict?

sameetandpotatoes · 2017-07-25T16:49:20Z

@maochf agreed this was good to go, so this will be merged once the build passes.

sameetandpotatoes · 2017-07-25T23:23:48Z

Build passed https://builds.cask.co/browse/CDAP-DUT5896, merging

sameetandpotatoes · 2017-08-10T19:00:17Z

Need to revert this to stabilize develop - will reopen once the Twill changes and corresponding CDAP changes have been tested.

…am-status-listener" This reverts commit fad573f, reversing changes made to 0e07d2c.

…rogram-states Revert "Merge pull request #9158 from caskdata/feature/refactor-progr…

sameetandpotatoes force-pushed the feature/refactor-program-status-listener branch from 83af8cd to 45eda07 Compare June 30, 2017 19:19

chtyim reviewed Jul 5, 2017

View reviewed changes

sameetandpotatoes force-pushed the feature/refactor-program-status-listener branch 14 times, most recently from b5a28ee to ebd4da2 Compare July 8, 2017 21:21

sameetandpotatoes changed the title ~~[CDAP-2013] Move the logic of states to ProgramRunner containers~~ [CDAP-2013] Refactor Program States Jul 10, 2017

sameetandpotatoes force-pushed the feature/refactor-program-status-listener branch 13 times, most recently from fc3d04c to 9c46cdf Compare July 14, 2017 01:18

sameetandpotatoes force-pushed the feature/refactor-program-status-listener branch 2 times, most recently from 254f8d1 to 4808102 Compare July 21, 2017 06:28

sameetandpotatoes commented Jul 21, 2017

View reviewed changes

sameetandpotatoes force-pushed the feature/refactor-program-status-listener branch 2 times, most recently from 8d25602 to 18fe493 Compare July 22, 2017 01:25

chtyim reviewed Jul 23, 2017

View reviewed changes

sameetandpotatoes force-pushed the feature/refactor-program-status-listener branch 2 times, most recently from 9592d3c to be09036 Compare July 24, 2017 19:24

ecfm reviewed Jul 24, 2017

View reviewed changes

sameetandpotatoes force-pushed the feature/refactor-program-status-listener branch from bc8c955 to 097198b Compare July 24, 2017 22:36

ecfm reviewed Jul 24, 2017

View reviewed changes

sameetandpotatoes force-pushed the feature/refactor-program-status-listener branch 6 times, most recently from 1d9c77e to 88816f0 Compare July 25, 2017 04:37

sameetandpotatoes force-pushed the feature/refactor-program-status-listener branch from 88816f0 to 85d4a19 Compare July 25, 2017 17:17

[CDAP-2013] Refactor program states

18e6626

sameetandpotatoes force-pushed the feature/refactor-program-status-listener branch from 0c29316 to 18e6626 Compare July 25, 2017 20:26

sameetandpotatoes merged commit fad573f into develop Jul 25, 2017

sameetandpotatoes deleted the feature/refactor-program-status-listener branch July 25, 2017 23:24

sameetandpotatoes added a commit that referenced this pull request Aug 10, 2017

Revert "Merge pull request #9158 from caskdata/feature/refactor-progr…

25c6bf6

…am-status-listener" This reverts commit fad573f, reversing changes made to 0e07d2c.

sameetandpotatoes added a commit that referenced this pull request Aug 10, 2017

Merge pull request #9365 from caskdata/revert-2013-feature/refactor-p…

4127f26

…rogram-states Revert "Merge pull request #9158 from caskdata/feature/refactor-progr…

sameetandpotatoes restored the feature/refactor-program-status-listener branch August 13, 2017 16:52

sameetandpotatoes deleted the feature/refactor-program-status-listener branch August 13, 2017 17:10

		@@ -124,19 +143,65 @@ public ProgramController run(Program program, ProgramOptions options) {
		// Add a service listener to make sure the plugin instantiator is closed when the http server is finished.
		component.addListener(new ServiceListenerAdapter() {

[CDAP-2013] Refactor Program States #9158

[CDAP-2013] Refactor Program States #9158

Conversation

sameetandpotatoes commented Jun 28, 2017 • edited

chtyim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sameetandpotatoes commented Jul 21, 2017 • edited

Choose a reason for hiding this comment

sameetandpotatoes Jul 23, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sameetandpotatoes Jul 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sameetandpotatoes commented Jul 25, 2017

sameetandpotatoes commented Jul 25, 2017

sameetandpotatoes commented Aug 10, 2017

sameetandpotatoes commented Jun 28, 2017 •

edited

sameetandpotatoes commented Jul 21, 2017 •

edited

sameetandpotatoes Jul 23, 2017 •

edited

sameetandpotatoes Jul 25, 2017 •

edited