Skip to content

[CDAP-20913] Fix bug where runs fails when appfabric is restarted#15517

Merged
rmstar merged 1 commit intodevelopfrom
CDAP-20913
Jan 17, 2024
Merged

[CDAP-20913] Fix bug where runs fails when appfabric is restarted#15517
rmstar merged 1 commit intodevelopfrom
CDAP-20913

Conversation

@rmstar
Copy link
Copy Markdown
Contributor

@rmstar rmstar commented Jan 3, 2024

Depends on cdapio/twill#46.

Bug description

If appfabric is restated while a run is in progress, we fail to correctly process the program completion message, because we can't find a run record with the given ProgramRunId in the AppMetadataStore.

2024-01-09 23:53:51,146 - WARN  [program.status.9:i.c.c.i.a.s.AppMetadataStore@1499] - Ignoring unexpected transition of program run program_run:default.DataFusionQuickstart_longnamewithabunchofcharactersthatexceedsmaximumlabelsize.-SNAPSHOT.workflow.DataPipelineWorkflow.81ea8d94-af49-11ee-a955-1ad446756c3b to program state COMPLETED with no existing run record.

Eventually the run record corrector transitions the run to failed state, but in this case the program had actually run to completion successfully.

2024-01-09 23:57:59,791 - WARN  [run-corrector:i.c.c.i.a.s.RunRecordCorrectorService@145] - Fixed RunRecord for program run program_run:default.DataFusionQuickstart_longnamewithabunchofcharactersthatexceedsmaximumlabelsize.c2e817f0-ab39-11ee-8f74-02f6e77072f9.workflow.DataPipelineWorkflow.81ea8d94-af49-11ee-a955-1ad446756c3b in RUNNING state because it is actually not running
2024-01-09 23:57:59,797 - WARN  [run-corrector:i.c.c.i.a.s.RunRecordCorrectorService@154] - Fixed 1 RunRecords with status in [STARTING, RUNNING, SUSPENDED], but the programs are not actually running
2024-01-09 23:57:59,798 - INFO  [run-corrector:i.c.c.i.a.s.RunRecordCorrectorService@107] - Corrected 1 run records with status in [STARTING, RUNNING, SUSPENDED] that have no actual running program. Such programs likely have crashed or were killed by external signal.

Root cause

After restart we construct the ProgramId from the app name; app version is not available here, so it's set the default value = SNAPSHOT.
However, the ProgramRunId in the run record does have a non-default version, so we fail to find the run record as the versions don't match.

Fix

Added application version to the LiveInfo so we have access to it when constructing the ProgramId from the application name.

@rmstar rmstar requested a review from albertshau January 10, 2024 00:39
runId = RunIds.fromString(((ExtendedTwillApplication) application).getRunId());
appVersion = ((ExtendedTwillApplication) application).getApplicationVersion();
} else {
appVersion = null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment about when we would expect this to happen (if at all)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@rmstar rmstar added the build Triggers github actions build label Jan 16, 2024
@rmstar rmstar merged commit 93f8fe1 into develop Jan 17, 2024
@rmstar rmstar deleted the CDAP-20913 branch January 17, 2024 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Triggers github actions build

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants