Improve observability of compaction jobs #9

gaffer01 · 2022-06-17T08:00:31Z

There is not one place to go to to find out all the information about the lifecycle of a compaction job (i.e. when it was created, when it was pulled off the queue, how long it took to run, whether it succeeded, etc). This information is scattered around various logs in Cloudwatch. We should record this information in a Dynamo table.

Suggested design:

Have one DynamoDB table that will be used to record information about the lifecycle of all compaction jobs (for all Sleeper tables, i.e. not one Dynamo table per Sleeper table).
Key design: hash key of compaction job id, sort key of timestamp of update.
Record the following stages of the lifecycle of a compaction job:
- Job creation
- Job pulled off the queue
- Job finish time
- Job finish status - success? total number of records read, number written (these 2 are not necessarily the same as an iterator may filter out records), the rate at which records were written).

Note that it is possible that a compaction job may be pulled off the queue twice as SQS does not guarantee that a message will only be delivered once, so maybe each compaction task should get a unique id so that we can separate the updates from the different tasks?

We could also record the lifecycle of compaction ECS tasks - creation time, total run time, total number of records processed, etc.

We will need a Java class that can report the status of a particular compaction job (by querying Dynamo for the relevant information), and a script in scripts/utility to make that class easy to use.

We will also need to update the documentation to explain how to use that script.

patchwork01 · 2022-09-08T11:03:23Z

I've made some more issues for this:

#158 Create DynamoDB table for compaction job events
#160 Record compaction job created event in DynamoDB
#161 Record compaction job processing events in DynamoDB
#162 Client to report compaction job status
#159 Create DynamoDB table for compaction task events
#163 Record compaction task events in DynamoDB

gaffer01 · 2022-10-20T08:38:27Z

Completed by the subissues listed above.

gaffer01 added enhancement New feature or request compactions-module labels Jun 20, 2022

gaffer01 added this to the 0.12.0 milestone Aug 8, 2022

gaffer01 mentioned this issue Aug 12, 2022

Improve observability of ingest jobs/tasks #10

Closed

patchwork01 self-assigned this Aug 24, 2022

This was referenced Aug 31, 2022

Split DynamoDB state store into smaller classes #126

Closed

Split S3 state store into smaller classes #127

Closed

Refactor compaction strategies for independent testability #128

Closed

Model for compaction jobs suitable for saving to database #131

Closed

This was referenced Sep 8, 2022

Create DynamoDB table for compaction job events #158

Closed

Create DynamoDB table for compaction task events #159

Closed

patchwork01 assigned kr565370 Sep 23, 2022

gaffer01 closed this as completed Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve observability of compaction jobs #9

Improve observability of compaction jobs #9

gaffer01 commented Jun 17, 2022 •

edited

patchwork01 commented Sep 8, 2022

gaffer01 commented Oct 20, 2022

Improve observability of compaction jobs #9

Improve observability of compaction jobs #9

Comments

gaffer01 commented Jun 17, 2022 • edited

patchwork01 commented Sep 8, 2022

gaffer01 commented Oct 20, 2022

gaffer01 commented Jun 17, 2022 •

edited