Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve observability of compaction jobs #9

Closed
gaffer01 opened this issue Jun 17, 2022 · 2 comments
Closed

Improve observability of compaction jobs #9

gaffer01 opened this issue Jun 17, 2022 · 2 comments
Assignees
Labels
compactions-module enhancement New feature or request
Milestone

Comments

@gaffer01
Copy link
Member

gaffer01 commented Jun 17, 2022

There is not one place to go to to find out all the information about the lifecycle of a compaction job (i.e. when it was created, when it was pulled off the queue, how long it took to run, whether it succeeded, etc). This information is scattered around various logs in Cloudwatch. We should record this information in a Dynamo table.

Suggested design:

  • Have one DynamoDB table that will be used to record information about the lifecycle of all compaction jobs (for all Sleeper tables, i.e. not one Dynamo table per Sleeper table).
  • Key design: hash key of compaction job id, sort key of timestamp of update.
  • Record the following stages of the lifecycle of a compaction job:
    • Job creation
    • Job pulled off the queue
    • Job finish time
    • Job finish status - success? total number of records read, number written (these 2 are not necessarily the same as an iterator may filter out records), the rate at which records were written).

Note that it is possible that a compaction job may be pulled off the queue twice as SQS does not guarantee that a message will only be delivered once, so maybe each compaction task should get a unique id so that we can separate the updates from the different tasks?

We could also record the lifecycle of compaction ECS tasks - creation time, total run time, total number of records processed, etc.

We will need a Java class that can report the status of a particular compaction job (by querying Dynamo for the relevant information), and a script in scripts/utility to make that class easy to use.

We will also need to update the documentation to explain how to use that script.

@patchwork01
Copy link
Collaborator

I've made some more issues for this:

#158 Create DynamoDB table for compaction job events
#160 Record compaction job created event in DynamoDB
#161 Record compaction job processing events in DynamoDB
#162 Client to report compaction job status
#159 Create DynamoDB table for compaction task events
#163 Record compaction task events in DynamoDB

@gaffer01
Copy link
Member Author

Completed by the subissues listed above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compactions-module enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants