Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement TTL support for Pinot upsert #9529

Open
deemoliu opened this issue Oct 4, 2022 · 5 comments
Open

Implement TTL support for Pinot upsert #9529

deemoliu opened this issue Oct 4, 2022 · 5 comments

Comments

@deemoliu
Copy link
Contributor

deemoliu commented Oct 4, 2022

Apache Pinot provides native support of Upsert since v0.6.0 (#4261), it allows users to modify existing records, and successfully onboard many use cases. We observed Pinot upsert clusters usually have high usage of heap memory. This is because the upsert metadata (primaryKeyIndexes and validDocIndexes), are stored in heap of pinot hosts. For use cases with high cardinality of primary keys, the heap usage of these upsert tables usually becomes the bottleneck of the hardware resource.

For some use cases, records that shared primary keys will get updates frequently during a time window, and after the time window, these records won’t get updated any more. In these use cases, each primary key has a lifecycle and will be deactivated after the time window. Currently these primary keys won’t expire until the retention days, and they will be kept in primaryKeyIndexes. We shall introduce TTL (time-to-live) for Pinot primary keys. Primary keys will expire after the TTL, and we can remove inactive keys from upsert metadata to save heap space.

Few Challenges that we want to solve.

  • snapshots management for validDocIndexes
  • implement TTL for primary keys in primaryKeyIndexes
  • snapshot backup in the deepstore.
@deemoliu
Copy link
Contributor Author

deemoliu commented Oct 4, 2022

We summarized the challenges and thoughts for partial upsert in this design

Please review cc @Jackie-Jiang @chenboat @yupeng9

@deemoliu
Copy link
Contributor Author

deemoliu commented Oct 4, 2022

ValidDocIds Snapshot management PR and Pinot doc.

@deemoliu
Copy link
Contributor Author

deemoliu commented Dec 6, 2022

After discussion with @Jackie-Jiang @yupeng9 @chenboat

We can break down the feature into the following part.

  • Design doc updates
  • part 1. When committing segment, update replaceSegment to clean up keys
  • part 1.1 clean up keys in primary key indexes
  • part 1.2 generate snapshot locally
  • part 1.3 [Deepstore] upload snapshot to Deepstore
  • part 2. periodic job in pinot controller (upload snapshot if not persisted)
  • part 3. add a download snapshot api on the server side.
  • part 4. when loading segments, get snapshot to avoid re-compute
  • part 4.1 get snapshot from peer server
  • part 4.2 [Deepstore] get snapshot from Deepstore

@Jackie-Jiang
Copy link
Contributor

Thanks for summarizing it. Part 1.3 is not required. Controller will ask server for the snapshot and then controller is responsible for the snapshot upload

@deemoliu
Copy link
Contributor Author

The POC was done in #10047 however there are unhandled corner cases.
These corner cases was addressed in #10915

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants