Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment compaction for upsert real-time tables #6912

Closed
pedro93 opened this issue May 13, 2021 · 1 comment
Closed

Segment compaction for upsert real-time tables #6912

pedro93 opened this issue May 13, 2021 · 1 comment

Comments

@pedro93
Copy link

pedro93 commented May 13, 2021

Hello,

This issue serves to request support for segment compaction on real-time upsert-enabled tables which currently does not exist as mentioned in a slack thread. This means that segments with old & stale entries are keep in disk and only deleted when the retention policy for segments is activated.

Giving a concrete example why this is useful:

  • Suppose you have have a stream of events related to user activity (updated profile, saw an article, updated preferences, etc...)
  • Defined a real-time table in pinot where the primary key is the userId. Segment size is 500k and the stream is partitioned.
  • The set of users is roughly fixed (~50M).
  • You want to keep segments for a largeish time period (> 2 years).
  • Each day ~20% (10M) of the users generate some event which is consumed by Pinot.

This will generate ~20 segments per day, over the course of 2 years we will have 14600 segments when in reality we need only 100 segments (the most up-to-date information for each user).

If the example or issue is not clear feel free to reach out.

Thank you.

@robertzych
Copy link
Contributor

I recently volunteered to implement this feature. Now requesting feedback on the design doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants