Conversation
Codecov Report
@@ Coverage Diff @@
## master #10702 +/- ##
=============================================
+ Coverage 31.97% 70.27% +38.30%
- Complexity 453 6439 +5986
=============================================
Files 2111 2112 +1
Lines 113959 114036 +77
Branches 17196 17228 +32
=============================================
+ Hits 36433 80144 +43711
+ Misses 74234 28283 -45951
- Partials 3292 5609 +2317
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1342 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
This won't work. We need to keep 2 bitmaps, one for valid docs and one for queryable docs. If we only keep valid docs, and count delete as removing it from the valid doc, the next event won't work as expected because we lost track of the delete event.
This is already a design (#10452) that covers this, so let's not duplicate the work.
@navina will post the implementation soon
This is precisely what we realized during our previous attempt (#10035) . Your solution isn't any different from what was attempted before. @atris : You had reached out to me via DM (on slack) to collaborate and agreed to review the existing design. Yet, you went ahead and did your own work. This is duplicated effort and not setting the right standard for committers of the Apache Pinot project. |
@Jackie-Jiang and myself discussed offline, and we have agreed to iterate on this PR -- and look on solving the snapshot of validDocID map issues. The main issue that I wish to avoid is the separate map for queryable docIDs. The three issues we face today with the tombstone approach are:
This PR solves 3. by maintaining the full record of deletion in the segment and the primary key index -- as is demonstrated in the added unit tests. For 1, it does not matter as long as we are able to rebuild the primary key index with the deleted primary key records present. The main challenge is 2, and we need to think about it a bit more. Will keep the group posted. |
There are many subtle changes from the last PR -- and this actually works for most cases except the snapshot load, which is easy to fix using the queryable map. The last PR was unable to handle out of order records and had no concept of denying partial updates on deleted records but allowing new records with the same primary key to be updated. To answer your specific question, we are not losing the deleted event in this PR. That is clearly demonstrated in the unit tests. For further details, please compare the two PRs.
Thank you for highlighting this. I do not wish to go into details of our 1:1 conversation here -- but my understanding was that:
In all honesty, I feel it is against the spirit of OSS if we "block" issues that we have not yet written code for. A good example is the issue on adding support for GROUP BY for filter aggregations -- where the issue was assigned to me, and I had a half baked PR that I did not get around to finishing. Since Evan posted a PR, I was happy to let my patch take a step back. In this case, I only finished my PR since you mentioned that you will work on it post May 10th and would find it hard to collaborate with me -- which led me to think that our internal timelines might not be matched. If you have any further comments, I would be happy to talk 1:1. Also, please feel free to review the PR and leave any feedback -- I would be happy to address it. |
This is not what I said.. I said not to directly close this PR and see if it is possible to solve the problems, and if @navina cannot get an implementation sticking with your timeline we can go with this one. |
Sure, I should have been more verbose, my bad.
Could you please elaborate a bit more on where things could break if we had keys in upsert metadata but not in valid doc IDs? My understanding is that the two places where things will break are:
I do not see anything breaking at the runtime because the access pattern defines that the upsert metadata dictates which keys to be looked up in valid doc IDs -- and since the upsert metadata has the superset of docIDs, it should not lose any data. I think this is fortified by the fact that this PR makes the change and there are no tests breaking. Please provide your inputs |
While all those are true, I told you that my design was open for feedback and you agreed to review it.
You didn't communicate your timeline in our conversation nor did you explain the use-case when I asked. I am not aware of how you communicated the handover of your group by issue to another contributor and I don't think it is relevant to the tasks I own. I have put in effort in this work and have shared my design ( https://docs.google.com/document/d/19s1AHCRjmqeVa0z_djYBTAt_meg6qWcUxwRO3y-G9O4/edit?usp=drivesdk ) and PR ( #10703 ) for review. |
|
Hi Team, reading with a neutral perspective, it seems like there were some miscommunications that led to duplicate effort on both sides. But I am 100% sure everyone here here:
With above in mind, and the contribution guidelines, I do see that this design doc being reviewed. My recommendation would be for all of us to collaborate on that design and any associated PRs to take this feature to completion as soon as possible. Thank you all! |
+1 to #One Team Also I hope we can agree on the direction. One key difference I can see between these two approaches is to reuse validDocId or not for deletion. I also left this comment on @navina's design. This is a key as it has implications on correctness/efficiency/complexity etc. Could you provide a detailed example on reusing validDocId is insufficient to help this discussion. Also, I'm curious what's the behavior if there's an update to record after deletion, will we ignore it or revive the record? |
I have added the reasoning in the design doc - https://docs.google.com/document/d/19s1AHCRjmqeVa0z_djYBTAt_meg6qWcUxwRO3y-G9O4/edit#heading=h.muy7y9j64ptn . can we continue the conversation in the design doc?
That's a good question. Going by the current upsert guarantees, if the event containing the update to a record has newer comparison values, it will be considered as a new record. I think it should be fine to revive it. I am not seeing any issues on doing that. Or we could make that configurable. Will add a section in the doc to further discuss this if you have any concerns. |
Please refer to the attached design document to this PR |
|
Closing this PR -- I have highlighted my concerns with the proposed design and have proposed an alternate in the attached design document. I will help land the other PR now |
Please refer to: https://docs.google.com/document/d/17WUvE6hZwauj3_gM5cPURWSjdE2e4q9au70VeFYYx4E/edit?usp=sharing
cc @xiangfu0 @yupeng9 @kishoreg @Jackie-Jiang