-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Presence vector #4585
Presence vector #4585
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4585 +/- ##
============================================
- Coverage 57.92% 57.78% -0.15%
+ Complexity 16 4 -12
============================================
Files 1213 1207 -6
Lines 65135 64880 -255
Branches 9488 9436 -52
============================================
- Hits 37732 37491 -241
+ Misses 24544 24542 -2
+ Partials 2859 2847 -12
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding it as part of indexingConfig
, I think that it may make more sense to configure this in schema since we are adding presenceVector
for handling null values.
e.g.
{
"name": "memberId",
"dataType": "INT",
"isNullable": true
}
...core/src/main/java/org/apache/pinot/core/segment/index/readers/PresenceVectorReaderImpl.java
Outdated
Show resolved
Hide resolved
pinot-core/src/test/resources/data/test_presence_vector_data.json
Outdated
Show resolved
Hide resolved
pinot-core/src/test/resources/data/test_presence_vector_pinot_schema.json
Outdated
Show resolved
Hide resolved
...tion-tests/src/test/java/org/apache/pinot/integration/tests/ClusterIntegrationTestUtils.java
Outdated
Show resolved
Hide resolved
pinot-common/src/main/java/org/apache/pinot/common/utils/CommonConstants.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/NullValueTransformer.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/core/realtime/impl/presence/RealtimePresenceVectorReaderWriter.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/segment/creator/impl/presence/PresenceVectorCreator.java
Outdated
Show resolved
Hide resolved
pinot-core/src/test/resources/data/test_presence_vector_data.json
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/common/DataSource.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I was out for a while for the conference.
Would you add the following to the commit message?
- Link NULL value support for all data types #4230 for the reference. After I read the issue, this pr made much more sense.
- Add the description on how presence vector is populated. By reading the code, it seems that we generate presence vector by default.
- Can you also add a little bit more explanation to the commit message on when you filter out NULL values? (e.g. when predicate is added by user -- column != NULL...)
I think that most comments are minor & style issues. I will do one more final review after this.
pinot-core/src/main/java/org/apache/pinot/core/segment/index/column/ColumnIndexContainer.java
Outdated
Show resolved
Hide resolved
@@ -33,7 +36,8 @@ | |||
public BitmapDocIdSet(ImmutableRoaringBitmap[] bitmaps, int startDocId, int endDocId, boolean exclusive) { | |||
int numBitmaps = bitmaps.length; | |||
if (numBitmaps > 1) { | |||
MutableRoaringBitmap orBitmap = MutableRoaringBitmap.or(bitmaps); | |||
Iterator iterator = Arrays.asList(bitmaps).iterator(); | |||
MutableRoaringBitmap orBitmap = MutableRoaringBitmap.or(iterator, startDocId, endDocId + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change about a bug fix or API change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kishoreg : can you please clarify ? I'm not sure we need this change for presence vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this might be slightly more efficient (based on conversation with Kishore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could backfire because there is an extra step of selecting range for each bitmap. I checked the code and seems this step is always redundant because we always use 0 as startDocId and numDocs-1 as endDocId. Also this is not related to this PR. @kishoreg Do you see actual performance gain for this change?
...ava/org/apache/pinot/core/realtime/impl/presence/RealtimePresenceVectorReaderWriterTest.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pinot/core/realtime/impl/presence/RealtimePresenceVectorReaderWriterTest.java
Outdated
Show resolved
Hide resolved
...core/src/main/java/org/apache/pinot/core/segment/index/readers/PresenceVectorReaderImpl.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/core/realtime/impl/presence/RealtimePresenceVectorReaderWriter.java
Outdated
Show resolved
Hide resolved
...re/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/segment/index/readers/PresenceVectorReader.java
Outdated
Show resolved
Hide resolved
...st/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImplPresenceVectorTest.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pinot/core/realtime/impl/presence/RealtimePresenceVectorReaderWriterTest.java
Outdated
Show resolved
Hide resolved
Thanks for pointing that out. Updated description. Please take a look |
@icefury71 Thanks for the reply. LGTM assuming that you address the comments above (2 spaces, comment on bloom filter etc). I will merge this once that part is updated. @kishoreg Can you reply to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to maintain a record level bitmap? That can easily cause memory issue (everything on heap).
If we want to preserve null
value, we should remove the NullValueTransformer from the record reader and let the record reader insert null
into the GenericRow. The segment creator/mutable segment should be able to directly process null
value.
Please reformat the code with PinotStyle #3705
pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/NullValueTransformer.java
Outdated
Show resolved
Hide resolved
The bitmap added to Generic Row inside NullValueTransformer is a transient object (short lived) - which shouldn't cause any memory concern. However, I was maintaining a bitmap per docId inside MutableSegmentImpl. I've optimized that part of the code - its no longer necessary to keep track of all these bitmaps in memory. |
I've reformatted code based on Pinot style. |
I'm still against storing the bitmap into the GenericRecord because that will couple the presence vector with the NullValueTransformer (real-time segment generation does not use NullValueTransformer, thus it won't work properly). |
By real-time I'm assuming you mean either HLC / LLC realtime segments ? From code inspection it does look like we're using NullValueTransformer in real-time path: (this is calling CompositeTransformer.getDefaultTransformer which in turn brings in NullValueTransformer)
We considered this option. However, the problem is that it's extremely difficult to identify default null value which is outside the domain of legitimate values. In many cases there could be an overlap. Given that memory is not a big concern, this could be a good way out. Thoughts ? |
9a45eab
to
005dec8
Compare
Did a rebase against master to resolve conflicts. |
The record reader you mentioned is for ingesting data into real-time. When we convert real-time segment into offline segment, we use the RealtimeSegmentConverter which does not perform any transform on the records as records should already be in the desired format.
I don't quite follow. The logic should be very straight forward. If we allow null value, put the null value into the presence vector; if not, throw exception |
…ate presence vector by default for all columns with this PR
- Bug fixes for NullValueTransformer and some unit tests
- Addressing review feedback
- Clean up imports
- Using dynamically generated null-bitmap instead of keeping a copy in memory
- Using a set instead of bitmap for keeping track of null columns
3f8bc67
to
3bf130c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the APIs introduced in #4671 to create the vector
pinot-common/src/main/java/org/apache/pinot/common/config/IndexingConfig.java
Show resolved
Hide resolved
pinot-common/src/main/java/org/apache/pinot/common/data/Schema.java
Outdated
Show resolved
Hide resolved
pinot-common/src/main/java/org/apache/pinot/common/utils/CommonConstants.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/common/DataSource.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/NullValueTransformer.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/segment/index/readers/PresenceVectorReader.java
Outdated
Show resolved
Hide resolved
- Addressing review comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good in general.
I still feel it is better to make it NullValueVector for the following reasons:
- Underlying bitmap is nullValueBitmap
- Always set null instead of set non-null
- We only need to check whether the value is null (all the return for isPresent() is reversed)
I understand the intention for PresenceVector is to fast return the presenceBitmap, but if the underlying storage is actually a nullValueBitmap, it is quite confusing. In the following PRs, we can maintain two bitmaps in memory, but still only store nullValueBitmap on disk. @kishoreg Thoughts?
pinot-common/src/main/java/org/apache/pinot/common/config/IndexingConfig.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/data/GenericRow.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
Outdated
Show resolved
Hide resolved
@@ -33,7 +36,8 @@ | |||
public BitmapDocIdSet(ImmutableRoaringBitmap[] bitmaps, int startDocId, int endDocId, boolean exclusive) { | |||
int numBitmaps = bitmaps.length; | |||
if (numBitmaps > 1) { | |||
MutableRoaringBitmap orBitmap = MutableRoaringBitmap.or(bitmaps); | |||
Iterator iterator = Arrays.asList(bitmaps).iterator(); | |||
MutableRoaringBitmap orBitmap = MutableRoaringBitmap.or(iterator, startDocId, endDocId + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could backfire because there is an extra step of selecting range for each bitmap. I checked the code and seems this step is always redundant because we always use 0 as startDocId and numDocs-1 as endDocId. Also this is not related to this PR. @kishoreg Do you see actual performance gain for this change?
pinot-core/src/main/java/org/apache/pinot/core/realtime/impl/RealtimeSegmentConfig.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/realtime/impl/RealtimeSegmentConfig.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/realtime/impl/RealtimeSegmentConfig.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/segment/index/readers/PresenceVectorReader.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/segment/index/column/ColumnIndexContainer.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/V1Constants.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good, minor comments
pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/pinot/core/segment/creator/impl/nullvalue/NullValueVectorCreator.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/segment/index/readers/NullValueVectorReader.java
Show resolved
Hide resolved
...ore/src/main/java/org/apache/pinot/core/segment/index/readers/NullValueVectorReaderImpl.java
Outdated
Show resolved
Hide resolved
...a/org/apache/pinot/core/realtime/impl/nullvalue/RealtimeNullValueVectorReaderWriterTest.java
Show resolved
Hide resolved
...src/test/java/org/apache/pinot/core/segment/index/readers/NullValueVectorReaderImplTest.java
Outdated
Show resolved
Hide resolved
...src/test/java/org/apache/pinot/core/segment/index/readers/NullValueVectorReaderImplTest.java
Outdated
Show resolved
Hide resolved
...src/test/java/org/apache/pinot/core/segment/index/readers/NullValueVectorReaderImplTest.java
Outdated
Show resolved
Hide resolved
...src/test/java/org/apache/pinot/core/segment/index/readers/NullValueVectorReaderImplTest.java
Show resolved
Hide resolved
Modified to NullValueVector as per suggestion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for addressing the comments
This PR adds support for a presence vector inside a mutable and immutable segment. This will enable the query layer to ignore null values in the corresponding columns. Please see this issue for more details: apache#4230 High level gist is as follows: Create a presence vector per column per segment by default. This presence vector keeps track of which document ID has a null value in the corresponding column. Expose this through the DataSource interface to enable the caller to ignore such document IDs. Presence vector is not really used anywhere as part of this PR. Subsequent PRs will use them to enable filtering out columns in predicates (eg: select ... from table where column_name != null)
This PR adds support for a presence vector inside a mutable and immutable segment. This will enable the query layer to ignore null values in the corresponding columns. Please see this issue for more details: apache#4230 High level gist is as follows: Create a presence vector per column per segment by default. This presence vector keeps track of which document ID has a null value in the corresponding column. Expose this through the DataSource interface to enable the caller to ignore such document IDs. Presence vector is not really used anywhere as part of this PR. Subsequent PRs will use them to enable filtering out columns in predicates (eg: select ... from table where column_name != null)
This PR adds support for a presence vector inside a mutable and immutable segment. This will enable the query layer to ignore null values in the corresponding columns. Please see this issue for more details: #4230
High level gist is as follows:
Presence vector is not really used anywhere as part of this PR. Subsequent PRs will use them to enable filtering out columns in predicates (eg: select ... from table where column_name != null)