Presence vector #4585

icefury71 · 2019-09-04T21:59:40Z

This PR adds support for a presence vector inside a mutable and immutable segment. This will enable the query layer to ignore null values in the corresponding columns. Please see this issue for more details: #4230

High level gist is as follows:

Create a presence vector per column per segment by default. This presence vector keeps track of which document ID has a null value in the corresponding column.
Expose this through the DataSource interface to enable the caller to ignore such document IDs.

Presence vector is not really used anywhere as part of this PR. Subsequent PRs will use them to enable filtering out columns in predicates (eg: select ... from table where column_name != null)

codecov-io · 2019-09-04T22:48:14Z

Codecov Report

Merging #4585 into master will decrease coverage by 0.14%.
The diff coverage is 89.07%.

@@             Coverage Diff              @@
##             master    #4585      +/-   ##
============================================
- Coverage     57.92%   57.78%   -0.15%     
+ Complexity       16        4      -12     
============================================
  Files          1213     1207       -6     
  Lines         65135    64880     -255     
  Branches       9488     9436      -52     
============================================
- Hits          37732    37491     -241     
+ Misses        24544    24542       -2     
+ Partials       2859     2847      -12

Impacted Files	Coverage Δ	Complexity Δ
...e/pinot/core/segment/creator/impl/V1Constants.java	`11.11% <ø> (ø)`	`0 <0> (ø)`	⬇️
...pinot/core/segment/store/ColumnIndexDirectory.java	`93.33% <ø> (ø)`	`0 <0> (ø)`	⬇️
.../java/org/apache/pinot/core/common/DataSource.java	`100% <ø> (ø)`	`0 <0> (ø)`	⬇️
...org/apache/pinot/common/config/IndexingConfig.java	`51.72% <0%> (-1.85%)`	`0 <0> (ø)`
...re/startree/v2/store/StarTreeMetricDataSource.java	`54.16% <0%> (-2.36%)`	`0 <0> (ø)`
...startree/v2/store/StarTreeDimensionDataSource.java	`69.56% <0%> (-3.17%)`	`0 <0> (ø)`
...t/core/segment/store/SingleFileIndexDirectory.java	`86.27% <100%> (+0.18%)`	`0 <0> (ø)`	⬇️
...inot/core/segment/store/FilePerIndexDirectory.java	`90.27% <100%> (+0.88%)`	`0 <0> (ø)`	⬇️
...t/index/loader/bloomfilter/BloomFilterHandler.java	`71.66% <100%> (ø)`	`0 <0> (ø)`	⬇️
...ache/pinot/core/segment/store/ColumnIndexType.java	`85.71% <100%> (+1.09%)`	`0 <0> (ø)`	⬇️
... and 89 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3579aae...260aac4. Read the comment docs.

snleee

Instead of adding it as part of indexingConfig, I think that it may make more sense to configure this in schema since we are adding presenceVector for handling null values.
e.g.

    {
      "name": "memberId",
      "dataType": "INT",
      "isNullable": true
    }

...core/src/main/java/org/apache/pinot/core/segment/index/readers/PresenceVectorReaderImpl.java

pinot-core/src/test/resources/data/test_presence_vector_data.json

pinot-core/src/test/resources/data/test_presence_vector_pinot_schema.json

...tion-tests/src/test/java/org/apache/pinot/integration/tests/ClusterIntegrationTestUtils.java

pinot-common/src/main/java/org/apache/pinot/common/utils/CommonConstants.java

pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/NullValueTransformer.java

...in/java/org/apache/pinot/core/realtime/impl/presence/RealtimePresenceVectorReaderWriter.java

...src/main/java/org/apache/pinot/core/segment/creator/impl/presence/PresenceVectorCreator.java

pinot-core/src/test/resources/data/test_presence_vector_data.json

pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java

pinot-core/src/main/java/org/apache/pinot/core/common/DataSource.java

snleee

Sorry for the delay. I was out for a while for the conference.

Would you add the following to the commit message?

Link NULL value support for all data types #4230 for the reference. After I read the issue, this pr made much more sense.
Add the description on how presence vector is populated. By reading the code, it seems that we generate presence vector by default.
Can you also add a little bit more explanation to the commit message on when you filter out NULL values? (e.g. when predicate is added by user -- column != NULL...)

I think that most comments are minor & style issues. I will do one more final review after this.

pinot-core/src/main/java/org/apache/pinot/core/segment/index/column/ColumnIndexContainer.java

snleee · 2019-09-17T20:05:07Z

pinot-core/src/main/java/org/apache/pinot/core/operator/docidsets/BitmapDocIdSet.java

@@ -33,7 +36,8 @@
  public BitmapDocIdSet(ImmutableRoaringBitmap[] bitmaps, int startDocId, int endDocId, boolean exclusive) {
    int numBitmaps = bitmaps.length;
    if (numBitmaps > 1) {
-      MutableRoaringBitmap orBitmap = MutableRoaringBitmap.or(bitmaps);
+      Iterator iterator = Arrays.asList(bitmaps).iterator();
+      MutableRoaringBitmap orBitmap = MutableRoaringBitmap.or(iterator, startDocId, endDocId + 1);


Is this change about a bug fix or API change?

@kishoreg : can you please clarify ? I'm not sure we need this change for presence vector.

I believe this might be slightly more efficient (based on conversation with Kishore)

This could backfire because there is an extra step of selecting range for each bitmap. I checked the code and seems this step is always redundant because we always use 0 as startDocId and numDocs-1 as endDocId. Also this is not related to this PR. @kishoreg Do you see actual performance gain for this change?

...ava/org/apache/pinot/core/realtime/impl/presence/RealtimePresenceVectorReaderWriterTest.java

...core/src/main/java/org/apache/pinot/core/segment/index/readers/PresenceVectorReaderImpl.java

...in/java/org/apache/pinot/core/realtime/impl/presence/RealtimePresenceVectorReaderWriter.java

...re/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java

pinot-core/src/main/java/org/apache/pinot/core/segment/index/readers/PresenceVectorReader.java

...st/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImplPresenceVectorTest.java

...ava/org/apache/pinot/core/realtime/impl/presence/RealtimePresenceVectorReaderWriterTest.java

icefury71 · 2019-09-19T01:10:02Z

Sorry for the delay. I was out for a while for the conference.

Would you add the following to the commit message?

Link NULL value support for all data types #4230 for the reference. After I read the issue, this pr made much more sense.

Add the description on how presence vector is populated. By reading the code, it seems that we generate presence vector by default.

Can you also add a little bit more explanation to the commit message on when you filter out NULL values? (e.g. when predicate is added by user -- column != NULL...)

I think that most comments are minor & style issues. I will do one more final review after this.

Thanks for pointing that out. Updated description. Please take a look

snleee · 2019-09-24T05:34:43Z

@icefury71 Thanks for the reply. LGTM assuming that you address the comments above (2 spaces, comment on bloom filter etc). I will merge this once that part is updated.

@kishoreg Can you reply to BitmapDocIdSet.java#line 39 change?

Jackie-Jiang

Why do we need to maintain a record level bitmap? That can easily cause memory issue (everything on heap).
If we want to preserve null value, we should remove the NullValueTransformer from the record reader and let the record reader insert null into the GenericRow. The segment creator/mutable segment should be able to directly process null value.

Please reformat the code with PinotStyle #3705

pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/NullValueTransformer.java

icefury71 · 2019-09-26T01:07:26Z

Why do we need to maintain a record level bitmap? That can easily cause memory issue (everything on heap).
If we want to preserve null value, we should remove the NullValueTransformer from the record reader and let the record reader insert null into the GenericRow. The segment creator/mutable segment should be able to directly process null value.

The bitmap added to Generic Row inside NullValueTransformer is a transient object (short lived) - which shouldn't cause any memory concern.

However, I was maintaining a bitmap per docId inside MutableSegmentImpl. I've optimized that part of the code - its no longer necessary to keep track of all these bitmaps in memory.

icefury71 · 2019-09-26T01:08:11Z

@icefury71 Thanks for the reply. LGTM assuming that you address the comments above (2 spaces, comment on bloom filter etc). I will merge this once that part is updated.

I've reformatted code based on Pinot style.

Jackie-Jiang · 2019-09-26T22:18:05Z

I'm still against storing the bitmap into the GenericRecord because that will couple the presence vector with the NullValueTransformer (real-time segment generation does not use NullValueTransformer, thus it won't work properly).
Instead, we should just allow null values (missing fields) in the GenericRecord when it is passed to the SegmentCreator, then it becomes much more flexible, and we can choose to fill in default values, or add them into the presence vector.

icefury71 · 2019-09-26T22:33:36Z

I'm still against storing the bitmap into the GenericRecord because that will couple the presence vector with the NullValueTransformer (real-time segment generation does not use NullValueTransformer, thus it won't work properly).

By real-time I'm assuming you mean either HLC / LLC realtime segments ? From code inspection it does look like we're using NullValueTransformer in real-time path:

https://github.com/apache/incubator-pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/LLRealtimeSegmentDataManager.java#L1158

(this is calling CompositeTransformer.getDefaultTransformer which in turn brings in NullValueTransformer)

Instead, we should just allow null values (missing fields) in the GenericRecord when it is passed to the SegmentCreator, then it becomes much more flexible, and we can choose to fill in default values, or add them into the presence vector.

We considered this option. However, the problem is that it's extremely difficult to identify default null value which is outside the domain of legitimate values. In many cases there could be an overlap.

Given that memory is not a big concern, this could be a good way out. Thoughts ?

icefury71 · 2019-09-27T00:36:46Z

Did a rebase against master to resolve conflicts.

Jackie-Jiang · 2019-09-27T00:52:48Z

The record reader you mentioned is for ingesting data into real-time. When we convert real-time segment into offline segment, we use the RealtimeSegmentConverter which does not perform any transform on the records as records should already be in the desired format.
The concern is more about module isolation, meaning SegmentCreator interface should not expect the GenericRow to have a field of bitmap.

We considered this option. However, the problem is that it's extremely difficult to identify default null value which is outside the domain of legitimate values. In many cases there could be an overlap.

I don't quite follow. The logic should be very straight forward. If we allow null value, put the null value into the presence vector; if not, throw exception

…ate presence vector by default for all columns with this PR

- Bug fixes for NullValueTransformer and some unit tests

- Addressing review feedback

- Clean up imports

- Using dynamically generated null-bitmap instead of keeping a copy in memory

- Using a set instead of bitmap for keeping track of null columns

Jackie-Jiang

Please use the APIs introduced in #4671 to create the vector

pinot-common/src/main/java/org/apache/pinot/common/config/IndexingConfig.java

pinot-common/src/main/java/org/apache/pinot/common/data/Schema.java

pinot-common/src/main/java/org/apache/pinot/common/utils/CommonConstants.java

pinot-core/src/main/java/org/apache/pinot/core/common/DataSource.java

pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/NullValueTransformer.java

pinot-core/src/main/java/org/apache/pinot/core/segment/index/readers/PresenceVectorReader.java

- Addressing review comments

Jackie-Jiang

Good in general.
I still feel it is better to make it NullValueVector for the following reasons:

Underlying bitmap is nullValueBitmap
Always set null instead of set non-null
We only need to check whether the value is null (all the return for isPresent() is reversed)

I understand the intention for PresenceVector is to fast return the presenceBitmap, but if the underlying storage is actually a nullValueBitmap, it is quite confusing. In the following PRs, we can maintain two bitmaps in memory, but still only store nullValueBitmap on disk. @kishoreg Thoughts?

pinot-common/src/main/java/org/apache/pinot/common/config/IndexingConfig.java

pinot-core/src/main/java/org/apache/pinot/core/data/GenericRow.java

pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java

Jackie-Jiang · 2019-10-23T23:33:32Z

pinot-core/src/main/java/org/apache/pinot/core/operator/docidsets/BitmapDocIdSet.java

@@ -33,7 +36,8 @@
  public BitmapDocIdSet(ImmutableRoaringBitmap[] bitmaps, int startDocId, int endDocId, boolean exclusive) {
    int numBitmaps = bitmaps.length;
    if (numBitmaps > 1) {
-      MutableRoaringBitmap orBitmap = MutableRoaringBitmap.or(bitmaps);
+      Iterator iterator = Arrays.asList(bitmaps).iterator();
+      MutableRoaringBitmap orBitmap = MutableRoaringBitmap.or(iterator, startDocId, endDocId + 1);


This could backfire because there is an extra step of selecting range for each bitmap. I checked the code and seems this step is always redundant because we always use 0 as startDocId and numDocs-1 as endDocId. Also this is not related to this PR. @kishoreg Do you see actual performance gain for this change?

pinot-core/src/main/java/org/apache/pinot/core/realtime/impl/RealtimeSegmentConfig.java

pinot-core/src/main/java/org/apache/pinot/core/segment/index/readers/PresenceVectorReader.java

pinot-core/src/main/java/org/apache/pinot/core/segment/index/column/ColumnIndexContainer.java

pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/V1Constants.java

…nt code

Jackie-Jiang

Implementation looks good, minor comments

pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java

...c/main/java/org/apache/pinot/core/segment/creator/impl/nullvalue/NullValueVectorCreator.java

pinot-core/src/main/java/org/apache/pinot/core/segment/index/readers/NullValueVectorReader.java

...ore/src/main/java/org/apache/pinot/core/segment/index/readers/NullValueVectorReaderImpl.java

...a/org/apache/pinot/core/realtime/impl/nullvalue/RealtimeNullValueVectorReaderWriterTest.java

...src/test/java/org/apache/pinot/core/segment/index/readers/NullValueVectorReaderImplTest.java

icefury71 · 2019-10-24T21:56:52Z

Good in general.
I still feel it is better to make it NullValueVector for the following reasons:

Underlying bitmap is nullValueBitmap

Always set null instead of set non-null

We only need to check whether the value is null (all the return for isPresent() is reversed)

I understand the intention for PresenceVector is to fast return the presenceBitmap, but if the underlying storage is actually a nullValueBitmap, it is quite confusing. In the following PRs, we can maintain two bitmaps in memory, but still only store nullValueBitmap on disk. @kishoreg Thoughts?

Modified to NullValueVector as per suggestion

Jackie-Jiang

LGTM. Thanks for addressing the comments

This PR adds support for a presence vector inside a mutable and immutable segment. This will enable the query layer to ignore null values in the corresponding columns. Please see this issue for more details: apache#4230 High level gist is as follows: Create a presence vector per column per segment by default. This presence vector keeps track of which document ID has a null value in the corresponding column. Expose this through the DataSource interface to enable the caller to ignore such document IDs. Presence vector is not really used anywhere as part of this PR. Subsequent PRs will use them to enable filtering out columns in predicates (eg: select ... from table where column_name != null)

icefury71 mentioned this pull request Sep 4, 2019

NULL value support for all data types #4230

Closed

snleee reviewed Sep 6, 2019

View reviewed changes

haibow reviewed Sep 9, 2019

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/common/DataSource.java Outdated Show resolved Hide resolved

snleee reviewed Sep 17, 2019

View reviewed changes

Jackie-Jiang requested changes Sep 24, 2019

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/NullValueTransformer.java Outdated Show resolved Hide resolved

icefury71 force-pushed the presence_vector branch from 9a45eab to 005dec8 Compare September 26, 2019 23:50

kishoreg and others added 10 commits October 17, 2019 11:37

Support for presence/null bitmap vector

a383328

Adding ability to create presence vector during realtime ingestion

ab2c923

Rebasing to master

c8ac0bf

Fixing bugs and adding test case for presence vector creation. We cre…

c47e995

…ate presence vector by default for all columns with this PR

- Adding unit test for Mutable segment generation

f6f102c

- Bug fixes for NullValueTransformer and some unit tests

- Optimizing storage for null columns in GenericRow

356ee6c

- Addressing review feedback

- Add unit tests for Presence vector creator, reader

ea364f2

- Clean up imports

- Reformatting code based on Pinot style

6691700

- Using dynamically generated null-bitmap instead of keeping a copy in memory

Fixing accidental changes in LLRealtimeSegmentDataManager

a9793f8

- Rebasing to master

3bf130c

- Using a set instead of bitmap for keeping track of null columns

icefury71 force-pushed the presence_vector branch from 3f8bc67 to 3bf130c Compare October 18, 2019 22:22

Adding a flag in IndexingConfig to control null handling

52c8247

Jackie-Jiang reviewed Oct 22, 2019

View reviewed changes

- Using the GenericRow API to handle null values

5b2130c

- Addressing review comments

Jackie-Jiang reviewed Oct 23, 2019

View reviewed changes

icefury71 added 2 commits October 24, 2019 10:47

Addressing review feedback regarding code style, javadocs and redunda…

77fd39c

…nt code

Renaming Presence vector artifacts to NullValue vector artifacts

2e8b054

Jackie-Jiang reviewed Oct 24, 2019

View reviewed changes

Improving unit tests. Adding a seal method to NullValueVectorCreator

e85a874

Reverting change in BitmapDocIdSet based on review feedback

260aac4

Jackie-Jiang approved these changes Oct 25, 2019

View reviewed changes

Jackie-Jiang merged commit 1f5bf57 into apache:master Oct 28, 2019

kishoreg mentioned this pull request Jun 16, 2020

NULL value for metrics #5574

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presence vector #4585

Presence vector #4585

icefury71 commented Sep 4, 2019 •

edited

Loading

codecov-io commented Sep 4, 2019 •

edited

Loading

snleee left a comment •

edited

Loading

snleee left a comment •

edited

Loading

snleee Sep 17, 2019

icefury71 Sep 19, 2019

icefury71 Sep 27, 2019

Jackie-Jiang Oct 23, 2019

icefury71 commented Sep 19, 2019

snleee commented Sep 24, 2019

Jackie-Jiang left a comment

icefury71 commented Sep 26, 2019

icefury71 commented Sep 26, 2019

Jackie-Jiang commented Sep 26, 2019

icefury71 commented Sep 26, 2019

icefury71 commented Sep 27, 2019

Jackie-Jiang commented Sep 27, 2019 •

edited

Loading

Jackie-Jiang left a comment

Jackie-Jiang left a comment

Jackie-Jiang Oct 23, 2019

Jackie-Jiang left a comment

icefury71 commented Oct 24, 2019

Jackie-Jiang left a comment

Presence vector #4585

Presence vector #4585

Conversation

icefury71 commented Sep 4, 2019 • edited Loading

codecov-io commented Sep 4, 2019 • edited Loading

Codecov Report

snleee left a comment • edited Loading

Choose a reason for hiding this comment

snleee left a comment • edited Loading

Choose a reason for hiding this comment

snleee Sep 17, 2019

Choose a reason for hiding this comment

icefury71 Sep 19, 2019

Choose a reason for hiding this comment

icefury71 Sep 27, 2019

Choose a reason for hiding this comment

Jackie-Jiang Oct 23, 2019

Choose a reason for hiding this comment

icefury71 commented Sep 19, 2019

snleee commented Sep 24, 2019

Jackie-Jiang left a comment

Choose a reason for hiding this comment

icefury71 commented Sep 26, 2019

icefury71 commented Sep 26, 2019

Jackie-Jiang commented Sep 26, 2019

icefury71 commented Sep 26, 2019

icefury71 commented Sep 27, 2019

Jackie-Jiang commented Sep 27, 2019 • edited Loading

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Jackie-Jiang Oct 23, 2019

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

icefury71 commented Oct 24, 2019

Jackie-Jiang left a comment

Choose a reason for hiding this comment

icefury71 commented Sep 4, 2019 •

edited

Loading

codecov-io commented Sep 4, 2019 •

edited

Loading

snleee left a comment •

edited

Loading

snleee left a comment •

edited

Loading

Jackie-Jiang commented Sep 27, 2019 •

edited

Loading