Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presence vector #4585

Merged
merged 16 commits into from
Oct 28, 2019
Merged

Presence vector #4585

merged 16 commits into from
Oct 28, 2019

Conversation

icefury71
Copy link
Contributor

@icefury71 icefury71 commented Sep 4, 2019

This PR adds support for a presence vector inside a mutable and immutable segment. This will enable the query layer to ignore null values in the corresponding columns. Please see this issue for more details: #4230

High level gist is as follows:

  • Create a presence vector per column per segment by default. This presence vector keeps track of which document ID has a null value in the corresponding column.
  • Expose this through the DataSource interface to enable the caller to ignore such document IDs.

Presence vector is not really used anywhere as part of this PR. Subsequent PRs will use them to enable filtering out columns in predicates (eg: select ... from table where column_name != null)

@codecov-io
Copy link

codecov-io commented Sep 4, 2019

Codecov Report

Merging #4585 into master will decrease coverage by 0.14%.
The diff coverage is 89.07%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #4585      +/-   ##
============================================
- Coverage     57.92%   57.78%   -0.15%     
+ Complexity       16        4      -12     
============================================
  Files          1213     1207       -6     
  Lines         65135    64880     -255     
  Branches       9488     9436      -52     
============================================
- Hits          37732    37491     -241     
+ Misses        24544    24542       -2     
+ Partials       2859     2847      -12
Impacted Files Coverage Δ Complexity Δ
...e/pinot/core/segment/creator/impl/V1Constants.java 11.11% <ø> (ø) 0 <0> (ø) ⬇️
...pinot/core/segment/store/ColumnIndexDirectory.java 93.33% <ø> (ø) 0 <0> (ø) ⬇️
.../java/org/apache/pinot/core/common/DataSource.java 100% <ø> (ø) 0 <0> (ø) ⬇️
...org/apache/pinot/common/config/IndexingConfig.java 51.72% <0%> (-1.85%) 0 <0> (ø)
...re/startree/v2/store/StarTreeMetricDataSource.java 54.16% <0%> (-2.36%) 0 <0> (ø)
...startree/v2/store/StarTreeDimensionDataSource.java 69.56% <0%> (-3.17%) 0 <0> (ø)
...t/core/segment/store/SingleFileIndexDirectory.java 86.27% <100%> (+0.18%) 0 <0> (ø) ⬇️
...inot/core/segment/store/FilePerIndexDirectory.java 90.27% <100%> (+0.88%) 0 <0> (ø) ⬇️
...t/index/loader/bloomfilter/BloomFilterHandler.java 71.66% <100%> (ø) 0 <0> (ø) ⬇️
...ache/pinot/core/segment/store/ColumnIndexType.java 85.71% <100%> (+1.09%) 0 <0> (ø) ⬇️
... and 89 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3579aae...260aac4. Read the comment docs.

Copy link
Contributor

@snleee snleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding it as part of indexingConfig, I think that it may make more sense to configure this in schema since we are adding presenceVector for handling null values.
e.g.

    {
      "name": "memberId",
      "dataType": "INT",
      "isNullable": true
    }

Copy link
Contributor

@snleee snleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I was out for a while for the conference.

Would you add the following to the commit message?

  1. Link NULL value support for all data types #4230 for the reference. After I read the issue, this pr made much more sense.
  2. Add the description on how presence vector is populated. By reading the code, it seems that we generate presence vector by default.
  3. Can you also add a little bit more explanation to the commit message on when you filter out NULL values? (e.g. when predicate is added by user -- column != NULL...)

I think that most comments are minor & style issues. I will do one more final review after this.

@@ -33,7 +36,8 @@
public BitmapDocIdSet(ImmutableRoaringBitmap[] bitmaps, int startDocId, int endDocId, boolean exclusive) {
int numBitmaps = bitmaps.length;
if (numBitmaps > 1) {
MutableRoaringBitmap orBitmap = MutableRoaringBitmap.or(bitmaps);
Iterator iterator = Arrays.asList(bitmaps).iterator();
MutableRoaringBitmap orBitmap = MutableRoaringBitmap.or(iterator, startDocId, endDocId + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change about a bug fix or API change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kishoreg : can you please clarify ? I'm not sure we need this change for presence vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this might be slightly more efficient (based on conversation with Kishore)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could backfire because there is an extra step of selecting range for each bitmap. I checked the code and seems this step is always redundant because we always use 0 as startDocId and numDocs-1 as endDocId. Also this is not related to this PR. @kishoreg Do you see actual performance gain for this change?

@icefury71
Copy link
Contributor Author

Sorry for the delay. I was out for a while for the conference.

Would you add the following to the commit message?

  1. Link NULL value support for all data types #4230 for the reference. After I read the issue, this pr made much more sense.
  2. Add the description on how presence vector is populated. By reading the code, it seems that we generate presence vector by default.
  3. Can you also add a little bit more explanation to the commit message on when you filter out NULL values? (e.g. when predicate is added by user -- column != NULL...)

I think that most comments are minor & style issues. I will do one more final review after this.

Thanks for pointing that out. Updated description. Please take a look

@snleee
Copy link
Contributor

snleee commented Sep 24, 2019

@icefury71 Thanks for the reply. LGTM assuming that you address the comments above (2 spaces, comment on bloom filter etc). I will merge this once that part is updated.

@kishoreg Can you reply to BitmapDocIdSet.java#line 39 change?

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to maintain a record level bitmap? That can easily cause memory issue (everything on heap).
If we want to preserve null value, we should remove the NullValueTransformer from the record reader and let the record reader insert null into the GenericRow. The segment creator/mutable segment should be able to directly process null value.

Please reformat the code with PinotStyle #3705

@icefury71
Copy link
Contributor Author

Why do we need to maintain a record level bitmap? That can easily cause memory issue (everything on heap).
If we want to preserve null value, we should remove the NullValueTransformer from the record reader and let the record reader insert null into the GenericRow. The segment creator/mutable segment should be able to directly process null value.

The bitmap added to Generic Row inside NullValueTransformer is a transient object (short lived) - which shouldn't cause any memory concern.

However, I was maintaining a bitmap per docId inside MutableSegmentImpl. I've optimized that part of the code - its no longer necessary to keep track of all these bitmaps in memory.

@icefury71
Copy link
Contributor Author

@icefury71 Thanks for the reply. LGTM assuming that you address the comments above (2 spaces, comment on bloom filter etc). I will merge this once that part is updated.

I've reformatted code based on Pinot style.

@Jackie-Jiang
Copy link
Contributor

I'm still against storing the bitmap into the GenericRecord because that will couple the presence vector with the NullValueTransformer (real-time segment generation does not use NullValueTransformer, thus it won't work properly).
Instead, we should just allow null values (missing fields) in the GenericRecord when it is passed to the SegmentCreator, then it becomes much more flexible, and we can choose to fill in default values, or add them into the presence vector.

@icefury71
Copy link
Contributor Author

I'm still against storing the bitmap into the GenericRecord because that will couple the presence vector with the NullValueTransformer (real-time segment generation does not use NullValueTransformer, thus it won't work properly).

By real-time I'm assuming you mean either HLC / LLC realtime segments ? From code inspection it does look like we're using NullValueTransformer in real-time path:

https://github.com/apache/incubator-pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/LLRealtimeSegmentDataManager.java#L1158

(this is calling CompositeTransformer.getDefaultTransformer which in turn brings in NullValueTransformer)

Instead, we should just allow null values (missing fields) in the GenericRecord when it is passed to the SegmentCreator, then it becomes much more flexible, and we can choose to fill in default values, or add them into the presence vector.

We considered this option. However, the problem is that it's extremely difficult to identify default null value which is outside the domain of legitimate values. In many cases there could be an overlap.

Given that memory is not a big concern, this could be a good way out. Thoughts ?

@icefury71
Copy link
Contributor Author

Did a rebase against master to resolve conflicts.

@Jackie-Jiang
Copy link
Contributor

Jackie-Jiang commented Sep 27, 2019

The record reader you mentioned is for ingesting data into real-time. When we convert real-time segment into offline segment, we use the RealtimeSegmentConverter which does not perform any transform on the records as records should already be in the desired format.
The concern is more about module isolation, meaning SegmentCreator interface should not expect the GenericRow to have a field of bitmap.

We considered this option. However, the problem is that it's extremely difficult to identify default null value which is outside the domain of legitimate values. In many cases there could be an overlap.

I don't quite follow. The logic should be very straight forward. If we allow null value, put the null value into the presence vector; if not, throw exception

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the APIs introduced in #4671 to create the vector

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good in general.
I still feel it is better to make it NullValueVector for the following reasons:

  • Underlying bitmap is nullValueBitmap
  • Always set null instead of set non-null
  • We only need to check whether the value is null (all the return for isPresent() is reversed)

I understand the intention for PresenceVector is to fast return the presenceBitmap, but if the underlying storage is actually a nullValueBitmap, it is quite confusing. In the following PRs, we can maintain two bitmaps in memory, but still only store nullValueBitmap on disk. @kishoreg Thoughts?

@@ -33,7 +36,8 @@
public BitmapDocIdSet(ImmutableRoaringBitmap[] bitmaps, int startDocId, int endDocId, boolean exclusive) {
int numBitmaps = bitmaps.length;
if (numBitmaps > 1) {
MutableRoaringBitmap orBitmap = MutableRoaringBitmap.or(bitmaps);
Iterator iterator = Arrays.asList(bitmaps).iterator();
MutableRoaringBitmap orBitmap = MutableRoaringBitmap.or(iterator, startDocId, endDocId + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could backfire because there is an extra step of selecting range for each bitmap. I checked the code and seems this step is always redundant because we always use 0 as startDocId and numDocs-1 as endDocId. Also this is not related to this PR. @kishoreg Do you see actual performance gain for this change?

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good, minor comments

@icefury71
Copy link
Contributor Author

Good in general.
I still feel it is better to make it NullValueVector for the following reasons:

  • Underlying bitmap is nullValueBitmap
  • Always set null instead of set non-null
  • We only need to check whether the value is null (all the return for isPresent() is reversed)

I understand the intention for PresenceVector is to fast return the presenceBitmap, but if the underlying storage is actually a nullValueBitmap, it is quite confusing. In the following PRs, we can maintain two bitmaps in memory, but still only store nullValueBitmap on disk. @kishoreg Thoughts?

Modified to NullValueVector as per suggestion

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for addressing the comments

@Jackie-Jiang Jackie-Jiang merged commit 1f5bf57 into apache:master Oct 28, 2019
chenboat pushed a commit to chenboat/incubator-pinot that referenced this pull request Nov 9, 2019
This PR adds support for a presence vector inside a mutable and immutable segment. This will enable the query layer to ignore null values in the corresponding columns. Please see this issue for more details: apache#4230

High level gist is as follows:

Create a presence vector per column per segment by default. This presence vector keeps track of which document ID has a null value in the corresponding column.
Expose this through the DataSource interface to enable the caller to ignore such document IDs.
Presence vector is not really used anywhere as part of this PR. Subsequent PRs will use them to enable filtering out columns in predicates (eg: select ... from table where column_name != null)
chenboat pushed a commit to chenboat/incubator-pinot that referenced this pull request Nov 15, 2019
This PR adds support for a presence vector inside a mutable and immutable segment. This will enable the query layer to ignore null values in the corresponding columns. Please see this issue for more details: apache#4230

High level gist is as follows:

Create a presence vector per column per segment by default. This presence vector keeps track of which document ID has a null value in the corresponding column.
Expose this through the DataSource interface to enable the caller to ignore such document IDs.
Presence vector is not really used anywhere as part of this PR. Subsequent PRs will use them to enable filtering out columns in predicates (eg: select ... from table where column_name != null)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants