Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cardinality when converting raw column to dict based #9875

Merged
merged 5 commits into from Dec 8, 2022

Conversation

vvivekiyer
Copy link
Contributor

@vvivekiyer vvivekiyer commented Nov 30, 2022

Fixes the issue #9874

Fixes two issues:

  1. If segment name contained "%", update to metadata.properties file was failing because Commons Configuration 1.10 does not support file path containing '%'
  2. If segments created by realtime consumption and a column is RAW, the cardinality value is populated as Integer.MIN_VALUE. When dictionary is enabled on this column later, loading the column fails and cardinality is invalid. So, this PR updates cardinality while enabling dictionary.
  3. Added a test to enable disable dictionary for a column where the segment is created through realtime consumption.

Also tested manually. Results are below.

Tested manually using breakpoints and REST API calls in LLCRealtimeClusterIntegrationTest.

Created a realtime segment. The metadata.properties file is as follow for noDictionary column ActualElapsedTime

SegmentName = mytable__0__0__20221202T0322Z

column.ActualElapsedTime.cardinality = -2147483648
column.ActualElapsedTime.totalDocs = 10
column.ActualElapsedTime.dataType = INT
column.ActualElapsedTime.bitsPerElement = 31
column.ActualElapsedTime.lengthOfEachEntry = 0
column.ActualElapsedTime.columnType = METRIC
column.ActualElapsedTime.isSorted = false
column.ActualElapsedTime.hasDictionary = false
column.ActualElapsedTime.isSingleValues = true
column.ActualElapsedTime.maxNumberOfMultiValues = 0
column.ActualElapsedTime.totalNumberOfEntries = 10
column.ActualElapsedTime.isAutoGenerated = false
column.ActualElapsedTime.minValue = -9999
column.ActualElapsedTime.maxValue = 147
column.ActualElapsedTime.defaultNullValue = 0

Now, issued a REST call to enable dictionary for ActualElapsedTime. Issued reloadSegment. Verified that dictionary load also succeeds. The updated metadata is as follows:

column.ActualElapsedTime.cardinality = 7
column.ActualElapsedTime.totalDocs = 10
column.ActualElapsedTime.dataType = INT
column.ActualElapsedTime.bitsPerElement = 31
column.ActualElapsedTime.lengthOfEachEntry = 0
column.ActualElapsedTime.columnType = METRIC
column.ActualElapsedTime.isSorted = false
column.ActualElapsedTime.hasDictionary = true
column.ActualElapsedTime.isSingleValues = true
column.ActualElapsedTime.maxNumberOfMultiValues = 0
column.ActualElapsedTime.totalNumberOfEntries = 10
column.ActualElapsedTime.isAutoGenerated = false
column.ActualElapsedTime.minValue = -9999
column.ActualElapsedTime.maxValue = 147
column.ActualElapsedTime.defaultNullValue = 0

@codecov-commenter
Copy link

codecov-commenter commented Nov 30, 2022

Codecov Report

Merging #9875 (69ce962) into master (e4184fc) will decrease coverage by 45.36%.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master    #9875       +/-   ##
=============================================
- Coverage     70.45%   25.08%   -45.37%     
+ Complexity     5545       44     -5501     
=============================================
  Files          1982     1973        -9     
  Lines        106558   106385      -173     
  Branches      16151    16152        +1     
=============================================
- Hits          75077    26690    -48387     
- Misses        26248    76967    +50719     
+ Partials       5233     2728     -2505     
Flag Coverage Δ
integration1 25.08% <0.00%> (-0.06%) ⬇️
integration2 ?
unittests1 ?
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ocal/segment/index/loader/ForwardIndexHandler.java 0.00% <0.00%> (-83.65%) ⬇️
...in/java/org/apache/pinot/spi/utils/BytesUtils.java 0.00% <0.00%> (-100.00%) ⬇️
...java/org/apache/pinot/spi/trace/BaseRecording.java 0.00% <0.00%> (-100.00%) ⬇️
...java/org/apache/pinot/spi/trace/NoOpRecording.java 0.00% <0.00%> (-100.00%) ⬇️
...ava/org/apache/pinot/spi/config/table/FSTType.java 0.00% <0.00%> (-100.00%) ⬇️
...ava/org/apache/pinot/spi/config/user/RoleType.java 0.00% <0.00%> (-100.00%) ⬇️
...ava/org/apache/pinot/spi/data/MetricFieldSpec.java 0.00% <0.00%> (-100.00%) ⬇️
...ava/org/apache/pinot/spi/stream/StreamMessage.java 0.00% <0.00%> (-100.00%) ⬇️
...java/org/apache/pinot/common/tier/TierFactory.java 0.00% <0.00%> (-100.00%) ⬇️
...a/org/apache/pinot/spi/config/table/TableType.java 0.00% <0.00%> (-100.00%) ⬇️
... and 1451 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@vvivekiyer vvivekiyer marked this pull request as ready for review December 2, 2022 03:37
@vvivekiyer
Copy link
Contributor Author

@Jackie-Jiang @siddharthteotia please review. Ideally, would like to add a unit test but it wasn't straightforward. I'm trying to see if I can add some sort of testing in LLCRealtimeClusterIntegrationTest.

@Jackie-Jiang
Copy link
Contributor

@vvivekiyer Adding a test in LLCRealtimeClusterIntegrationTest should be relatively easy. You can pick a column with dictionary and a column without dictionary, then switch the encoding, reload and check, switch them back, reload and check

@vvivekiyer vvivekiyer marked this pull request as draft December 5, 2022 04:22
@vvivekiyer vvivekiyer force-pushed the updateCardinality branch 3 times, most recently from dea500c to a41b197 Compare December 6, 2022 06:31
@vvivekiyer vvivekiyer marked this pull request as ready for review December 6, 2022 06:32
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good. Thanks for adding (also cleaning up) the test


queryResponse = postQuery(query);
long numTotalDocsAfterReload = queryResponse.get("totalDocs").asLong();
assertEquals(numTotalDocs, numTotalDocsAfterReload);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to validate if total doc and query response is always correct during the reload. We can also pick a query that can return different metadata based on whether the column is dictionary encoded to ensure the changing index actually happens. E.g. if we pick a value that doesn't exist, SELECT COUNT(*) FROM myTable WHERE ActualElapsedTime = <val> will have docs scanned in filter for raw index, but no scan for dictionary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good point. Added this test.
Just want to mention I enabled both dictionary and inverted index because "docs scanned in filter" is zero only if inverted index is added.

@vvivekiyer vvivekiyer force-pushed the updateCardinality branch 4 times, most recently from 3863d89 to 18426c1 Compare December 7, 2022 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants