Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support disabling dictionary at runtime for an existing column #9868

Merged
merged 1 commit into from Nov 30, 2022

Conversation

vvivekiyer
Copy link
Contributor

OSS issue: #9348
Label: Feature
Document

With this PR, we add support to disable dictionary on a dict-based column for immutable segments during segment reload. This support is added for both SV and MV columns.

Added Tests.

@codecov-commenter
Copy link

Codecov Report

Merging #9868 (59c5d27) into master (463c120) will increase coverage by 0.19%.
The diff coverage is 79.71%.

@@             Coverage Diff              @@
##             master    #9868      +/-   ##
============================================
+ Coverage     70.15%   70.34%   +0.19%     
- Complexity     5410     5520     +110     
============================================
  Files          1956     1972      +16     
  Lines        104975   105869     +894     
  Branches      15892    16020     +128     
============================================
+ Hits          73641    74473     +832     
+ Misses        26195    26175      -20     
- Partials       5139     5221      +82     
Flag Coverage Δ
integration1 25.14% <0.00%> (-0.13%) ⬇️
integration2 24.58% <0.00%> (+0.01%) ⬆️
unittests1 67.79% <79.71%> (+0.01%) ⬆️
unittests2 15.70% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ocal/segment/index/loader/ForwardIndexHandler.java 83.80% <79.64%> (-1.82%) ⬇️
...ment/creator/impl/SegmentColumnarIndexCreator.java 81.03% <100.00%> (-0.22%) ⬇️
...ntroller/helix/core/minion/TaskMetricsEmitter.java 34.88% <0.00%> (-41.87%) ⬇️
...ot/common/utils/fetcher/SegmentFetcherFactory.java 86.30% <0.00%> (-7.15%) ⬇️
...transform/function/IsNotNullTransformFunction.java 65.51% <0.00%> (-6.90%) ⬇️
...or/transform/function/IsNullTransformFunction.java 75.86% <0.00%> (-6.90%) ⬇️
...tream/kafka20/server/KafkaDataServerStartable.java 72.91% <0.00%> (-6.25%) ⬇️
...nction/DistinctCountBitmapAggregationFunction.java 47.66% <0.00%> (-5.70%) ⬇️
...pache/pinot/core/transport/AsyncQueryResponse.java 86.44% <0.00%> (-5.56%) ⬇️
...apache/pinot/common/function/FunctionRegistry.java 81.63% <0.00%> (-5.21%) ⬇️
... and 127 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@vvivekiyer vvivekiyer marked this pull request as ready for review November 28, 2022 16:57
List<Object[]> beforeResultRows4 = resultTable.getRows();

// TEST5
query = "SELECT column1, max(column1), sum(column10) from testTable WHERE column6 = 1001 GROUP BY "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is column6 the column where dictionary is getting deleted ?

May be compare results for the same query before with dict and after without dict ? Also might want to double check in debugger once that query plan uses scan based evaluator and inv index is no longer being used (dropped infact)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We do compare that the query results exactly match before and after reload.

The list of operations is documented in L726

  • column1 -> change compression.
  • column6 -> disable dictionary
  • column9 -> disable dictionary
  • column3 -> Enable dictionary.
  • column2 -> Enable dictionary. Add inverted index.
  • column7 -> Enable dictionary. Add inverted index.
  • column10 -> Enable dictionary.

BrokerResponseNative brokerResponseNative = getBrokerResponse(query);
assertTrue(brokerResponseNative.getProcessingExceptions() == null
|| brokerResponseNative.getProcessingExceptions().size() == 0);
ResultTable resultTable1 = brokerResponseNative.getResultTable();
assertEquals(brokerResponseNative.getNumRowsResultSet(), 1);
assertEquals(brokerResponseNative.getTotalDocs(), 400_000L);
assertEquals(brokerResponseNative.getNumDocsScanned(), 103280L);
assertEquals(brokerResponseNative.getNumDocsScanned(), 40224L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So rangeIndex will be rewritten to be raw value based right ? I wonder why should numDocsScanned / numEntriesScannedInFilter change ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually changed the query by adding another WHERE clause.

Set<String> noDictionaryColumns = new HashSet<>(_noDictionaryColumns);
indexLoadingConfig.setNoDictionaryColumns(noDictionaryColumns);
indexLoadingConfig.getNoDictionaryColumns().remove("column2");
indexLoadingConfig.getNoDictionaryColumns().remove("column3");
indexLoadingConfig.getNoDictionaryColumns().remove("column7");
indexLoadingConfig.getNoDictionaryColumns().remove("column10");
indexLoadingConfig.getNoDictionaryColumns().add("column6");
indexLoadingConfig.getNoDictionaryColumns().add("column9");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we handle basic validation (similar to what we did in noForwardIndex work) at the table config level ?

For example, if someone wants to disable dictionary on a column, but does not remove the column from the invertedIndexColumn list in the tableConfig when updating the tableConfig, then we should throw error right there.

Other way would be to detect this invalid combination during reload but I guess that's too late.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's already there in TableConfigUtils.

private static void validateIndexingConfig(@Nullable IndexingConfig indexingConfig, @Nullable Schema schema) {

// dictionary will only be allowed if FST and inverted index are also disabled.
if (_indexLoadingConfig.getInvertedIndexColumns().contains(column) || _indexLoadingConfig.getFSTIndexColumns()
.contains(column)) {
LOGGER.warn("Cannot disabled dictionary as column={} has FST index or inverted index or both.", column);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, ideally we should not have permitted the user updating the tableConfig in the first place since it is an invalid combination. May want to take a look at TableConfigValidator code once

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this here if validator can catch it ? Reload is invoked after tableConfig is updated successfully so I don't think this check is needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkForwardIndexCreation(COLUMN1_NAME, 51594, 16, _schema, false, false, false, 0, null, true, 0, DataType.INT,
100000);

// TEST 3: Disable dictionary for a column (Column10) that has range index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do / can we verify that range index indeed got rebuilt with raw values ? May be compare offset and size of range index buffer from index_map before and after ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an additional validation based on rangeIndex size from indexMap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Can we can come up with another test query where range predicate is full match and thus results in reading the entire range index ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure that there are no index out of bounds / seg fault type situations due to incorrectly building index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private int getMaxRowLengthForMVColumn(String column, ForwardIndexReader reader, Dictionary dictionary)
throws Exception {
ColumnMetadata existingColMetadata = _segmentMetadata.getColumnMetadataFor(column);
AbstractColumnStatisticsCollector statsCollector =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably nothing wrong in theory but I wonder why do we need to go through StatsCollector framework to gather this info. We can simply read the forward index to collect this right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, using statsCollector is convenient and it can be the single place that requires change if we alter anything in the future. Without StatsCollector, we might have duplicate the logic for each datatype here that might lead to unnecessarily more code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Fair enough

@siddharthteotia siddharthteotia merged commit 6210f43 into apache:master Nov 30, 2022
@siddharthteotia
Copy link
Contributor

Overall looks good. @vvivekiyer tagged you on two comment threads that you may want to address in follow-ups

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants