Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derive num docs per chunk from max column value length for varbyte raw index creator #5256

Merged
merged 2 commits into from
Apr 17, 2020

Conversation

siddharthteotia
Copy link
Contributor

@siddharthteotia siddharthteotia commented Apr 16, 2020

As part of internal testing for text search, we found that there could be cases where a text column value is several hundred thousands of characters. This would be for a very small percentage of rows from the overall dataset.

The VarByteChunkWriter uses a fixed hard-coded value 1000 for number of docs per chunk. It is better to derive this from metadata (length of longest value in bytes from stats). For unusually higher (1million) value of lengthOfLongestEntry (in bytes), we were seeing int overflow since the chunk size was computed as :

1000 * (lengthOfLongestEntry + 4 byte header offset). Secondly, the compression buffer is allocated twice of this size to account for negative compression so the capacity for compression buffer became negative.

The PR has change for deriving num docs per chunk from lengthOfLongestEntry using a fix target max chunk size of 1MB.

This is backward compatible since we wrote the number of docs per chunk in the file header.

There is a tentative follow-up

Use long for the chunk offset array in file header. Currently we use int. If most of the text column values are blob like data, then the total size of text data across all rows could be more than 2GB. So we need long to track chunk offsets. This would be backward incompatible change with a new version of chunk writer and reader.

for var byte raw forward index creator
@@ -162,8 +163,9 @@ private void createTextIndexForColumn(ColumnMetadata columnMetadata)
try (LuceneTextIndexCreator textIndexCreator = new LuceneTextIndexCreator(column, segmentDirectory, true)) {
try (DataFileReader forwardIndexReader = getForwardIndexReader(columnMetadata)) {
VarByteChunkSingleValueReader forwardIndex = (VarByteChunkSingleValueReader) forwardIndexReader;
ChunkReaderContext readerContext = forwardIndex.createContext();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -338,7 +339,7 @@ void createV1ForwardIndexForTextIndex(String column, IndexLoadingConfig indexLoa
int totalDocs = _segmentMetadata.getTotalDocs();
Object defaultValue = fieldSpec.getDefaultNullValue();
String stringDefaultValue = (String) defaultValue;
int lengthOfLongestEntry = stringDefaultValue.length();
int lengthOfLongestEntry = stringDefaultValue.getBytes(Charset.forName("UTF-8")).length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

Suggested change
int lengthOfLongestEntry = stringDefaultValue.getBytes(Charset.forName("UTF-8")).length;
int lengthOfLongestEntry = StringUtil.encodeUtf8(stringDefaultValue).length;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

@VisibleForTesting
public static int getNumDocsPerChunk(int lengthOfLongestEntry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic can be pushed down to the VarByteChunkSingleValueWriter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be. The call to super() in the constructor of VarByteChunkSingleValueWriter makes things slightly since you have to call this function two times (as part of the call to super). I think the constructor of VarByteChunkSingleValueWriter and its base class can be refactored a little bit to make this logic private to the writer.

I have a follow-up coming up for the TODO mentioned in the PR description. Will do as part of that

@@ -61,7 +62,6 @@
* @param file File to write to.
* @param compressionType Type of compression to use.
* @param totalDocs Total number of docs to write.
* @param numDocsPerChunk Number of documents per chunk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, forgot to undo

@@ -27,15 +28,21 @@


public class SingleValueVarByteRawIndexCreator extends BaseSingleValueRawIndexCreator {
private static final int NUM_DOCS_PER_CHUNK = 1000; // TODO: Auto-derive this based on metadata.
private static final int TARGET_MAX_CHUNK_SIZE = 1024*1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) reformat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@siddharthteotia siddharthteotia merged commit 142a86f into apache:master Apr 17, 2020
siddharthteotia added a commit to siddharthteotia/incubator-pinot that referenced this pull request May 29, 2020
…w index creator (apache#5256)

* Derive numDocsPerChunk from max column value length
for var byte raw forward index creator

* review comments

Co-authored-by: Siddharth Teotia <steotia@steotia-mn1.linkedin.biz>
siddharthteotia added a commit that referenced this pull request May 29, 2020
…w index creator (#5256)

* Derive numDocsPerChunk from max column value length
for var byte raw forward index creator

* review comments

Co-authored-by: Siddharth Teotia <steotia@steotia-mn1.linkedin.biz>
siddharthteotia pushed a commit to siddharthteotia/incubator-pinot that referenced this pull request May 31, 2020
(1) PR apache#5256
added support for deriving num docs per chunk for var byte
raw index create from column length. This was specifically
done as part of supporting text blobs. For use cases that
don't want this feature and are high QPS, see a negative
impact since size of chunk increases (earlier value
of numDocsPerChunk was hardcoded to 1000) and based on the
access pattern we might end up uncompressing a bigger chunk to get values
for a set of docIds. We have made this change configurable.
So the default behaviour is same as old (1000 docs per chunk)

(2) PR apache#4791
added support for noDict for STRING/BYTES in consuming segments.
There is a particular impact of this change on the use cases
that have set noDict on their STRING dimension columns for other performance
reasons and also want metricsAggregation. These use cases don't get to
aggregateMetrics because the new implementation was able to honor their
table config setting of noDict on STRING/BYTES. Without metrics aggregation,
memory pressure increases. So to continue aggregating metrics for such cases,
we will create dictionary even if the column is part of noDictionary set
from table config.
siddharthteotia pushed a commit to siddharthteotia/incubator-pinot that referenced this pull request May 31, 2020
(1) PR apache#5256
added support for deriving num docs per chunk for var byte
raw index create from column length. This was specifically
done as part of supporting text blobs. For use cases that
don't want this feature and are high QPS, see a negative
impact since size of chunk increases (earlier value
of numDocsPerChunk was hardcoded to 1000) and based on the
access pattern we might end up uncompressing a bigger chunk to get values
for a set of docIds. We have made this change configurable.
So the default behaviour is same as old (1000 docs per chunk)

(2) PR apache#4791
added support for noDict for STRING/BYTES in consuming segments.
There is a particular impact of this change on the use cases
that have set noDict on their STRING dimension columns for other performance
reasons and also want metricsAggregation. These use cases don't get to
aggregateMetrics because the new implementation was able to honor their
table config setting of noDict on STRING/BYTES. Without metrics aggregation,
memory pressure increases. So to continue aggregating metrics for such cases,
we will create dictionary even if the column is part of noDictionary set
from table config.
siddharthteotia added a commit that referenced this pull request May 31, 2020
(1) PR #5256
added support for deriving num docs per chunk for var byte
raw index create from column length. This was specifically
done as part of supporting text blobs. For use cases that
don't want this feature and are high QPS, see a negative
impact since size of chunk increases (earlier value
of numDocsPerChunk was hardcoded to 1000) and based on the
access pattern we might end up uncompressing a bigger chunk to get values
for a set of docIds. We have made this change configurable.
So the default behaviour is same as old (1000 docs per chunk)

(2) PR #4791
added support for noDict for STRING/BYTES in consuming segments.
There is a particular impact of this change on the use cases
that have set noDict on their STRING dimension columns for other performance
reasons and also want metricsAggregation. These use cases don't get to
aggregateMetrics because the new implementation was able to honor their
table config setting of noDict on STRING/BYTES. Without metrics aggregation,
memory pressure increases. So to continue aggregating metrics for such cases,
we will create dictionary even if the column is part of noDictionary set
from table config.

Co-authored-by: Siddharth Teotia <steotia@steotia-mn1.linkedin.biz>
haibow pushed a commit that referenced this pull request Jun 2, 2020
(1) PR #5256
added support for deriving num docs per chunk for var byte
raw index create from column length. This was specifically
done as part of supporting text blobs. For use cases that
don't want this feature and are high QPS, see a negative
impact since size of chunk increases (earlier value
of numDocsPerChunk was hardcoded to 1000) and based on the
access pattern we might end up uncompressing a bigger chunk to get values
for a set of docIds. We have made this change configurable.
So the default behaviour is same as old (1000 docs per chunk)

(2) PR #4791
added support for noDict for STRING/BYTES in consuming segments.
There is a particular impact of this change on the use cases
that have set noDict on their STRING dimension columns for other performance
reasons and also want metricsAggregation. These use cases don't get to
aggregateMetrics because the new implementation was able to honor their
table config setting of noDict on STRING/BYTES. Without metrics aggregation,
memory pressure increases. So to continue aggregating metrics for such cases,
we will create dictionary even if the column is part of noDictionary set
from table config.

Co-authored-by: Siddharth Teotia <steotia@steotia-mn1.linkedin.biz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants