-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Derive num docs per chunk from max column value length for varbyte raw index creator #5256
Derive num docs per chunk from max column value length for varbyte raw index creator #5256
Conversation
for var byte raw forward index creator
f310180
to
39af59a
Compare
@@ -162,8 +163,9 @@ private void createTextIndexForColumn(ColumnMetadata columnMetadata) | |||
try (LuceneTextIndexCreator textIndexCreator = new LuceneTextIndexCreator(column, segmentDirectory, true)) { | |||
try (DataFileReader forwardIndexReader = getForwardIndexReader(columnMetadata)) { | |||
VarByteChunkSingleValueReader forwardIndex = (VarByteChunkSingleValueReader) forwardIndexReader; | |||
ChunkReaderContext readerContext = forwardIndex.createContext(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -338,7 +339,7 @@ void createV1ForwardIndexForTextIndex(String column, IndexLoadingConfig indexLoa | |||
int totalDocs = _segmentMetadata.getTotalDocs(); | |||
Object defaultValue = fieldSpec.getDefaultNullValue(); | |||
String stringDefaultValue = (String) defaultValue; | |||
int lengthOfLongestEntry = stringDefaultValue.length(); | |||
int lengthOfLongestEntry = stringDefaultValue.getBytes(Charset.forName("UTF-8")).length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
int lengthOfLongestEntry = stringDefaultValue.getBytes(Charset.forName("UTF-8")).length; | |
int lengthOfLongestEntry = StringUtil.encodeUtf8(stringDefaultValue).length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
@VisibleForTesting | ||
public static int getNumDocsPerChunk(int lengthOfLongestEntry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic can be pushed down to the VarByteChunkSingleValueWriter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be. The call to super() in the constructor of VarByteChunkSingleValueWriter makes things slightly since you have to call this function two times (as part of the call to super). I think the constructor of VarByteChunkSingleValueWriter and its base class can be refactored a little bit to make this logic private to the writer.
I have a follow-up coming up for the TODO mentioned in the PR description. Will do as part of that
@@ -61,7 +62,6 @@ | |||
* @param file File to write to. | |||
* @param compressionType Type of compression to use. | |||
* @param totalDocs Total number of docs to write. | |||
* @param numDocsPerChunk Number of documents per chunk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, forgot to undo
@@ -27,15 +28,21 @@ | |||
|
|||
|
|||
public class SingleValueVarByteRawIndexCreator extends BaseSingleValueRawIndexCreator { | |||
private static final int NUM_DOCS_PER_CHUNK = 1000; // TODO: Auto-derive this based on metadata. | |||
private static final int TARGET_MAX_CHUNK_SIZE = 1024*1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) reformat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
…w index creator (apache#5256) * Derive numDocsPerChunk from max column value length for var byte raw forward index creator * review comments Co-authored-by: Siddharth Teotia <steotia@steotia-mn1.linkedin.biz>
…w index creator (#5256) * Derive numDocsPerChunk from max column value length for var byte raw forward index creator * review comments Co-authored-by: Siddharth Teotia <steotia@steotia-mn1.linkedin.biz>
(1) PR apache#5256 added support for deriving num docs per chunk for var byte raw index create from column length. This was specifically done as part of supporting text blobs. For use cases that don't want this feature and are high QPS, see a negative impact since size of chunk increases (earlier value of numDocsPerChunk was hardcoded to 1000) and based on the access pattern we might end up uncompressing a bigger chunk to get values for a set of docIds. We have made this change configurable. So the default behaviour is same as old (1000 docs per chunk) (2) PR apache#4791 added support for noDict for STRING/BYTES in consuming segments. There is a particular impact of this change on the use cases that have set noDict on their STRING dimension columns for other performance reasons and also want metricsAggregation. These use cases don't get to aggregateMetrics because the new implementation was able to honor their table config setting of noDict on STRING/BYTES. Without metrics aggregation, memory pressure increases. So to continue aggregating metrics for such cases, we will create dictionary even if the column is part of noDictionary set from table config.
(1) PR apache#5256 added support for deriving num docs per chunk for var byte raw index create from column length. This was specifically done as part of supporting text blobs. For use cases that don't want this feature and are high QPS, see a negative impact since size of chunk increases (earlier value of numDocsPerChunk was hardcoded to 1000) and based on the access pattern we might end up uncompressing a bigger chunk to get values for a set of docIds. We have made this change configurable. So the default behaviour is same as old (1000 docs per chunk) (2) PR apache#4791 added support for noDict for STRING/BYTES in consuming segments. There is a particular impact of this change on the use cases that have set noDict on their STRING dimension columns for other performance reasons and also want metricsAggregation. These use cases don't get to aggregateMetrics because the new implementation was able to honor their table config setting of noDict on STRING/BYTES. Without metrics aggregation, memory pressure increases. So to continue aggregating metrics for such cases, we will create dictionary even if the column is part of noDictionary set from table config.
(1) PR #5256 added support for deriving num docs per chunk for var byte raw index create from column length. This was specifically done as part of supporting text blobs. For use cases that don't want this feature and are high QPS, see a negative impact since size of chunk increases (earlier value of numDocsPerChunk was hardcoded to 1000) and based on the access pattern we might end up uncompressing a bigger chunk to get values for a set of docIds. We have made this change configurable. So the default behaviour is same as old (1000 docs per chunk) (2) PR #4791 added support for noDict for STRING/BYTES in consuming segments. There is a particular impact of this change on the use cases that have set noDict on their STRING dimension columns for other performance reasons and also want metricsAggregation. These use cases don't get to aggregateMetrics because the new implementation was able to honor their table config setting of noDict on STRING/BYTES. Without metrics aggregation, memory pressure increases. So to continue aggregating metrics for such cases, we will create dictionary even if the column is part of noDictionary set from table config. Co-authored-by: Siddharth Teotia <steotia@steotia-mn1.linkedin.biz>
(1) PR #5256 added support for deriving num docs per chunk for var byte raw index create from column length. This was specifically done as part of supporting text blobs. For use cases that don't want this feature and are high QPS, see a negative impact since size of chunk increases (earlier value of numDocsPerChunk was hardcoded to 1000) and based on the access pattern we might end up uncompressing a bigger chunk to get values for a set of docIds. We have made this change configurable. So the default behaviour is same as old (1000 docs per chunk) (2) PR #4791 added support for noDict for STRING/BYTES in consuming segments. There is a particular impact of this change on the use cases that have set noDict on their STRING dimension columns for other performance reasons and also want metricsAggregation. These use cases don't get to aggregateMetrics because the new implementation was able to honor their table config setting of noDict on STRING/BYTES. Without metrics aggregation, memory pressure increases. So to continue aggregating metrics for such cases, we will create dictionary even if the column is part of noDictionary set from table config. Co-authored-by: Siddharth Teotia <steotia@steotia-mn1.linkedin.biz>
As part of internal testing for text search, we found that there could be cases where a text column value is several hundred thousands of characters. This would be for a very small percentage of rows from the overall dataset.
The VarByteChunkWriter uses a fixed hard-coded value 1000 for number of docs per chunk. It is better to derive this from metadata (length of longest value in bytes from stats). For unusually higher (1million) value of lengthOfLongestEntry (in bytes), we were seeing int overflow since the chunk size was computed as :
1000 * (lengthOfLongestEntry + 4 byte header offset). Secondly, the compression buffer is allocated twice of this size to account for negative compression so the capacity for compression buffer became negative.
The PR has change for deriving num docs per chunk from lengthOfLongestEntry using a fix target max chunk size of 1MB.
This is backward compatible since we wrote the number of docs per chunk in the file header.
There is a tentative follow-up
Use long for the chunk offset array in file header. Currently we use int. If most of the text column values are blob like data, then the total size of text data across all rows could be more than 2GB. So we need long to track chunk offsets. This would be backward incompatible change with a new version of chunk writer and reader.