Derive num docs per chunk from max column value length for varbyte raw index creator #5256

siddharthteotia · 2020-04-16T01:10:36Z

As part of internal testing for text search, we found that there could be cases where a text column value is several hundred thousands of characters. This would be for a very small percentage of rows from the overall dataset.

The VarByteChunkWriter uses a fixed hard-coded value 1000 for number of docs per chunk. It is better to derive this from metadata (length of longest value in bytes from stats). For unusually higher (1million) value of lengthOfLongestEntry (in bytes), we were seeing int overflow since the chunk size was computed as :

1000 * (lengthOfLongestEntry + 4 byte header offset). Secondly, the compression buffer is allocated twice of this size to account for negative compression so the capacity for compression buffer became negative.

The PR has change for deriving num docs per chunk from lengthOfLongestEntry using a fix target max chunk size of 1MB.

This is backward compatible since we wrote the number of docs per chunk in the file header.

There is a tentative follow-up

Use long for the chunk offset array in file header. Currently we use int. If most of the text column values are blob like data, then the total size of text data across all rows could be more than 2GB. So we need long to track chunk offsets. This would be backward incompatible change with a new version of chunk writer and reader.

for var byte raw forward index creator

Jackie-Jiang · 2020-04-16T18:59:22Z

...src/main/java/org/apache/pinot/core/segment/index/loader/invertedindex/TextIndexHandler.java

@@ -162,8 +163,9 @@ private void createTextIndexForColumn(ColumnMetadata columnMetadata)
    try (LuceneTextIndexCreator textIndexCreator = new LuceneTextIndexCreator(column, segmentDirectory, true)) {
      try (DataFileReader forwardIndexReader = getForwardIndexReader(columnMetadata)) {
        VarByteChunkSingleValueReader forwardIndex = (VarByteChunkSingleValueReader) forwardIndexReader;
+        ChunkReaderContext readerContext = forwardIndex.createContext();


Jackie-Jiang · 2020-04-16T19:00:15Z

.../java/org/apache/pinot/core/segment/index/loader/defaultcolumn/BaseDefaultColumnHandler.java

@@ -338,7 +339,7 @@ void createV1ForwardIndexForTextIndex(String column, IndexLoadingConfig indexLoa
    int totalDocs = _segmentMetadata.getTotalDocs();
    Object defaultValue = fieldSpec.getDefaultNullValue();
    String stringDefaultValue = (String) defaultValue;
-    int lengthOfLongestEntry = stringDefaultValue.length();
+    int lengthOfLongestEntry = stringDefaultValue.getBytes(Charset.forName("UTF-8")).length;


Good catch

Suggested change

int lengthOfLongestEntry = stringDefaultValue.getBytes(Charset.forName("UTF-8")).length;

int lengthOfLongestEntry = StringUtil.encodeUtf8(stringDefaultValue).length;

Jackie-Jiang · 2020-04-16T19:06:43Z

...n/java/org/apache/pinot/core/segment/creator/impl/fwd/SingleValueVarByteRawIndexCreator.java

+  }
+
+  @VisibleForTesting
+  public static int getNumDocsPerChunk(int lengthOfLongestEntry) {


This logic can be pushed down to the VarByteChunkSingleValueWriter?

It can be. The call to super() in the constructor of VarByteChunkSingleValueWriter makes things slightly since you have to call this function two times (as part of the call to super). I think the constructor of VarByteChunkSingleValueWriter and its base class can be refactored a little bit to make this logic private to the writer.

I have a follow-up coming up for the TODO mentioned in the PR description. Will do as part of that

Jackie-Jiang · 2020-04-16T19:10:21Z

...ore/src/main/java/org/apache/pinot/core/io/writer/impl/v1/VarByteChunkSingleValueWriter.java

@@ -61,7 +62,6 @@
   * @param file File to write to.
   * @param compressionType Type of compression to use.
   * @param totalDocs Total number of docs to write.
-   * @param numDocsPerChunk Number of documents per chunk.


sorry, forgot to undo

Jackie-Jiang · 2020-04-16T19:10:39Z

...n/java/org/apache/pinot/core/segment/creator/impl/fwd/SingleValueVarByteRawIndexCreator.java

@@ -27,15 +28,21 @@


 public class SingleValueVarByteRawIndexCreator extends BaseSingleValueRawIndexCreator {
-  private static final int NUM_DOCS_PER_CHUNK = 1000; // TODO: Auto-derive this based on metadata.
+  private static final int TARGET_MAX_CHUNK_SIZE = 1024*1024;


(nit) reformat

Jackie-Jiang

LGTM otherwise

…w index creator (apache#5256) * Derive numDocsPerChunk from max column value length for var byte raw forward index creator * review comments Co-authored-by: Siddharth Teotia <steotia@steotia-mn1.linkedin.biz>

…w index creator (#5256) * Derive numDocsPerChunk from max column value length for var byte raw forward index creator * review comments Co-authored-by: Siddharth Teotia <steotia@steotia-mn1.linkedin.biz>

(1) PR apache#5256 added support for deriving num docs per chunk for var byte raw index create from column length. This was specifically done as part of supporting text blobs. For use cases that don't want this feature and are high QPS, see a negative impact since size of chunk increases (earlier value of numDocsPerChunk was hardcoded to 1000) and based on the access pattern we might end up uncompressing a bigger chunk to get values for a set of docIds. We have made this change configurable. So the default behaviour is same as old (1000 docs per chunk) (2) PR apache#4791 added support for noDict for STRING/BYTES in consuming segments. There is a particular impact of this change on the use cases that have set noDict on their STRING dimension columns for other performance reasons and also want metricsAggregation. These use cases don't get to aggregateMetrics because the new implementation was able to honor their table config setting of noDict on STRING/BYTES. Without metrics aggregation, memory pressure increases. So to continue aggregating metrics for such cases, we will create dictionary even if the column is part of noDictionary set from table config.

(1) PR #5256 added support for deriving num docs per chunk for var byte raw index create from column length. This was specifically done as part of supporting text blobs. For use cases that don't want this feature and are high QPS, see a negative impact since size of chunk increases (earlier value of numDocsPerChunk was hardcoded to 1000) and based on the access pattern we might end up uncompressing a bigger chunk to get values for a set of docIds. We have made this change configurable. So the default behaviour is same as old (1000 docs per chunk) (2) PR #4791 added support for noDict for STRING/BYTES in consuming segments. There is a particular impact of this change on the use cases that have set noDict on their STRING dimension columns for other performance reasons and also want metricsAggregation. These use cases don't get to aggregateMetrics because the new implementation was able to honor their table config setting of noDict on STRING/BYTES. Without metrics aggregation, memory pressure increases. So to continue aggregating metrics for such cases, we will create dictionary even if the column is part of noDictionary set from table config. Co-authored-by: Siddharth Teotia <steotia@steotia-mn1.linkedin.biz>

Derive numDocsPerChunk from max column value length

39af59a

for var byte raw forward index creator

siddharthteotia force-pushed the large-string-text branch from f310180 to 39af59a Compare April 16, 2020 08:16

siddharthteotia requested a review from Jackie-Jiang April 16, 2020 17:36

Jackie-Jiang reviewed Apr 16, 2020

View reviewed changes

Jackie-Jiang approved these changes Apr 16, 2020

View reviewed changes

review comments

2378bbf

siddharthteotia merged commit 142a86f into apache:master Apr 17, 2020

siddharthteotia mentioned this pull request Apr 22, 2020

Use 8byte offsets in chunk based raw index creator #5285

Merged

siddharthteotia mentioned this pull request May 30, 2020

Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled. #5470

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derive num docs per chunk from max column value length for varbyte raw index creator #5256

Derive num docs per chunk from max column value length for varbyte raw index creator #5256

siddharthteotia commented Apr 16, 2020 •

edited

Loading

Jackie-Jiang Apr 16, 2020

Jackie-Jiang Apr 16, 2020

siddharthteotia Apr 17, 2020

Jackie-Jiang Apr 16, 2020

siddharthteotia Apr 17, 2020

Jackie-Jiang Apr 16, 2020

siddharthteotia Apr 17, 2020

Jackie-Jiang Apr 16, 2020

siddharthteotia Apr 17, 2020

Jackie-Jiang left a comment

	int lengthOfLongestEntry = stringDefaultValue.getBytes(Charset.forName("UTF-8")).length;
	int lengthOfLongestEntry = StringUtil.encodeUtf8(stringDefaultValue).length;

Derive num docs per chunk from max column value length for varbyte raw index creator #5256

Derive num docs per chunk from max column value length for varbyte raw index creator #5256

Conversation

siddharthteotia commented Apr 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

siddharthteotia commented Apr 16, 2020 •

edited

Loading