Add SchemaConformingTransformerV2 to enhance text search abilities #12788

lnbest0707-uber · 2024-04-03T19:35:47Z

tags: feature, refactor, release-notes

This adds an evolved version of ShemaConformingTransformerV2, it evolves from the existing one with following new features:

Refactored code with better readability and extensibility
Support over-lapping schema fields, in which case it could support schema column "a" and "a.b" at the same time. And it only allows primitive type fields to be the value.
Extract flattened key-value pairs as mergedTextIndex for better text searching.
Add shingle index tokenization functionality for extremely large text fields.
Add flexibility to map json extracted field name to meaningful user specified column name
Improve serialization logics to handle nested json fields
Enforce graceful handling on extracted String type column. Will convert collection or array to String if column type is singleField.

The new transformer is contributed by multiple developers: @jackluo923 @Bill-hbrhbr @itschrispeck @lnbest0707-uber and PR owner is summarizing and maintaining the OSS uploading.

codecov-commenter · 2024-04-03T20:34:02Z

Codecov Report

Attention: Patch coverage is 56.98630% with 157 lines in your changes are missing coverage. Please review.

Project coverage is 62.03%. Comparing base (59551e4) to head (b9a013e).
Report is 218 commits behind head on master.

Files	Patch %	Lines
...ingestion/SchemaConformingTransformerV2Config.java	0.00%	78 Missing ⚠️
...cordtransformer/SchemaConformingTransformerV2.java	73.51%	37 Missing and 30 partials ⚠️
...recordtransformer/SchemaConformingTransformer.java	66.66%	0 Missing and 4 partials ⚠️
.../apache/pinot/segment/local/utils/Base64Utils.java	80.00%	1 Missing and 1 partial ⚠️
...ache/pinot/segment/local/utils/IngestionUtils.java	0.00%	0 Missing and 2 partials ⚠️
...he/pinot/segment/local/utils/TableConfigUtils.java	50.00%	1 Missing and 1 partial ⚠️
...ot/spi/config/table/ingestion/IngestionConfig.java	50.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #12788      +/-   ##
============================================
+ Coverage     61.75%   62.03%   +0.28%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2468      +32     
  Lines        133233   135237    +2004     
  Branches      20636    20892     +256     
============================================
+ Hits          82274    83892    +1618     
- Misses        44911    45145     +234     
- Partials       6048     6200     +152

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`61.98% <56.98%> (+0.27%)`	⬆️
java-21	`61.86% <56.98%> (+0.23%)`	⬆️
skip-bytebuffers-false	`62.01% <56.98%> (+0.26%)`	⬆️
skip-bytebuffers-true	`61.83% <56.98%> (+34.10%)`	⬆️
temurin	`62.03% <56.98%> (+0.28%)`	⬆️
unittests	`62.02% <56.98%> (+0.28%)`	⬆️
unittests1	`46.52% <1.09%> (-0.37%)`	⬇️
unittests2	`28.14% <55.89%> (+0.40%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

...cal/src/main/java/org/apache/pinot/segment/local/recordtransformer/CompositeTransformer.java

...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java

chenboat · 2024-04-04T21:23:05Z

...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java

+    Map<String, Object> mergedTextIndexMap = new HashMap<>();
+
+    try {
+      Deque<String> jsonPath = new ArrayDeque<>();


jsonPath --> jsonPaths since we have multiple paths.

The Deque object represents multiple paths by revising itself in place with add/remove. But at each single moment, it could only represent one single path. Queue(a, b, c) is representing path a.b.c which is a single path instead of 3 paths.

...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/Base64Utils.java

...in/java/org/apache/pinot/spi/config/table/ingestion/SchemaConformingTransformerV2Config.java

jackluo923 · 2024-04-05T04:05:34Z

...in/java/org/apache/pinot/spi/config/table/ingestion/SchemaConformingTransformerV2Config.java

+
+  public SchemaConformingTransformerV2Config setMergedTextIndexShinglingTokenOverlapLength(
+      Integer mergedTextIndexShinglingOverlapLength) {
+    _mergedTextIndexShinglingOverlapLength = mergedTextIndexShinglingOverlapLength;


Can you double check what is the side-effect of passing a null Integer object and whether we need to check for null value?

We are treating mergedTextIndexShinglingOverlapLength = null as a special case in transformer, there should not be side effect as long as we check its null value during usage
if (null == mergedTextIndexShinglingOverlapLength) { generateTextIndexToken(kv, luceneTokens, mergedTextIndexTokenMaxLength); } else { generateShingleTextIndexToken(kv, luceneTokens, mergedTextIndexTokenMaxLength, mergedTextIndexShinglingOverlapLength); }

...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java

jackluo923 · 2024-04-05T04:26:17Z

...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java

+
+      // Generate merged text index
+      if (null != _mergedTextIndexFieldSpec && !mergedTextIndexMap.isEmpty()) {
+        List<String> luceneTokens = getLuceneTokensFromMergedTextIndexMap(mergedTextIndexMap);


Can we use Lucene's terminology? i.e. use document rather than token. Perhaps refactor the code in all the places that have this behavior.

renamed to documents, please check. Plus two configurations names in setting has to be changed, mergedTextIndexDocumentMaxLength and mergedTextIndexBinaryDocumentDetectionMinLength changed from Token to Document

...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java

jackluo923 · 2024-04-05T04:42:47Z

...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java

+    }
+  }
+
+  private List<String> getLuceneTokensFromMergedTextIndexMap(Map<String, Object> mergedTextIndexMap) {


Again, perhaps change the terminology from tokens to terms

jackluo923 · 2024-04-05T04:43:57Z

...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java

+ * where node with "*" could represent a valid column in the schema.
+ */
+class SchemaTreeNode {
+  private boolean _isColumn;


Nit: should probably be consistent and use either field or column.

These 2 terms are both used in Pinot documents, so I assume both as acceptable names. Keep them as is for now.

...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java

jackluo923 · 2024-04-08T21:25:43Z

...in/java/org/apache/pinot/spi/config/table/ingestion/SchemaConformingTransformerV2Config.java

+  private String _mergedTextIndexField = "__mergedTextIndex";
+
+  @JsonPropertyDescription("mergedTextIndex document max length")
+  private int _mergedTextIndexDocumentMaxLength = 32766;


If we want to be 100% precise, we should change this parameter to _mergedTextIndexDocumentMaxLength to _mergedTextIndexDocumentMaxSize as 32766 refers to number of bytes rather than number of characters. We also should change the implementation for the check by decoding the string to bytes, etc.

We could simply remove this feature as well because it's not necessary for us because in production, we use a custom analyzer which already enforces this limit.

Thanks for the insight. This would be a relatively big change, will handle in future patch, thanks.

lnbest0707-uber · 2024-04-09T18:17:04Z

Some extreme column name corner case checks would be taken care in future patch, e.g., column name with a., input data with {"a": {"b":1}, "a.b":2, "a.":3, "a.b.":4}

lnbest0707-uber added 4 commits April 3, 2024 10:30

Add SchemaConformingTransformerV2 to enhance text search abilities

461631e

Fix style

ce35805

Update __mergedTextIndex field logics

eab5639

Fix UT

f7c893a

lnbest0707-uber mentioned this pull request Apr 3, 2024

Add SchemaConformingTransformerV2 to enhance text search abilities #12779

Closed

lnbest0707-uber changed the title ~~Upstream fork/schema transformer~~ Add SchemaConformingTransformerV2 to enhance text search abilities Apr 3, 2024

chenboat self-requested a review April 3, 2024 21:21

chenboat reviewed Apr 4, 2024

View reviewed changes

...cal/src/main/java/org/apache/pinot/segment/local/recordtransformer/CompositeTransformer.java Show resolved Hide resolved

chenboat reviewed Apr 4, 2024

View reviewed changes

...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java Outdated Show resolved Hide resolved

chenboat reviewed Apr 4, 2024

View reviewed changes

...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java Outdated Show resolved Hide resolved

chenboat reviewed Apr 4, 2024

View reviewed changes

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/Base64Utils.java Show resolved Hide resolved

jackluo923 suggested changes Apr 5, 2024

View reviewed changes

Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes refactor labels Apr 5, 2024

Resolve comments and add fieldPathsToPreserveInput config

b9a013e

jackluo923 approved these changes Apr 8, 2024

View reviewed changes

chenboat approved these changes Apr 9, 2024

View reviewed changes

chenboat merged commit 13673f1 into apache:master Apr 9, 2024
19 checks passed

lnbest0707-uber deleted the upstream-fork/schema_transformer branch April 9, 2024 23:14

lnbest0707-uber mentioned this pull request Apr 15, 2024

Followup fix and improvements on SchemaConformingTransformerV2 #12932

Open

lnbest0707-uber mentioned this pull request May 20, 2024

Upgrade and optimize text search in SchemaConformingTransformerV2 #13188

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SchemaConformingTransformerV2 to enhance text search abilities #12788

Add SchemaConformingTransformerV2 to enhance text search abilities #12788

lnbest0707-uber commented Apr 3, 2024 •

edited

codecov-commenter commented Apr 3, 2024 •

edited

chenboat Apr 4, 2024

lnbest0707-uber Apr 4, 2024

jackluo923 Apr 5, 2024

lnbest0707-uber Apr 7, 2024

jackluo923 Apr 5, 2024

lnbest0707-uber Apr 7, 2024

jackluo923 Apr 5, 2024

jackluo923 Apr 5, 2024

lnbest0707-uber Apr 7, 2024

jackluo923 Apr 8, 2024

lnbest0707-uber Apr 9, 2024

lnbest0707-uber commented Apr 9, 2024

Add SchemaConformingTransformerV2 to enhance text search abilities #12788

Add SchemaConformingTransformerV2 to enhance text search abilities #12788

Conversation

lnbest0707-uber commented Apr 3, 2024 • edited

codecov-commenter commented Apr 3, 2024 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lnbest0707-uber commented Apr 9, 2024

lnbest0707-uber commented Apr 3, 2024 •

edited

codecov-commenter commented Apr 3, 2024 •

edited