-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SchemaConformingTransformerV2 to enhance text search abilities #12788
Add SchemaConformingTransformerV2 to enhance text search abilities #12788
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12788 +/- ##
============================================
+ Coverage 61.75% 62.03% +0.28%
+ Complexity 207 198 -9
============================================
Files 2436 2468 +32
Lines 133233 135237 +2004
Branches 20636 20892 +256
============================================
+ Hits 82274 83892 +1618
- Misses 44911 45145 +234
- Partials 6048 6200 +152
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
...cal/src/main/java/org/apache/pinot/segment/local/recordtransformer/CompositeTransformer.java
Show resolved
Hide resolved
...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java
Outdated
Show resolved
Hide resolved
Map<String, Object> mergedTextIndexMap = new HashMap<>(); | ||
|
||
try { | ||
Deque<String> jsonPath = new ArrayDeque<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jsonPath --> jsonPaths since we have multiple paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Deque object represents multiple paths by revising itself in place with add/remove. But at each single moment, it could only represent one single path. Queue(a, b, c) is representing path a.b.c which is a single path instead of 3 paths.
...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java
Outdated
Show resolved
Hide resolved
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/Base64Utils.java
Show resolved
Hide resolved
...in/java/org/apache/pinot/spi/config/table/ingestion/SchemaConformingTransformerV2Config.java
Show resolved
Hide resolved
|
||
public SchemaConformingTransformerV2Config setMergedTextIndexShinglingTokenOverlapLength( | ||
Integer mergedTextIndexShinglingOverlapLength) { | ||
_mergedTextIndexShinglingOverlapLength = mergedTextIndexShinglingOverlapLength; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you double check what is the side-effect of passing a null
Integer object and whether we need to check for null value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are treating mergedTextIndexShinglingOverlapLength = null
as a special case in transformer, there should not be side effect as long as we check its null value during usage
if (null == mergedTextIndexShinglingOverlapLength) { generateTextIndexToken(kv, luceneTokens, mergedTextIndexTokenMaxLength); } else { generateShingleTextIndexToken(kv, luceneTokens, mergedTextIndexTokenMaxLength, mergedTextIndexShinglingOverlapLength); }
...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java
Show resolved
Hide resolved
|
||
// Generate merged text index | ||
if (null != _mergedTextIndexFieldSpec && !mergedTextIndexMap.isEmpty()) { | ||
List<String> luceneTokens = getLuceneTokensFromMergedTextIndexMap(mergedTextIndexMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use Lucene's terminology? i.e. use document
rather than token
. Perhaps refactor the code in all the places that have this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to documents, please check. Plus two configurations names in setting has to be changed, mergedTextIndexDocumentMaxLength
and mergedTextIndexBinaryDocumentDetectionMinLength
changed from Token to Document
...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java
Show resolved
Hide resolved
...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
private List<String> getLuceneTokensFromMergedTextIndexMap(Map<String, Object> mergedTextIndexMap) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, perhaps change the terminology from tokens
to terms
* where node with "*" could represent a valid column in the schema. | ||
*/ | ||
class SchemaTreeNode { | ||
private boolean _isColumn; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should probably be consistent and use either field
or column
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 terms are both used in Pinot documents, so I assume both as acceptable names. Keep them as is for now.
...ain/java/org/apache/pinot/segment/local/recordtransformer/SchemaConformingTransformerV2.java
Outdated
Show resolved
Hide resolved
private String _mergedTextIndexField = "__mergedTextIndex"; | ||
|
||
@JsonPropertyDescription("mergedTextIndex document max length") | ||
private int _mergedTextIndexDocumentMaxLength = 32766; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to be 100% precise, we should change this parameter to _mergedTextIndexDocumentMaxLength
to _mergedTextIndexDocumentMaxSize
as 32766 refers to number of bytes rather than number of characters. We also should change the implementation for the check by decoding the string to bytes, etc.
We could simply remove this feature as well because it's not necessary for us because in production, we use a custom analyzer which already enforces this limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the insight. This would be a relatively big change, will handle in future patch, thanks.
Some extreme column name corner case checks would be taken care in future patch, e.g., column name with |
tags:
feature
,refactor
,release-notes
This adds an evolved version of ShemaConformingTransformerV2, it evolves from the existing one with following new features:
Refactored code with better readability and extensibility
Support over-lapping schema fields, in which case it could support schema column "a" and "a.b" at the same time. And it only allows primitive type fields to be the value.
Extract flattened key-value pairs as mergedTextIndex for better text searching.
Add shingle index tokenization functionality for extremely large text fields.
Add flexibility to map json extracted field name to meaningful user specified column name
Improve serialization logics to handle nested json fields
Enforce graceful handling on extracted String type column. Will convert collection or array to String if column type is singleField.
The new transformer is contributed by multiple developers: @jackluo923 @Bill-hbrhbr @itschrispeck @lnbest0707-uber and PR owner is summarizing and maintaining the OSS uploading.