Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SchemaConformingTransformerV2 to enhance text search abilities #12788

Merged

Conversation

lnbest0707-uber
Copy link
Contributor

@lnbest0707-uber lnbest0707-uber commented Apr 3, 2024

tags: feature, refactor, release-notes

This adds an evolved version of ShemaConformingTransformerV2, it evolves from the existing one with following new features:

Refactored code with better readability and extensibility
Support over-lapping schema fields, in which case it could support schema column "a" and "a.b" at the same time. And it only allows primitive type fields to be the value.
Extract flattened key-value pairs as mergedTextIndex for better text searching.
Add shingle index tokenization functionality for extremely large text fields.
Add flexibility to map json extracted field name to meaningful user specified column name
Improve serialization logics to handle nested json fields
Enforce graceful handling on extracted String type column. Will convert collection or array to String if column type is singleField.

The new transformer is contributed by multiple developers: @jackluo923 @Bill-hbrhbr @itschrispeck @lnbest0707-uber and PR owner is summarizing and maintaining the OSS uploading.

@lnbest0707-uber lnbest0707-uber changed the title Upstream fork/schema transformer Add SchemaConformingTransformerV2 to enhance text search abilities Apr 3, 2024
@codecov-commenter
Copy link

codecov-commenter commented Apr 3, 2024

Codecov Report

Attention: Patch coverage is 56.98630% with 157 lines in your changes are missing coverage. Please review.

Project coverage is 62.03%. Comparing base (59551e4) to head (b9a013e).
Report is 218 commits behind head on master.

Files Patch % Lines
...ingestion/SchemaConformingTransformerV2Config.java 0.00% 78 Missing ⚠️
...cordtransformer/SchemaConformingTransformerV2.java 73.51% 37 Missing and 30 partials ⚠️
...recordtransformer/SchemaConformingTransformer.java 66.66% 0 Missing and 4 partials ⚠️
.../apache/pinot/segment/local/utils/Base64Utils.java 80.00% 1 Missing and 1 partial ⚠️
...ache/pinot/segment/local/utils/IngestionUtils.java 0.00% 0 Missing and 2 partials ⚠️
...he/pinot/segment/local/utils/TableConfigUtils.java 50.00% 1 Missing and 1 partial ⚠️
...ot/spi/config/table/ingestion/IngestionConfig.java 50.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12788      +/-   ##
============================================
+ Coverage     61.75%   62.03%   +0.28%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2468      +32     
  Lines        133233   135237    +2004     
  Branches      20636    20892     +256     
============================================
+ Hits          82274    83892    +1618     
- Misses        44911    45145     +234     
- Partials       6048     6200     +152     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.98% <56.98%> (+0.27%) ⬆️
java-21 61.86% <56.98%> (+0.23%) ⬆️
skip-bytebuffers-false 62.01% <56.98%> (+0.26%) ⬆️
skip-bytebuffers-true 61.83% <56.98%> (+34.10%) ⬆️
temurin 62.03% <56.98%> (+0.28%) ⬆️
unittests 62.02% <56.98%> (+0.28%) ⬆️
unittests1 46.52% <1.09%> (-0.37%) ⬇️
unittests2 28.14% <55.89%> (+0.40%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chenboat chenboat self-requested a review April 3, 2024 21:21
Map<String, Object> mergedTextIndexMap = new HashMap<>();

try {
Deque<String> jsonPath = new ArrayDeque<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jsonPath --> jsonPaths since we have multiple paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Deque object represents multiple paths by revising itself in place with add/remove. But at each single moment, it could only represent one single path. Queue(a, b, c) is representing path a.b.c which is a single path instead of 3 paths.


public SchemaConformingTransformerV2Config setMergedTextIndexShinglingTokenOverlapLength(
Integer mergedTextIndexShinglingOverlapLength) {
_mergedTextIndexShinglingOverlapLength = mergedTextIndexShinglingOverlapLength;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check what is the side-effect of passing a null Integer object and whether we need to check for null value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are treating mergedTextIndexShinglingOverlapLength = null as a special case in transformer, there should not be side effect as long as we check its null value during usage
if (null == mergedTextIndexShinglingOverlapLength) { generateTextIndexToken(kv, luceneTokens, mergedTextIndexTokenMaxLength); } else { generateShingleTextIndexToken(kv, luceneTokens, mergedTextIndexTokenMaxLength, mergedTextIndexShinglingOverlapLength); }


// Generate merged text index
if (null != _mergedTextIndexFieldSpec && !mergedTextIndexMap.isEmpty()) {
List<String> luceneTokens = getLuceneTokensFromMergedTextIndexMap(mergedTextIndexMap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Lucene's terminology? i.e. use document rather than token. Perhaps refactor the code in all the places that have this behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to documents, please check. Plus two configurations names in setting has to be changed, mergedTextIndexDocumentMaxLength and mergedTextIndexBinaryDocumentDetectionMinLength changed from Token to Document

}
}

private List<String> getLuceneTokensFromMergedTextIndexMap(Map<String, Object> mergedTextIndexMap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, perhaps change the terminology from tokens to terms

* where node with "*" could represent a valid column in the schema.
*/
class SchemaTreeNode {
private boolean _isColumn;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should probably be consistent and use either field or column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 terms are both used in Pinot documents, so I assume both as acceptable names. Keep them as is for now.

@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes refactor labels Apr 5, 2024
private String _mergedTextIndexField = "__mergedTextIndex";

@JsonPropertyDescription("mergedTextIndex document max length")
private int _mergedTextIndexDocumentMaxLength = 32766;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to be 100% precise, we should change this parameter to _mergedTextIndexDocumentMaxLength to _mergedTextIndexDocumentMaxSize as 32766 refers to number of bytes rather than number of characters. We also should change the implementation for the check by decoding the string to bytes, etc.

We could simply remove this feature as well because it's not necessary for us because in production, we use a custom analyzer which already enforces this limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insight. This would be a relatively big change, will handle in future patch, thanks.

@lnbest0707-uber
Copy link
Contributor Author

Some extreme column name corner case checks would be taken care in future patch, e.g., column name with a., input data with {"a": {"b":1}, "a.b":2, "a.":3, "a.b.":4}

@chenboat chenboat merged commit 13673f1 into apache:master Apr 9, 2024
19 checks passed
@lnbest0707-uber lnbest0707-uber deleted the upstream-fork/schema_transformer branch April 9, 2024 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature refactor release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants