Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding jsonExtractScalar function to extract field from json object #4597

Merged
merged 5 commits into from May 13, 2020

Conversation

xiangfu0
Copy link
Contributor

@xiangfu0 xiangfu0 commented Sep 9, 2019

Right now we could put a JSON blob into a string field.
This udf leverages JsonPath (https://github.com/json-path/JsonPath) DSL to read from a JSON string.

  1. Adding Transform function: jsonExtractScalar

Function Syntax:

jsonExtractScalar(JSON_STRING_FIELD, JSON_PATH, OUTPUT_FORMAT)

Sample queries:

Select jsonExtractScalar(myJsonMapStr,'$.k1','STRING') from myTable  where jsonExtractScalar(myJsonMapStr,'$.k1','STRING') = 'value-k1-0'";
Select sum(jsonExtractScalar(complexMapStr,'$.k4.met','INT')) from myTable group by jsonExtractScalar(complexMapStr,'$.k1','STRING')
  1. Adding Transform function: jsonExtractKey to extract the paths for a given pattern of a json.
    E.g. for a given json:
{ "k1": "v1", "k2": "v2", "k3": "v3" }

The result of jsonExtractKey(jsonField, '$.*') will return a list a string of matched json paths pattern. In above example, the result is ["$['k1']", "$['k2']", "$['k3']"]

  1. Normalized transform function to resolve both jsonExtractScalar and json_extract_scalar function names to the same function.

@xiangfu0 xiangfu0 force-pushed the jsonpath_function branch 2 times, most recently from 8e4c8ba to af7bdc1 Compare September 9, 2019 20:30
@codecov-io
Copy link

codecov-io commented Sep 9, 2019

Codecov Report

Merging #4597 into master will increase coverage by 0.42%.
The diff coverage is 73.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4597      +/-   ##
==========================================
+ Coverage   66.08%   66.50%   +0.42%     
==========================================
  Files        1072     1077       +5     
  Lines       54668    54978     +310     
  Branches     8152     8213      +61     
==========================================
+ Hits        36125    36565     +440     
+ Misses      15895    15717     -178     
- Partials     2648     2696      +48     
Impacted Files Coverage Δ
...ava/org/apache/pinot/common/utils/SchemaUtils.java 9.85% <0.00%> (ø)
...ger/realtime/Server2ControllerSegmentUploader.java 71.42% <ø> (ø)
...edicate/BaseDictionaryBasedPredicateEvaluator.java 54.16% <ø> (ø)
...predicate/BaseRawValueBasedPredicateEvaluator.java 87.87% <ø> (ø)
...e/operator/dociditerators/MVScanDocIdIterator.java 59.37% <14.28%> (-6.73%) ⬇️
...not/core/data/recordtransformer/PinotDataType.java 95.08% <33.33%> (-1.03%) ⬇️
...m/function/JsonExtractScalarTransformFunction.java 48.80% <48.80%> (ø)
...rg/apache/pinot/broker/routing/RoutingManager.java 80.91% <50.00%> (-0.24%) ⬇️
...e/pinot/common/function/TransformFunctionType.java 82.85% <50.00%> (-4.24%) ⬇️
...indexsegment/generator/SegmentGeneratorConfig.java 72.44% <50.00%> (+10.18%) ⬆️
... and 103 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44b0a91...fa1bfdc. Read the comment docs.

@mcvsubbu
Copy link
Contributor

Can you also add documentation on how json columns are to be used? I suppose the json columns cannot be used in filters (yet) right?

@xiangfu0
Copy link
Contributor Author

xiangfu0 commented Sep 13, 2019

Will add more documents, this could be used in selection/filtering/groupby, please refer to integration tests.

Current implementation requires the column to be a string column and the content is the json string in order to be used(e.g. the AVRO field needs to be String type). I'm still investigating how to directly ingest a record/map type into the field.

@xiangfu0 xiangfu0 force-pushed the jsonpath_function branch 2 times, most recently from f5c17ef to 0f94d9f Compare September 16, 2019 06:13
@xiangfu0 xiangfu0 changed the title [WIP] Adding json_path function to extract field from json object Adding json_path function to extract field from json object Sep 17, 2019
@xiangfu0 xiangfu0 force-pushed the jsonpath_function branch 5 times, most recently from 9e171e1 to 0644c0f Compare May 2, 2020 13:32
Copy link
Member

@kishoreg kishoreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have support for udf in filter predicates, this can be an amazing feature.
Can we change the function names to match with presto

@xiangfu0 xiangfu0 force-pushed the jsonpath_function branch 3 times, most recently from 3fca70e to 9ed2e19 Compare May 3, 2020 02:46
@xiangfu0
Copy link
Contributor Author

xiangfu0 commented May 3, 2020

Now that we have support for udf in filter predicates, this can be an amazing feature.
Can we change the function names to match with presto

Rename function name to json_extract_scalar also added a new function to extract json_extract_key to extract the paths for a given json.

@xiangfu0 xiangfu0 changed the title Adding json_path function to extract field from json object Adding jsonExtractScalar function to extract field from json object May 3, 2020
@xiangfu0 xiangfu0 requested review from kishoreg and npawar May 3, 2020 02:55
@kishoreg
Copy link
Member

Can you also add documentation on how json columns are to be used? I suppose the json columns cannot be used in filters (yet) right?

We can know after we added the support for expressions in the filter predicates

@xiangfu0 xiangfu0 merged commit a6fe685 into master May 13, 2020
@xiangfu0 xiangfu0 deleted the jsonpath_function branch May 13, 2020 17:37
Jackie-Jiang added a commit to Jackie-Jiang/pinot that referenced this pull request May 14, 2020
1. Fix the compilation error introduced because of the merge of apache#4597 and apache#5240
2. Fix the bug of not loading the range index if both inverted index and range index exist

TODO: The range index triggeres another severe issue of accessing closed DataBuffer which can cause JVM crash. Will address in a separate PR
Jackie-Jiang added a commit that referenced this pull request May 15, 2020
1. Fix the compilation error introduced because of the merge of #4597 and #5240
2. Fix the bug of not loading the range index if both inverted index and range index exist

TODO: The range index triggeres another severe issue of accessing closed DataBuffer which can cause JVM crash. Will address in a separate PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants