Skip to content

Avoid re-stringifying strings in JSON_FORMAT function#13097

Closed
yashmayya wants to merge 1 commit intoapache:masterfrom
yashmayya:jsonformat-restringify
Closed

Avoid re-stringifying strings in JSON_FORMAT function#13097
yashmayya wants to merge 1 commit intoapache:masterfrom
yashmayya:jsonformat-restringify

Conversation

@yashmayya
Copy link
Contributor

@yashmayya yashmayya commented May 7, 2024

  • The JSON_FORMAT function currently just calls JsonUtils::objectToString which in turn calls Jackson Databind's ObjectMapper::writeValueAsStringMethod.
  • If a JSON object that has already been formatted is passed to this function, the string is "re-stringified" and the result is an escaped string.
  • For instance, taking the JSON object {"data":{"alias":"student1","age":24,"name":{"full.name":"Peter1","nick.name":"Pete1"}}}. Calling the JSON_FORMAT function on this will result in the string value {"data":{"alias":"student1","age":24,"name":{"full.name":"Peter1","nick.name":"Pete1"}}}.
  • Calling the JSON_FORMAT function again on this string value will result in the escaped string - "{\"data\":{\"alias\":\"student1\",\"age\":24,\"name\":{\"full.name\":\"Peter1\",\"nick.name\":\"Pete1\"}}}". Passing it through the function again will result in a doubly escaped string - "\"{\\\"data\\\":{\\\"alias\\\":\\\"student1\\\",\\\"age\\\":24,\\\"name\\\":{\\\"full.name\\\":\\\"Peter1\\\",\\\"nick.name\\\":\\\"Pete1\\\"}}}\"".
  • This can lead to unexpected results in certain scenarios. For instance, let's say that we have a table with a data JSON column with the row - {"data":{"alias":"student1","age":24,"name":{"full.name":"Peter1","nick.name":"Pete1"}}}.
  • Now suppose we want to add a new derived column alias using the ingestion transform function JSON_PATH_STRING(JSON_FORMAT(data), '$.alias') (via segment reload). Since JSON columns are stored as strings internally, calling JSON_FORMAT on its values will lead to escaped strings. This leads to an NPE here because the JSON_PATH_STRING function returns a null as it will treat the escaped string argument as a literal JSON string value. This minor patch fixes this issue by avoiding re-stringifying strings in JsonUtils::objectToString.
  • Note that the above behavior is especially confusing because if the ingestion transform function was added during table creation itself, ingestion goes through fine because the argument passed to the JSON_FORMAT function in this case would be the raw HashMap that hasn't been converted to a string yet.
  • bugfix

@yashmayya yashmayya force-pushed the jsonformat-restringify branch from 02f91b1 to 56f0747 Compare May 7, 2024 10:56
@codecov-commenter
Copy link

codecov-commenter commented May 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.16%. Comparing base (59551e4) to head (56f0747).
Report is 413 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13097      +/-   ##
============================================
+ Coverage     61.75%   62.16%   +0.41%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2514      +78     
  Lines        133233   137790    +4557     
  Branches      20636    21319     +683     
============================================
+ Hits          82274    85657    +3383     
- Misses        44911    45739     +828     
- Partials       6048     6394     +346     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 35.31% <100.00%> (-26.40%) ⬇️
java-21 62.05% <100.00%> (+0.42%) ⬆️
skip-bytebuffers-false 62.14% <100.00%> (+0.39%) ⬆️
skip-bytebuffers-true 62.02% <100.00%> (+34.30%) ⬆️
temurin 62.16% <100.00%> (+0.41%) ⬆️
unittests 62.16% <100.00%> (+0.41%) ⬆️
unittests1 46.84% <100.00%> (-0.05%) ⬇️
unittests2 27.74% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yashmayya yashmayya marked this pull request as ready for review May 7, 2024 11:50
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the correct behavior.
JSON_FORMAT should be able to convert any object into JSON. If the value itself is abc, then the proper JSON version should be "abc". If user wants to generate another level of json over a JSON string, we should still allow that, and the new JSON is also valid

@Jackie-Jiang
Copy link
Contributor

I don't fully follow the example in the description. If the value is already JSON string, user shouldn't call JSON_FORMAT again

@yashmayya
Copy link
Contributor Author

I'm not sure if this is the correct behavior.
JSON_FORMAT should be able to convert any object into JSON. If the value itself is abc, then the proper JSON version should be "abc".

That makes sense, I hadn't considered that.

If user wants to generate another level of json over a JSON string, we should still allow that, and the new JSON is also valid

Are there any valid use cases for that?

I don't fully follow the example in the description. If the value is already JSON string, user shouldn't call JSON_FORMAT again

Yeah, I agree in principle but this leads to the confusing behavior that I tried to document in the PR description. Let me try again below.


Scenario 1

  • There is a table with two columns in its schema - data (JSON) and alias (STRING). alias is a derived column using the following ingestion transform function: JSON_PATH_STRING(JSON_FORMAT(data), '$.alias').
  • A sample value is: {"data":{"alias":"student1","age":24,"name":{"full.name":"Peter1","nick.name":"Pete1"}}}.
  • This works as expected and the above value is ingested fine, with the alias column being set to student1.
  • In this case, the JSON_FORMAT function is called on the Map value representing data during ingestion, which results in the JSON string and that works fine with the JSON_PATH_STRING function.

Scenario 2

  • There is a table with one column in its schema - data (JSON). A sample value that has already been ingested is {"data":{"alias":"student1","age":24,"name":{"full.name":"Peter1","nick.name":"Pete1"}}}.
  • Now, the user wants to add a new derived column alias with the same logic as the above scenario. Now, the user would expect that the same ingestion transform function should work here as well with segment reload.
  • If the user updates the table config to add the ingestion transform function: JSON_PATH_STRING(JSON_FORMAT(data), '$.alias') (and also updates the schema to add the new alias column), the segment reload fails with an NPE here.
  • The reason is that in this case, since the data has already been ingested and stored in a segment, the value that is passed to the JSON_FORMAT function in this case is a String (rather than a Map as in the above scenario) since JSON columns are stored as strings internally. This results in an escaped string value that is treated as a literal JSON string value by the JSON_PATH_STRING function thus resulting in an unexpected null value being returned.

Is this working as expected and documented somewhere? Or should we solve this issue in a different way to what this PR was attempting?

@Jackie-Jiang
Copy link
Contributor

Jackie-Jiang commented May 10, 2024

In the given example, the problem is actually from the recursive call of data = JSON_FORMAT(data). I don't think Pinot allows this (maybe I was wrong).
It should work as expected if we configure the ingestion transform to be:

  • data_json = JSON_FORMAT(data)
  • alias = JSON_PATH_STRING(data_json, '$,alias')

@yashmayya yashmayya closed this May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants