Changes in the Json Transform#1134
Conversation
PR Summary
|
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.Random; | ||
| import java.util.*; |
There was a problem hiding this comment.
Please restore the specific imports
| if (i < fields.length - 1) { | ||
| sb.append(", "); | ||
| } | ||
| if (i < fields.length - 1) sb.append(", "); |
There was a problem hiding this comment.
This is syntactically correct but being consistent in use of braces is better IMO
| private final char[] wrappers; | ||
| private static final char[] WRAPPERS = "[]".toCharArray(); | ||
| private final boolean commaBetweenObjects; | ||
| private final boolean prettPrint; |
pom.xml
Outdated
| locale. Turkish is a notoriously tricky locale. --> | ||
| <surefire.argline>@{argLine} -Duser.language=TR -Duser.country=tr</surefire.argline> | ||
| <kotlin.version>1.9.23</kotlin.version> | ||
| <json.version>20240303</json.version> |
There was a problem hiding this comment.
Besides the pretty printing, what do you need this library for?
There was a problem hiding this comment.
Directly for the pretty print. Indirectly to ensure that the JSON output is valid when using the JSONObject and JSONArray classes.
| // map.put('\u0008', "\\u0008"); | ||
| // covered by map.put('\b', "\\b"); | ||
| // map.put('\u0009', "\\u0009"); | ||
| // covered by map.put('\t', "\\t"); | ||
| // map.put((char) 10, "\\u000A"); | ||
| // covered by map.put('\n', "\\n"); | ||
| Map.entry('\u000B', "\\u000B"), | ||
| // map.put('\u000C', "\\u000C"); | ||
| // covered by map.put('\f', "\\f"); | ||
| // map.put((char) 13, "\\u000D"); | ||
| // covered by map.put('\r', "\\r"); |
There was a problem hiding this comment.
I don't think we need to remove it
this clarifies why we don't have map.put('\u0008', "\\u0008"); and others
otherwise each time this question will pop up
There was a problem hiding this comment.
Reverting to the previous state.
Wouldn't be better to add this as a comment in the method instead of commented code?
Something like this:
/*
* The following entries are not added to the map because they are covered by another existing one
*
* map.put('\u0008', "\\u0008"); -> covered by map.put('\b', "\\b");
* map.put('\u0009', "\\u0009"); -> covered by map.put('\t', "\\t");
* map.put((char) 10, "\\u000A"); -> covered by map.put('\n', "\\n");
* map.put('\u000C', "\\u000C"); -> covered by map.put('\f', "\\f");
* map.put((char) 13, "\\u000D"); -> covered by map.put('\r', "\\r");
*/|
|
||
| if (limit == 1) { | ||
| sb.append(apply(null, schema, 0)); | ||
| return sb.toString(); | ||
| } else { | ||
| for (int i = 0; i < limit; i++) { | ||
| sb.append(apply(null, schema, i)); | ||
| if (commaBetweenObjects && i < limit - 1) sb.append(","); | ||
| } | ||
| } | ||
| return limit > 1 ? wrappers[0] + LINE_SEPARATOR + sb + LINE_SEPARATOR + wrappers[1] : sb.toString(); | ||
|
|
||
| String result = "" + WRAPPERS[0] + sb + WRAPPERS[1]; | ||
|
|
||
| return prettPrint ? result.startsWith("{") | ||
| ? new JSONObject(result).toString(2) : new JSONArray(result).toString(2) : result; |
There was a problem hiding this comment.
This is a breaking change
and this is not clear why we need it.
The main idea behind was that the output was in JSON Lines format (https://jsonlines.org/)
And the main feature is that it could be easily copied and pasted into BigQuery as stated in their doc (https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json)
Now i don't see how it could be possible
There was a problem hiding this comment.
I understand your concern @snuyanzin, and I do agree it's a breaking change as the way to generate large datasets is now missing.
First, the "regular" JSON output was wrong as it wasn't taking the correct format, based on its semantics. See JSON.
The output was, all the time, an object even when the output was an array, and this change fixed it.
Now, from my point of view, the way this feature was created this biased in only one usage, which might be not the common usage for the Datafaker consumers. Datafaker has delivered this feature thinking about generating JSON output to be processed for large external datasets or tools, like logging systems and the mentioned BigQuery example. And it's awesome!
As an end user, my first way of thinking is that Datafaker will generate a valid JSON for most of the use cases.
Nonetheless, it wasn't creating a valid JSON for any other purpose. The #1109 puts some light on it.
I think that a decision should be made, in terms of which format Datafaker will support it.
It can be one, it can be both.
Maybe a new builder in the JsonTransformerBuilder can be added to take care of the large dataset output.
I will follow any decision you guys make, but with the note that the current approach seems not valid for most Datafaker users.
There was a problem hiding this comment.
May be i was not clear enough, sorry
My point is not about valid not valid json
Fixing json is ok
The problem is that here we change output format: now there either everything in one line or pretty
at the same time jsonline assumes that there should be one top level entity per line and I don't know how can we have it with the proposed solution.
So, do we really need this json lib usage? I guess we could fix wrong json issue without breaking existing approach
There was a problem hiding this comment.
Sorry for my misunderstanding.
The library is not that important if we make sure the JSON is correctly created. I intended to have this as a double-check, plus the pretty output.
I would say that we can keep all, making the pretty not default. By the way, when we apply the approach without commas, which is the JSON line solution, the pretty is not applied, but I have changed the code to have it in one line.
Can I propose to remove the one-line approach and keep the pretty output?
There was a problem hiding this comment.
Ideally, I would like the library to be removed. If that means no pretty printing, then I don't mind, I think that's a nice to have, and I don't think that's the responsibility of Datafaker.
There was a problem hiding this comment.
The problem with pretty print: measurement shows that it is 3 times slower for 10, 1000 entities and probably other amounts as well
Benchmark Mode Cnt Score Error Units
DatafakerSimpleMethods.json1 thrpt 10 2517.912 ± 43.562 ops/ms
DatafakerSimpleMethods.json10 thrpt 10 252.746 ± 3.304 ops/ms
DatafakerSimpleMethods.json100 thrpt 10 25.772 ± 0.437 ops/ms
DatafakerSimpleMethods.json100Pretty thrpt 10 8.306 ± 0.013 ops/ms
DatafakerSimpleMethods.json10Pretty thrpt 10 77.777 ± 2.426 ops/ms
DatafakerSimpleMethods.json1Pretty thrpt 10 2513.785 ± 14.620 ops/ms
There was a problem hiding this comment.
No problem.
I will remove the library and other changes, keeping the original fix, in a new PR.
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1134 +/- ##
============================================
+ Coverage 92.35% 92.46% +0.11%
- Complexity 2821 2839 +18
============================================
Files 292 293 +1
Lines 5609 5617 +8
Branches 599 599
============================================
+ Hits 5180 5194 +14
+ Misses 275 271 -4
+ Partials 154 152 -2 ☔ View full report in Codecov by Sentry. |
|
Don't open a new PR, just keep pushing to this branch. |
| } | ||
|
|
||
| StringJoiner data = new StringJoiner(LINE_SEPARATOR); | ||
| StringBuilder sb = new StringBuilder(); |
There was a problem hiding this comment.
Why do we need to remove usage of line separator here?
I agree that there might be cases when it is preferred to get result without it however sometimes it is better to have it.
If you really need to have result without it then i would suggest to add dedicated method to json transformer builder to set lineseparator, then both old and new behavior will be supported
|
@snuyanzin @kingthorin @bodiam Finally the changes are here! All tests green! |
snuyanzin
left a comment
There was a problem hiding this comment.
Thanks for the contribution and for addressing feedback
|
Sorry I missed this somehow. @snuyanzin thanks for approving/merging! |
Description
This PR changes the
JsonTransformerby generating the output as valid JSON data (except by the without comma approach). The output will always be valid when one or more data is generated.Added extra
prettyPrint()method to enable a more readable output.Changed
prettyPrint()method intoJsonTransformerclass along with thejsonlibrarywithCommaBetweenObjects, theprettyPrint()is "disabled" (set tofalse) due to a parse errorJsonTest,FakeStreamTest, andFakeCollectionTestfor methods using theJsonTransformerclass