Skip to content

Changes in the Json Transform#1134

Merged
snuyanzin merged 13 commits intodatafaker-net:mainfrom
eliasnogueira:json-transform
Apr 4, 2024
Merged

Changes in the Json Transform#1134
snuyanzin merged 13 commits intodatafaker-net:mainfrom
eliasnogueira:json-transform

Conversation

@eliasnogueira
Copy link
Contributor

Description

This PR changes the JsonTransformer by generating the output as valid JSON data (except by the without comma approach). The output will always be valid when one or more data is generated.

Added extra prettyPrint() method to enable a more readable output.

Changed

  • Added the prettyPrint() method into JsonTransformer class along with the json library
    • This is set by default and will output the JSON value in a pretty print format
    • When disabled, one line String is generated
    • When using the withCommaBetweenObjects, the prettyPrint() is "disabled" (set to false) due to a parse error
      • parse error when there's no comma for multiple items
  • Refactored JsonTest, FakeStreamTest, and FakeCollectionTest for methods using the JsonTransformer class

@what-the-diff
Copy link

what-the-diff bot commented Mar 22, 2024

PR Summary

  • New Library Addition
    Included a new library: org.json to help with JSON data manipulation in pom.xml.
  • Update in JsonTransformer Class
    Several enhancements in the JsonTransformer class to improve the way we generate JSON strings. This includes the capability to use a 'comma' between objects and accommodating indentation for easier readability (pretty printing feature). It also got optimized by utilizing StringBuilder to build strings more efficiently.
  • Updates in Testing Classes
    Changes were made in JsonTest.java to align with the updates in JsonTransformer.java. New test methods have been introduced, including a special one to test the new pretty printing feature - to ensure we can format returned JSON data nicely. Changes were also made in FakeCollectionTest.java and FakeStreamTest.java to correlate with modifications in the JsonTransformer constructor.

import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.*;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restore the specific imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if (i < fields.length - 1) {
sb.append(", ");
}
if (i < fields.length - 1) sb.append(", ");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is syntactically correct but being consistent in use of braces is better IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rolled back.

private final char[] wrappers;
private static final char[] WRAPPERS = "[]".toCharArray();
private final boolean commaBetweenObjects;
private final boolean prettPrint;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a "y"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

pom.xml Outdated
locale. Turkish is a notoriously tricky locale. -->
<surefire.argline>@{argLine} -Duser.language=TR -Duser.country=tr</surefire.argline>
<kotlin.version>1.9.23</kotlin.version>
<json.version>20240303</json.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the pretty printing, what do you need this library for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly for the pretty print. Indirectly to ensure that the JSON output is valid when using the JSONObject and JSONArray classes.

Comment on lines -160 to -170
// map.put('\u0008', "\\u0008");
// covered by map.put('\b', "\\b");
// map.put('\u0009', "\\u0009");
// covered by map.put('\t', "\\t");
// map.put((char) 10, "\\u000A");
// covered by map.put('\n', "\\n");
Map.entry('\u000B', "\\u000B"),
// map.put('\u000C', "\\u000C");
// covered by map.put('\f', "\\f");
// map.put((char) 13, "\\u000D");
// covered by map.put('\r', "\\r");
Copy link
Collaborator

@snuyanzin snuyanzin Mar 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to remove it
this clarifies why we don't have map.put('\u0008', "\\u0008"); and others

otherwise each time this question will pop up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting to the previous state.

Wouldn't be better to add this as a comment in the method instead of commented code?
Something like this:

    /*
     * The following entries are not added to the map because they are covered by another existing one
     *
     * map.put('\u0008', "\\u0008"); -> covered by map.put('\b', "\\b");
     * map.put('\u0009', "\\u0009"); -> covered by map.put('\t', "\\t");
     * map.put((char) 10, "\\u000A"); -> covered by map.put('\n', "\\n");
     * map.put('\u000C', "\\u000C"); -> covered by map.put('\f', "\\f");
     * map.put((char) 13, "\\u000D"); -> covered by map.put('\r', "\\r");
     */

Comment on lines +73 to +87

if (limit == 1) {
sb.append(apply(null, schema, 0));
return sb.toString();
} else {
for (int i = 0; i < limit; i++) {
sb.append(apply(null, schema, i));
if (commaBetweenObjects && i < limit - 1) sb.append(",");
}
}
return limit > 1 ? wrappers[0] + LINE_SEPARATOR + sb + LINE_SEPARATOR + wrappers[1] : sb.toString();

String result = "" + WRAPPERS[0] + sb + WRAPPERS[1];

return prettPrint ? result.startsWith("{")
? new JSONObject(result).toString(2) : new JSONArray(result).toString(2) : result;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change
and this is not clear why we need it.

The main idea behind was that the output was in JSON Lines format (https://jsonlines.org/)

And the main feature is that it could be easily copied and pasted into BigQuery as stated in their doc (https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json)

Now i don't see how it could be possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your concern @snuyanzin, and I do agree it's a breaking change as the way to generate large datasets is now missing.

First, the "regular" JSON output was wrong as it wasn't taking the correct format, based on its semantics. See JSON.
The output was, all the time, an object even when the output was an array, and this change fixed it.

Now, from my point of view, the way this feature was created this biased in only one usage, which might be not the common usage for the Datafaker consumers. Datafaker has delivered this feature thinking about generating JSON output to be processed for large external datasets or tools, like logging systems and the mentioned BigQuery example. And it's awesome!

As an end user, my first way of thinking is that Datafaker will generate a valid JSON for most of the use cases.
Nonetheless, it wasn't creating a valid JSON for any other purpose. The #1109 puts some light on it.

I think that a decision should be made, in terms of which format Datafaker will support it.
It can be one, it can be both.

Maybe a new builder in the JsonTransformerBuilder can be added to take care of the large dataset output.

I will follow any decision you guys make, but with the note that the current approach seems not valid for most Datafaker users.

Copy link
Collaborator

@snuyanzin snuyanzin Mar 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be i was not clear enough, sorry
My point is not about valid not valid json

Fixing json is ok

The problem is that here we change output format: now there either everything in one line or pretty

at the same time jsonline assumes that there should be one top level entity per line and I don't know how can we have it with the proposed solution.

So, do we really need this json lib usage? I guess we could fix wrong json issue without breaking existing approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my misunderstanding.

The library is not that important if we make sure the JSON is correctly created. I intended to have this as a double-check, plus the pretty output.

I would say that we can keep all, making the pretty not default. By the way, when we apply the approach without commas, which is the JSON line solution, the pretty is not applied, but I have changed the code to have it in one line.

Can I propose to remove the one-line approach and keep the pretty output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I would like the library to be removed. If that means no pretty printing, then I don't mind, I think that's a nice to have, and I don't think that's the responsibility of Datafaker.

Copy link
Collaborator

@snuyanzin snuyanzin Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with pretty print: measurement shows that it is 3 times slower for 10, 1000 entities and probably other amounts as well

Benchmark                              Mode  Cnt     Score    Error   Units
DatafakerSimpleMethods.json1          thrpt   10  2517.912 ± 43.562  ops/ms
DatafakerSimpleMethods.json10         thrpt   10   252.746 ±  3.304  ops/ms
DatafakerSimpleMethods.json100        thrpt   10    25.772 ±  0.437  ops/ms
DatafakerSimpleMethods.json100Pretty  thrpt   10     8.306 ±  0.013  ops/ms
DatafakerSimpleMethods.json10Pretty   thrpt   10    77.777 ±  2.426  ops/ms
DatafakerSimpleMethods.json1Pretty    thrpt   10  2513.785 ± 14.620  ops/ms

Copy link
Contributor Author

@eliasnogueira eliasnogueira Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem.
I will remove the library and other changes, keeping the original fix, in a new PR.

@codecov-commenter
Copy link

codecov-commenter commented Mar 23, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 92.46%. Comparing base (b37c566) to head (593eb8c).
Report is 42 commits behind head on main.

Files Patch % Lines
...net/datafaker/transformations/JsonTransformer.java 75.00% 1 Missing and 2 partials ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1134      +/-   ##
============================================
+ Coverage     92.35%   92.46%   +0.11%     
- Complexity     2821     2839      +18     
============================================
  Files           292      293       +1     
  Lines          5609     5617       +8     
  Branches        599      599              
============================================
+ Hits           5180     5194      +14     
+ Misses          275      271       -4     
+ Partials        154      152       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kingthorin
Copy link
Collaborator

Don't open a new PR, just keep pushing to this branch.

}

StringJoiner data = new StringJoiner(LINE_SEPARATOR);
StringBuilder sb = new StringBuilder();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to remove usage of line separator here?

I agree that there might be cases when it is preferred to get result without it however sometimes it is better to have it.

If you really need to have result without it then i would suggest to add dedicated method to json transformer builder to set lineseparator, then both old and new behavior will be supported

@eliasnogueira
Copy link
Contributor Author

@snuyanzin @kingthorin @bodiam Finally the changes are here!
I've removed the JSON library along with the pretty print, and the text block usage (it's causing issues on Java 22 + Windows).

All tests green!

Copy link
Collaborator

@snuyanzin snuyanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution and for addressing feedback

@snuyanzin snuyanzin merged commit bff89ed into datafaker-net:main Apr 4, 2024
@kingthorin
Copy link
Collaborator

Sorry I missed this somehow.

@snuyanzin thanks for approving/merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants