Add clean illegal characters mode to json parser. #651

smdmts · 2017-05-28T13:14:22Z

A JSON which including broken-encoded character is throwing an exception from Jackson because that is supported pure JSON specification in below reasons.

Charset autodetection fail FasterXML/jackson-core#222 (comment)

But ideally, I want to force loading the broken-encoded JSON files.
Therefore, I add clean_illegal_char mode to the JSON parser and that works cleansing to illegal char inside of JSON characters.

Please, check this PRs.

Stack Trace.

Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x5c
 at [Source: org.embulk.spi.util.FileInputInputStream@4f26425b; line: 17821, column: 69]
	at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3473)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3480)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._decodeUtf8_2(UTF8StreamJsonParser.java:3254)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2462)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2414)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:285)
	at org.embulk.spi.json.JsonParser$AbstractParseContext.jsonTokenToValue(JsonParser.java:188)
	at org.embulk.spi.json.JsonParser$AbstractParseContext.jsonTokenToValue(JsonParser.java:220)
	at org.embulk.spi.json.JsonParser$AbstractParseContext.next(JsonParser.java:152)

muga · 2017-05-28T16:24:23Z

@smdmts Thank you for sending the PR. As you know that JSON specification supports only UTF-8, UTF-16 and UTF-32 as valid encodings, it would be useful if such non-standard JSON data could be ingested by Embulk and the JSON parser.

I'm OK to add such conversion to JSON parser plugin in embulk-standards but, I ideally think that it's good to handle such conversion by decoder plugin. How about it? @dmikurube

dmikurube · 2017-05-29T03:55:03Z

I agree with @muga. Adding such cleansing methods in every possible parser doesn't sound very reasonable to me.

Creating a new decoder plugin to cleanup the illegal characters, and apply the decoder before this JSON parser might be the cleanest way. What do you think, @smdmts?

smdmts · 2017-05-29T04:01:18Z

Hi @muga, @dmikurube
Thanks for prompt replying. I agreed that cleansing methods responsibility is better to move decoder plugin. But I have the one point of the issue what JSON's backslash(\) escape handling.

https://stackoverflow.com/questions/19176024/how-to-escape-special-characters-in-building-a-json-string

So, I want to divide this PR implementation in the bellows.

add-plugins

RemoveNonStanradUTF8BytesDecorderPlugin.java
- removing a dust of including any bytes which breaking UTF-8.

modify-plugins

JsonParserPlugin.java
- additional implementing remove unfit-backslash mode.

How do you think this Idea?

hiroyuki-sato · 2017-05-29T04:09:38Z

@smdmts

Maybe you know,
you can develop decoder as a plugin.
So you don't need add decoder into core.

embulk new java-decoder remove_nonstandard_utf8

dmikurube · 2017-05-29T04:12:12Z

@hiroyuki-sato Thanks for your follow-up. Yes, RemoveNonStanradUTF8BytesDecorderPlugin.java are to be implemented as a third-party plugin, not in Embulk's standard plugins.

@smdmts What does the "unfit backslash" mean?

smdmts · 2017-05-29T04:29:11Z

@dmikurube
The backslash with some character such as (\a) is not allowed JSON specs and if including that JACKSON is throwing an exception. But including JSON specs of the backslash with character is allowed.
e.f. \t (tab) , \n (new line) is allowed.

Therefore I want to have unfit backslash cleansing mode.

dmikurube · 2017-05-29T04:47:50Z

@smdmts Thanks! Got it. It sounds reasonable to implement that part in the standard's JSON parser.

How about making this config as follows?
invalid_string_escapes: PASSTHROUGH | SKIP | UNESCAPE (default: PASSTHROUGH)

PASSTHROUGH to do nothing. (Default)
SKIP to remove these invalid escapes. (Wanted in this PR)
UNESCAPE to convert invalid escapes (like \a) just to remove \ (like just a). (Not necessary in this PR. Just for future extension.)

dmikurube · 2017-05-29T04:49:53Z

Just a memo: https://tools.ietf.org/html/rfc7159#section-7

smdmts · 2017-05-29T15:27:12Z

@dmikurube Thank you for your reviewing. Would you mind tell me more details and that behavior is in the below?

SKIP \a convertTo `` (empty value)
UNESCAPE \a convertTo a

dmikurube · 2017-05-30T03:56:54Z

@smdmts Thanks for taking care of this. I think:

SKIP: Yes, that's what I meant. Maybe nice to refer the RFC for characters to be passed.
UNESCAPE: It is not necessary for the time being, but your understanding is correct. I showed this to represent just true/false may not be sufficient.

dmikurube · 2017-05-30T04:01:53Z

Ah, you've already implemented. Thanks! Taking a look...

hiroyuki-sato · 2017-05-30T04:16:23Z

@dmikurube
All other configurations use lowercase.
So I think it is better to use lowercase these settings for consistency.
Likeskip, passthrough and unescape.
@smdmts is implementing it with uppercase.

How do you think?

dmikurube · 2017-05-30T04:26:58Z

@smdmts Left some comments. Most of them are just style nitpicking comments.

@hiroyuki-sato Ah, it might be an implicit consensus. We've conventionally used uppercases for kinds of CONSTANTS. For example, CRLF in CSV parser and MINIMAL for CSV formatter.

dmikurube · 2017-05-30T04:03:27Z

embulk-standards/src/test/java/org/embulk/standards/TestJsonParserPlugin.java

-import org.embulk.spi.FileInput;
-import org.embulk.spi.ParserPlugin;
-import org.embulk.spi.Schema;
+import org.embulk.spi.*;


Please avoid wildcard imports.

Sorry, I applied Airlift-style in my IDE.

dmikurube · 2017-05-30T04:03:35Z

embulk-standards/src/test/java/org/embulk/standards/TestJsonParserPlugin.java

-import java.io.ByteArrayInputStream;
-import java.io.IOException;
-import java.io.InputStream;
+import java.io.*;


dmikurube · 2017-05-30T04:03:41Z

embulk-standards/src/test/java/org/embulk/standards/TestJsonParserPlugin.java

 import java.util.List;
 import java.util.Map;

+import static org.embulk.standards.JsonParserPlugin.InvalidEscapeStringPolicy.*;


dmikurube · 2017-05-30T04:09:40Z

embulk-standards/src/main/java/org/embulk/standards/JsonParserPlugin.java

-            throws IOException
-    {
-        return new JsonParser().open(in);
+    private JsonParser.Stream newJsonStream(FileInputInputStream in , PluginTask task)


style nitpick: no space before comma

dmikurube · 2017-05-30T04:11:28Z

embulk-standards/src/main/java/org/embulk/standards/JsonParserPlugin.java

+            final Pattern p = Pattern.compile("\\p{XDigit}+");
+            @Override
+            public CharSource apply(@Nullable String input) {
+                assert input != null;


We often used com.google.common.base.Preconditions instead of assert. It's fine, though.

dmikurube · 2017-05-30T04:18:24Z

embulk-standards/src/main/java/org/embulk/standards/JsonParserPlugin.java

+                            }
+                        }
+                    } else {
+                        s.append(c);


Do we keep \\ at the end of line?

No, if \\ has at the end of line , it will removed at SKIP or UNESCAPE. I will check test and fix it.

{"a":"b"} \\<EOL>

convert to:

{"a":"b"} <EOL>

Got it. Sounds reasonable both for SKIP and UNESCAPE.

dmikurube · 2017-05-30T04:19:28Z

embulk-standards/src/main/java/org/embulk/standards/JsonParserPlugin.java

+                                    s.append(c);
+                                    break;
+                                case 'u': // hexstring such as \u0001
+                                    if (charArray.length > i + 5) {


How do you think it should work for: \\u12<EOL> and \\u12xY?

I think it could not be handling for remove \\u12xYof only \\u12. Because next character may be a valid string and if removed that it break some context in the json.
So, my implementation is removed the only backslash like u12xY or u12<EOL>.

@smdmts Thanks. It's perfect for UNESCAPE, but it may be worth considering SKIP.

Only from SKIP's simple definition like "remove these invalid escapes", behavior for \\u may not be trivial. For example, it can be still natural expectation for some users that \\u12 is removed from \\u12xY or \\u12<EOL>.

Reasonable expectations for \\u... on SKIP may be:

Only \\u is always removed.

\\u1234 ==> 1234 / \\u12xY ==> 12xY

Whole \\uXXXX is removed if valid. Only \\ is removed if invalid. (This PR)

\\u1234 ==> (empty) / \\u12xY ==> u12xY

Whole \\uXXXX is removed if valid. Only \\u is removed if invalid.

\\u1234 ==> (empty) / \\u12xY ==> 12xY

Whole \\uX... is removed even if the length of its valid part is shorter than 4.

\\u1234 ==> (empty) / \\u12xY ==> xY

I thought 2 is not very consistent with SKIP's definition to be honest. What do you think about 1 or 4?

@dmikurube
Sorry, my thought did not reach 1 or 4 and it's better than 2.
In my opinion, SKIP mode would be 1.

Because In 4 mode, if 12xY of "12" is a valid part of a sentence, it will breaking removing "12".
How do you think this idea?

dmikurube · 2017-05-30T04:21:05Z

embulk-standards/src/main/java/org/embulk/standards/JsonParserPlugin.java

@@ -63,17 +96,16 @@ public void run(TaskSource taskSource, Schema schema, FileInput input, PageOutpu
        final Column column = schema.getColumn(0); // record column

        try (PageBuilder pageBuilder = newPageBuilder(schema, output);
-                FileInputInputStream in = new FileInputInputStream(input)) {
+             FileInputInputStream in = new FileInputInputStream(input)) {


Let's get this back -- changes unrelated to this topic.

dmikurube · 2017-05-30T04:21:38Z

embulk-standards/src/main/java/org/embulk/standards/JsonParserPlugin.java

    }

+
+


Let's get these newlines back -- changes unrelated to this topic.

dmikurube · 2017-05-30T04:21:50Z

embulk-standards/src/test/java/org/embulk/standards/TestJsonParserPlugin.java

@@ -60,7 +71,7 @@ public void readNormalJson()
                        "\"_c1\":-10,\n" +
                        "\"_c2\":\"エンバルク\",\n" +
                        "\"_c3\":[\"e0\",\"e1\"]\n" +
-                "}",
+                        "}",


Let's get this back -- changes unrelated to this topic.

hiroyuki-sato · 2017-05-30T04:31:33Z

@dmikurube You are right. Thanks.

smdmts · 2017-05-30T06:07:08Z

Umm..., AppVeyor was failed none-changed code by OOM.

dmikurube · 2017-05-30T06:08:24Z

Hmm, it sometimes fails by OutOfMemory, but the failures are not OOM. We may need some investigation...

smdmts · 2017-05-30T06:12:02Z

@dmikurube Thanks a lot. I fixed your comment on my PR.

dmikurube · 2017-05-30T06:14:08Z

Ah, no, it's OOM. I'm triggering retry.

dmikurube · 2017-05-30T06:45:05Z

AppVeyor passed. :)

smdmts · 2017-06-01T11:24:56Z

I implemented SKIP mode with the non-standard character with \u is below of 1. #651 (comment).

Only \u is always removed.
\u1234 ==> 1234 / \u12xY ==> 12xY

test code
https://github.com/embulk/embulk/pull/651/files#diff-b58b1c39f56eca489cc109b6b6c6947fR253

Conflicts: embulk-standards/src/main/java/org/embulk/standards/JsonParserPlugin.java

dmikurube · 2017-06-06T14:12:40Z

@smdmts Ahh, an urgently merged PR #661 was conflicting with this PR... Sorry about that.

Is 0afac30 still conflicting?

dmikurube · 2017-06-07T06:41:43Z

@smdmts Sorry, it was AppVeyor just failing. I've rebuilt that, and it's now succeeded. I'd merge this if @muga feels okay.

Any thoughts? > @muga

smdmts · 2017-06-07T07:58:15Z

@dmikurube I merged #661 at 0afac30 and it’s fine. Thanks,

dmikurube · 2017-06-12T05:24:57Z

@muga I'll be merging this in a day unless you have any comments.

muga · 2017-06-13T00:51:27Z

@dmikurube @smdmts Baiscally LGTM. sorry for the delay.

dmikurube

@smdmts Thanks for your contribution and sorry for the delay. I'll be merging!

dmikurube · 2017-06-13T06:21:01Z

Merged!

[Doc] add document of #651

smdmts force-pushed the add_clean_illegal_char_to_jsonparser branch 2 times, most recently from 09e44d6 to caaffb0 Compare May 28, 2017 15:43

add clean illegal char mode to json parser.

f96d1d6

smdmts force-pushed the add_clean_illegal_char_to_jsonparser branch from caaffb0 to f96d1d6 Compare May 28, 2017 15:45

add InvalidEscapeStringPolicy

a662d92

dmikurube reviewed May 30, 2017

View reviewed changes

fix reviewed.

5ad7f85

modify SKIP mode in removing backslash with u.

310a1a7

hiroyuki-sato mentioned this pull request Jun 1, 2017

Plan 0.8.23 #662

Closed

3 tasks

dmikurube mentioned this pull request Jun 1, 2017

Plan 0.8.24 #665

Closed

5 tasks

Merge remote-tracking branch 'origin/master' into HEAD

0afac30

Conflicts: embulk-standards/src/main/java/org/embulk/standards/JsonParserPlugin.java

muga self-requested a review June 13, 2017 00:51

muga approved these changes Jun 13, 2017

View reviewed changes

dmikurube approved these changes Jun 13, 2017

View reviewed changes

dmikurube added the enhancement New feature or request label Jun 13, 2017

dmikurube merged commit d26bd3b into embulk:master Jun 13, 2017

smdmts added a commit to smdmts/embulk that referenced this pull request Jun 17, 2017

write documents of embulk#651

0ba844b

smdmts added a commit to smdmts/embulk that referenced this pull request Jun 17, 2017

add documents of embulk#651

7f7085c

smdmts mentioned this pull request Jun 17, 2017

[Doc] add document of #651 #683

Merged

muga added a commit that referenced this pull request Jun 20, 2017

Merge pull request #683 from smdmts/add_document_651

b3408fa

[Doc] add document of #651

dmikurube mentioned this pull request Mar 2, 2021

Modernize embulk-parser-json with v0.11-ready API/SPI #1395

Merged

Add clean illegal characters mode to json parser. #651

Add clean illegal characters mode to json parser. #651

Conversation

smdmts commented May 28, 2017

muga commented May 28, 2017

dmikurube commented May 29, 2017

smdmts commented May 29, 2017 • edited

add-plugins

modify-plugins

hiroyuki-sato commented May 29, 2017

dmikurube commented May 29, 2017

smdmts commented May 29, 2017 • edited

dmikurube commented May 29, 2017 • edited

dmikurube commented May 29, 2017

smdmts commented May 29, 2017

dmikurube commented May 30, 2017

dmikurube commented May 30, 2017

hiroyuki-sato commented May 30, 2017

dmikurube commented May 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smdmts May 30, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smdmts May 30, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smdmts May 31, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiroyuki-sato commented May 30, 2017

smdmts commented May 30, 2017

dmikurube commented May 30, 2017

smdmts commented May 30, 2017

dmikurube commented May 30, 2017

dmikurube commented May 30, 2017

smdmts commented Jun 1, 2017

dmikurube commented Jun 6, 2017

dmikurube commented Jun 7, 2017

smdmts commented Jun 7, 2017

dmikurube commented Jun 12, 2017

muga commented Jun 13, 2017

dmikurube left a comment

Choose a reason for hiding this comment

dmikurube commented Jun 13, 2017

smdmts commented May 29, 2017 •

edited

smdmts commented May 29, 2017 •

edited

dmikurube commented May 29, 2017 •

edited

smdmts May 30, 2017 •

edited

smdmts May 30, 2017 •

edited

smdmts May 31, 2017 •

edited