Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clean illegal characters mode to json parser. #651

Merged
merged 5 commits into from Jun 13, 2017

Conversation

smdmts
Copy link
Contributor

@smdmts smdmts commented May 28, 2017

A JSON which including broken-encoded character is throwing an exception from Jackson because that is supported pure JSON specification in below reasons.

But ideally, I want to force loading the broken-encoded JSON files.
Therefore, I add clean_illegal_char mode to the JSON parser and that works cleansing to illegal char inside of JSON characters.

Please, check this PRs.

  • Stack Trace.
Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x5c
 at [Source: org.embulk.spi.util.FileInputInputStream@4f26425b; line: 17821, column: 69]
	at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3473)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3480)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._decodeUtf8_2(UTF8StreamJsonParser.java:3254)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2462)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2414)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:285)
	at org.embulk.spi.json.JsonParser$AbstractParseContext.jsonTokenToValue(JsonParser.java:188)
	at org.embulk.spi.json.JsonParser$AbstractParseContext.jsonTokenToValue(JsonParser.java:220)
	at org.embulk.spi.json.JsonParser$AbstractParseContext.next(JsonParser.java:152)

@smdmts smdmts force-pushed the add_clean_illegal_char_to_jsonparser branch 2 times, most recently from 09e44d6 to caaffb0 Compare May 28, 2017 15:43
@smdmts smdmts force-pushed the add_clean_illegal_char_to_jsonparser branch from caaffb0 to f96d1d6 Compare May 28, 2017 15:45
@muga
Copy link
Contributor

muga commented May 28, 2017

@smdmts Thank you for sending the PR. As you know that JSON specification supports only UTF-8, UTF-16 and UTF-32 as valid encodings, it would be useful if such non-standard JSON data could be ingested by Embulk and the JSON parser.

I'm OK to add such conversion to JSON parser plugin in embulk-standards but, I ideally think that it's good to handle such conversion by decoder plugin. How about it? @dmikurube

@dmikurube
Copy link
Member

I agree with @muga. Adding such cleansing methods in every possible parser doesn't sound very reasonable to me.

Creating a new decoder plugin to cleanup the illegal characters, and apply the decoder before this JSON parser might be the cleanest way. What do you think, @smdmts?

@smdmts
Copy link
Contributor Author

smdmts commented May 29, 2017

Hi @muga, @dmikurube
Thanks for prompt replying. I agreed that cleansing methods responsibility is better to move decoder plugin. But I have the one point of the issue what JSON's backslash(\) escape handling.

So, I want to divide this PR implementation in the bellows.

add-plugins

  • RemoveNonStanradUTF8BytesDecorderPlugin.java
    • removing a dust of including any bytes which breaking UTF-8.

modify-plugins

  • JsonParserPlugin.java
    • additional implementing remove unfit-backslash mode.

How do you think this Idea?

@hiroyuki-sato
Copy link
Contributor

@smdmts

Maybe you know,
you can develop decoder as a plugin.
So you don't need add decoder into core.

embulk new java-decoder remove_nonstandard_utf8 

@dmikurube
Copy link
Member

@hiroyuki-sato Thanks for your follow-up. Yes, RemoveNonStanradUTF8BytesDecorderPlugin.java are to be implemented as a third-party plugin, not in Embulk's standard plugins.

@smdmts What does the "unfit backslash" mean?

@smdmts
Copy link
Contributor Author

smdmts commented May 29, 2017

@dmikurube
The backslash with some character such as (\a) is not allowed JSON specs and if including that JACKSON is throwing an exception. But including JSON specs of the backslash with character is allowed.
e.f. \t (tab) , \n (new line) is allowed.

Therefore I want to have unfit backslash cleansing mode.

@dmikurube
Copy link
Member

dmikurube commented May 29, 2017

@smdmts Thanks! Got it. It sounds reasonable to implement that part in the standard's JSON parser.

How about making this config as follows?
invalid_string_escapes: PASSTHROUGH | SKIP | UNESCAPE (default: PASSTHROUGH)

  • PASSTHROUGH to do nothing. (Default)
  • SKIP to remove these invalid escapes. (Wanted in this PR)
  • UNESCAPE to convert invalid escapes (like \a) just to remove \ (like just a). (Not necessary in this PR. Just for future extension.)

@dmikurube
Copy link
Member

@smdmts
Copy link
Contributor Author

smdmts commented May 29, 2017

@dmikurube Thank you for your reviewing. Would you mind tell me more details and that behavior is in the below?

  • SKIP \a convertTo `` (empty value)
  • UNESCAPE \a convertTo a

@dmikurube
Copy link
Member

@smdmts Thanks for taking care of this. I think:

  • SKIP: Yes, that's what I meant. Maybe nice to refer the RFC for characters to be passed.
  • UNESCAPE: It is not necessary for the time being, but your understanding is correct. I showed this to represent just true/false may not be sufficient.

@dmikurube
Copy link
Member

Ah, you've already implemented. Thanks! Taking a look...

@hiroyuki-sato
Copy link
Contributor

@dmikurube
All other configurations use lowercase.
So I think it is better to use lowercase these settings for consistency.
Likeskip, passthrough and unescape.
@smdmts is implementing it with uppercase.

How do you think?

@dmikurube
Copy link
Member

@smdmts Left some comments. Most of them are just style nitpicking comments.

@hiroyuki-sato Ah, it might be an implicit consensus. We've conventionally used uppercases for kinds of CONSTANTS. For example, CRLF in CSV parser and MINIMAL for CSV formatter.

import org.embulk.spi.FileInput;
import org.embulk.spi.ParserPlugin;
import org.embulk.spi.Schema;
import org.embulk.spi.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid wildcard imports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I applied Airlift-style in my IDE.

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

import java.util.List;
import java.util.Map;

import static org.embulk.standards.JsonParserPlugin.InvalidEscapeStringPolicy.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

throws IOException
{
return new JsonParser().open(in);
private JsonParser.Stream newJsonStream(FileInputInputStream in , PluginTask task)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nitpick: no space before comma

final Pattern p = Pattern.compile("\\p{XDigit}+");
@Override
public CharSource apply(@Nullable String input) {
assert input != null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We often used com.google.common.base.Preconditions instead of assert. It's fine, though.

}
}
} else {
s.append(c);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we keep \\ at the end of line?

Copy link
Contributor Author

@smdmts smdmts May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if \\ has at the end of line , it will removed at SKIP or UNESCAPE. I will check test and fix it.

{"a":"b"}
\\<EOL>

convert to:

{"a":"b"}
<EOL>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Sounds reasonable both for SKIP and UNESCAPE.

s.append(c);
break;
case 'u': // hexstring such as \u0001
if (charArray.length > i + 5) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you think it should work for: \\u12<EOL> and \\u12xY?

Copy link
Contributor Author

@smdmts smdmts May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could not be handling for remove \\u12xYof only \\u12. Because next character may be a valid string and if removed that it break some context in the json.
So, my implementation is removed the only backslash like u12xY or u12<EOL>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smdmts Thanks. It's perfect for UNESCAPE, but it may be worth considering SKIP.

Only from SKIP's simple definition like "remove these invalid escapes", behavior for \\u may not be trivial. For example, it can be still natural expectation for some users that \\u12 is removed from \\u12xY or \\u12<EOL>.

Reasonable expectations for \\u... on SKIP may be:

  1. Only \\u is always removed.
    • \\u1234 ==> 1234 / \\u12xY ==> 12xY
  2. Whole \\uXXXX is removed if valid. Only \\ is removed if invalid. (This PR)
    • \\u1234 ==> (empty) / \\u12xY ==> u12xY
  3. Whole \\uXXXX is removed if valid. Only \\u is removed if invalid.
    • \\u1234 ==> (empty) / \\u12xY ==> 12xY
  4. Whole \\uX... is removed even if the length of its valid part is shorter than 4.
    • \\u1234 ==> (empty) / \\u12xY ==> xY

I thought 2 is not very consistent with SKIP's definition to be honest. What do you think about 1 or 4?

Copy link
Contributor Author

@smdmts smdmts May 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmikurube
Sorry, my thought did not reach 1 or 4 and it's better than 2.
In my opinion, SKIP mode would be 1.

Because In 4 mode, if 12xY of "12" is a valid part of a sentence, it will breaking removing "12".
How do you think this idea?

@@ -63,17 +96,16 @@ public void run(TaskSource taskSource, Schema schema, FileInput input, PageOutpu
final Column column = schema.getColumn(0); // record column

try (PageBuilder pageBuilder = newPageBuilder(schema, output);
FileInputInputStream in = new FileInputInputStream(input)) {
FileInputInputStream in = new FileInputInputStream(input)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this back -- changes unrelated to this topic.

}



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get these newlines back -- changes unrelated to this topic.

@@ -60,7 +71,7 @@ public void readNormalJson()
"\"_c1\":-10,\n" +
"\"_c2\":\"エンバルク\",\n" +
"\"_c3\":[\"e0\",\"e1\"]\n" +
"}",
"}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this back -- changes unrelated to this topic.

@hiroyuki-sato
Copy link
Contributor

@dmikurube You are right. Thanks.

@smdmts
Copy link
Contributor Author

smdmts commented May 30, 2017

Umm..., AppVeyor was failed none-changed code by OOM.

@dmikurube
Copy link
Member

Hmm, it sometimes fails by OutOfMemory, but the failures are not OOM. We may need some investigation...

@smdmts
Copy link
Contributor Author

smdmts commented May 30, 2017

@dmikurube Thanks a lot. I fixed your comment on my PR.

@dmikurube
Copy link
Member

Ah, no, it's OOM. I'm triggering retry.

@dmikurube
Copy link
Member

AppVeyor passed. :)

@hiroyuki-sato hiroyuki-sato mentioned this pull request Jun 1, 2017
3 tasks
@dmikurube dmikurube mentioned this pull request Jun 1, 2017
5 tasks
@smdmts
Copy link
Contributor Author

smdmts commented Jun 1, 2017

I implemented SKIP mode with the non-standard character with \u is below of 1. #651 (comment).

  1. Only \u is always removed.
    \u1234 ==> 1234 / \u12xY ==> 12xY

Conflicts:
	embulk-standards/src/main/java/org/embulk/standards/JsonParserPlugin.java
@dmikurube
Copy link
Member

@smdmts Ahh, an urgently merged PR #661 was conflicting with this PR... Sorry about that.

Is 0afac30 still conflicting?

@dmikurube
Copy link
Member

@smdmts Sorry, it was AppVeyor just failing. I've rebuilt that, and it's now succeeded. I'd merge this if @muga feels okay.

Any thoughts? > @muga

@smdmts
Copy link
Contributor Author

smdmts commented Jun 7, 2017

@dmikurube I merged #661 at 0afac30 and it’s fine. Thanks,

@dmikurube
Copy link
Member

@muga I'll be merging this in a day unless you have any comments.

@muga
Copy link
Contributor

muga commented Jun 13, 2017

@dmikurube @smdmts Baiscally LGTM. sorry for the delay.

@muga muga self-requested a review June 13, 2017 00:51
Copy link
Member

@dmikurube dmikurube left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smdmts Thanks for your contribution and sorry for the delay. I'll be merging!

@dmikurube dmikurube added the enhancement New feature or request label Jun 13, 2017
@dmikurube dmikurube merged commit d26bd3b into embulk:master Jun 13, 2017
@dmikurube
Copy link
Member

Merged!

smdmts added a commit to smdmts/embulk that referenced this pull request Jun 17, 2017
smdmts added a commit to smdmts/embulk that referenced this pull request Jun 17, 2017
muga added a commit that referenced this pull request Jun 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

None yet

4 participants