New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add clean illegal characters mode to json parser. #651
Add clean illegal characters mode to json parser. #651
Conversation
09e44d6
to
caaffb0
Compare
caaffb0
to
f96d1d6
Compare
@smdmts Thank you for sending the PR. As you know that JSON specification supports only UTF-8, UTF-16 and UTF-32 as valid encodings, it would be useful if such non-standard JSON data could be ingested by Embulk and the JSON parser. I'm OK to add such conversion to JSON parser plugin in embulk-standards but, I ideally think that it's good to handle such conversion by decoder plugin. How about it? @dmikurube |
Hi @muga, @dmikurube So, I want to divide this PR implementation in the bellows. add-plugins
modify-plugins
How do you think this Idea? |
Maybe you know,
|
@hiroyuki-sato Thanks for your follow-up. Yes, @smdmts What does the "unfit backslash" mean? |
@dmikurube Therefore I want to have unfit backslash cleansing mode. |
@smdmts Thanks! Got it. It sounds reasonable to implement that part in the standard's JSON parser. How about making this config as follows?
|
Just a memo: https://tools.ietf.org/html/rfc7159#section-7 |
@dmikurube Thank you for your reviewing. Would you mind tell me more details and that behavior is in the below?
|
@smdmts Thanks for taking care of this. I think:
|
Ah, you've already implemented. Thanks! Taking a look... |
@dmikurube How do you think? |
@smdmts Left some comments. Most of them are just style nitpicking comments. @hiroyuki-sato Ah, it might be an implicit consensus. We've conventionally used uppercases for kinds of CONSTANTS. For example, |
import org.embulk.spi.FileInput; | ||
import org.embulk.spi.ParserPlugin; | ||
import org.embulk.spi.Schema; | ||
import org.embulk.spi.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid wildcard imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I applied Airlift-style in my IDE.
import java.io.ByteArrayInputStream; | ||
import java.io.IOException; | ||
import java.io.InputStream; | ||
import java.io.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
import java.util.List; | ||
import java.util.Map; | ||
|
||
import static org.embulk.standards.JsonParserPlugin.InvalidEscapeStringPolicy.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
throws IOException | ||
{ | ||
return new JsonParser().open(in); | ||
private JsonParser.Stream newJsonStream(FileInputInputStream in , PluginTask task) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nitpick: no space before comma
final Pattern p = Pattern.compile("\\p{XDigit}+"); | ||
@Override | ||
public CharSource apply(@Nullable String input) { | ||
assert input != null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We often used com.google.common.base.Preconditions
instead of assert
. It's fine, though.
} | ||
} | ||
} else { | ||
s.append(c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we keep \\
at the end of line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, if \\
has at the end of line , it will removed at SKIP or UNESCAPE. I will check test and fix it.
{"a":"b"}
\\<EOL>
convert to:
{"a":"b"}
<EOL>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Sounds reasonable both for SKIP
and UNESCAPE
.
s.append(c); | ||
break; | ||
case 'u': // hexstring such as \u0001 | ||
if (charArray.length > i + 5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you think it should work for: \\u12<EOL>
and \\u12xY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could not be handling for remove \\u12xY
of only \\u12
. Because next character may be a valid string and if removed that it break some context in the json.
So, my implementation is removed the only backslash like u12xY
or u12<EOL>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smdmts Thanks. It's perfect for UNESCAPE
, but it may be worth considering SKIP
.
Only from SKIP
's simple definition like "remove these invalid escapes", behavior for \\u
may not be trivial. For example, it can be still natural expectation for some users that \\u12
is removed from \\u12xY
or \\u12<EOL>
.
Reasonable expectations for \\u...
on SKIP
may be:
- Only
\\u
is always removed.\\u1234
==>1234
/\\u12xY
==>12xY
- Whole
\\uXXXX
is removed if valid. Only\\
is removed if invalid. (This PR)\\u1234
==> (empty) /\\u12xY
==>u12xY
- Whole
\\uXXXX
is removed if valid. Only\\u
is removed if invalid.\\u1234
==> (empty) /\\u12xY
==>12xY
- Whole
\\uX...
is removed even if the length of its valid part is shorter than 4.\\u1234
==> (empty) /\\u12xY
==>xY
I thought 2 is not very consistent with SKIP
's definition to be honest. What do you think about 1 or 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmikurube
Sorry, my thought did not reach 1 or 4 and it's better than 2.
In my opinion, SKIP mode would be 1.
Because In 4 mode, if 12xY of "12" is a valid part of a sentence, it will breaking removing "12".
How do you think this idea?
@@ -63,17 +96,16 @@ public void run(TaskSource taskSource, Schema schema, FileInput input, PageOutpu | |||
final Column column = schema.getColumn(0); // record column | |||
|
|||
try (PageBuilder pageBuilder = newPageBuilder(schema, output); | |||
FileInputInputStream in = new FileInputInputStream(input)) { | |||
FileInputInputStream in = new FileInputInputStream(input)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this back -- changes unrelated to this topic.
} | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get these newlines back -- changes unrelated to this topic.
@@ -60,7 +71,7 @@ public void readNormalJson() | |||
"\"_c1\":-10,\n" + | |||
"\"_c2\":\"エンバルク\",\n" + | |||
"\"_c3\":[\"e0\",\"e1\"]\n" + | |||
"}", | |||
"}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this back -- changes unrelated to this topic.
@dmikurube You are right. Thanks. |
Umm..., AppVeyor was failed none-changed code by OOM. |
Hmm, it sometimes fails by OutOfMemory, but the failures are not OOM. We may need some investigation... |
@dmikurube Thanks a lot. I fixed your comment on my PR. |
Ah, no, it's OOM. I'm triggering retry. |
AppVeyor passed. :) |
I implemented SKIP mode with the non-standard character with
|
Conflicts: embulk-standards/src/main/java/org/embulk/standards/JsonParserPlugin.java
@dmikurube I merged #661 at 0afac30 and it’s fine. Thanks, |
@muga I'll be merging this in a day unless you have any comments. |
@dmikurube @smdmts Baiscally LGTM. sorry for the delay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smdmts Thanks for your contribution and sorry for the delay. I'll be merging!
Merged! |
A JSON which including broken-encoded character is throwing an exception from Jackson because that is supported pure JSON specification in below reasons.
But ideally, I want to force loading the broken-encoded JSON files.
Therefore, I add
clean_illegal_char
mode to the JSON parser and that works cleansing to illegal char inside of JSON characters.Please, check this PRs.