-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19595][SQL] Support json array in from_json #16929
Conversation
Cc @hvanhovell, could you please take a look and see if this makes sense? |
Test build #72878 has finished for PR 16929 at commit
|
also cc @marmbrus |
I agree that its wrong to truncate, but why not just fix handling of arrays rather than disallow it? |
Sure, let me turn it to suuport. I thought disallowing was kind of a safe choice :).. |
acbce26
to
7f07acf
Compare
ef783bd
to
25cdd7d
Compare
@hvanhovell, @zsxwing and @marmbrus, I just updated and rebased. Could you take another look please? |
Test build #73110 has finished for PR 16929 at commit
|
Test build #73113 has finished for PR 16929 at commit
|
Test build #73111 has finished for PR 16929 at commit
|
Test build #73114 has finished for PR 16929 at commit
|
Hi @marmbrus, does this sounds good to you? |
Hmm, I'm not sure we want to change this to a generator. I think that has performance consequences as well as possibly being surprising. I would probably make it possible to handle arrays (when the correct schema is given). If they want to explode they can run |
/cc @brkyvz |
Sure, let me take a look and try. |
d6fd39b
to
8c48436
Compare
Let me clean up and fix the tests if failed with an updated PR description soon. It is still a wip. |
Test build #73467 has finished for PR 16929 at commit
|
2aaf609
to
a0a7091
Compare
Hi @marmbrus, @brkyvz and @hvanhovell, I think it is ready for a review. |
Test build #73474 has finished for PR 16929 at commit
|
Test build #73475 has finished for PR 16929 at commit
|
Then, let me fix this as below:
|
Test build #73570 has started for PR 16929 at commit |
Test build #73572 has started for PR 16929 at commit |
Test build #73573 has started for PR 16929 at commit |
Test build #73574 has started for PR 16929 at commit |
retest this please |
Test build #73621 has finished for PR 16929 at commit
|
@transient | ||
lazy val converter = schema match { | ||
case _: StructType => | ||
(rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this breaks previous behavior. I would still return the first element if rows.length > 1
. Feel free to push back. Also wonder what @marmbrus thinks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay breaking previous behavior because I'd call truncating an array a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should list this in the release notes though (i.e. go tag the JIRA).
InternalRow.fromSeq(1 :: Nil) :: | ||
InternalRow.fromSeq(2 :: Nil) :: Nil | ||
checkEvaluation(JsonToStruct( | ||
schema, Map.empty, Literal(jsonData1), gmtId), expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you put input and expected output in different rows for readability please
schema, Map.empty, Literal(jsonData1), gmtId), expected) | ||
|
||
// json object: `Array(Row(...))` | ||
val jsonData2 = """{"a": 1}""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make each example a separate test. This way it's easier to figure out what breaks later.
e.g.
from_json - input=array, schema=array, output=array
from_json - input=object, schema=array, output=array of single object
from_json - input=empty json array, schema=array, output=empty array
...
@HyukjinKwon Implementation seems fine. Just left a cosmetic comment on your unit tests. Otherwise LGTM |
Thank you so much. Let me clean up. |
Test build #73893 has finished for PR 16929 at commit
|
should the |
I just updated the PR description to prevent confusion. |
@@ -372,6 +372,62 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
) | |||
} | |||
|
|||
test("from_json - input=array, schema=array, output=array") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are great! thanks!
Merging to master. Thanks! |
Thank you @brkyvz. |
What changes were proposed in this pull request?
This PR proposes to both,
Do not allow json arrays with multiple elements and return null in
from_json
withStructType
as the schema.Currently, it only reads the single row when the input is a json array. So, the codes below:
prints
This PR simply suggests to print this as
null
if the schema isStructType
and input is json array.with multiple elementsSupport json arrays in
from_json
withArrayType
as the schema.prints
How was this patch tested?
Unit test in
JsonExpressionsSuite
,JsonFunctionsSuite
, Python doctests and manual test.