Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19595][SQL] Support json array in from_json #16929

Closed
wants to merge 8 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 14, 2017

What changes were proposed in this pull request?

This PR proposes to both,

Do not allow json arrays with multiple elements and return null in from_json with StructType as the schema.

Currently, it only reads the single row when the input is a json array. So, the codes below:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val schema = StructType(StructField("a", IntegerType) :: Nil)
Seq(("""[{"a": 1}, {"a": 2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show()

prints

+--------------------+
|jsontostruct(struct)|
+--------------------+
|                 [1]|
+--------------------+

This PR simply suggests to print this as null if the schema is StructType and input is json array.with multiple elements

+--------------------+
|jsontostruct(struct)|
+--------------------+
|                null|
+--------------------+

Support json arrays in from_json with ArrayType as the schema.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
Seq(("""[{"a": 1}, {"a": 2}]""")).toDF("array").select(from_json(col("array"), schema)).show()

prints

+-------------------+
|jsontostruct(array)|
+-------------------+
|         [[1], [2]]|
+-------------------+

How was this patch tested?

Unit test in JsonExpressionsSuite, JsonFunctionsSuite, Python doctests and manual test.

@HyukjinKwon
Copy link
Member Author

Cc @hvanhovell, could you please take a look and see if this makes sense?

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72878 has finished for PR 16929 at commit acbce26.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Feb 14, 2017

also cc @marmbrus

@marmbrus
Copy link
Contributor

I agree that its wrong to truncate, but why not just fix handling of arrays rather than disallow it?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 14, 2017

Sure, let me turn it to suuport. I thought disallowing was kind of a safe choice :)..

@HyukjinKwon HyukjinKwon changed the title [SPARK-19595][SQL] Do not allow json array in from_json [WIP][SPARK-19595][SQL] Do not allow json array in from_json Feb 14, 2017
@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-19595][SQL] Do not allow json array in from_json [SPARK-19595][SQL] Support json array in from_json Feb 18, 2017
@HyukjinKwon HyukjinKwon force-pushed the disallow-array branch 2 times, most recently from ef783bd to 25cdd7d Compare February 18, 2017 18:20
@HyukjinKwon
Copy link
Member Author

@hvanhovell, @zsxwing and @marmbrus, I just updated and rebased. Could you take another look please?

@SparkQA
Copy link

SparkQA commented Feb 18, 2017

Test build #73110 has finished for PR 16929 at commit 25cdd7d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 18, 2017

Test build #73113 has finished for PR 16929 at commit c7b5f2e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 18, 2017

Test build #73111 has finished for PR 16929 at commit 8122314.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 18, 2017

Test build #73114 has finished for PR 16929 at commit d6fd39b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Hi @marmbrus, does this sounds good to you?

@marmbrus
Copy link
Contributor

Hmm, I'm not sure we want to change this to a generator. I think that has performance consequences as well as possibly being surprising. I would probably make it possible to handle arrays (when the correct schema is given). If they want to explode they can run explode(from_json(...)).

@marmbrus
Copy link
Contributor

/cc @brkyvz

@HyukjinKwon
Copy link
Member Author

Sure, let me take a look and try.

@HyukjinKwon HyukjinKwon changed the title [SPARK-19595][SQL] Support json array in from_json [WIP][SPARK-19595][SQL] Support json array in from_json Feb 25, 2017
@HyukjinKwon
Copy link
Member Author

Let me clean up and fix the tests if failed with an updated PR description soon. It is still a wip.

@SparkQA
Copy link

SparkQA commented Feb 25, 2017

Test build #73467 has finished for PR 16929 at commit 8c48436.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-19595][SQL] Support json array in from_json [SPARK-19595][SQL] Support json array in from_json Feb 26, 2017
@HyukjinKwon
Copy link
Member Author

Hi @marmbrus, @brkyvz and @hvanhovell, I think it is ready for a review.

@SparkQA
Copy link

SparkQA commented Feb 26, 2017

Test build #73474 has finished for PR 16929 at commit 25086ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2017

Test build #73475 has finished for PR 16929 at commit 2aaf609.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 28, 2017

Then, let me fix this as below:

  • from_json with StructType

    • JSON array -> null
    • JSON object -> Row(...)
  • from_json with ArrayType

    • JSON array -> Array(Row(...), ...)
    • JSON object -> Array(Row(...))
  • exposed APIs

    def from_json(..., schema: StructType, ...) = ...
    def from_json(..., schema: DataType, ...) = ...

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73570 has started for PR 16929 at commit 72d6410.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73572 has started for PR 16929 at commit 54e60bb.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73573 has started for PR 16929 at commit 9f1e966.

@HyukjinKwon
Copy link
Member Author

@brkyvz, @marmbrus - I think it is ready for another look. Could you see if I understood your comments correctly?

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73574 has started for PR 16929 at commit 0c088bf.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73621 has finished for PR 16929 at commit 0c088bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@transient
lazy val converter = schema match {
case _: StructType =>
(rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
Copy link
Contributor

@brkyvz brkyvz Mar 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this breaks previous behavior. I would still return the first element if rows.length > 1. Feel free to push back. Also wonder what @marmbrus thinks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay breaking previous behavior because I'd call truncating an array a bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should list this in the release notes though (i.e. go tag the JIRA).

@HyukjinKwon
Copy link
Member Author

@mambrus and @brkyvz, would there be other things I should double check?

InternalRow.fromSeq(1 :: Nil) ::
InternalRow.fromSeq(2 :: Nil) :: Nil
checkEvaluation(JsonToStruct(
schema, Map.empty, Literal(jsonData1), gmtId), expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you put input and expected output in different rows for readability please

schema, Map.empty, Literal(jsonData1), gmtId), expected)

// json object: `Array(Row(...))`
val jsonData2 = """{"a": 1}"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make each example a separate test. This way it's easier to figure out what breaks later.

e.g.

  1. from_json - input=array, schema=array, output=array
  2. from_json - input=object, schema=array, output=array of single object
  3. from_json - input=empty json array, schema=array, output=empty array
    ...

@brkyvz
Copy link
Contributor

brkyvz commented Mar 4, 2017

@HyukjinKwon Implementation seems fine. Just left a cosmetic comment on your unit tests. Otherwise LGTM

@HyukjinKwon
Copy link
Member Author

Thank you so much. Let me clean up.

@SparkQA
Copy link

SparkQA commented Mar 4, 2017

Test build #73893 has finished for PR 16929 at commit 3d490e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Mar 5, 2017

should the title description of this PR be updated to reflect that we should support array of json?

@HyukjinKwon
Copy link
Member Author

I just updated the PR description to prevent confusion.

@HyukjinKwon
Copy link
Member Author

Cc @brkyvz and @marmbrus could this be merged by any chance?

@@ -372,6 +372,62 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
)
}

test("from_json - input=array, schema=array, output=array") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are great! thanks!

@brkyvz
Copy link
Contributor

brkyvz commented Mar 5, 2017

Merging to master. Thanks!

@asfgit asfgit closed this in 369a148 Mar 5, 2017
@HyukjinKwon
Copy link
Member Author

Thank you @brkyvz.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants