[SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one #19017

jmchung · 2017-08-22T06:46:43Z

What changes were proposed in this pull request?

When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below:

scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'a')""").show()
+---+---+----+
| c0| c1|  c2|
+---+---+----+
|  1|  2|null|
+---+---+----+

I think this should be consistent with Hive's implementation:

hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
...
1    1

In this PR, we located all the matched indices in fieldNames instead of returning the first matched index, i.e., indexOf.

How was this patch tested?

Added test in JsonExpressionsSuite.

…xcept the first one

jmchung · 2017-08-22T06:47:27Z

cc @viirya

HyukjinKwon · 2017-08-22T07:14:49Z

ok to test

viirya · 2017-08-22T07:18:43Z

Thanks for triggering the test @HyukjinKwon

SparkQA · 2017-08-22T09:48:22Z

Test build #80971 has finished for PR 19017 at commit f04b896.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

@jmchung Could we avoid functional transformations by a while loop here? I think this should be avoided, in particular, when we are in a hot path. This should be a valid suggestion per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex

I was thinking of an additional while loop with a if because all we need is to set the same value to multiple fields (while) if the field is the same (if) ..

jmchung · 2017-08-23T03:04:29Z

@HyukjinKwon That's a good point, thanks.

jmchung · 2017-08-23T04:37:27Z

@HyukjinKwon @viirya I replaced the functional transformations with a while loop.
What do you think about this? Thanks.

SparkQA · 2017-08-23T07:04:48Z

Test build #81016 has finished for PR 19017 at commit 8a25e92.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jmchung · 2017-08-23T07:31:59Z

retest this please

HyukjinKwon · 2017-08-23T07:39:00Z

retest this please

SparkQA · 2017-08-23T10:18:31Z

Test build #81025 has finished for PR 19017 at commit 8a25e92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-08-24T03:25:09Z

Please edit the PR title as [SPARK-21804][SQL] json_tuple returns ....

jmchung · 2017-08-24T03:27:44Z

@viirya PR title fixed, thanks.

viirya · 2017-08-24T03:33:18Z

LGTM

HyukjinKwon · 2017-08-24T03:40:16Z

LGTM too.

gatorsmile · 2017-08-24T04:10:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+                row(idx) = jsonValue
+              }
+              idx = idx + 1
+            }


Could you rewrite it using less lines? A more Scala way?

We have followed @HyukjinKwon's suggestion #19017 (review) to avoid functional transformation with a while-loop, since this is a hot path. It makes sense to me.

You still can simplify the codes a lot without functional transformation.

If I comment out the L451-452, the repeated fields still have the same jsonValue because fieldNames(idx) == jsonField, but the first comparison is not necessary since idx >= 0 means matched.

Could you please give me some advice?

Would you maybe have a suggestion? The current status looks fine.

row(idx) = jsonValue idx = idx + 1 // SPARK-21804: json_tuple returns null values within repeated columns // except the first one; so that we need to check the remaining fields. while (idx < fieldNames.length) { if (fieldNames(idx) == jsonField) { row(idx) = jsonValue } idx = idx + 1 }

->

do { row(idx) = jsonValue idx = fieldNames.indexOf(jsonField, idx + 1) } while (idx >= 0)

I am also thinking if we should use a Hash table. However,,, the number of columns is not large. Thus, it might not get a noticeable benefit.

SparkQA · 2017-08-24T06:20:20Z

Test build #81066 has finished for PR 19017 at commit ff01e04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-08-24T06:22:37Z

Is it better than functional transformation?

…

On Aug 24, 2017 1:46 PM, "Xiao Li" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ expressions/jsonExpressions.scala <#19017 (comment)>: > @@ -447,7 +448,18 @@ case class JsonTuple(children: Seq[Expression]) generator => copyCurrentStructure(generator, parser) } - row(idx) = UTF8String.fromBytes(output.toByteArray) + val jsonValue = UTF8String.fromBytes(output.toByteArray) + row(idx) = jsonValue + idx = idx + 1 + + // SPARK-21804: json_tuple returns null values within repeated columns + // except the first one; so that we need to check the remaining fields. + while (idx < fieldNames.length) { + if (fieldNames(idx) == jsonField) { + row(idx) = jsonValue + } + idx = idx + 1 + } row(idx) = jsonValue idx = idx + 1 // SPARK-21804: json_tuple returns null values within repeated columns // except the first one; so that we need to check the remaining fields. while (idx < fieldNames.length) { if (fieldNames(idx) == jsonField) { row(idx) = jsonValue } idx = idx + 1 } -> do { row(idx) = jsonValue idx = fieldNames.indexOf(jsonField, idx + 1) } while (idx >= 0) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19017 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEM9469kwhOt3hF3Wl6MKVviD7dRMQoks5sbQ5KgaJpZM4O-LpB> .

viirya · 2017-08-24T06:24:10Z

Hash table is over-kill here. On Aug 24, 2017 1:49 PM, "Xiao Li" <notifications@github.com> wrote: *@gatorsmile* commented on this pull request. ------------------------------ In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ expressions/jsonExpressions.scala <#19017 (comment)>:

@@ -447,7 +448,18 @@ case class JsonTuple(children: Seq[Expression])

generator => copyCurrentStructure(generator, parser) } - row(idx) = UTF8String.fromBytes(output.toByteArray) + val jsonValue = UTF8String.fromBytes(output.toByteArray) + row(idx) = jsonValue + idx = idx + 1 + + // SPARK-21804: json_tuple returns null values within repeated columns + // except the first one; so that we need to check the remaining fields. + while (idx < fieldNames.length) { + if (fieldNames(idx) == jsonField) { + row(idx) = jsonValue + } + idx = idx + 1 + } I am also thinking if we should use a Hash table. However,,, the number of columns is not large. Thus, it might not get a noticeable benefit. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19017 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEM9-RZHNPUotQGkeHthOBHoOfixaESks5sbQ7ogaJpZM4O-LpB> .

gatorsmile · 2017-08-24T06:27:11Z

I really doubt we can see a measurable performance difference among these different solutions. I just did not want to challenge it. Maybe you can write an end-to-end test and see the difference.

Thus, I prefer to the simplest one.

viirya · 2017-08-24T06:43:51Z

If we assume the performance difference is negligible, functional transform actually is concise more. @HyukjinKwon What do you think?

HyukjinKwon · 2017-08-24T06:59:51Z

Current status looks fine enough. I don't think we should prefer simplicity in a hot path. This follows obviously the guide lines and should be safe enough to go. This does not hurt my eyes.

gatorsmile · 2017-08-24T07:05:44Z

Sorry, I do not think the current code is ready to merge.

HyukjinKwon · 2017-08-24T07:06:17Z

Sorry, I do not think the current code is ready to merge.

Could you explain why?

gatorsmile · 2017-08-24T07:09:54Z

I think my suggestion is better: #19017 (comment)

If you think mine is slower, please provide an end-to-end test to show the performance number. If this really impact the performance, I think using Hash table might be better

HyukjinKwon · 2017-08-24T07:15:23Z

For me, either way is fine but personally prefer the current way because it exactly follows the guides.

BTW, I think you should do the perf tests if you think your suggestion is better.

gatorsmile · 2017-08-24T07:20:26Z

OK.

@jmchung Please change it based on my comment.

jmchung · 2017-08-24T07:26:03Z

@gatorsmile ok and really thanks for all the nice comments.

SparkQA · 2017-08-24T10:18:15Z

Test build #81072 has finished for PR 19017 at commit 792f350.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-08-24T10:22:25Z

LGTM

HyukjinKwon · 2017-08-24T10:30:05Z

Merged to master.

viirya · 2017-08-24T12:53:00Z

Thanks @HyukjinKwon @gatorsmile

jmchung · 2017-08-24T13:28:35Z

Thanks @viirya, @HyukjinKwon and @gatorsmile.

SPARK-21804: json_tuple returns null values within repeated columns e…

f04b896

…xcept the first one

HyukjinKwon reviewed Aug 23, 2017

View reviewed changes

change functional transformations to while loop

8a25e92

jmchung changed the title ~~SPARK-21804: json_tuple returns null values within repeated columns except the first one~~ [SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one Aug 24, 2017

change to meaningful name to variable

ff01e04

gatorsmile reviewed Aug 24, 2017

View reviewed changes

simplify the codes

792f350

asfgit closed this in 95713eb Aug 24, 2017

[SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one #19017

[SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one #19017

Conversation

jmchung commented Aug 22, 2017

What changes were proposed in this pull request?

How was this patch tested?

jmchung commented Aug 22, 2017

HyukjinKwon commented Aug 22, 2017

viirya commented Aug 22, 2017

SparkQA commented Aug 22, 2017

HyukjinKwon left a comment • edited

Choose a reason for hiding this comment

jmchung commented Aug 23, 2017

jmchung commented Aug 23, 2017

SparkQA commented Aug 23, 2017

jmchung commented Aug 23, 2017

HyukjinKwon commented Aug 23, 2017

SparkQA commented Aug 23, 2017

viirya commented Aug 24, 2017

jmchung commented Aug 24, 2017

viirya commented Aug 24, 2017

HyukjinKwon commented Aug 24, 2017

gatorsmile Aug 24, 2017 • edited

Choose a reason for hiding this comment

viirya Aug 24, 2017 • edited

Choose a reason for hiding this comment

gatorsmile Aug 24, 2017

Choose a reason for hiding this comment

jmchung Aug 24, 2017

Choose a reason for hiding this comment

HyukjinKwon Aug 24, 2017

Choose a reason for hiding this comment

gatorsmile Aug 24, 2017

Choose a reason for hiding this comment

gatorsmile Aug 24, 2017

Choose a reason for hiding this comment

SparkQA commented Aug 24, 2017

viirya commented Aug 24, 2017 via email

viirya commented Aug 24, 2017 via email

gatorsmile commented Aug 24, 2017

viirya commented Aug 24, 2017

HyukjinKwon commented Aug 24, 2017

gatorsmile commented Aug 24, 2017

HyukjinKwon commented Aug 24, 2017

gatorsmile commented Aug 24, 2017

HyukjinKwon commented Aug 24, 2017

gatorsmile commented Aug 24, 2017

jmchung commented Aug 24, 2017

SparkQA commented Aug 24, 2017

HyukjinKwon commented Aug 24, 2017

HyukjinKwon commented Aug 24, 2017

viirya commented Aug 24, 2017

jmchung commented Aug 24, 2017

HyukjinKwon left a comment •

edited

gatorsmile Aug 24, 2017 •

edited

viirya Aug 24, 2017 •

edited