-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one #19017
Conversation
…xcept the first one
cc @viirya |
ok to test |
Thanks for triggering the test @HyukjinKwon |
Test build #80971 has finished for PR 19017 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmchung Could we avoid functional transformations by a while loop here? I think this should be avoided, in particular, when we are in a hot path. This should be a valid suggestion per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex
I was thinking of an additional while loop with a if because all we need is to set the same value to multiple fields (while) if the field is the same (if) ..
@HyukjinKwon That's a good point, thanks. |
@HyukjinKwon @viirya I replaced the functional transformations with a while loop. |
Test build #81016 has finished for PR 19017 at commit
|
retest this please |
1 similar comment
retest this please |
Test build #81025 has finished for PR 19017 at commit
|
Please edit the PR title as |
@viirya PR title fixed, thanks. |
LGTM |
LGTM too. |
row(idx) = jsonValue | ||
} | ||
idx = idx + 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rewrite it using less lines? A more Scala way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have followed @HyukjinKwon's suggestion #19017 (review) to avoid functional transformation with a while-loop, since this is a hot path. It makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still can simplify the codes a lot without functional transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I comment out the L451-452, the repeated fields still have the same jsonValue because fieldNames(idx) == jsonField
, but the first comparison is not necessary since idx >= 0
means matched.
Could you please give me some advice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you maybe have a suggestion? The current status looks fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row(idx) = jsonValue
idx = idx + 1
// SPARK-21804: json_tuple returns null values within repeated columns
// except the first one; so that we need to check the remaining fields.
while (idx < fieldNames.length) {
if (fieldNames(idx) == jsonField) {
row(idx) = jsonValue
}
idx = idx + 1
}
->
do {
row(idx) = jsonValue
idx = fieldNames.indexOf(jsonField, idx + 1)
} while (idx >= 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also thinking if we should use a Hash table. However,,, the number of columns is not large. Thus, it might not get a noticeable benefit.
Test build #81066 has finished for PR 19017 at commit
|
Is it better than functional transformation?
…On Aug 24, 2017 1:46 PM, "Xiao Li" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/
expressions/jsonExpressions.scala
<#19017 (comment)>:
> @@ -447,7 +448,18 @@ case class JsonTuple(children: Seq[Expression])
generator => copyCurrentStructure(generator, parser)
}
- row(idx) = UTF8String.fromBytes(output.toByteArray)
+ val jsonValue = UTF8String.fromBytes(output.toByteArray)
+ row(idx) = jsonValue
+ idx = idx + 1
+
+ // SPARK-21804: json_tuple returns null values within repeated columns
+ // except the first one; so that we need to check the remaining fields.
+ while (idx < fieldNames.length) {
+ if (fieldNames(idx) == jsonField) {
+ row(idx) = jsonValue
+ }
+ idx = idx + 1
+ }
row(idx) = jsonValue
idx = idx + 1
// SPARK-21804: json_tuple returns null values within repeated columns
// except the first one; so that we need to check the remaining fields.
while (idx < fieldNames.length) {
if (fieldNames(idx) == jsonField) {
row(idx) = jsonValue
}
idx = idx + 1
}
->
do {
row(idx) = jsonValue
idx = fieldNames.indexOf(jsonField, idx + 1)
} while (idx >= 0)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19017 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEM9469kwhOt3hF3Wl6MKVviD7dRMQoks5sbQ5KgaJpZM4O-LpB>
.
|
Hash table is over-kill here.
On Aug 24, 2017 1:49 PM, "Xiao Li" <notifications@github.com> wrote:
*@gatorsmile* commented on this pull request.
------------------------------
In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/
expressions/jsonExpressions.scala
<#19017 (comment)>:
@@ -447,7 +448,18 @@ case class JsonTuple(children: Seq[Expression])
generator => copyCurrentStructure(generator, parser)
}
- row(idx) = UTF8String.fromBytes(output.toByteArray)
+ val jsonValue = UTF8String.fromBytes(output.toByteArray)
+ row(idx) = jsonValue
+ idx = idx + 1
+
+ // SPARK-21804: json_tuple returns null values within
repeated columns
+ // except the first one; so that we need to check the
remaining fields.
+ while (idx < fieldNames.length) {
+ if (fieldNames(idx) == jsonField) {
+ row(idx) = jsonValue
+ }
+ idx = idx + 1
+ }
I am also thinking if we should use a Hash table. However,,, the number of
columns is not large. Thus, it might not get a noticeable benefit.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19017 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEM9-RZHNPUotQGkeHthOBHoOfixaESks5sbQ7ogaJpZM4O-LpB>
.
|
I really doubt we can see a measurable performance difference among these different solutions. I just did not want to challenge it. Maybe you can write an end-to-end test and see the difference. Thus, I prefer to the simplest one. |
If we assume the performance difference is negligible, functional transform actually is concise more. @HyukjinKwon What do you think? |
Current status looks fine enough. I don't think we should prefer simplicity in a hot path. This follows obviously the guide lines and should be safe enough to go. This does not hurt my eyes. |
Sorry, I do not think the current code is ready to merge. |
Could you explain why? |
I think my suggestion is better: #19017 (comment) If you think mine is slower, please provide an end-to-end test to show the performance number. If this really impact the performance, I think using Hash table might be better |
For me, either way is fine but personally prefer the current way because it exactly follows the guides. BTW, I think you should do the perf tests if you think your suggestion is better. |
OK. @jmchung Please change it based on my comment. |
@gatorsmile ok and really thanks for all the nice comments. |
Test build #81072 has finished for PR 19017 at commit
|
LGTM |
Merged to master. |
Thanks @HyukjinKwon @gatorsmile |
Thanks @viirya, @HyukjinKwon and @gatorsmile. |
What changes were proposed in this pull request?
When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below:
I think this should be consistent with Hive's implementation:
In this PR, we located all the matched indices in
fieldNames
instead of returning the first matched index, i.e., indexOf.How was this patch tested?
Added test in JsonExpressionsSuite.