Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one #19017

Closed
wants to merge 4 commits into from

Conversation

jmchung
Copy link
Contributor

@jmchung jmchung commented Aug 22, 2017

What changes were proposed in this pull request?

When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below:

scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'a')""").show()
+---+---+----+
| c0| c1|  c2|
+---+---+----+
|  1|  2|null|
+---+---+----+

I think this should be consistent with Hive's implementation:

hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
...
1    1

In this PR, we located all the matched indices in fieldNames instead of returning the first matched index, i.e., indexOf.

How was this patch tested?

Added test in JsonExpressionsSuite.

@jmchung
Copy link
Contributor Author

jmchung commented Aug 22, 2017

cc @viirya

@HyukjinKwon
Copy link
Member

ok to test

@viirya
Copy link
Member

viirya commented Aug 22, 2017

Thanks for triggering the test @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80971 has finished for PR 19017 at commit f04b896.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmchung Could we avoid functional transformations by a while loop here? I think this should be avoided, in particular, when we are in a hot path. This should be a valid suggestion per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex

I was thinking of an additional while loop with a if because all we need is to set the same value to multiple fields (while) if the field is the same (if) ..

@jmchung
Copy link
Contributor Author

jmchung commented Aug 23, 2017

@HyukjinKwon That's a good point, thanks.

@jmchung
Copy link
Contributor Author

jmchung commented Aug 23, 2017

@HyukjinKwon @viirya I replaced the functional transformations with a while loop.
What do you think about this? Thanks.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81016 has finished for PR 19017 at commit 8a25e92.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jmchung
Copy link
Contributor Author

jmchung commented Aug 23, 2017

retest this please

1 similar comment
@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81025 has finished for PR 19017 at commit 8a25e92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Aug 24, 2017

Please edit the PR title as [SPARK-21804][SQL] json_tuple returns ....

@jmchung jmchung changed the title SPARK-21804: json_tuple returns null values within repeated columns except the first one [SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one Aug 24, 2017
@jmchung
Copy link
Contributor Author

jmchung commented Aug 24, 2017

@viirya PR title fixed, thanks.

@viirya
Copy link
Member

viirya commented Aug 24, 2017

LGTM

@HyukjinKwon
Copy link
Member

LGTM too.

row(idx) = jsonValue
}
idx = idx + 1
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rewrite it using less lines? A more Scala way?

Copy link
Member

@viirya viirya Aug 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have followed @HyukjinKwon's suggestion #19017 (review) to avoid functional transformation with a while-loop, since this is a hot path. It makes sense to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still can simplify the codes a lot without functional transformation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I comment out the L451-452, the repeated fields still have the same jsonValue because fieldNames(idx) == jsonField, but the first comparison is not necessary since idx >= 0 means matched.

Could you please give me some advice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you maybe have a suggestion? The current status looks fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            row(idx) = jsonValue
            idx = idx + 1

            // SPARK-21804: json_tuple returns null values within repeated columns
            // except the first one; so that we need to check the remaining fields.
            while (idx < fieldNames.length) {
              if (fieldNames(idx) == jsonField) {
                row(idx) = jsonValue
              }
              idx = idx + 1
            }

->

            do {
              row(idx) = jsonValue
              idx = fieldNames.indexOf(jsonField, idx + 1)
            } while (idx >= 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also thinking if we should use a Hash table. However,,, the number of columns is not large. Thus, it might not get a noticeable benefit.

@SparkQA
Copy link

SparkQA commented Aug 24, 2017

Test build #81066 has finished for PR 19017 at commit ff01e04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Aug 24, 2017 via email

@viirya
Copy link
Member

viirya commented Aug 24, 2017 via email

@gatorsmile
Copy link
Member

I really doubt we can see a measurable performance difference among these different solutions. I just did not want to challenge it. Maybe you can write an end-to-end test and see the difference.

Thus, I prefer to the simplest one.

@viirya
Copy link
Member

viirya commented Aug 24, 2017

If we assume the performance difference is negligible, functional transform actually is concise more. @HyukjinKwon What do you think?

@HyukjinKwon
Copy link
Member

Current status looks fine enough. I don't think we should prefer simplicity in a hot path. This follows obviously the guide lines and should be safe enough to go. This does not hurt my eyes.

@gatorsmile
Copy link
Member

Sorry, I do not think the current code is ready to merge.

@HyukjinKwon
Copy link
Member

Sorry, I do not think the current code is ready to merge.

Could you explain why?

@gatorsmile
Copy link
Member

I think my suggestion is better: #19017 (comment)

If you think mine is slower, please provide an end-to-end test to show the performance number. If this really impact the performance, I think using Hash table might be better

@HyukjinKwon
Copy link
Member

For me, either way is fine but personally prefer the current way because it exactly follows the guides.

BTW, I think you should do the perf tests if you think your suggestion is better.

@gatorsmile
Copy link
Member

OK.

@jmchung Please change it based on my comment.

@jmchung
Copy link
Contributor Author

jmchung commented Aug 24, 2017

@gatorsmile ok and really thanks for all the nice comments.

@SparkQA
Copy link

SparkQA commented Aug 24, 2017

Test build #81072 has finished for PR 19017 at commit 792f350.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

LGTM

@asfgit asfgit closed this in 95713eb Aug 24, 2017
@HyukjinKwon
Copy link
Member

Merged to master.

@viirya
Copy link
Member

viirya commented Aug 24, 2017

Thanks @HyukjinKwon @gatorsmile

@jmchung
Copy link
Contributor Author

jmchung commented Aug 24, 2017

Thanks @viirya, @HyukjinKwon and @gatorsmile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants