Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30592][SQL] Interval support for csv and json funtions #27317

Closed
wants to merge 9 commits into from
Closed

[SPARK-30592][SQL] Interval support for csv and json funtions #27317

wants to merge 9 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jan 22, 2020

What changes were proposed in this pull request?

In this PR, I'd propose to fully support interval for the CSV and JSON functions.

On one hand, CSV and JSON records consist of string values, in the cast logic, we can cast string from/to interval now, so we can make those functions support intervals easily.

Before this change we can only use this as a workaround.

SELECT cast(from_csv('1, 1 day', 'a INT, b string').b as interval)
struct<CAST(from_csv(1, 1 day).b AS INTERVAL):interval>
1 days

On the other hand, we ban reading or writing intervals from CSV and JSON files. To directly read and write with external json/csv storage, you still need explicit cast, e.g.

spark.read.schema("a string").json("a.json").selectExpr("cast(a as interval)").show
+------+
|     a|
+------+
|1 days|
+------+

Why are the changes needed?

for interval's future-proofing purpose

Does this PR introduce any user-facing change?

yes, the to_json/from_json function can deal with intervals now. e.g.
for from_json there is no such use case because we do not support a interval
for to_json, we can use interval values now

before

SELECT to_json(map('a', interval 25 month 100 day 130 minute));
Error in query: cannot resolve 'to_json(map('a', INTERVAL '2 years 1 months 100 days 2 hours 10 minutes'))' due to data type mismatch: Unable to convert column a of type interval to JSON.; line 1 pos 7;
'Project [unresolvedalias(to_json(map(a, 2 years 1 months 100 days 2 hours 10 minutes), Some(Asia/Shanghai)), None)]
+- OneRowRelation

after

SELECT to_json(map('a', interval 25 month 100 day 130 minute))
{"a":"2 years 1 months 100 days 2 hours 10 minutes"}

How was this patch tested?

add ut

SELECT cast(from_csv('1, 1 day', 'a INT, b string').b as interval)
struct<CAST(from_csv(1, 1 day).b AS INTERVAL):interval>
1 days
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for spark.sql.legacy.to_json.verifyMapKeyType.enabled is false to be sure that the config works.

@yaooqinn
Copy link
Member Author

Thanks, @MaxGekk on second thought, we should support these functions to deal with intervals as we only need to prohibit interval from external storage. So I have removed that conf

val mapType1 = MapType(CalendarIntervalType, IntegerType)
val schema1 = StructType(StructField("a", mapType1) :: Nil)
val struct1 = Literal.create(null, schema1)
test("from_json - interval support") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we follow csv test and test from/to in one test case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@@ -222,33 +222,24 @@ class JsonFunctionsSuite extends QueryTest with SharedSparkSession {
Row("""{"_1":"26/08/2015 18:00"}""") :: Nil)
}

test("to_json - key types of map don't matter") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove it now. There is no unsupported types anymore. JSON supports all the types.

@yaooqinn yaooqinn changed the title [SPARK-30592][SQL] Interval type checker for csv and json funtions [SPARK-30592][SQL] Support Intervals for csv and json funtions Jan 22, 2020
@yaooqinn yaooqinn changed the title [SPARK-30592][SQL] Support Intervals for csv and json funtions [SPARK-30592][SQL] Interval support for csv and json funtions Jan 22, 2020
parseJsonToken[CalendarInterval](parser, dataType) {
case VALUE_STRING =>
IntervalUtils.safeStringToInterval(UTF8String.fromString(parser.getText))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shift } to the right, it belongs to parseJsonToken not to case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117222 has finished for PR 27317 at commit aa7742c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117223 has finished for PR 27317 at commit d0dc6cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117230 has finished for PR 27317 at commit 9cb10ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117232 has finished for PR 27317 at commit 4d0df26.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117235 has finished for PR 27317 at commit 5a32fed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

SELECT from_csv('1, 1 day', 'a INT, b interval');
SELECT to_csv(named_struct('a', interval 32 month, 'b', interval 70 minute));
SELECT from_json('{"a":"1 days"}', 'a interval');
SELECT to_json(map('a', interval 25 month 100 day 130 minute));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn Could you add round-trip test cases?

  • from_csv(to_csv(....))
  • to_csv(from_csv(...))
  • from_json(to_json(....))
  • to_json(from_json(....))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll make a followup.

@gatorsmile
Copy link
Member

@yaooqinn Does this PR introduce any user-facing change? does not include the changes made in CSV. We need to update the PR description when you updated the code.

@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 9, 2020

@yaooqinn Does this PR introduce any user-facing change? does not include the changes made in CSV. We need to update the PR description when you updated the code.

The changes in CSV are not user-facing

HyukjinKwon pushed a commit that referenced this pull request Feb 10, 2020
### What changes were proposed in this pull request?

Add round-trip tests for CSV and JSON functions as  #27317 (comment) asked.

### Why are the changes needed?

improve test coverage

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

add uts

Closes #27510 from yaooqinn/SPARK-30592-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Feb 10, 2020
### What changes were proposed in this pull request?

Add round-trip tests for CSV and JSON functions as  #27317 (comment) asked.

### Why are the changes needed?

improve test coverage

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

add uts

Closes #27510 from yaooqinn/SPARK-30592-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 58b9ca1)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

Add round-trip tests for CSV and JSON functions as  apache#27317 (comment) asked.

### Why are the changes needed?

improve test coverage

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

add uts

Closes apache#27510 from yaooqinn/SPARK-30592-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants