[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour by maropu · Pull Request #13372 · apache/spark

maropu · 2016-05-28T05:23:22Z

What changes were proposed in this pull request?

This pr fixes the behaviour of format("csv").option("quote", null) along with one of spark-csv.
Also, it explicitly sets default values for CSV options in python.

How was this patch tested?

Added tests in CSVSuite.

maropu · 2016-05-28T05:24:52Z

@rxin Could you check this to satisfy your suggestion? If no problem, I'll also set default values in json options in a similar way.

SparkQA · 2016-05-28T06:38:15Z

Test build #59551 has finished for PR 13372 at commit 12f4f89.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-05-28T12:31:20Z

Whether it should be done here or not, shouldn't we maybe define explicit behavioirs when the options are set to null? I mean, currently some of options allow null as a legitimate value whereas most of options will throw NullPointerException as far as I know.
Maybe at least allowing null for the option quote might have to be documented here.

SparkQA · 2016-05-28T13:36:59Z

Test build #59559 has finished for PR 13372 at commit 04c2d99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-05-28T14:02:17Z

yea, it is a difficult question. We must define an explicit behaviour for quote, but I'm not sure about other options. So, An alternative idea is to define a special function for quote, e.g., getCharForQuote, for the null handling. Thought?

Anyway, yes, we need to document this change. After we get consensus, I'll add this doc.

HyukjinKwon · 2016-05-28T15:43:16Z

I think at least we need to handle the other options as well with a clearer message rather than java.lang.NullPointerException was thrown.. I see some of options are being validated. For example, samplingRatio in JSON data source throws a clear message (requirement failed: samplingRatio (-1.0) should be greater than 0) when the value is a negative numeric type. I think maybe this can be done in another PR anyway.

maropu · 2016-05-28T16:27:13Z

This is the meaningful suggestion. I think all the option validation should be done in a single place and, if there is an invalidate option, spark should throw clear messages for making users easily understood.
As for samplingRatio, should we check the validation not in inferSchema, but JSONOptions? This is because, for example, the validation of parseMode is done there. I think we'd be better of finishing these validation in JSONOptions and CSVOptions.

Anyway, I'm not sure these fixes should be included in this pr and follow-up prs seems to be okay to me.

rxin · 2016-06-06T06:34:37Z

Thanks - I'm going to merge this in master/2.0. and make some fixes myself.

## What changes were proposed in this pull request? This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv. Also, it explicitly sets default values for CSV options in python. ## How was this patch tested? Added tests in CSVSuite. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13372 from maropu/SPARK-15585. (cherry picked from commit b7e8d1c) Signed-off-by: Reynold Xin <rxin@databricks.com>

rxin · 2016-06-06T06:39:30Z

Actually sorry -- I thought about this more and unfortunately we can't do it this way. The main problem is that we would break existing option use case, e.g. the following code:

df.option("sep", "|").csv("...")

In this case, the default sep would still be chosen. I'm going to revert this patch, and then think about a workaround instead.

rxin · 2016-06-06T06:43:50Z

I think the best way is probably to document and ask users to use '\u0000' explicitly. @maropu does that work?

maropu · 2016-06-06T07:56:40Z

okay though, I'm getting on the flight to SF. So, I'll check in a day, thanks!

maropu added 2 commits May 28, 2016 13:23

Return u0 if we set null at quote in csv options

472e756

Explicitly set defalut values for csv options in python

12f4f89

Fix a bug in readwriter.py

04c2d99

asfgit closed this in b7e8d1c Jun 6, 2016

maropu deleted the SPARK-15585 branch July 5, 2017 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour#13372

[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour#13372
maropu wants to merge 3 commits intoapache:masterfrom
maropu:SPARK-15585

maropu commented May 28, 2016

Uh oh!

maropu commented May 28, 2016

Uh oh!

SparkQA commented May 28, 2016

Uh oh!

HyukjinKwon commented May 28, 2016 •

edited

Loading

Uh oh!

SparkQA commented May 28, 2016

Uh oh!

maropu commented May 28, 2016 •

edited

Loading

Uh oh!

HyukjinKwon commented May 28, 2016 •

edited

Loading

Uh oh!

maropu commented May 28, 2016 •

edited

Loading

Uh oh!

rxin commented Jun 6, 2016

Uh oh!

rxin commented Jun 6, 2016

Uh oh!

rxin commented Jun 6, 2016

Uh oh!

maropu commented Jun 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

maropu commented May 28, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

maropu commented May 28, 2016

Uh oh!

SparkQA commented May 28, 2016

Uh oh!

HyukjinKwon commented May 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 28, 2016

Uh oh!

maropu commented May 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu commented May 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Jun 6, 2016

Uh oh!

rxin commented Jun 6, 2016

Uh oh!

rxin commented Jun 6, 2016

Uh oh!

maropu commented Jun 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented May 28, 2016 •

edited

Loading

maropu commented May 28, 2016 •

edited

Loading

HyukjinKwon commented May 28, 2016 •

edited

Loading

maropu commented May 28, 2016 •

edited

Loading