Skip to content

[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour#13372

Closed
maropu wants to merge 3 commits intoapache:masterfrom
maropu:SPARK-15585
Closed

[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour#13372
maropu wants to merge 3 commits intoapache:masterfrom
maropu:SPARK-15585

Conversation

@maropu
Copy link
Member

@maropu maropu commented May 28, 2016

What changes were proposed in this pull request?

This pr fixes the behaviour of format("csv").option("quote", null) along with one of spark-csv.
Also, it explicitly sets default values for CSV options in python.

How was this patch tested?

Added tests in CSVSuite.

@maropu
Copy link
Member Author

maropu commented May 28, 2016

@rxin Could you check this to satisfy your suggestion? If no problem, I'll also set default values in json options in a similar way.

@SparkQA
Copy link

SparkQA commented May 28, 2016

Test build #59551 has finished for PR 13372 at commit 12f4f89.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 28, 2016

Whether it should be done here or not, shouldn't we maybe define explicit behavioirs when the options are set to null? I mean, currently some of options allow null as a legitimate value whereas most of options will throw NullPointerException as far as I know.
Maybe at least allowing null for the option quote might have to be documented here.

@SparkQA
Copy link

SparkQA commented May 28, 2016

Test build #59559 has finished for PR 13372 at commit 04c2d99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented May 28, 2016

yea, it is a difficult question. We must define an explicit behaviour for quote, but I'm not sure about other options. So, An alternative idea is to define a special function for quote, e.g., getCharForQuote, for the null handling. Thought?

Anyway, yes, we need to document this change. After we get consensus, I'll add this doc.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 28, 2016

I think at least we need to handle the other options as well with a clearer message rather than java.lang.NullPointerException was thrown.. I see some of options are being validated. For example, samplingRatio in JSON data source throws a clear message (requirement failed: samplingRatio (-1.0) should be greater than 0) when the value is a negative numeric type. I think maybe this can be done in another PR anyway.

@maropu
Copy link
Member Author

maropu commented May 28, 2016

This is the meaningful suggestion. I think all the option validation should be done in a single place and, if there is an invalidate option, spark should throw clear messages for making users easily understood.
As for samplingRatio, should we check the validation not in inferSchema, but JSONOptions? This is because, for example, the validation of parseMode is done there. I think we'd be better of finishing these validation in JSONOptions and CSVOptions.

Anyway, I'm not sure these fixes should be included in this pr and follow-up prs seems to be okay to me.

@rxin
Copy link
Contributor

rxin commented Jun 6, 2016

Thanks - I'm going to merge this in master/2.0. and make some fixes myself.

@asfgit asfgit closed this in b7e8d1c Jun 6, 2016
asfgit pushed a commit that referenced this pull request Jun 6, 2016
## What changes were proposed in this pull request?
This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv.
Also, it explicitly sets default values for CSV options in python.

## How was this patch tested?
Added tests in CSVSuite.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13372 from maropu/SPARK-15585.

(cherry picked from commit b7e8d1c)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@rxin
Copy link
Contributor

rxin commented Jun 6, 2016

Actually sorry -- I thought about this more and unfortunately we can't do it this way. The main problem is that we would break existing option use case, e.g. the following code:

df.option("sep", "|").csv("...")

In this case, the default sep would still be chosen. I'm going to revert this patch, and then think about a workaround instead.

@rxin
Copy link
Contributor

rxin commented Jun 6, 2016

I think the best way is probably to document and ask users to use '\u0000' explicitly. @maropu does that work?

@maropu
Copy link
Member Author

maropu commented Jun 6, 2016

okay though, I'm getting on the flight to SF. So, I'll check in a day, thanks!

@maropu maropu deleted the SPARK-15585 branch July 5, 2017 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants