Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15143][SPARK-15144][SQL] Add CSV tests with HadoopFsRelationTest and support for nullValue for other types #12921

Closed
wants to merge 7 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented May 5, 2016

What changes were proposed in this pull request?

Currently, nullValue option does not work for some types, BooleanType, TimestampType, DateType and StringType. So, currently there is no way to read null for those types. This PR adds the support just like the other types.

Also, CSV data source is not being tested with HadoopFsRelationTest as a HadoopFsRelation. HadoopFsRelationTest includes 50 more tests (eg. partitioned table tests).

This PR adds two variables, extraReadOptions and extraWriteOptions in HadoopFsRelationTest so that the child class gives some options for reading and writing. In order to get the tests in HadoopFsRelationTest passed, CSV data source needs to give options header and inferSchema as true for reading and header as true for writing.

How was this patch tested?

Unittests in CSVHadoopFsRelationTest, CSVTypeCastSuite and edited tests in CSVSuite

@@ -447,7 +446,7 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {

verifyCars(cars, withHeader = true, checkValues = false)
val results = cars.collect()
assert(results(0).toSeq === Array(2012, "Tesla", "S", "null", "null"))
assert(results(0).toSeq === Array(2012, "Tesla", "S", null, null))
Copy link
Member Author

@HyukjinKwon HyukjinKwon May 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being tested against the data as below:

year,make,model,comment,blank
"2012","Tesla","S",null,

1997,Ford,E350,"Go get one now they are going fast",
null,Chevy,Volt

Since the header is year,make,model,comment,blank, this should produce the values 2012,Tesla,S,null,null because nullValue is set to "null".

@HyukjinKwon
Copy link
Member Author

cc @rxin @falaki

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57845 has finished for PR 12921 at commit ef71599.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57846 has finished for PR 12921 at commit f80df8c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57847 has finished for PR 12921 at commit 2ea702c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

DateTimeUtils.millisToDays(DateTimeUtils.stringToTime(datum).getTime)
case _: StringType => UTF8String.fromString(datum)
case _ => throw new RuntimeException(s"Unsupported type: ${castType.typeName}")
if (datum == null || (datum == options.nullValue && nullable)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply the logic below was added just like inferField():

if (datum == null || (datum == options.nullValue && nullable)) {
  null
} else {
 ... 

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57856 has finished for PR 12921 at commit 33288c8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

UTF8String.fromString(""))
assert(
CSVTypeCast.castTo("", StringType, nullable = false, CSVOptions()) ==
CSVTypeCast.castTo("", StringType, nullable = false, CSVOptions("nullValue", null)) ==
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@falaki I just noticed and thought this test implies nullValue does not apply for StringType. Is this intendedly being exclusive? I thought nullValue should be applied for all the types equivalently.

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, nulls for StringType will be lost in the roundtrip of reading and writing.

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57871 has finished for PR 12921 at commit f31f69e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 7, 2016

Test build #58045 has finished for PR 12921 at commit 1233bd7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NoSuchPermanentFunctionException(db: String, func: String)
    • class NoSuchFunctionException(db: String, func: String)
    • case class GetExternalRowField(

@SparkQA
Copy link

SparkQA commented May 11, 2016

Test build #58319 has finished for PR 12921 at commit 75f1cb8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Please excuse my ping @rxin @falaki

@HyukjinKwon
Copy link
Member Author

Hi @cloud-fan, Could you please take a look?

@HyukjinKwon
Copy link
Member Author

Closing this since another PR has (I think) a better change. I will maybe submit another PR for adding some tests in the future.

@HyukjinKwon HyukjinKwon deleted the SPARK-15143-15144 branch January 2, 2018 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants