[SPARK-12668][SQL] Providing aliases for CSV options to be similar to Pandas and R #10800

HyukjinKwon · 2016-01-18T06:07:14Z

https://issues.apache.org/jira/browse/SPARK-12668

Spark CSV datasource has been being merged (filed in SPARK-12420). This is a quicky PR that simply renames several CSV options to similar Pandas and R.

Alias for delimiter -> sep
charset -> encoding

…ncoding

falaki · 2016-01-18T06:10:19Z

@HyukjinKwon please do not replace existing option names. Please just provide aliases. This can be done entirely inside CSVParameters.scala

HyukjinKwon · 2016-01-18T06:13:19Z

@falaki Ah, right.

rxin · 2016-01-18T06:33:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParameters.scala

@@ -44,9 +44,9 @@ private[sql] case class CSVParameters(parameters: Map[String, String]) extends L
    }
  }

-  val delimiter = CSVTypeCast.toChar(parameters.getOrElse("delimiter", ","))
+  val seq = CSVTypeCast.toChar(parameters.getOrElse("seq", ","))


I think it is "sep", not "seq".

Also I like the old name for the variable better. We should just rename the name for the option. And for backward compatibility, we should accept "delimiter" if "sep" is not set.

rxin · 2016-01-18T06:33:56Z

Do you know what happened to the compression codec?

HyukjinKwon · 2016-01-18T06:44:33Z

Sorry that I misunderstood the issue patch. For codec I was thinking it was removed intendedly as it looks this can be set via Hadoop configuration and also thought JSON datasource does not support this by configuration for this reason.

rxin · 2016-01-18T06:47:16Z

It'd be good to support that - but maybe we can do it in a separate pr.

HyukjinKwon · 2016-01-18T06:47:53Z

Sure. Do you think it would be great if JSON datasource has that one as well?

rxin · 2016-01-18T06:49:05Z

Yes I think so. But let's do that as a separate thing.

SparkQA · 2016-01-18T07:58:54Z

Test build #49572 has finished for PR 10800 at commit 8992d41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-01-18T08:18:11Z

Hm.. Do you know anything about the test test different encoding? This is ignored and when I test this, this emits exception below.

00:17:40.263 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.InvalidClassException: org.apache.spark.sql.execution.datasources.csv.CSVRelation; unable to create instance
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1788)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

I am looking into this though. I just wonder if you know already.

HyukjinKwon · 2016-01-18T08:43:30Z

Ah.. this was because it was not serializanle due to CSVParameters (as a member variable in CSVRelation) trying to pass out of driver side. Let me fix this here and change the ignore to test.

SparkQA · 2016-01-18T10:35:17Z

Test build #49587 has finished for PR 10800 at commit cb0c9a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-18T19:22:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParameters.scala

@@ -44,9 +44,11 @@ private[sql] case class CSVParameters(parameters: Map[String, String]) extends L
    }
  }

-  val delimiter = CSVTypeCast.toChar(parameters.getOrElse("delimiter", ","))
+  val delimiter = CSVTypeCast.toChar(
+    parameters.getOrElse("delimiter", parameters.getOrElse("sep", ",")))


we should take sep first, and then delimiter, since sep is the new canonical option now.

while you are at it, can you make parameters map transient, and make it not a case class?

HyukjinKwon · 2016-01-19T01:58:03Z

@rxin Can I make it like JSONOptions in a separate PR?
The object of this class is being passed to executor-side so it should be serializable (as you know case classes are serializable by default). So, I think I should unnecessarily change the codes so that the class is serializable or it does not pass the object.

SparkQA · 2016-01-19T02:03:08Z

Test build #49652 has finished for PR 10800 at commit c52f1fc.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-19T02:18:25Z

Test build #49653 has finished for PR 10800 at commit 63b9ee8.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-01-19T02:30:25Z

This style failure is occuring in sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala which I did not change and ~~it is fine in my local test~~ it fails in local test. I am looking into this.

HyukjinKwon · 2016-01-19T03:45:27Z

~~I am not too sure why I am hitting this issue though, but~~ I just corrected some imports in an alphabetical order at SparkStrategies and InnerJoinSuite.

This was because of [HOT][BUILD] Changed the import order.

SparkQA · 2016-01-19T05:34:54Z

Test build #49658 has finished for PR 10800 at commit 976e3af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-19T05:38:05Z

Thanks - I'm going to merge this.

Simply rename the options from delimiter to seq and from charset to e…

8992d41

…ncoding

HyukjinKwon changed the title ~~[SPARK-12668] Renaming CSV options to be similar to Pandas and R~~ [SPARK-12668] Providing aliases for CSV options to be similar to Pandas and R Jan 18, 2016

rxin reviewed Jan 18, 2016
View reviewed changes

Simply add aliases for delimiter and charset.

14020d3

HyukjinKwon changed the title ~~[SPARK-12668] Providing aliases for CSV options to be similar to Pandas and R~~ [SPARK-12668][SQL] Providing aliases for CSV options to be similar to Pandas and R Jan 18, 2016

Do not pass all the parameters and add a test.

cb0c9a0

rxin reviewed Jan 18, 2016
View reviewed changes

Make parameters transient and look up new aliases first.

c52f1fc

Update codes to the latest

63b9ee8

HyukjinKwon added 2 commits January 19, 2016 12:23

Update codes to the latest

318555a

Reorder imports

976e3af

asfgit closed this in 453dae5 Jan 19, 2016

HyukjinKwon deleted the SPARK-12668 branch September 23, 2016 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12668][SQL] Providing aliases for CSV options to be similar to Pandas and R #10800

[SPARK-12668][SQL] Providing aliases for CSV options to be similar to Pandas and R #10800

HyukjinKwon commented Jan 18, 2016

falaki commented Jan 18, 2016

HyukjinKwon commented Jan 18, 2016

rxin Jan 18, 2016

rxin commented Jan 18, 2016

HyukjinKwon commented Jan 18, 2016

rxin commented Jan 18, 2016

HyukjinKwon commented Jan 18, 2016

rxin commented Jan 18, 2016

SparkQA commented Jan 18, 2016

HyukjinKwon commented Jan 18, 2016

HyukjinKwon commented Jan 18, 2016

SparkQA commented Jan 18, 2016

rxin Jan 18, 2016

rxin Jan 18, 2016

HyukjinKwon Jan 18, 2016

HyukjinKwon commented Jan 19, 2016

SparkQA commented Jan 19, 2016

SparkQA commented Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

SparkQA commented Jan 19, 2016

rxin commented Jan 19, 2016

[SPARK-12668][SQL] Providing aliases for CSV options to be similar to Pandas and R #10800

[SPARK-12668][SQL] Providing aliases for CSV options to be similar to Pandas and R #10800

Conversation

HyukjinKwon commented Jan 18, 2016

falaki commented Jan 18, 2016

HyukjinKwon commented Jan 18, 2016

rxin Jan 18, 2016

Choose a reason for hiding this comment

rxin commented Jan 18, 2016

HyukjinKwon commented Jan 18, 2016

rxin commented Jan 18, 2016

HyukjinKwon commented Jan 18, 2016

rxin commented Jan 18, 2016

SparkQA commented Jan 18, 2016

HyukjinKwon commented Jan 18, 2016

HyukjinKwon commented Jan 18, 2016

SparkQA commented Jan 18, 2016

rxin Jan 18, 2016

Choose a reason for hiding this comment

rxin Jan 18, 2016

Choose a reason for hiding this comment

HyukjinKwon Jan 18, 2016

Choose a reason for hiding this comment

HyukjinKwon commented Jan 19, 2016

SparkQA commented Jan 19, 2016

SparkQA commented Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

SparkQA commented Jan 19, 2016

rxin commented Jan 19, 2016