Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12668][SQL] Providing aliases for CSV options to be similar to Pandas and R #10800

Closed
wants to merge 7 commits into from

Conversation

HyukjinKwon
Copy link
Member

https://issues.apache.org/jira/browse/SPARK-12668

Spark CSV datasource has been being merged (filed in SPARK-12420). This is a quicky PR that simply renames several CSV options to similar Pandas and R.

  • Alias for delimiter ­-> sep
  • charset -­> encoding

@falaki
Copy link
Contributor

falaki commented Jan 18, 2016

@HyukjinKwon please do not replace existing option names. Please just provide aliases. This can be done entirely inside CSVParameters.scala

@HyukjinKwon
Copy link
Member Author

@falaki Ah, right.

@HyukjinKwon HyukjinKwon changed the title [SPARK-12668] Renaming CSV options to be similar to Pandas and R [SPARK-12668] Providing aliases for CSV options to be similar to Pandas and R Jan 18, 2016
@@ -44,9 +44,9 @@ private[sql] case class CSVParameters(parameters: Map[String, String]) extends L
}
}

val delimiter = CSVTypeCast.toChar(parameters.getOrElse("delimiter", ","))
val seq = CSVTypeCast.toChar(parameters.getOrElse("seq", ","))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is "sep", not "seq".

Also I like the old name for the variable better. We should just rename the name for the option. And for backward compatibility, we should accept "delimiter" if "sep" is not set.

@rxin
Copy link
Contributor

rxin commented Jan 18, 2016

Do you know what happened to the compression codec?

@HyukjinKwon
Copy link
Member Author

Sorry that I misunderstood the issue patch. For codec I was thinking it was removed intendedly as it looks this can be set via Hadoop configuration and also thought JSON datasource does not support this by configuration for this reason.

@rxin
Copy link
Contributor

rxin commented Jan 18, 2016

It'd be good to support that - but maybe we can do it in a separate pr.

@HyukjinKwon
Copy link
Member Author

Sure. Do you think it would be great if JSON datasource has that one as well?

@rxin
Copy link
Contributor

rxin commented Jan 18, 2016

Yes I think so. But let's do that as a separate thing.

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #49572 has finished for PR 10800 at commit 8992d41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-12668] Providing aliases for CSV options to be similar to Pandas and R [SPARK-12668][SQL] Providing aliases for CSV options to be similar to Pandas and R Jan 18, 2016
@HyukjinKwon
Copy link
Member Author

Hm.. Do you know anything about the test test different encoding? This is ignored and when I test this, this emits exception below.

00:17:40.263 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.InvalidClassException: org.apache.spark.sql.execution.datasources.csv.CSVRelation; unable to create instance
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1788)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

I am looking into this though. I just wonder if you know already.

@HyukjinKwon
Copy link
Member Author

Ah.. this was because it was not serializanle due to CSVParameters (as a member variable in CSVRelation) trying to pass out of driver side. Let me fix this here and change the ignore to test.

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #49587 has finished for PR 10800 at commit cb0c9a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -44,9 +44,11 @@ private[sql] case class CSVParameters(parameters: Map[String, String]) extends L
}
}

val delimiter = CSVTypeCast.toChar(parameters.getOrElse("delimiter", ","))
val delimiter = CSVTypeCast.toChar(
parameters.getOrElse("delimiter", parameters.getOrElse("sep", ",")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should take sep first, and then delimiter, since sep is the new canonical option now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while you are at it, can you make parameters map transient, and make it not a case class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@HyukjinKwon
Copy link
Member Author

@rxin Can I make it like JSONOptions in a separate PR?
The object of this class is being passed to executor-side so it should be serializable (as you know case classes are serializable by default). So, I think I should unnecessarily change the codes so that the class is serializable or it does not pass the object.

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49652 has finished for PR 10800 at commit c52f1fc.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49653 has finished for PR 10800 at commit 63b9ee8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

This style failure is occuring in sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala which I did not change and it is fine in my local test it fails in local test. I am looking into this.

@HyukjinKwon
Copy link
Member Author

I am not too sure why I am hitting this issue though, but I just corrected some imports in an alphabetical order at SparkStrategies and InnerJoinSuite.

This was because of [HOT][BUILD] Changed the import order.

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49658 has finished for PR 10800 at commit 976e3af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jan 19, 2016

Thanks - I'm going to merge this.

@asfgit asfgit closed this in 453dae5 Jan 19, 2016
@HyukjinKwon HyukjinKwon deleted the SPARK-12668 branch September 23, 2016 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants