Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17024][SQL] Weird behaviour of the DataFrame when a column name contains dots. #14736

Closed
wants to merge 1 commit into from

Conversation

izeigerman
Copy link

@izeigerman izeigerman commented Aug 20, 2016

What changes were proposed in this pull request?

The Spark SQL doesn’t support field names that contains dots. It’s not about queries like select but about any manipulations with the dataset.
Here is a dataset example:

field1,field1.some,field2,field3.some
"field1","field1.some","field2","field3.some"

And a code snippet:

scala> spark.sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load(“/tmp/test.csv").collect

The result of this operation:

org.apache.spark.sql.AnalysisException: Can't extract value from field1#0;
  at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:253)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:252)
  at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
  at scala.collection.immutable.List.foldLeft(List.scala:84)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:252)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
….

The following code fails with the same error:

scala> val df = spark.sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/tmp/test.csv")
df: org.apache.spark.sql.DataFrame = [field1: string, field1.some: string ... 2 more fields]

scala> df.select("field1", "`field1.some`", "field2", "`field3.some`").collect

This patch makes LogicalPlan treat a dot-separated string as an attribute's name in case when nested fields resolution fails.

How was this patch tested?

Existing tests. Tested with a mentioned CSV file in the CSVSuite (not committed). I'm not sure where exactly I should put a test for this. LogicalPlanSuite doesn't look like appropriate place for this.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 21, 2016

spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/tmp/test.csv")
  .show()

prints the results as below:

+------+-----------+------+-----------+
|field1|field1.some|field2|field3.some|
+------+-----------+------+-----------+
|field1|field1.some|field2|field3.some|
+------+-----------+------+-----------+

and

spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/tmp/test.csv")
  .select("field1", "`field1.some`", "field2", "`field3.some`")
  .show()

prints as below:

+------+-----------+------+-----------+
|field1|field1.some|field2|field3.some|
+------+-----------+------+-----------+
|field1|field1.some|field2|field3.some|
+------+-----------+------+-----------+

in the current master. Could you please confirm this?

@izeigerman
Copy link
Author

thanks @HyukjinKwon , for some reason I tested the old code. Both master and branch-2.0 don't have this issue. But the issue is still present in 2.0 release.

@izeigerman izeigerman closed this Aug 21, 2016
@izeigerman izeigerman deleted the iaroslav/spark-17024 branch August 21, 2016 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants