[SPARK-17024][SQL] Weird behaviour of the DataFrame when a column name contains dots. #14736

izeigerman · 2016-08-20T20:15:01Z

What changes were proposed in this pull request?

The Spark SQL doesn’t support field names that contains dots. It’s not about queries like select but about any manipulations with the dataset.
Here is a dataset example:

field1,field1.some,field2,field3.some
"field1","field1.some","field2","field3.some"

And a code snippet:

scala> spark.sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load(“/tmp/test.csv").collect

The result of this operation:

org.apache.spark.sql.AnalysisException: Can't extract value from field1#0;
  at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:253)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:252)
  at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
  at scala.collection.immutable.List.foldLeft(List.scala:84)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:252)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
….

The following code fails with the same error:

scala> val df = spark.sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/tmp/test.csv")
df: org.apache.spark.sql.DataFrame = [field1: string, field1.some: string ... 2 more fields]

scala> df.select("field1", "`field1.some`", "field2", "`field3.some`").collect

This patch makes LogicalPlan treat a dot-separated string as an attribute's name in case when nested fields resolution fails.

How was this patch tested?

Existing tests. Tested with a mentioned CSV file in the CSVSuite (not committed). I'm not sure where exactly I should put a test for this. LogicalPlanSuite doesn't look like appropriate place for this.

… contain dots in their names.

AmplabJenkins · 2016-08-20T20:17:13Z

Can one of the admins verify this patch?

HyukjinKwon · 2016-08-21T10:18:57Z

spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/tmp/test.csv")
  .show()

prints the results as below:

+------+-----------+------+-----------+
|field1|field1.some|field2|field3.some|
+------+-----------+------+-----------+
|field1|field1.some|field2|field3.some|
+------+-----------+------+-----------+

and

spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/tmp/test.csv")
  .select("field1", "`field1.some`", "field2", "`field3.some`")
  .show()

prints as below:

+------+-----------+------+-----------+
|field1|field1.some|field2|field3.some|
+------+-----------+------+-----------+
|field1|field1.some|field2|field3.some|
+------+-----------+------+-----------+

in the current master. Could you please confirm this?

izeigerman · 2016-08-21T13:47:45Z

thanks @HyukjinKwon , for some reason I tested the old code. Both master and branch-2.0 don't have this issue. But the issue is still present in 2.0 release.

fix attribute resolution for the Logical Plan in case when attributes…

6059dfc

… contain dots in their names.

izeigerman closed this Aug 21, 2016

izeigerman deleted the iaroslav/spark-17024 branch August 21, 2016 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17024][SQL] Weird behaviour of the DataFrame when a column name contains dots. #14736

[SPARK-17024][SQL] Weird behaviour of the DataFrame when a column name contains dots. #14736

izeigerman commented Aug 20, 2016 •

edited

AmplabJenkins commented Aug 20, 2016

HyukjinKwon commented Aug 21, 2016 •

edited

izeigerman commented Aug 21, 2016

[SPARK-17024][SQL] Weird behaviour of the DataFrame when a column name contains dots. #14736

[SPARK-17024][SQL] Weird behaviour of the DataFrame when a column name contains dots. #14736

Conversation

izeigerman commented Aug 20, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented Aug 20, 2016

HyukjinKwon commented Aug 21, 2016 • edited

izeigerman commented Aug 21, 2016

izeigerman commented Aug 20, 2016 •

edited

HyukjinKwon commented Aug 21, 2016 •

edited