[SPARK-19773][SparkR] SparkDataFrame should not allow duplicate names #17105

actuaryzhang · 2017-02-28T20:39:03Z

What changes were proposed in this pull request?

SparkDataFrame in SparkR seems to accept duplicate names at creation, but incurs error when calling methods downstream. For example, we can do:

l <- list(list(1, 2), list(3, 4))
df <- createDataFrame(l, c("a", "a"))
head(df)

But an error occurs when we do df$a = df$a * 2.0.

In this PR, I add validity check for duplicate names at initialization.

How was this patch tested?

new tests.

actuaryzhang · 2017-02-28T20:40:42Z

@felixcheung

SparkQA · 2017-02-28T21:16:15Z

Test build #73609 has finished for PR 17105 at commit df61c70.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-02-28T22:04:03Z

@felixcheung Ahh, it seems that we have some conflicting design issues.

From the test in collect() and crossJoin, it seems to allow dup names in SparkDataFrame by design:

  # collect() correctly handles multiple columns with same name
  df <- createDataFrame(list(list(1, 2)), schema = c("name", "name"))
  ldf <- collect(df)
  expect_equal(names(ldf), c("name", "name"))

However, it seems that the mutate method does not allow dup names:

            # Check if there is any duplicated column name in the DataFrame
            dfCols <- columns(x)
            if (length(unique(dfCols)) != length(dfCols)) {
              stop("Error: found duplicated column name in the DataFrame")
            }

Wonder if you know the reasoning for such conflicting design? I think it's best to not allow dup names as it does not work with Spark SQL (which does not allow dup names). For example, we can not even extract the columns which will report error:

l <- list(list(1, 2), list(3, 4))
df <- createDataFrame(l, c("a", "a"))
df$a

What do you think?

felixcheung · 2017-03-01T05:58:13Z

@actuaryzhang there's a bit of a history about this... but long story short, Spark does support DataFrame with multiple columns having the same name, for example

# in pyspark
>>> from pyspark.sql import Row
>>> from pyspark.sql.types import *
>>> data = [(1, 2, 'Foo')]
>>> df = spark.createDataFrame(data, ("key", "key", "value"))
>>> df
DataFrame[key: bigint, key: bigint, value: string]

And each column will get a unique id, so underneath the cover they are not actually "duplicating".

Now the reason why you are getting an error with df$a = df$a * 2.0 is because "a" is not a full unique id. You get the same in python

>>> df.select(col("key"))
...
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"Reference 'key' is ambiguous, could be: key#0L, key#1L.;"

And in R, df$a is essentially a shortcut to that and so it will also fail similarly.

As for why it is disallowed in mutate - it is just a factor of the implementation. I think we could potentially implement it to support duplicated names.

felixcheung · 2017-03-01T05:58:36Z

so tl; dr; I think we should support duplicated name like everything else in Spark does.

actuaryzhang · 2017-03-01T07:49:11Z

@felixcheung Thanks for the clarification. I will close this then.

actuaryzhang added 3 commits February 28, 2017 09:25

add dataframe validity check

f0f2dbb

add tests

21b8b4d

cleanup

df61c70

actuaryzhang closed this Mar 1, 2017

actuaryzhang deleted the sparkRDataFrameValid branch May 15, 2017 06:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19773][SparkR] SparkDataFrame should not allow duplicate names #17105

[SPARK-19773][SparkR] SparkDataFrame should not allow duplicate names #17105

actuaryzhang commented Feb 28, 2017

actuaryzhang commented Feb 28, 2017

SparkQA commented Feb 28, 2017

actuaryzhang commented Feb 28, 2017

felixcheung commented Mar 1, 2017 •

edited

felixcheung commented Mar 1, 2017 •

edited

actuaryzhang commented Mar 1, 2017

[SPARK-19773][SparkR] SparkDataFrame should not allow duplicate names #17105

[SPARK-19773][SparkR] SparkDataFrame should not allow duplicate names #17105

Conversation

actuaryzhang commented Feb 28, 2017

What changes were proposed in this pull request?

How was this patch tested?

actuaryzhang commented Feb 28, 2017

SparkQA commented Feb 28, 2017

actuaryzhang commented Feb 28, 2017

felixcheung commented Mar 1, 2017 • edited

felixcheung commented Mar 1, 2017 • edited

actuaryzhang commented Mar 1, 2017

felixcheung commented Mar 1, 2017 •

edited

felixcheung commented Mar 1, 2017 •

edited