Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19773][SparkR] SparkDataFrame should not allow duplicate names #17105

Closed
wants to merge 3 commits into from

Conversation

actuaryzhang
Copy link
Contributor

What changes were proposed in this pull request?

SparkDataFrame in SparkR seems to accept duplicate names at creation, but incurs error when calling methods downstream. For example, we can do:

l <- list(list(1, 2), list(3, 4))
df <- createDataFrame(l, c("a", "a"))
head(df)

But an error occurs when we do df$a = df$a * 2.0.

In this PR, I add validity check for duplicate names at initialization.

How was this patch tested?

new tests.

@actuaryzhang
Copy link
Contributor Author

@felixcheung

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73609 has finished for PR 17105 at commit df61c70.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@felixcheung Ahh, it seems that we have some conflicting design issues.

  1. From the test in collect() and crossJoin, it seems to allow dup names in SparkDataFrame by design:
  # collect() correctly handles multiple columns with same name
  df <- createDataFrame(list(list(1, 2)), schema = c("name", "name"))
  ldf <- collect(df)
  expect_equal(names(ldf), c("name", "name"))
  1. However, it seems that the mutate method does not allow dup names:
            # Check if there is any duplicated column name in the DataFrame
            dfCols <- columns(x)
            if (length(unique(dfCols)) != length(dfCols)) {
              stop("Error: found duplicated column name in the DataFrame")
            }

Wonder if you know the reasoning for such conflicting design? I think it's best to not allow dup names as it does not work with Spark SQL (which does not allow dup names). For example, we can not even extract the columns which will report error:

l <- list(list(1, 2), list(3, 4))
df <- createDataFrame(l, c("a", "a"))
df$a

What do you think?

@felixcheung
Copy link
Member

felixcheung commented Mar 1, 2017

@actuaryzhang there's a bit of a history about this... but long story short, Spark does support DataFrame with multiple columns having the same name, for example

# in pyspark
>>> from pyspark.sql import Row
>>> from pyspark.sql.types import *
>>> data = [(1, 2, 'Foo')]
>>> df = spark.createDataFrame(data, ("key", "key", "value"))
>>> df
DataFrame[key: bigint, key: bigint, value: string]

And each column will get a unique id, so underneath the cover they are not actually "duplicating".

Now the reason why you are getting an error with df$a = df$a * 2.0 is because "a" is not a full unique id. You get the same in python

>>> df.select(col("key"))
...
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"Reference 'key' is ambiguous, could be: key#0L, key#1L.;"

And in R, df$a is essentially a shortcut to that and so it will also fail similarly.

As for why it is disallowed in mutate - it is just a factor of the implementation. I think we could potentially implement it to support duplicated names.

@felixcheung
Copy link
Member

felixcheung commented Mar 1, 2017

so tl; dr; I think we should support duplicated name like everything else in Spark does.

@actuaryzhang
Copy link
Contributor Author

@felixcheung Thanks for the clarification. I will close this then.

@actuaryzhang actuaryzhang deleted the sparkRDataFrameValid branch May 15, 2017 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants