Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

What changes were proposed in this pull request?

We can see the DataFrame API:
def select(col: String, cols: String*)
such definition can prevent user to call select in such way: df.select( )

but, currently we can still use df.select( ) and pass compiling,
because it match the API
def select(cols: Column*)

so, my modification is, add an API such as:
def select(col: Column, cols: Column*)
and change def select(cols: Column*) into private[spark] def select(cols: Column*)
so that the public select API can only be called with non-empty param list.

How was this patch tested?

Existing test.

@WeichenXu123 WeichenXu123 changed the title [SPARK-17046][SQL] prevent_user_call_df_select_will_empty_paramlist [SPARK-17046][SQL] prevent user using dataframe.select with empty param list Aug 13, 2016
@SparkQA
Copy link

SparkQA commented Aug 13, 2016

Test build #63723 has finished for PR 14629 at commit 1179249.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 force-pushed the prevent_user_call_df_select_will_empty_paramlist branch from 1179249 to 8060aa4 Compare August 13, 2016 09:22
@WeichenXu123 WeichenXu123 changed the title [SPARK-17046][SQL] prevent user using dataframe.select with empty param list [WIP][SPARK-17046][SQL] prevent user using dataframe.select with empty param list Aug 13, 2016
@SparkQA
Copy link

SparkQA commented Aug 13, 2016

Test build #63724 has finished for PR 14629 at commit 8060aa4.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

@srowen
How do you think about this problem?
I found adding two method like
def select(cols: Column*)
def select(col: Column, cols: Column*)
causing ambiguous,
I prepare to change def select(cols: Column*) into def selectInternal(cols: Column*)
and mark selectInternal as private[spark]
do you think it reasonable?

@rxin
Copy link
Contributor

rxin commented Aug 13, 2016

Why do we want to enforce this? It is valid to have a DataFrame without any columns.

@srowen
Copy link
Member

srowen commented Aug 13, 2016

Yes that's a good question. A 0-column DataFrame is valid, though that's a little different from being able to select 0 columns from a DataFrame. I don't have a database handy, but can you select no columns in any SQL syntax? Maybe best to emulate that?

@WeichenXu123
Copy link
Contributor Author

MySql do not allow select with 0 columns, and I think select() is useless, no one will do such operation, so, is it better to generate compiling error when detecting code use df.select() because it is usually a coding mistake?
Or, df.select() is useful in some particular scenario?

@hvanhovell
Copy link
Contributor

A df.select() without any columns is not useless IMO: You can still get a valid count() from a data frame.

@srowen
Copy link
Member

srowen commented Aug 14, 2016

Interesting point, yeah, because normally in an RDBMS you have to COUNT(*) or COUNT(1) and the argument is useless anyway, so would be nice to not have to provide an argument to select in this context. But while comparing to DataFrames, of course you should already count() directly if this is desired.

@srowen
Copy link
Member

srowen commented Aug 20, 2016

@hvanhovell @rxin unless you've changed your stance a little bit on this, I think the conclusion is that this isn't worth changing this behavior and we can close this @WeichenXu123

@hvanhovell
Copy link
Contributor

I haven't changed my mind of this. Lets close this one.

@WeichenXu123 WeichenXu123 deleted the prevent_user_call_df_select_will_empty_paramlist branch April 24, 2019 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants