Skip to content

[SPARK-41831][CONNECT] DataFrame.select to take a single list of columns#39417

Closed
HyukjinKwon wants to merge 1 commit intoapache:masterfrom
HyukjinKwon:SPARK-41831
Closed

[SPARK-41831][CONNECT] DataFrame.select to take a single list of columns#39417
HyukjinKwon wants to merge 1 commit intoapache:masterfrom
HyukjinKwon:SPARK-41831

Conversation

@HyukjinKwon
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR proposes to make DataFrame.select to take a single list of columns that regular PySpark supports.
Before this fix, the doctest fails as below:

File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1269, in pyspark.sql.connect.dataframe.DataFrame.transform
Failed example:
    df.transform(cast_all_to_int).transform(sort_columns_asc).show()
Exception raised:
    Traceback (most recent call last):
      File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 1336, in __run
        exec(compile(example.source, filename, "single",
      File "<doctest pyspark.sql.connect.dataframe.DataFrame.transform[4]>", line 1, in <module>
        df.transform(cast_all_to_int).transform(sort_columns_asc).show()
      File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1203, in transform
        result = func(self, *args, **kwargs)
      File "<doctest pyspark.sql.connect.dataframe.DataFrame.transform[2]>", line 2, in cast_all_to_int
        return input_df.select([col(col_name).cast("int") for col_name in input_df.columns])
      File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 95, in select
        return DataFrame.withPlan(plan.Project(self._plan, *cols), session=self._session)
      File "/.../spark/python/pyspark/sql/connect/plan.py", line 348, in __init__
        self._verify_expressions()
      File "/.../spark/python/pyspark/sql/connect/plan.py", line 354, in _verify_expressions
        raise InputValidationError(
    pyspark.sql.connect.plan.InputValidationError: Only Column or String can be used for projections: '[Column<'(ColumnReference(int) (int))'>, Column<'(ColumnReference(float) (int))'>]'.

Why are the changes needed?

For feature parity.

Does this PR introduce any user-facing change?

How was this patch tested?

Manually tested as below:

./python/run-tests --testnames 'pyspark.sql.connect.dataframe'

@HyukjinKwon
Copy link
Copy Markdown
Member Author

cc @zhengruifeng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant