[SPARK-41831][CONNECT] DataFrame.select to take a single list of columns by HyukjinKwon · Pull Request #39417 · apache/spark

HyukjinKwon · 2023-01-06T02:30:11Z

What changes were proposed in this pull request?

This PR proposes to make DataFrame.select to take a single list of columns that regular PySpark supports.
Before this fix, the doctest fails as below:

File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1269, in pyspark.sql.connect.dataframe.DataFrame.transform
Failed example:
    df.transform(cast_all_to_int).transform(sort_columns_asc).show()
Exception raised:
    Traceback (most recent call last):
      File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 1336, in __run
        exec(compile(example.source, filename, "single",
      File "<doctest pyspark.sql.connect.dataframe.DataFrame.transform[4]>", line 1, in <module>
        df.transform(cast_all_to_int).transform(sort_columns_asc).show()
      File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1203, in transform
        result = func(self, *args, **kwargs)
      File "<doctest pyspark.sql.connect.dataframe.DataFrame.transform[2]>", line 2, in cast_all_to_int
        return input_df.select([col(col_name).cast("int") for col_name in input_df.columns])
      File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 95, in select
        return DataFrame.withPlan(plan.Project(self._plan, *cols), session=self._session)
      File "/.../spark/python/pyspark/sql/connect/plan.py", line 348, in __init__
        self._verify_expressions()
      File "/.../spark/python/pyspark/sql/connect/plan.py", line 354, in _verify_expressions
        raise InputValidationError(
    pyspark.sql.connect.plan.InputValidationError: Only Column or String can be used for projections: '[Column<'(ColumnReference(int) (int))'>, Column<'(ColumnReference(float) (int))'>]'.

Why are the changes needed?

For feature parity.

Does this PR introduce any user-facing change?

How was this patch tested?

Manually tested as below:

./python/run-tests --testnames 'pyspark.sql.connect.dataframe'

HyukjinKwon · 2023-01-06T02:30:23Z

cc @zhengruifeng

DataFrame.select to take a single list of columns

2fc6acb

github-actions bot added CONNECT CORE PYTHON SQL labels Jan 6, 2023

HyukjinKwon closed this Jan 6, 2023

HyukjinKwon deleted the SPARK-41831 branch January 15, 2024 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41831][CONNECT] DataFrame.select to take a single list of columns#39417

[SPARK-41831][CONNECT] DataFrame.select to take a single list of columns#39417
HyukjinKwon wants to merge 1 commit intoapache:masterfrom
HyukjinKwon:SPARK-41831

HyukjinKwon commented Jan 6, 2023

Uh oh!

HyukjinKwon commented Jan 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HyukjinKwon commented Jan 6, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jan 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant