[SPARK-41977][SPARK-41978][CONNECT] SparkSession.range to take float as arguments #39499

HyukjinKwon · 2023-01-11T07:36:09Z

What changes were proposed in this pull request?

This PR proposes to Spark Connect's SparkSession.range to accept floats. e.g., spark.range(10e10).

Why are the changes needed?

For feature parity.

Does this PR introduce any user-facing change?

No to end users since Spark Connect has not been released yet.
SparkSession.range allows floats.

How was this patch tested?

Unittests enabled back.

HyukjinKwon · 2023-01-11T15:23:40Z

Merged to master.

dongjoon-hyun

May I ask a question?

The function range's signature is not changed. Does this introduce any behavior change?

zhengruifeng · 2023-01-12T00:49:58Z

@dongjoon-hyun

good question. range in Connect has the same signature as the PySpark's one, which should only accept intergers.

But PySpark's implementation doesn't check the input type

spark/python/pyspark/sql/session.py

Lines 788 to 796 in 56c7cf3

    
           if numPartitions is None: 
        
               numPartitions = self._sc.defaultParallelism 
        
           if end is None: 
        
               jdf = self._jsparkSession.range(0, int(start), int(step), int(numPartitions)) 
        
           else: 
        
               jdf = self._jsparkSession.range(int(start), int(end), int(step), int(numPartitions)) 
        
           return DataFrame(jdf, self)

and in some cases / tests, the floats are used, like range(10e10).

We might need to update the typing for both Connect and PySpark in the future.

HyukjinKwon · 2023-01-12T01:04:42Z

Yeah .. so technically it should only take ints as that's what the method wants. There are a lot of cases like that in PySpark (e.g., DataFrameReader.jdbc), and a lot of discussions happened.

The take is: type desired signature, and do not type the signature supported by casting (and keep it as a legacy behaviour).
Here I fixed Spark Connect's range to match with the legacy behaviour of the regular PySpark's range.

SparkSession.range to take float as arguments

ad54b8f

HyukjinKwon requested a review from zhengruifeng January 11, 2023 07:36

github-actions bot added CONNECT CORE PYTHON SQL labels Jan 11, 2023

zhengruifeng approved these changes Jan 11, 2023

View reviewed changes

Merge branch 'master' into SPARK-41977

0af773b

HyukjinKwon closed this in e821d84 Jan 11, 2023

dongjoon-hyun reviewed Jan 11, 2023

View reviewed changes

HyukjinKwon deleted the SPARK-41977 branch January 15, 2024 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41977][SPARK-41978][CONNECT] SparkSession.range to take float as arguments #39499

[SPARK-41977][SPARK-41978][CONNECT] SparkSession.range to take float as arguments #39499

HyukjinKwon commented Jan 11, 2023

HyukjinKwon commented Jan 11, 2023

dongjoon-hyun left a comment

zhengruifeng commented Jan 12, 2023

HyukjinKwon commented Jan 12, 2023

[SPARK-41977][SPARK-41978][CONNECT] SparkSession.range to take float as arguments #39499

[SPARK-41977][SPARK-41978][CONNECT] SparkSession.range to take float as arguments #39499

Conversation

HyukjinKwon commented Jan 11, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Jan 11, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

zhengruifeng commented Jan 12, 2023

HyukjinKwon commented Jan 12, 2023