Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41977][SPARK-41978][CONNECT] SparkSession.range to take float as arguments #39499

Closed
wants to merge 2 commits into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to Spark Connect's SparkSession.range to accept floats. e.g., spark.range(10e10).

Why are the changes needed?

For feature parity.

Does this PR introduce any user-facing change?

No to end users since Spark Connect has not been released yet.
SparkSession.range allows floats.

How was this patch tested?

Unittests enabled back.

@HyukjinKwon
Copy link
Member Author

Merged to master.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask a question?

  • The function range's signature is not changed. Does this introduce any behavior change?

@zhengruifeng
Copy link
Contributor

@dongjoon-hyun

good question. range in Connect has the same signature as the PySpark's one, which should only accept intergers.

But PySpark's implementation doesn't check the input type

if numPartitions is None:
numPartitions = self._sc.defaultParallelism
if end is None:
jdf = self._jsparkSession.range(0, int(start), int(step), int(numPartitions))
else:
jdf = self._jsparkSession.range(int(start), int(end), int(step), int(numPartitions))
return DataFrame(jdf, self)

and in some cases / tests, the floats are used, like range(10e10).

We might need to update the typing for both Connect and PySpark in the future.

@HyukjinKwon
Copy link
Member Author

Yeah .. so technically it should only take ints as that's what the method wants. There are a lot of cases like that in PySpark (e.g., DataFrameReader.jdbc), and a lot of discussions happened.

The take is: type desired signature, and do not type the signature supported by casting (and keep it as a legacy behaviour).
Here I fixed Spark Connect's range to match with the legacy behaviour of the regular PySpark's range.

@HyukjinKwon HyukjinKwon deleted the SPARK-41977 branch January 15, 2024 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants