[SPARK-56122][PS] Use pandas-aware numeric dtype check in Series.cov#54933
Closed
ueshin wants to merge 1 commit intoapache:masterfrom
Closed
[SPARK-56122][PS] Use pandas-aware numeric dtype check in Series.cov#54933ueshin wants to merge 1 commit intoapache:masterfrom
ueshin wants to merge 1 commit intoapache:masterfrom
Conversation
Member
Author
dongjoon-hyun
approved these changes
Mar 21, 2026
Member
|
Merged to master. Thank you, @ueshin . |
terana
pushed a commit
to terana/spark
that referenced
this pull request
Mar 23, 2026
### What changes were proposed in this pull request?
This PR updates `pyspark.pandas.Series.cov` to use pandas' `is_numeric_dtype` instead of `np.issubdtype(..., np.number)` when validating the input Series dtypes.
The previous NumPy-based check works for plain NumPy dtypes, but it raises a `TypeError` for pandas extension dtypes such as `StringDtype`, which causes `Series.cov` to fail before it reaches the intended `unsupported dtype` error path.
This PR also updates the related `Series.cov` tests to:
- accept the dtype name shown by both pandas 2 (`object`) and pandas 3 (`str`) for non-numeric string inputs
- add extension dtype coverage for nullable `Float64`
- add a string extension dtype case using pandas `"string"`
### Why are the changes needed?
With pandas 3, `Series(["a", "b", "c"]).dtype` is a pandas extension dtype (`StringDtype`), and `np.issubdtype(self.dtype, np.number)` raises:
`TypeError: Cannot interpret '<StringDtype(na_value=nan)>' as a data type`
That means `Series.cov` fails in dtype validation instead of returning the expected `TypeError("unsupported dtype: ...")`.
Using `is_numeric_dtype` is more appropriate here because it is pandas-aware and works with both NumPy dtypes and pandas extension dtypes. This keeps the existing behavior for numeric inputs while making the validation path robust across pandas versions.
### Does this PR introduce _any_ user-facing change?
Yes.
For pandas extension dtypes such as pandas 3 string dtype, `pyspark.pandas.Series.cov` now raises the intended `TypeError("unsupported dtype: ...")` instead of failing earlier with a NumPy dtype interpretation error.
### How was this patch tested?
Added the related tests and the other existing tests should pass.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: OpenAI Codex GPT-5
Closes apache#54933 from ueshin/issues/SPARK-56122/cov.
Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR updates
pyspark.pandas.Series.covto use pandas'is_numeric_dtypeinstead ofnp.issubdtype(..., np.number)when validating the input Series dtypes.The previous NumPy-based check works for plain NumPy dtypes, but it raises a
TypeErrorfor pandas extension dtypes such asStringDtype, which causesSeries.covto fail before it reaches the intendedunsupported dtypeerror path.This PR also updates the related
Series.covtests to:object) and pandas 3 (str) for non-numeric string inputsFloat64"string"Why are the changes needed?
With pandas 3,
Series(["a", "b", "c"]).dtypeis a pandas extension dtype (StringDtype), andnp.issubdtype(self.dtype, np.number)raises:TypeError: Cannot interpret '<StringDtype(na_value=nan)>' as a data typeThat means
Series.covfails in dtype validation instead of returning the expectedTypeError("unsupported dtype: ...").Using
is_numeric_dtypeis more appropriate here because it is pandas-aware and works with both NumPy dtypes and pandas extension dtypes. This keeps the existing behavior for numeric inputs while making the validation path robust across pandas versions.Does this PR introduce any user-facing change?
Yes.
For pandas extension dtypes such as pandas 3 string dtype,
pyspark.pandas.Series.covnow raises the intendedTypeError("unsupported dtype: ...")instead of failing earlier with a NumPy dtype interpretation error.How was this patch tested?
Added the related tests and the other existing tests should pass.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: OpenAI Codex GPT-5