[SPARK-56122][PS] Use pandas-aware numeric dtype check in Series.cov by ueshin · Pull Request #54933 · apache/spark

ueshin · 2026-03-20T23:33:14Z

What changes were proposed in this pull request?

This PR updates pyspark.pandas.Series.cov to use pandas' is_numeric_dtype instead of np.issubdtype(..., np.number) when validating the input Series dtypes.

The previous NumPy-based check works for plain NumPy dtypes, but it raises a TypeError for pandas extension dtypes such as StringDtype, which causes Series.cov to fail before it reaches the intended unsupported dtype error path.

This PR also updates the related Series.cov tests to:

accept the dtype name shown by both pandas 2 (object) and pandas 3 (str) for non-numeric string inputs
add extension dtype coverage for nullable Float64
add a string extension dtype case using pandas "string"

Why are the changes needed?

With pandas 3, Series(["a", "b", "c"]).dtype is a pandas extension dtype (StringDtype), and np.issubdtype(self.dtype, np.number) raises:

TypeError: Cannot interpret '<StringDtype(na_value=nan)>' as a data type

That means Series.cov fails in dtype validation instead of returning the expected TypeError("unsupported dtype: ...").

Using is_numeric_dtype is more appropriate here because it is pandas-aware and works with both NumPy dtypes and pandas extension dtypes. This keeps the existing behavior for numeric inputs while making the validation path robust across pandas versions.

Does this PR introduce any user-facing change?

Yes.

For pandas extension dtypes such as pandas 3 string dtype, pyspark.pandas.Series.cov now raises the intended TypeError("unsupported dtype: ...") instead of failing earlier with a NumPy dtype interpretation error.

How was this patch tested?

Added the related tests and the other existing tests should pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex GPT-5

ueshin · 2026-03-20T23:33:29Z

cc @gaogaotiantian @HyukjinKwon @zhengruifeng

dongjoon-hyun · 2026-03-21T07:32:08Z

Merged to master. Thank you, @ueshin .

### What changes were proposed in this pull request? This PR updates `pyspark.pandas.Series.cov` to use pandas' `is_numeric_dtype` instead of `np.issubdtype(..., np.number)` when validating the input Series dtypes. The previous NumPy-based check works for plain NumPy dtypes, but it raises a `TypeError` for pandas extension dtypes such as `StringDtype`, which causes `Series.cov` to fail before it reaches the intended `unsupported dtype` error path. This PR also updates the related `Series.cov` tests to: - accept the dtype name shown by both pandas 2 (`object`) and pandas 3 (`str`) for non-numeric string inputs - add extension dtype coverage for nullable `Float64` - add a string extension dtype case using pandas `"string"` ### Why are the changes needed? With pandas 3, `Series(["a", "b", "c"]).dtype` is a pandas extension dtype (`StringDtype`), and `np.issubdtype(self.dtype, np.number)` raises: `TypeError: Cannot interpret '<StringDtype(na_value=nan)>' as a data type` That means `Series.cov` fails in dtype validation instead of returning the expected `TypeError("unsupported dtype: ...")`. Using `is_numeric_dtype` is more appropriate here because it is pandas-aware and works with both NumPy dtypes and pandas extension dtypes. This keeps the existing behavior for numeric inputs while making the validation path robust across pandas versions. ### Does this PR introduce _any_ user-facing change? Yes. For pandas extension dtypes such as pandas 3 string dtype, `pyspark.pandas.Series.cov` now raises the intended `TypeError("unsupported dtype: ...")` instead of failing earlier with a NumPy dtype interpretation error. ### How was this patch tested? Added the related tests and the other existing tests should pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: OpenAI Codex GPT-5 Closes apache#54933 from ueshin/issues/SPARK-56122/cov. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Use pandas-aware numeric dtype check in Series.cov

04f2ef7

dongjoon-hyun approved these changes Mar 21, 2026

View reviewed changes

dongjoon-hyun closed this in dce5047 Mar 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56122][PS] Use pandas-aware numeric dtype check in Series.cov#54933

[SPARK-56122][PS] Use pandas-aware numeric dtype check in Series.cov#54933
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-56122/cov

ueshin commented Mar 20, 2026

Uh oh!

ueshin commented Mar 20, 2026

Uh oh!

dongjoon-hyun commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ueshin commented Mar 20, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ueshin commented Mar 20, 2026

Uh oh!

dongjoon-hyun commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants