Skip to content

[SPARK-56122][PS] Use pandas-aware numeric dtype check in Series.cov#54933

Closed
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-56122/cov
Closed

[SPARK-56122][PS] Use pandas-aware numeric dtype check in Series.cov#54933
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-56122/cov

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Mar 20, 2026

What changes were proposed in this pull request?

This PR updates pyspark.pandas.Series.cov to use pandas' is_numeric_dtype instead of np.issubdtype(..., np.number) when validating the input Series dtypes.

The previous NumPy-based check works for plain NumPy dtypes, but it raises a TypeError for pandas extension dtypes such as StringDtype, which causes Series.cov to fail before it reaches the intended unsupported dtype error path.

This PR also updates the related Series.cov tests to:

  • accept the dtype name shown by both pandas 2 (object) and pandas 3 (str) for non-numeric string inputs
  • add extension dtype coverage for nullable Float64
  • add a string extension dtype case using pandas "string"

Why are the changes needed?

With pandas 3, Series(["a", "b", "c"]).dtype is a pandas extension dtype (StringDtype), and np.issubdtype(self.dtype, np.number) raises:

TypeError: Cannot interpret '<StringDtype(na_value=nan)>' as a data type

That means Series.cov fails in dtype validation instead of returning the expected TypeError("unsupported dtype: ...").

Using is_numeric_dtype is more appropriate here because it is pandas-aware and works with both NumPy dtypes and pandas extension dtypes. This keeps the existing behavior for numeric inputs while making the validation path robust across pandas versions.

Does this PR introduce any user-facing change?

Yes.

For pandas extension dtypes such as pandas 3 string dtype, pyspark.pandas.Series.cov now raises the intended TypeError("unsupported dtype: ...") instead of failing earlier with a NumPy dtype interpretation error.

How was this patch tested?

Added the related tests and the other existing tests should pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex GPT-5

@ueshin
Copy link
Member Author

ueshin commented Mar 20, 2026

@dongjoon-hyun
Copy link
Member

Merged to master. Thank you, @ueshin .

terana pushed a commit to terana/spark that referenced this pull request Mar 23, 2026
### What changes were proposed in this pull request?

This PR updates `pyspark.pandas.Series.cov` to use pandas' `is_numeric_dtype` instead of `np.issubdtype(..., np.number)` when validating the input Series dtypes.

The previous NumPy-based check works for plain NumPy dtypes, but it raises a `TypeError` for pandas extension dtypes such as `StringDtype`, which causes `Series.cov` to fail before it reaches the intended `unsupported dtype` error path.

This PR also updates the related `Series.cov` tests to:
- accept the dtype name shown by both pandas 2 (`object`) and pandas 3 (`str`) for non-numeric string inputs
- add extension dtype coverage for nullable `Float64`
- add a string extension dtype case using pandas `"string"`

### Why are the changes needed?

With pandas 3, `Series(["a", "b", "c"]).dtype` is a pandas extension dtype (`StringDtype`), and `np.issubdtype(self.dtype, np.number)` raises:

`TypeError: Cannot interpret '<StringDtype(na_value=nan)>' as a data type`

That means `Series.cov` fails in dtype validation instead of returning the expected `TypeError("unsupported dtype: ...")`.

Using `is_numeric_dtype` is more appropriate here because it is pandas-aware and works with both NumPy dtypes and pandas extension dtypes. This keeps the existing behavior for numeric inputs while making the validation path robust across pandas versions.

### Does this PR introduce _any_ user-facing change?

Yes.

For pandas extension dtypes such as pandas 3 string dtype, `pyspark.pandas.Series.cov` now raises the intended `TypeError("unsupported dtype: ...")` instead of failing earlier with a NumPy dtype interpretation error.

### How was this patch tested?

Added the related tests and the other existing tests should pass.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex GPT-5

Closes apache#54933 from ueshin/issues/SPARK-56122/cov.

Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants