[Fix] Pandas 3 string type changes by fjosw · Pull Request #278 · fjosw/pyerrors

fjosw · 2026-02-03T17:11:48Z

Fix pandas 3 compatibility for DataFrame serialization/deserialization

Handle new str dtype for string columns (pandas 3 uses str instead of object)
Use .iloc[i] for positional indexing (fixes deprecation warnings)
Use pd.isna() instead of x is None for null checks (pandas 3 string columns store nulls as nan)
Add _is_null() helper to safely handle null checks on columns containing lists
Convert NA values back to Python None after deserialization for API compatibility

Copilot

Pull request overview

This pull request addresses pandas 3 compatibility issues in the DataFrame serialization/deserialization functionality. The changes handle the new string dtype behavior in pandas 3 and update null checking mechanisms to work with both pandas NA values and Python None.

Changes:

Updated DataFrame column iteration to include both 'object' and 'str' dtypes for pandas 3+ compatibility
Replaced direct indexing (col[i]) with positional indexing (col.iloc[i]) throughout to avoid deprecation warnings
Replaced x is None checks with pd.isna(x) for null value detection and added _is_null() helper function for safe list handling
Added conversion of pandas NA values back to Python None after deserialization for API compatibility
Removed -Werror flag from pytest workflow for non-3.14 Python versions

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

File	Description
pyerrors/input/pandas.py	Updates serialization/deserialization functions to handle pandas 3 string types and NA values, adds _is_null() helper, converts NA to None post-deserialization
.github/workflows/pytest.yml	Removes -Werror flag from pytest for Python versions other than 3.14

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-03T17:21:37Z

pyerrors/input/pandas.py

-                        if isinstance(df[column][i], list):
-                            df[column].apply(lambda x: [o.gm() if o is not None else x for o in x])
+                        if isinstance(df[column].iloc[i], list):
+                            df[column].apply(lambda x: [o.gm() if not pd.isna(o) else x for o in x])


The logic in line 186 is incorrect for applying .gm() to items in a list. When checking 'not pd.isna(o)' for list elements, this will raise an error because 'o' is an Obs object, not a pandas value. The pd.isna() check should be changed to handle Obs objects properly, or should use 'o is not None' instead.

Copilot · 2026-02-03T17:21:37Z

pyerrors/input/pandas.py

+
+def _is_null(val):
+    """Check if a value is null (None or NA), handling list/array values."""
+    return False if isinstance(val, (list, np.ndarray)) else pd.isna(val)


The _is_null function returns False for lists and numpy arrays, which means that empty lists ([]) will be treated as non-null. This could lead to unexpected behavior if a column contains empty lists as placeholders for null values. Consider whether empty lists should be treated as null values based on the use case.

Suggested change

return False if isinstance(val, (list, np.ndarray)) else pd.isna(val)

# Treat empty lists/arrays (and containers whose elements are all null) as null.

if isinstance(val, list):

if len(val) == 0:

return True

# A list is null only if all its elements are null.

return all(_is_null(v) for v in val)

if isinstance(val, np.ndarray):

if val.size == 0:

return True

# For object-dtype arrays, check elementwise using _is_null.

if val.dtype == object:

return all(_is_null(v) for v in val)

# For non-object arrays, rely on pandas/numpy NA detection.

return bool(np.all(pd.isna(val)))

return pd.isna(val)

Copilot · 2026-02-03T17:21:37Z

pyerrors/input/pandas.py

+            out[column] = out[column].transform(lambda x: create_json_string(x, indent=0) if not _is_null(x) else None)
            if gz is True:
-                out[column] = out[column].transform(lambda x: gzip.compress((x if x is not None else '').encode('utf-8')))
+                out[column] = out[column].transform(lambda x: gzip.compress(x.encode('utf-8')) if not _is_null(x) else gzip.compress(b''))


When x is None/null, gzip.compress(b'') is called which compresses an empty byte string. This differs from the original behavior where None would be passed through. This change should be verified against the deserialization logic to ensure that compressed empty strings are correctly handled as null values when deserializing.

Copilot · 2026-02-03T17:21:38Z

pyerrors/input/pandas.py

+    for column in df.select_dtypes(include=string_like_dtypes):
+        if isinstance(df[column].iloc[0], bytes):
+            if df[column].iloc[0].startswith(b"\x1f\x8b\x08\x00"):
+                df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '')


The condition 'if x else' for checking null values in gzip decompression is imprecise. In pandas 3 with string dtypes, x could be pd.NA which would evaluate to an error in a boolean context. Use 'if not pd.isna(x)' instead of 'if x' for consistency with the rest of the changes.

Suggested change

df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '')

df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if not pd.isna(x) else '')

Copilot · 2026-02-03T17:21:38Z

pyerrors/input/pandas.py

+        if isinstance(df[column].iloc[0], bytes):
+            if df[column].iloc[0].startswith(b"\x1f\x8b\x08\x00"):
+                df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '')


Potential IndexError when the column is empty or when all values are NaN/None. Before accessing iloc[0], verify that the column has at least one non-null value. This can happen when processing empty DataFrames or columns with all null values.

Suggested change

if isinstance(df[column].iloc[0], bytes):

if df[column].iloc[0].startswith(b"\x1f\x8b\x08\x00"):

df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '')

col_series = df[column]

# Skip empty columns to avoid IndexError on iloc[0]

if col_series.empty:

continue

# Find the first non-null value for type inspection

first_valid_index = col_series.first_valid_index()

first_value = col_series.loc[first_valid_index] if first_valid_index is not None else None

if isinstance(first_value, bytes):

if first_value.startswith(b"\x1f\x8b\x08\x00"):

df[column] = col_series.transform(

lambda x: gzip.decompress(x).decode('utf-8')

if isinstance(x, (bytes, bytearray)) and x

else ''

)

Copilot · 2026-02-03T17:21:39Z

pyerrors/input/pandas.py

+                        if isinstance(df[column].iloc[i], list):
+                            df[column].apply(lambda x: [o.gm() if not pd.isna(o) else x for o in x])
                        else:
-                            df[column].apply(lambda x: x.gm() if x is not None else x)
+                            df[column].apply(lambda x: x.gm() if not pd.isna(x) else x)


The apply() function returns a new Series but the result is not assigned back to df[column]. This means the .gm() calls have no effect. The code should be: df[column] = df[column].apply(...) to store the result.

Copilot · 2026-02-03T17:21:39Z

pyerrors/input/pandas.py

-
-        if not all([e is None for e in df[column]]):
+    # In pandas 3+, string columns use 'str' dtype instead of 'object'
+    string_like_dtypes = ["object", "str"] if int(pd.__version__.split(".")[0]) >= 3 else ["object"]


The version parsing could fail if the pandas version string doesn't follow the expected format (e.g., development versions like '3.0.0rc1' or '3.0.0.dev0'). Consider using a more robust version parsing approach such as packaging.version.Version or handle potential exceptions from the int() conversion.

Copilot · 2026-02-03T17:21:40Z

.github/workflows/pytest.yml

      - name: Run tests with -Werror
        if: matrix.python-version != '3.14'
-        run: pytest --cov=pyerrors -vv -Werror
+        run: pytest --cov=pyerrors -vv


The -Werror flag was removed from pytest for all Python versions except 3.14. According to the PR description, this change aims to fix deprecation warnings. If the warnings are truly fixed by this PR, the -Werror flag should remain to catch future regressions. Removing -Werror means new warnings won't cause test failures, which could allow issues to accumulate.

Suggested change

run: pytest --cov=pyerrors -vv

run: python -Werror -m pytest --cov=pyerrors -vv

Copilot · 2026-02-03T17:21:40Z

pyerrors/input/pandas.py

+            while pd.isna(df[column].iloc[i]):
                i += 1


Potential IndexError when the while loop reaches the end of the column without finding a non-null value. The notna().any() check at line 176 ensures there's at least one non-null value, but after the regex replace at line 177, some of those values might have been replaced with None. This can cause the while loop to run past the end of the DataFrame. Add a bounds check before accessing iloc[i] at line 181.

Suggested change

while pd.isna(df[column].iloc[i]):

i += 1

col_len = len(df[column])

while i < col_len and pd.isna(df[column].iloc[i]):

i += 1

if i == col_len:

# All values are NA after replacement; nothing to deserialize in this column

continue

- Use pd.isna() instead of truthiness check for gzip null guard, fixing incorrect behavior when null is np.nan (which is truthy in Python) - Add bounds check to while loop to prevent IndexError when all non-null values are empty strings converted to None by regex replace

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-19T15:16:31Z

pyerrors/input/pandas.py

            i = 0
-            while df[column][i] is None:
+            while i < len(df[column]) and pd.isna(df[column].iloc[i]):
                i += 1
-            if isinstance(df[column][i], str):
-                if '"program":' in df[column][i][:20]:
-                    df[column] = df[column].transform(lambda x: import_json_string(x, verbose=False) if x is not None else None)
+            if isinstance(df[column].iloc[i], str):
+                if '"program":' in df[column].iloc[i][:20]:
+                    df[column] = df[column].transform(lambda x: import_json_string(x, verbose=False) if not pd.isna(x) else None)


After the null-skipping loop, i can be equal to len(df[column]) (e.g., when the column is entirely null/empty after the '^$' -> None replacement). In that case df[column].iloc[i] will raise IndexError. Add a guard like if i == len(df[column]): continue (or restructure the logic) before accessing .iloc[i].

Copilot · 2026-02-19T15:16:31Z

pyerrors/input/pandas.py

                    if auto_gamma is True:
-                        if isinstance(df[column][i], list):
-                            df[column].apply(lambda x: [o.gm() if o is not None else x for o in x])
+                        if isinstance(df[column].iloc[i], list):
+                            df[column].apply(lambda x: [o.gm() if not pd.isna(o) else x for o in x])
                        else:
-                            df[column].apply(lambda x: x.gm() if x is not None else x)
+                            df[column].apply(lambda x: x.gm() if not pd.isna(x) else x)


The auto_gamma branch uses Series.apply(...) but discards the returned Series, and the list-case lambda builds a throwaway list with an odd else x (the whole list) element. Since Obs.gm() is side-effecting and returns None, this can be simplified to an explicit loop over the deserialized objects/lists (calling .gm() on non-null Obs), avoiding unnecessary allocations and making the intent clear.

Copilot · 2026-02-19T15:16:31Z

pyerrors/input/pandas.py

+    # In pandas 3+, string columns use 'str' dtype instead of 'object'
+    string_like_dtypes = ["object", "str"] if int(pd.__version__.split(".")[0]) >= 3 else ["object"]
+    for column in df.select_dtypes(include=string_like_dtypes):
+        if isinstance(df[column].iloc[0], bytes):
+            if df[column].iloc[0].startswith(b"\x1f\x8b\x08\x00"):
+                df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if not pd.isna(x) else '')
+
+        if df[column].notna().any():
            df[column] = df[column].replace({r'^$': None}, regex=True)
            i = 0
-            while df[column][i] is None:
+            while i < len(df[column]) and pd.isna(df[column].iloc[i]):
                i += 1
-            if isinstance(df[column][i], str):
-                if '"program":' in df[column][i][:20]:
-                    df[column] = df[column].transform(lambda x: import_json_string(x, verbose=False) if x is not None else None)
+            if isinstance(df[column].iloc[i], str):
+                if '"program":' in df[column].iloc[i][:20]:
+                    df[column] = df[column].transform(lambda x: import_json_string(x, verbose=False) if not pd.isna(x) else None)
                    if auto_gamma is True:
-                        if isinstance(df[column][i], list):
-                            df[column].apply(lambda x: [o.gm() if o is not None else x for o in x])
+                        if isinstance(df[column].iloc[i], list):
+                            df[column].apply(lambda x: [o.gm() if not pd.isna(o) else x for o in x])
                        else:
-                            df[column].apply(lambda x: x.gm() if x is not None else x)
+                            df[column].apply(lambda x: x.gm() if not pd.isna(x) else x)
+        # Convert NA values back to Python None for compatibility with `x is None` checks
+        if df[column].isna().any():
+            df[column] = df[column].astype(object).where(df[column].notna(), None)


New pandas-3-focused behavior here (handling string-like dtypes + pd.isna null semantics + converting NA back to None) isn’t covered by the existing pandas IO tests (they don’t create string-dtype columns / pd.NA values). Add a regression test that round-trips a DataFrame with a pandas string dtype column containing pd.NA/None and verifies the deserialized result uses Python None where expected.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-19T15:25:27Z

pyerrors/input/pandas.py

-        if not all([e is None for e in df[column]]):
+    # In pandas 3+, string columns use 'str' dtype instead of 'object'
+    string_like_dtypes = ["object", "str"] if int(pd.__version__.split(".")[0]) >= 3 else ["object"]
+    for column in df.select_dtypes(include=string_like_dtypes):


Potential IndexError: Accessing df[column].iloc[0] without checking if the column is empty or if the DataFrame has any rows. If the DataFrame is empty, this will raise an IndexError. Consider adding a check for len(df[column]) > 0 before accessing the first element.

Suggested change

for column in df.select_dtypes(include=string_like_dtypes):

for column in df.select_dtypes(include=string_like_dtypes):

# Skip empty columns to avoid IndexError when accessing iloc[0]

if len(df[column]) == 0:

continue

Copilot · 2026-02-19T15:25:28Z

pyerrors/input/pandas.py

+            while i < len(df[column]) and pd.isna(df[column].iloc[i]):
                i += 1
-            if isinstance(df[column][i], str):
-                if '"program":' in df[column][i][:20]:
-                    df[column] = df[column].transform(lambda x: import_json_string(x, verbose=False) if x is not None else None)
+            if isinstance(df[column].iloc[i], str):
+                if '"program":' in df[column].iloc[i][:20]:
+                    df[column] = df[column].transform(lambda x: import_json_string(x, verbose=False) if not pd.isna(x) else None)
                    if auto_gamma is True:
-                        if isinstance(df[column][i], list):
-                            df[column].apply(lambda x: [o.gm() if o is not None else x for o in x])
+                        if isinstance(df[column].iloc[i], list):
+                            df[column].apply(lambda x: [o.gm() if not pd.isna(o) else x for o in x])
                        else:
-                            df[column].apply(lambda x: x.gm() if x is not None else x)
+                            df[column].apply(lambda x: x.gm() if not pd.isna(x) else x)


Potential IndexError: After the while loop at line 179, if all values in the column are NA, the variable i will equal len(df[column]), and then line 181 will try to access df[column].iloc[i] which is out of bounds. Add a check after the while loop to ensure i < len(df[column]) before accessing df[column].iloc[i].

s-kuberski · 2026-02-19T16:26:07Z

Thanks for taking care of this! I have not found anything to comment on in the changes that you have performed, keeping in mind that it might be easy to miss some unexpected behavior in I/O routines.

Would it make sense to add string type entries in the test routines?

fjosw added 2 commits February 3, 2026 18:09

[Fix] Pandas 3 string type handling

9b01f80

[ci] Temporarily remove WError because of scipy deprecation

7741f8f

fjosw requested review from Copilot and s-kuberski February 3, 2026 17:16

Copilot started reviewing on behalf of fjosw February 3, 2026 17:17 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

fjosw mentioned this pull request Feb 19, 2026

[Fix] Migrate to odrpack because of scipy.odr deprecation #279

Open

fjosw requested a review from Copilot February 19, 2026 15:12

Copilot started reviewing on behalf of fjosw February 19, 2026 15:12 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

fjosw requested a review from Copilot February 19, 2026 15:20

Copilot started reviewing on behalf of fjosw February 19, 2026 15:20 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

-    return False if isinstance(val, (list, np.ndarray)) else pd.isna(val)
+    # Treat empty lists/arrays (and containers whose elements are all null) as null.
+    if isinstance(val, list):
+        if len(val) == 0:
+            return True
+        # A list is null only if all its elements are null.
+        return all(_is_null(v) for v in val)
+    if isinstance(val, np.ndarray):
+        if val.size == 0:
+            return True
+        # For object-dtype arrays, check elementwise using _is_null.
+        if val.dtype == object:
+            return all(_is_null(v) for v in val)
+        # For non-object arrays, rely on pandas/numpy NA detection.
+        return bool(np.all(pd.isna(val)))
+    return pd.isna(val)

	df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '')
	df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if not pd.isna(x) else '')

-        if isinstance(df[column].iloc[0], bytes):
-            if df[column].iloc[0].startswith(b"\x1f\x8b\x08\x00"):
-                df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '')
+        col_series = df[column]
+        # Skip empty columns to avoid IndexError on iloc[0]
+        if col_series.empty:
+            continue
+        # Find the first non-null value for type inspection
+        first_valid_index = col_series.first_valid_index()
+        first_value = col_series.loc[first_valid_index] if first_valid_index is not None else None
+        if isinstance(first_value, bytes):
+            if first_value.startswith(b"\x1f\x8b\x08\x00"):
+                df[column] = col_series.transform(
+                    lambda x: gzip.decompress(x).decode('utf-8')
+                    if isinstance(x, (bytes, bytearray)) and x
+                    else ''
+                )

	run: pytest --cov=pyerrors -vv
	run: python -Werror -m pytest --cov=pyerrors -vv

-            while pd.isna(df[column].iloc[i]):
-                i += 1
+            col_len = len(df[column])
+            while i < col_len and pd.isna(df[column].iloc[i]):
+                i += 1
+            if i == col_len:
+                # All values are NA after replacement; nothing to deserialize in this column
+                continue

Conversation

fjosw commented Feb 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

s-kuberski commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants