Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ jobs:

- name: Run tests with -Werror
if: matrix.python-version != '3.14'
run: pytest --cov=pyerrors -vv -Werror
run: pytest --cov=pyerrors -vv
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -Werror flag was removed from pytest for all Python versions except 3.14. According to the PR description, this change aims to fix deprecation warnings. If the warnings are truly fixed by this PR, the -Werror flag should remain to catch future regressions. Removing -Werror means new warnings won't cause test failures, which could allow issues to accumulate.

Suggested change
run: pytest --cov=pyerrors -vv
run: python -Werror -m pytest --cov=pyerrors -vv

Copilot uses AI. Check for mistakes.

- name: Run tests without -Werror for python 3.14
if: matrix.python-version == '3.14'
Expand Down
48 changes: 29 additions & 19 deletions pyerrors/input/pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,9 +145,9 @@ def _serialize_df(df, gz=False):
serialize = _need_to_serialize(out[column])

if serialize is True:
out[column] = out[column].transform(lambda x: create_json_string(x, indent=0) if x is not None else None)
out[column] = out[column].transform(lambda x: create_json_string(x, indent=0) if not _is_null(x) else None)
if gz is True:
out[column] = out[column].transform(lambda x: gzip.compress((x if x is not None else '').encode('utf-8')))
out[column] = out[column].transform(lambda x: gzip.compress(x.encode('utf-8')) if not _is_null(x) else gzip.compress(b''))
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When x is None/null, gzip.compress(b'') is called which compresses an empty byte string. This differs from the original behavior where None would be passed through. This change should be verified against the deserialization logic to ensure that compressed empty strings are correctly handled as null values when deserializing.

Copilot uses AI. Check for mistakes.
return out


Expand All @@ -166,37 +166,47 @@ def _deserialize_df(df, auto_gamma=False):
------
In case any column of the DataFrame is gzipped it is gunzipped in the process.
"""
for column in df.select_dtypes(include="object"):
if isinstance(df[column][0], bytes):
if df[column][0].startswith(b"\x1f\x8b\x08\x00"):
df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8'))

if not all([e is None for e in df[column]]):
# In pandas 3+, string columns use 'str' dtype instead of 'object'
string_like_dtypes = ["object", "str"] if int(pd.__version__.split(".")[0]) >= 3 else ["object"]
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version parsing could fail if the pandas version string doesn't follow the expected format (e.g., development versions like '3.0.0rc1' or '3.0.0.dev0'). Consider using a more robust version parsing approach such as packaging.version.Version or handle potential exceptions from the int() conversion.

Copilot uses AI. Check for mistakes.
for column in df.select_dtypes(include=string_like_dtypes):
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential IndexError: Accessing df[column].iloc[0] without checking if the column is empty or if the DataFrame has any rows. If the DataFrame is empty, this will raise an IndexError. Consider adding a check for len(df[column]) > 0 before accessing the first element.

Suggested change
for column in df.select_dtypes(include=string_like_dtypes):
for column in df.select_dtypes(include=string_like_dtypes):
# Skip empty columns to avoid IndexError when accessing iloc[0]
if len(df[column]) == 0:
continue

Copilot uses AI. Check for mistakes.
if isinstance(df[column].iloc[0], bytes):
if df[column].iloc[0].startswith(b"\x1f\x8b\x08\x00"):
df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if not pd.isna(x) else '')

if df[column].notna().any():
df[column] = df[column].replace({r'^$': None}, regex=True)
i = 0
while df[column][i] is None:
while i < len(df[column]) and pd.isna(df[column].iloc[i]):
i += 1
if isinstance(df[column][i], str):
if '"program":' in df[column][i][:20]:
df[column] = df[column].transform(lambda x: import_json_string(x, verbose=False) if x is not None else None)
if isinstance(df[column].iloc[i], str):
if '"program":' in df[column].iloc[i][:20]:
df[column] = df[column].transform(lambda x: import_json_string(x, verbose=False) if not pd.isna(x) else None)
Comment on lines 178 to +183
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the null-skipping loop, i can be equal to len(df[column]) (e.g., when the column is entirely null/empty after the '^$' -> None replacement). In that case df[column].iloc[i] will raise IndexError. Add a guard like if i == len(df[column]): continue (or restructure the logic) before accessing .iloc[i].

Copilot uses AI. Check for mistakes.
if auto_gamma is True:
if isinstance(df[column][i], list):
df[column].apply(lambda x: [o.gm() if o is not None else x for o in x])
if isinstance(df[column].iloc[i], list):
df[column].apply(lambda x: [o.gm() if not pd.isna(o) else x for o in x])
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in line 186 is incorrect for applying .gm() to items in a list. When checking 'not pd.isna(o)' for list elements, this will raise an error because 'o' is an Obs object, not a pandas value. The pd.isna() check should be changed to handle Obs objects properly, or should use 'o is not None' instead.

Copilot uses AI. Check for mistakes.
else:
df[column].apply(lambda x: x.gm() if x is not None else x)
df[column].apply(lambda x: x.gm() if not pd.isna(x) else x)
Comment on lines +185 to +188
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The apply() function returns a new Series but the result is not assigned back to df[column]. This means the .gm() calls have no effect. The code should be: df[column] = df[column].apply(...) to store the result.

Copilot uses AI. Check for mistakes.
Comment on lines 184 to +188
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto_gamma branch uses Series.apply(...) but discards the returned Series, and the list-case lambda builds a throwaway list with an odd else x (the whole list) element. Since Obs.gm() is side-effecting and returns None, this can be simplified to an explicit loop over the deserialized objects/lists (calling .gm() on non-null Obs), avoiding unnecessary allocations and making the intent clear.

Copilot uses AI. Check for mistakes.
Comment on lines +179 to +188
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential IndexError: After the while loop at line 179, if all values in the column are NA, the variable i will equal len(df[column]), and then line 181 will try to access df[column].iloc[i] which is out of bounds. Add a check after the while loop to ensure i < len(df[column]) before accessing df[column].iloc[i].

Copilot uses AI. Check for mistakes.
# Convert NA values back to Python None for compatibility with `x is None` checks
if df[column].isna().any():
df[column] = df[column].astype(object).where(df[column].notna(), None)
Comment on lines +169 to +191
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New pandas-3-focused behavior here (handling string-like dtypes + pd.isna null semantics + converting NA back to None) isn’t covered by the existing pandas IO tests (they don’t create string-dtype columns / pd.NA values). Add a regression test that round-trips a DataFrame with a pandas string dtype column containing pd.NA/None and verifies the deserialized result uses Python None where expected.

Copilot uses AI. Check for mistakes.
return df


def _need_to_serialize(col):
serialize = False
i = 0
while i < len(col) and col[i] is None:
while i < len(col) and _is_null(col.iloc[i]):
i += 1
if i == len(col):
return serialize
if isinstance(col[i], (Obs, Corr)):
if isinstance(col.iloc[i], (Obs, Corr)):
serialize = True
elif isinstance(col[i], list):
if all(isinstance(o, Obs) for o in col[i]):
elif isinstance(col.iloc[i], list):
if all(isinstance(o, Obs) for o in col.iloc[i]):
serialize = True
return serialize


def _is_null(val):
"""Check if a value is null (None or NA), handling list/array values."""
return False if isinstance(val, (list, np.ndarray)) else pd.isna(val)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _is_null function returns False for lists and numpy arrays, which means that empty lists ([]) will be treated as non-null. This could lead to unexpected behavior if a column contains empty lists as placeholders for null values. Consider whether empty lists should be treated as null values based on the use case.

Suggested change
return False if isinstance(val, (list, np.ndarray)) else pd.isna(val)
# Treat empty lists/arrays (and containers whose elements are all null) as null.
if isinstance(val, list):
if len(val) == 0:
return True
# A list is null only if all its elements are null.
return all(_is_null(v) for v in val)
if isinstance(val, np.ndarray):
if val.size == 0:
return True
# For object-dtype arrays, check elementwise using _is_null.
if val.dtype == object:
return all(_is_null(v) for v in val)
# For non-object arrays, rely on pandas/numpy NA detection.
return bool(np.all(pd.isna(val)))
return pd.isna(val)

Copilot uses AI. Check for mistakes.
Loading