Patch pre-Release v0.30.1#2626
Conversation
| def convert_all_nan_unknown_to_double(data): | ||
| def is_column_pd_na(data, col): | ||
| return all([isinstance(x, type(pd.NA)) for x in data[col]]) | ||
| return all(data[col].isna()) |
There was a problem hiding this comment.
@freddyaboulton and @bchen1116 I think you recommended both this and all(data[col] == pd.NA). What happened was I tried all(data[col] == pd.NA), which created weird behavior in as much as a column full of pd.NA returns a column full of pd.NA when compared to pd.NA. I found a little bit more about this here, where they discuss some of the design decisions about how arithmetic and comparisons using pd.NA result in pd.NA to "propagate pd.NA". I ended up just going with what I had and it was super slow - as you foresaw. I'm now trying to atone for my sins.
| assert invalid_targets_check.validate(X, y=pd.Series([np.nan, np.nan, np.nan])) == { | ||
| "warnings": [], | ||
| "errors": [ | ||
| DataCheckError( |
There was a problem hiding this comment.
The slightly different logic catches fully null Unknown columns and changes them to np.nan Double columns both in dataframes and in series now.
Codecov Report
@@ Coverage Diff @@
## main #2626 +/- ##
=====================================
Coverage 99.9% 99.9%
=====================================
Files 297 297
Lines 27071 27071
=====================================
Hits 27027 27027
Misses 44 44
Continue to review full report at Codecov.
|
bchen1116
left a comment
There was a problem hiding this comment.
LGTM! Thanks for making the changes. Do we need to update the release note with this PR # (i think you can just tack it on to the end of the ww update PR)
dsherry
left a comment
There was a problem hiding this comment.
Well done @chukarsten , I'm so glad we caught this pre-release 👏
A small update to do a better job of detecting all-pd.NA values. I guess.