-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH add check to automatically drop null columns #13
Conversation
I'd like to get your thoughts on the performance, @stephen-hoover. For a 7593x1777 data frame, I also made a basic plot of performance by number of rows/cols. The upper line is scaling by number of columns (with a fixed 1000 rows), and the lower line is scaling by number of rows (with a fixed 1000 cols). I'm not totally sure it's worth the performance cost, but not sure how else to resolve the bug in CivisML. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On your plot, what's the execution time without the null check? Spending a minute on this fit does feel excessive, but fitting a model on data with half a million columns is probably going to take a lot longer, so it doesn't matter so much.
This new cleaning should be documented in the ETLTransformer
's class doc string.
Should there be a switch in the constructor to turn this on or off? That could help mitigate speed concerns.
A not-for-this PR question: In the tests, you originally had cols_to_drop
with a scalar input. Does this point to a concern with the object? What happens if users provide a single string? Will the object silently do the wrong thing?
civismlext/preprocessing.py
Outdated
@@ -148,14 +149,25 @@ def __init__(self, | |||
self.fill_value = fill_value | |||
self.dataframe_output = dataframe_output | |||
|
|||
def _flag_nulls(self, X, cols_to_drop): | |||
null_cols = [col for col in X if | |||
X[col].isnull().all() and col not in cols_to_drop] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you flip the order of these checks, then you could save a little bit of run time. If users request to drop a column and the col not in cols_to_drop
comes first, then you'll avoid having to check the column for nulls.
civismlext/preprocessing.py
Outdated
warnings.warn('The following contain only nulls and ' | ||
'will be dropped: ' + str(null_cols), | ||
UserWarning) | ||
cols_to_drop.extend(null_cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This modifies the input list. I think you should pick either modifying the input or returning a list, but not both. You could return cols_to_drop + null_cols
and the output will be the same, without modifying the input.
# check that we don't add the col if it's already being dropped | ||
assert expander._flag_nulls(data_raw, drop_cols_2) == drop_cols_2 | ||
assert len(w) == 1 | ||
assert issubclass(w[-1].category, UserWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small thing, but I'd put this check first, so that you can have assert len(w) == 0
. It's mildly confusing to assert one warning in a test which shouldn't warn.
As a random comment, the current sklearn mean imputer does this as well (and also on transform, which is even worse!). I would argue strongly this behavior is not desired. An all null column is a serious enough ETL issue that even throwing a warning is holding hands too much. The code should detect those and then fail with an informative error message. |
@beckermr, that's actually the reason for this PR in the first place. I passed in a dataset with null columns and got failures deep in CivisML, because the silently dropped columns made the column_names metadata have different dimensions than the dataset itself. If DataFrameETL took care of this, we wouldn't have to worry about the imputer's behavior. But I disagree that we should raise an exception instead of a warning. If I were using this estimator and I got the error, I would just add the null columns to |
Regarding
|
@elsander The point here is that if someone is using this and it errors on an all null column that you did not know about, they should be forced to go look at the data (even if they just end up dropping it). What we are doing with this change is possibly hiding an upstream ETL bug from a user. Remember, nobody looks at warnings. They only look at failures. The fact that someone might just add the column to the ones to be dropped anyways is besides the point. By adding it automatically, we are making an assumption that all null columns on input are meant to be dropped. This assumption is problematic. This is the point. There are no magic buttons. |
Null checking accounts for essentially all of the time that The downside is that it wouldn't actually fix the metadata bug that motivated this fix-- because of the imputer's behavior, it'll continue to fail without a robust way for us to check. In that case, the decision would essentially be that we don't guarantee that everything will work properly if there are nan columns (although the model training itself should still succeed). null_plot.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about changing the drop_null_cols
parameter so that valid inputs are [None, False, 'raise', 'warn'], so that users can select between different levels of severity?
It's probably better to change the parameter name if it takes those options. Maybe handle_null_cols
, and give the option "drop" instead of "warn".
civismlext/preprocessing.py
Outdated
|
||
def _flag_nulls(self, X, cols_to_drop): | ||
null_cols = [col for col in X if | ||
col not in cols_to_drop and X[col].isnull().all()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this line -- you don't really care about the null status of every entry in the DataFrame. All you want to know is whether or not there's at least one non-null value. Therefore, I have two suggestions for performance improvements.
First, I've found a significant improvement in run time by replacing X[col].isnull().all()
with X[col].first_valid_index() is None
. It's 4x faster for a tiny DataFrame, and 10x faster for a 1M row DataFrame.
Second, if you think the normal use case is for most columns to be mostly non-null (I think that's the case), you can add a further condition to the list comprehension:
null_cols = [col for col in X if
col not in cols_to_drop and
pd.isnull(X[col].values[0]) and
X[col].first_valid_index() is None]
Checking if the first element of the series is null takes about 4 us when I test it locally. That's significantly faster than either the isnull
check or the first_valid_index
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to special-case empty DataFrame
s when using these checks. You should skip the null check if the DataFrame
length is zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea a lot-- really clever idea, to check the first value to save extra time!
Why are the choices [None, False, 'raise', 'warn'] instead of [True, 'raise', 'warn'] with a default of 'raise'? |
For a I suggested the default to be A default option other than |
Ahhhh mmmk. Thx. |
With the performance improvements, number of rows has essentially no effect as long as the number of nan columns is low, and it takes about 1/5 of the time as before for the same number of columns. |
civismlext/preprocessing.py
Outdated
if self.drop_null_cols == 'warn': | ||
warnings.warn(msg, UserWarning) | ||
elif self.drop_null_cols == 'raise': | ||
raise RuntimeError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message isn't correct in the case where you're raising an exception.
civismlext/preprocessing.py
Outdated
If True, columns which are all nulls will issue a warning during | ||
`fit` and be dropped during `transform`. If False, there will not | ||
be a check for null columns (but `fit` performance will be better). | ||
drop_null_cols : {None, False, 'raise', 'warn'} (default: False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you happy with the name "drop_null_cols" now that it's not a boolean? I can see False
and "warn"
as options for a parameter named "drop_null_cols", but "raise"
feels a bit stranger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, missed this comment at first-- maybe "check_null_cols", or "handle_null_cols"? I think "check_null_cols" is clearest.
CHANGELOG.md
Outdated
## [0.1.5] - 2017-10-27 | ||
|
||
### Added | ||
- Added `drop_null_cols` argument to check for null columns (#13) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing to change to check_null_cols
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This adds a check during fitting for columns that only contain nulls, and if any are found, to drop them and throw a warning.