Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH add check to automatically drop null columns #13

Merged
merged 10 commits into from
Oct 31, 2017

Conversation

elsander
Copy link
Contributor

This adds a check during fitting for columns that only contain nulls, and if any are found, to drop them and throw a warning.

@elsander
Copy link
Contributor Author

I'd like to get your thoughts on the performance, @stephen-hoover. For a 7593x1777 data frame,
with lots of cols to expand, and <10 null columns, the null flagging took on average .31 sec,
fitting with null flagging took .37 sec, and fit_transform with null flagging took 1.32 sec,
so the performance hit is significant.

I also made a basic plot of performance by number of rows/cols. The upper line is scaling by number of columns (with a fixed 1000 rows), and the lower line is scaling by number of rows (with a fixed 1000 cols). I'm not totally sure it's worth the performance cost, but not sure how else to resolve the bug in CivisML.
performance.pdf

Copy link
Contributor

@stephen-hoover stephen-hoover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On your plot, what's the execution time without the null check? Spending a minute on this fit does feel excessive, but fitting a model on data with half a million columns is probably going to take a lot longer, so it doesn't matter so much.

This new cleaning should be documented in the ETLTransformer's class doc string.

Should there be a switch in the constructor to turn this on or off? That could help mitigate speed concerns.

A not-for-this PR question: In the tests, you originally had cols_to_drop with a scalar input. Does this point to a concern with the object? What happens if users provide a single string? Will the object silently do the wrong thing?

@@ -148,14 +149,25 @@ def __init__(self,
self.fill_value = fill_value
self.dataframe_output = dataframe_output

def _flag_nulls(self, X, cols_to_drop):
null_cols = [col for col in X if
X[col].isnull().all() and col not in cols_to_drop]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you flip the order of these checks, then you could save a little bit of run time. If users request to drop a column and the col not in cols_to_drop comes first, then you'll avoid having to check the column for nulls.

warnings.warn('The following contain only nulls and '
'will be dropped: ' + str(null_cols),
UserWarning)
cols_to_drop.extend(null_cols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modifies the input list. I think you should pick either modifying the input or returning a list, but not both. You could return cols_to_drop + null_cols and the output will be the same, without modifying the input.

# check that we don't add the col if it's already being dropped
assert expander._flag_nulls(data_raw, drop_cols_2) == drop_cols_2
assert len(w) == 1
assert issubclass(w[-1].category, UserWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, but I'd put this check first, so that you can have assert len(w) == 0. It's mildly confusing to assert one warning in a test which shouldn't warn.

@beckermr
Copy link
Contributor

As a random comment, the current sklearn mean imputer does this as well (and also on transform, which is even worse!).

I would argue strongly this behavior is not desired.

An all null column is a serious enough ETL issue that even throwing a warning is holding hands too much. The code should detect those and then fail with an informative error message.

@elsander
Copy link
Contributor Author

@beckermr, that's actually the reason for this PR in the first place. I passed in a dataset with null columns and got failures deep in CivisML, because the silently dropped columns made the column_names metadata have different dimensions than the dataset itself. If DataFrameETL took care of this, we wouldn't have to worry about the imputer's behavior.

But I disagree that we should raise an exception instead of a warning. If I were using this estimator and I got the error, I would just add the null columns to cols_to_drop and run again. That's an extra step for the user when we could simply use a warning and add the columns to cols_to_drop for them. We should drop the columns if we're going to the trouble of detecting them.

@elsander
Copy link
Contributor Author

Regarding cols_to_drop with scalar input, DataFrameETL could handle scalar input correctly in that argument before this PR. The null flagging function is the only one that explicitly relies on the fact it's a list. With this change, DataFrameETL will throw the following error during _flag_nulls:

TypeError: must be str, not list

@beckermr
Copy link
Contributor

@elsander The point here is that if someone is using this and it errors on an all null column that you did not know about, they should be forced to go look at the data (even if they just end up dropping it). What we are doing with this change is possibly hiding an upstream ETL bug from a user.

Remember, nobody looks at warnings. They only look at failures.

The fact that someone might just add the column to the ones to be dropped anyways is besides the point.

By adding it automatically, we are making an assumption that all null columns on input are meant to be dropped. This assumption is problematic. This is the point.

There are no magic buttons.

@elsander
Copy link
Contributor Author

elsander commented Oct 27, 2017

Null checking accounts for essentially all of the time that fit takes (see the attached plot). fit is still faster than transform, but I'm not sure it's worth having this on by default. Having it as a switch that is off by default would allow users to use that option if they wanted to, which should also address Matt's concern.

The downside is that it wouldn't actually fix the metadata bug that motivated this fix-- because of the imputer's behavior, it'll continue to fail without a robust way for us to check. In that case, the decision would essentially be that we don't guarantee that everything will work properly if there are nan columns (although the model training itself should still succeed).

null_plot.pdf
^Note: In the plot, the red lines are without null checking, blue lines are with, in case the legend is a little obscure.

@elsander elsander assigned stephen-hoover and unassigned elsander Oct 27, 2017
Copy link
Contributor

@stephen-hoover stephen-hoover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about changing the drop_null_cols parameter so that valid inputs are [None, False, 'raise', 'warn'], so that users can select between different levels of severity?

It's probably better to change the parameter name if it takes those options. Maybe handle_null_cols, and give the option "drop" instead of "warn".


def _flag_nulls(self, X, cols_to_drop):
null_cols = [col for col in X if
col not in cols_to_drop and X[col].isnull().all()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this line -- you don't really care about the null status of every entry in the DataFrame. All you want to know is whether or not there's at least one non-null value. Therefore, I have two suggestions for performance improvements.

First, I've found a significant improvement in run time by replacing X[col].isnull().all() with X[col].first_valid_index() is None. It's 4x faster for a tiny DataFrame, and 10x faster for a 1M row DataFrame.

Second, if you think the normal use case is for most columns to be mostly non-null (I think that's the case), you can add a further condition to the list comprehension:

null_cols = [col for col in X if
                    col not in cols_to_drop and 
                    pd.isnull(X[col].values[0]) and
                    X[col].first_valid_index() is None]

Checking if the first element of the series is null takes about 4 us when I test it locally. That's significantly faster than either the isnull check or the first_valid_index check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to special-case empty DataFrames when using these checks. You should skip the null check if the DataFrame length is zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea a lot-- really clever idea, to check the first value to save extra time!

@beckermr
Copy link
Contributor

Why are the choices

 [None, False, 'raise', 'warn']

instead of

 [True, 'raise', 'warn']

with a default of 'raise'?

@stephen-hoover
Copy link
Contributor

For a handle_null_cols option, the True would be split up between "raise" and "drop". Maybe "error" and "drop"? And both None and False because people might try either.

I suggested the default to be False because that preserves the current behavior and avoids a potential performance hit. (Although I think the performance hit is small.)

A default option other than False would require a major version increase, wouldn't it? That could break existing code.

@beckermr
Copy link
Contributor

Ahhhh mmmk. Thx.

@elsander
Copy link
Contributor Author

With the performance improvements, number of rows has essentially no effect as long as the number of nan columns is low, and it takes about 1/5 of the time as before for the same number of columns.

null_plot_2.pdf

@elsander elsander assigned stephen-hoover and unassigned elsander Oct 27, 2017
if self.drop_null_cols == 'warn':
warnings.warn(msg, UserWarning)
elif self.drop_null_cols == 'raise':
raise RuntimeError(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message isn't correct in the case where you're raising an exception.

If True, columns which are all nulls will issue a warning during
`fit` and be dropped during `transform`. If False, there will not
be a check for null columns (but `fit` performance will be better).
drop_null_cols : {None, False, 'raise', 'warn'} (default: False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you happy with the name "drop_null_cols" now that it's not a boolean? I can see False and "warn" as options for a parameter named "drop_null_cols", but "raise" feels a bit stranger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, missed this comment at first-- maybe "check_null_cols", or "handle_null_cols"? I think "check_null_cols" is clearest.

CHANGELOG.md Outdated
## [0.1.5] - 2017-10-27

### Added
- Added `drop_null_cols` argument to check for null columns (#13)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing to change to check_null_cols here.

@elsander elsander assigned stephen-hoover and unassigned elsander Oct 31, 2017
Copy link
Contributor

@stephen-hoover stephen-hoover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@elsander elsander merged commit 5546ba1 into civisanalytics:master Oct 31, 2017
@elsander elsander deleted the drop_nulls branch October 31, 2017 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants