ENH add check to automatically drop null columns #13

elsander · 2017-10-26T17:39:03Z

This adds a check during fitting for columns that only contain nulls, and if any are found, to drop them and throw a warning.

elsander · 2017-10-26T17:50:37Z

I'd like to get your thoughts on the performance, @stephen-hoover. For a 7593x1777 data frame,
with lots of cols to expand, and <10 null columns, the null flagging took on average .31 sec,
fitting with null flagging took .37 sec, and fit_transform with null flagging took 1.32 sec,
so the performance hit is significant.

I also made a basic plot of performance by number of rows/cols. The upper line is scaling by number of columns (with a fixed 1000 rows), and the lower line is scaling by number of rows (with a fixed 1000 cols). I'm not totally sure it's worth the performance cost, but not sure how else to resolve the bug in CivisML.
performance.pdf

stephen-hoover

On your plot, what's the execution time without the null check? Spending a minute on this fit does feel excessive, but fitting a model on data with half a million columns is probably going to take a lot longer, so it doesn't matter so much.

This new cleaning should be documented in the ETLTransformer's class doc string.

Should there be a switch in the constructor to turn this on or off? That could help mitigate speed concerns.

A not-for-this PR question: In the tests, you originally had cols_to_drop with a scalar input. Does this point to a concern with the object? What happens if users provide a single string? Will the object silently do the wrong thing?

stephen-hoover · 2017-10-27T16:19:21Z

civismlext/preprocessing.py

@@ -148,14 +149,25 @@ def __init__(self,
        self.fill_value = fill_value
        self.dataframe_output = dataframe_output

+    def _flag_nulls(self, X, cols_to_drop):
+        null_cols = [col for col in X if
+                     X[col].isnull().all() and col not in cols_to_drop]


If you flip the order of these checks, then you could save a little bit of run time. If users request to drop a column and the col not in cols_to_drop comes first, then you'll avoid having to check the column for nulls.

stephen-hoover · 2017-10-27T16:22:22Z

civismlext/preprocessing.py

+            warnings.warn('The following contain only nulls and '
+                          'will be dropped: ' + str(null_cols),
+                          UserWarning)
+        cols_to_drop.extend(null_cols)


This modifies the input list. I think you should pick either modifying the input or returning a list, but not both. You could return cols_to_drop + null_cols and the output will be the same, without modifying the input.

stephen-hoover · 2017-10-27T16:28:05Z

civismlext/test/test_preprocessing.py

+        # check that we don't add the col if it's already being dropped
+        assert expander._flag_nulls(data_raw, drop_cols_2) == drop_cols_2
+        assert len(w) == 1
+        assert issubclass(w[-1].category, UserWarning)


Small thing, but I'd put this check first, so that you can have assert len(w) == 0. It's mildly confusing to assert one warning in a test which shouldn't warn.

beckermr · 2017-10-27T16:44:06Z

As a random comment, the current sklearn mean imputer does this as well (and also on transform, which is even worse!).

I would argue strongly this behavior is not desired.

An all null column is a serious enough ETL issue that even throwing a warning is holding hands too much. The code should detect those and then fail with an informative error message.

elsander · 2017-10-27T16:55:50Z

@beckermr, that's actually the reason for this PR in the first place. I passed in a dataset with null columns and got failures deep in CivisML, because the silently dropped columns made the column_names metadata have different dimensions than the dataset itself. If DataFrameETL took care of this, we wouldn't have to worry about the imputer's behavior.

But I disagree that we should raise an exception instead of a warning. If I were using this estimator and I got the error, I would just add the null columns to cols_to_drop and run again. That's an extra step for the user when we could simply use a warning and add the columns to cols_to_drop for them. We should drop the columns if we're going to the trouble of detecting them.

elsander · 2017-10-27T17:12:22Z

Regarding cols_to_drop with scalar input, DataFrameETL could handle scalar input correctly in that argument before this PR. The null flagging function is the only one that explicitly relies on the fact it's a list. With this change, DataFrameETL will throw the following error during _flag_nulls:

TypeError: must be str, not list

beckermr · 2017-10-27T17:21:57Z

@elsander The point here is that if someone is using this and it errors on an all null column that you did not know about, they should be forced to go look at the data (even if they just end up dropping it). What we are doing with this change is possibly hiding an upstream ETL bug from a user.

Remember, nobody looks at warnings. They only look at failures.

The fact that someone might just add the column to the ones to be dropped anyways is besides the point.

By adding it automatically, we are making an assumption that all null columns on input are meant to be dropped. This assumption is problematic. This is the point.

There are no magic buttons.

elsander · 2017-10-27T18:22:52Z

Null checking accounts for essentially all of the time that fit takes (see the attached plot). fit is still faster than transform, but I'm not sure it's worth having this on by default. Having it as a switch that is off by default would allow users to use that option if they wanted to, which should also address Matt's concern.

The downside is that it wouldn't actually fix the metadata bug that motivated this fix-- because of the imputer's behavior, it'll continue to fail without a robust way for us to check. In that case, the decision would essentially be that we don't guarantee that everything will work properly if there are nan columns (although the model training itself should still succeed).

null_plot.pdf
^Note: In the plot, the red lines are without null checking, blue lines are with, in case the legend is a little obscure.

stephen-hoover

What do you think about changing the drop_null_cols parameter so that valid inputs are [None, False, 'raise', 'warn'], so that users can select between different levels of severity?

It's probably better to change the parameter name if it takes those options. Maybe handle_null_cols, and give the option "drop" instead of "warn".

stephen-hoover · 2017-10-27T19:18:17Z

civismlext/preprocessing.py

+
+    def _flag_nulls(self, X, cols_to_drop):
+        null_cols = [col for col in X if
+                     col not in cols_to_drop and X[col].isnull().all()]


Looking at this line -- you don't really care about the null status of every entry in the DataFrame. All you want to know is whether or not there's at least one non-null value. Therefore, I have two suggestions for performance improvements.

First, I've found a significant improvement in run time by replacing X[col].isnull().all() with X[col].first_valid_index() is None. It's 4x faster for a tiny DataFrame, and 10x faster for a 1M row DataFrame.

Second, if you think the normal use case is for most columns to be mostly non-null (I think that's the case), you can add a further condition to the list comprehension:

null_cols = [col for col in X if col not in cols_to_drop and pd.isnull(X[col].values[0]) and X[col].first_valid_index() is None]

Checking if the first element of the series is null takes about 4 us when I test it locally. That's significantly faster than either the isnull check or the first_valid_index check.

You'll need to special-case empty DataFrames when using these checks. You should skip the null check if the DataFrame length is zero.

I like this idea a lot-- really clever idea, to check the first value to save extra time!

beckermr · 2017-10-27T19:26:55Z

Why are the choices

 [None, False, 'raise', 'warn']

instead of

 [True, 'raise', 'warn']

with a default of 'raise'?

stephen-hoover · 2017-10-27T19:38:12Z

For a handle_null_cols option, the True would be split up between "raise" and "drop". Maybe "error" and "drop"? And both None and False because people might try either.

I suggested the default to be False because that preserves the current behavior and avoids a potential performance hit. (Although I think the performance hit is small.)

A default option other than False would require a major version increase, wouldn't it? That could break existing code.

beckermr · 2017-10-27T20:24:57Z

Ahhhh mmmk. Thx.

…ents

elsander · 2017-10-27T21:20:43Z

With the performance improvements, number of rows has essentially no effect as long as the number of nan columns is low, and it takes about 1/5 of the time as before for the same number of columns.

null_plot_2.pdf

stephen-hoover · 2017-10-27T22:16:24Z

civismlext/preprocessing.py

+            if self.drop_null_cols == 'warn':
+                warnings.warn(msg, UserWarning)
+            elif self.drop_null_cols == 'raise':
+                raise RuntimeError(msg)


The message isn't correct in the case where you're raising an exception.

stephen-hoover · 2017-10-27T22:20:04Z

civismlext/preprocessing.py

-        If True, columns which are all nulls will issue a warning during
-        `fit` and be dropped during `transform`. If False, there will not
-        be a check for null columns (but `fit` performance will be better).
+    drop_null_cols : {None, False, 'raise', 'warn'} (default: False)


Are you happy with the name "drop_null_cols" now that it's not a boolean? I can see False and "warn" as options for a parameter named "drop_null_cols", but "raise" feels a bit stranger.

Oops, missed this comment at first-- maybe "check_null_cols", or "handle_null_cols"? I think "check_null_cols" is clearest.

stephen-hoover · 2017-10-30T22:53:33Z

CHANGELOG.md

+## [0.1.5] - 2017-10-27
+
+### Added
+- Added `drop_null_cols` argument to check for null columns (#13)


One last thing to change to check_null_cols here.

stephen-hoover

LGTM!

ENH add check to automatically drop null columns

91d9cfb

elsander requested a review from stephen-hoover October 26, 2017 17:39

Liz Sander added 3 commits October 26, 2017 12:51

STY PEP8 fix

fde619a

TST no f string formatting after all :(

a4ea11f

TST fix stateful handling of warnings in python 2.7

201524f

elsander assigned stephen-hoover Oct 26, 2017

stephen-hoover suggested changes Oct 27, 2017

View reviewed changes

stephen-hoover assigned elsander and unassigned stephen-hoover Oct 27, 2017

ENH add null checking as a flag rather than default

53e0053

elsander assigned stephen-hoover and unassigned elsander Oct 27, 2017

stephen-hoover suggested changes Oct 27, 2017

View reviewed changes

stephen-hoover assigned elsander and unassigned stephen-hoover Oct 27, 2017

ENH more detailed options for drop_null_cols and performance improvem…

f8f2623

…ents

DOC update changelog and micro version

282bb2c

elsander assigned stephen-hoover and unassigned elsander Oct 27, 2017

stephen-hoover suggested changes Oct 27, 2017

View reviewed changes

stephen-hoover assigned elsander and unassigned stephen-hoover Oct 27, 2017

ENH fix error message

967c88e

elsander assigned stephen-hoover and elsander and unassigned elsander and stephen-hoover Oct 30, 2017

ENH drop_null_cols -> check_null_cols

aedcbbd

elsander assigned stephen-hoover and unassigned elsander Oct 30, 2017

elsander mentioned this pull request Oct 30, 2017

V3.3.0 civisanalytics/datascience-python#47

Closed

elsander added the enhancement label Oct 30, 2017

stephen-hoover reviewed Oct 30, 2017

View reviewed changes

stephen-hoover assigned elsander and unassigned stephen-hoover Oct 30, 2017

STY changelog update

fda15fc

elsander assigned stephen-hoover and unassigned elsander Oct 31, 2017

stephen-hoover approved these changes Oct 31, 2017

View reviewed changes

stephen-hoover assigned elsander and unassigned stephen-hoover Oct 31, 2017

elsander merged commit 5546ba1 into civisanalytics:master Oct 31, 2017

elsander deleted the drop_nulls branch October 31, 2017 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH add check to automatically drop null columns #13

ENH add check to automatically drop null columns #13

elsander commented Oct 26, 2017

elsander commented Oct 26, 2017

stephen-hoover left a comment

stephen-hoover Oct 27, 2017

stephen-hoover Oct 27, 2017

stephen-hoover Oct 27, 2017

beckermr commented Oct 27, 2017

elsander commented Oct 27, 2017

elsander commented Oct 27, 2017

beckermr commented Oct 27, 2017

elsander commented Oct 27, 2017 •

edited

Loading

stephen-hoover left a comment

stephen-hoover Oct 27, 2017

stephen-hoover Oct 27, 2017

elsander Oct 27, 2017

beckermr commented Oct 27, 2017

stephen-hoover commented Oct 27, 2017

beckermr commented Oct 27, 2017

elsander commented Oct 27, 2017

stephen-hoover Oct 27, 2017

stephen-hoover Oct 27, 2017

elsander Oct 30, 2017

stephen-hoover Oct 30, 2017

stephen-hoover left a comment

ENH add check to automatically drop null columns #13

ENH add check to automatically drop null columns #13

Conversation

elsander commented Oct 26, 2017

elsander commented Oct 26, 2017

stephen-hoover left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beckermr commented Oct 27, 2017

elsander commented Oct 27, 2017

elsander commented Oct 27, 2017

beckermr commented Oct 27, 2017

elsander commented Oct 27, 2017 • edited Loading

stephen-hoover left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beckermr commented Oct 27, 2017

stephen-hoover commented Oct 27, 2017

beckermr commented Oct 27, 2017

elsander commented Oct 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephen-hoover left a comment

Choose a reason for hiding this comment

elsander commented Oct 27, 2017 •

edited

Loading