Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize component predict/predict_proba in/out to pandas series/dataframe #853

Merged
merged 12 commits into from Jun 19, 2020

Conversation

dsherry
Copy link
Contributor

@dsherry dsherry commented Jun 15, 2020

This PR adds test coverage which checks that for any component listed in all_components(), predict outputs a pd.Series and predict_proba outputs a pd.DataFrame. It also fixes specific components' impls to conform to this expectation.

It would be great if we could find a way in the code to avoid having to covert to/from pd datastructures all over the place. We could define a component metaclass to help with this, to wrap subclasses' predict and predict_proba methods in standardization code. But for now, this PR was easier and adds value.

We should add similar coverage for pipelines, but I'd like us to do that separately. Its unclear how necessary that is if we have good component-level coverage.

@codecov
Copy link

codecov bot commented Jun 15, 2020

Codecov Report

Merging #853 into master will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #853      +/-   ##
==========================================
+ Coverage   99.70%   99.74%   +0.04%     
==========================================
  Files         195      195              
  Lines        8181     8365     +184     
==========================================
+ Hits         8157     8344     +187     
+ Misses         24       21       -3     
Impacted Files Coverage Δ
evalml/objectives/objective_base.py 100.00% <100.00%> (ø)
evalml/pipelines/binary_classification_pipeline.py 100.00% <100.00%> (ø)
...ents/estimators/classifiers/catboost_classifier.py 100.00% <100.00%> (ø)
...valml/pipelines/components/estimators/estimator.py 100.00% <100.00%> (ø)
...onents/estimators/regressors/catboost_regressor.py 100.00% <100.00%> (ø)
.../pipelines/components/transformers/drop_columns.py 100.00% <100.00%> (ø)
...transformers/feature_selection/feature_selector.py 83.78% <100.00%> (+5.40%) ⬆️
...onents/transformers/imputers/per_column_imputer.py 100.00% <100.00%> (ø)
...components/transformers/imputers/simple_imputer.py 100.00% <100.00%> (ø)
...l/pipelines/components/transformers/transformer.py 100.00% <100.00%> (ø)
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a56753...3006d3d. Read the comment docs.

@dsherry dsherry force-pushed the ds_236_standardize_component_output_type branch from 2b1e2ee to 31c7796 Compare June 17, 2020 04:06
@@ -145,7 +145,7 @@
"outputs": [],
"source": [
"# get the predicted probabilities associated with the \"true\" label\n",
"y_pred_proba = pipeline.predict_proba(X)[:, 1]\n",
"y_pred_proba = pipeline.predict_proba(X)[1]\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this code pipeline.predict_proba(X)["true"] now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's up next, but not in this PR--that's tracked by #645 which will get unblocked by this work.

@dsherry dsherry marked this pull request as ready for review June 17, 2020 20:54
@dsherry
Copy link
Contributor Author

dsherry commented Jun 17, 2020

Still battling codecov but this is ready for review

if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X)
if not isinstance(y, pd.Series):
y = pd.Series(y)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution in this PR is not elegant: it adds this sort of code to a bunch of components, rather than putting a general pattern in place.

I still feel good about this PR, because the test coverage iterates over anything in all_components, meaning new components will have consistent coverage.

I will file something to track finding a permanent solution for how to avoid scattering this code around our codebase. One idea is to use a metaclass on components, which wraps fit/predict/etc and ensures the inputs and return type are consistent. We could use the same solution for #851

@@ -40,7 +40,7 @@ def transform(self, X, y=None):
Returns:
pd.DataFrame: Transformed X
"""
cols = self.parameters.get("columns", [])
cols = self.parameters.get("columns") or []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I don't remember if this change was necessary, will try backing out

@@ -36,18 +36,19 @@ def transform(self, X, y=None):
if isinstance(X, pd.DataFrame):
self.input_feature_names = list(X.columns.values)
else:
self.input_feature_names = range(len(X.shape[1]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug!

X_t = pd.DataFrame(X_t, columns=X_null_dropped.columns).astype(X_null_dropped.dtypes.to_dict())
return X_t
return pd.DataFrame(X_t, columns=X_null_dropped.columns).astype(X_null_dropped.dtypes.to_dict())
return pd.DataFrame(X_t)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't great. Suggestions appreciated

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad that you're doing this, even if it's just a short-term solution! 👍

@@ -186,11 +186,15 @@ def test_normalize_confusion_matrix_error():
normalize_confusion_matrix(conf_mat, 'all')


def test_graph_roc_curve_binary(X_y):
@pytest.mark.parametrize("data_type", ['np', 'pd'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oo cool, didn't know this was a thing!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! Pretty fun :) and you can run individual tests from the cmdline with pytest evalml/tests/pipeline_tests/test_graph_utils.py::test_graph_roc_curve_binary\[np\] (backslash escaping needed for unix shell)

@@ -64,9 +68,12 @@ def fit(self, X, y=None):

def predict(self, X):
predictions = self._component_obj.predict(X)
if predictions.ndim == 2 and predictions.shape[1] == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the standardization or an unrelated change? :o

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related. If predictions is a 2d input with one column, inverse_transform will fail.

y_true = [2, 0, 2, 2, 0, 1]
y_predicted = [0, 0, 2, 2, 0, 2]
expected_return_type = pd.DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, why not just check assert isinstance(conf_mat, pd.DataFrame)? :o

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks, yeah this was left over from an earlier state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed this too.

Comment on lines 395 to 427
range_index = pd.RangeIndex(start=0, stop=X_np.shape[1], step=1)
assert (X_df_no_col_names.columns == range_index).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I both like this check but am also conflicted because it seems more like we're testing pandas and what pandas DataFrame does when we don't specify a column parameter 🤷‍♀️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. This was a sanity check I left in from when I was setting up the test. I can clean this up. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I'm looking at an outdated version but just FYI!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Augh! Yes, thanks for the reminder

components = all_components()
transformers = dict(filter(lambda el: issubclass(el[1], Transformer), components.items()))
for component_name, component_class in transformers.items():
print('Testing transformer {}'.format(component_class.name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentionally left here o:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found leaving prints in these tests which iterate over each component/pipeline is helpful when there are failures, because the last printed value is the one for which the failure occurred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this particular print is redundant because there's another below. Will remove

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense, but do we want to keep them in when we push to master? :O

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Is there a reason we shouldn't? My understanding is pytest will suppress print output unless there's a failure, which is when this information would become helpful. I say this because I don't see this print (or others) in the current test output!

@@ -59,4 +59,4 @@ def test_xgboost_feature_name_with_random_ascii(X_y):
predictions = clf.predict(X)
assert len(predictions) == len(y)
assert not np.isnan(predictions).all()
assert not np.isnan(clf.predict_proba(X)).all()
assert not np.isnan(clf.predict_proba(X)).all().all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the second .all() call? I see that when doing the tests, this generates a higher dimensional array so the second is necessary to get something that assert will accept, but why take np.isnan(clf.predict_proba(X)) instead of np.isnan(clf.predict_proba(X).to_numpy()), which is generally what you have in the other files? Just wondering in terms of consistency if there was any reason for not doing the to_numpy in this case as opposed to in the other cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in terms of my other comment, I think it just makes sense to me to choose one or the other (numpy array or pandas dataframes) for most tests or at least for classes of tests to standardize between them

Copy link
Contributor Author

@dsherry dsherry Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, predict_proba returns a 2d array of predicted probabilities, class vs row. Next, all estimators' predict() methods now return pd.Series, and all estimators' predict_proba() methods now return pd.DataFrame. Finally, in order to check if there are no nans in a 2d array, we have to do .all().all() to AND the check across every row and column. Does that make sense?

RE your second comment, I agree, and that's what this PR does! We've standardized component and pipeline predictions to pandas types. So all the tests check that the output is pandas. See the main new tests I added in test_components.py for type checking across all components

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my question was more along the lines of why do you do .all() twice instead of converting to numpy and just doing .all() once, which is more similar to the way that some other tests are done?
And then re my second comment it seems like that change made sense, but then there were other tests that first converted to numpy and then checked the closeness of the arrays rather than just using the pandas assert_frame_equal-it was more of a consistency question (for both)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, got it, np.isnan(clf.predict_proba(X).to_numpy()).all(), yep that'd work.

RE assert_frame_equal, yeah I think whatever gets the job done! In this case, assert_frame_equal worked and didn't cause platform issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for future, does it matter if there is continuity in the testing between pd.testing.assert_frame_equal(a, b) vs np.assert_almost_equal(a.as_numpy(), b.as_numpy()), or is either ok in regard to style/efficiency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. As long as our tests provide good coverage, I think we're doing well. What I mean by that is that if we need the "almost" equal functionality to avoid numerical precision errors, we should use that, but otherwise its just a style thing.

@@ -716,4 +716,4 @@ def test_clone_fitted(X_y, lr_pipeline):
pipeline_clone.fit(X, y)
X_t_clone = pipeline_clone.predict_proba(X)

np.testing.assert_almost_equal(X_t, X_t_clone)
pd.testing.assert_frame_equal(X_t, X_t_clone)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you use pd.testing.assert_frame_equal here and in other places convert to numpy array and then use the similar numpy version, np.assert_almost_equal first? I think it might make sense to use this particular call elsewhere instead of converting to numpy arrays each time, especially since this PR was standardizing to pandas, it might make sense to test in that environment as well. It seems like this can also support the "almost equal" part of numpy comparisons, as there is an ability to specify how close you want the threshold for equal to be in the documentation of this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this test isn't doing any conversion to np array -- predict and predict_proba now always return pd dataframes. In light of that, we need to use pandas' comparison method, because np's method doesn't work with pd types. Try it out on the cmdline if you're curious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry my comment wasn't that clear-I think I was wondering why you do it this way here, but then convert to a numpy array and then compare with the numpy method in other places instead of doing it in this more simple way. I just left another comment at an example of using the numpy way of doing it (with the extra conversion) rather than the pandas way like you do here. I think the one here (this file, test_pipelines) makes more sense

@@ -64,13 +64,13 @@ def test_catboost_objective_tuning(X_y):
# testing objective parameter passed in does not change results
objective = Precision()
y_pred_with_objective = clf.predict(X, objective)
np.testing.assert_almost_equal(y_pred, y_pred_with_objective, decimal=5)
np.testing.assert_almost_equal(y_pred.to_numpy(), y_pred_with_objective.to_numpy(), decimal=5)
Copy link
Contributor

@ctduffy ctduffy Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry here is an example of the place where you convert to numpy and then use the numpy function instead of just using the pandas assertion. (sorry probably should have commented on this one instead of the one with the pandas version of testing array equality)

@dsherry dsherry force-pushed the ds_236_standardize_component_output_type branch from fcfdf0e to e6199a7 Compare June 19, 2020 13:24
…unit test coverage for pd+np dtypes with objective scoring
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few small comments but otherwise LGTM!

Comment on lines +53 to +56
if isinstance(y_true, pd.Series):
y_true = y_true.to_numpy()
if isinstance(y_pred_proba, (pd.Series, pd.DataFrame)):
y_pred_proba = y_pred_proba.to_numpy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, along with the new static method you added in ObjectiveBase, makes me wonder if it'd be worth extracting into some util method but 🤷‍♀️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! It would definitely be worth it. In this PR I just wanted to get solid test coverage in place and apply minimal changes to get them passing. Once this is merged, yeah, we should get a better fix in place!

@@ -238,7 +240,7 @@ class MockTransformerWithFitAndTransform(Transformer):
hyperparameter_ranges = {}

def fit(self, X, y=None):
return X
return self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

components = all_components()
transformers = dict(filter(lambda el: issubclass(el[1], Transformer), components.items()))
for component_name, component_class in transformers.items():
print('Testing transformer {}'.format(component_class.name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense, but do we want to keep them in when we push to master? :O

Comment on lines 395 to 427
range_index = pd.RangeIndex(start=0, stop=X_np.shape[1], step=1)
assert (X_df_no_col_names.columns == range_index).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I'm looking at an outdated version but just FYI!

@dsherry dsherry merged commit da28481 into master Jun 19, 2020
@dsherry dsherry deleted the ds_236_standardize_component_output_type branch June 19, 2020 18:09
@angela97lin angela97lin mentioned this pull request Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants