New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize component predict/predict_proba in/out to pandas series/dataframe #853
Conversation
Codecov Report
@@ Coverage Diff @@
## master #853 +/- ##
==========================================
+ Coverage 99.70% 99.74% +0.04%
==========================================
Files 195 195
Lines 8181 8365 +184
==========================================
+ Hits 8157 8344 +187
+ Misses 24 21 -3
Continue to review full report at Codecov.
|
2b1e2ee
to
31c7796
Compare
@@ -145,7 +145,7 @@ | |||
"outputs": [], | |||
"source": [ | |||
"# get the predicted probabilities associated with the \"true\" label\n", | |||
"y_pred_proba = pipeline.predict_proba(X)[:, 1]\n", | |||
"y_pred_proba = pipeline.predict_proba(X)[1]\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this code pipeline.predict_proba(X)["true"]
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's up next, but not in this PR--that's tracked by #645 which will get unblocked by this work.
Still battling codecov but this is ready for review |
if not isinstance(X, pd.DataFrame): | ||
X = pd.DataFrame(X) | ||
if not isinstance(y, pd.Series): | ||
y = pd.Series(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution in this PR is not elegant: it adds this sort of code to a bunch of components, rather than putting a general pattern in place.
I still feel good about this PR, because the test coverage iterates over anything in all_components
, meaning new components will have consistent coverage.
I will file something to track finding a permanent solution for how to avoid scattering this code around our codebase. One idea is to use a metaclass on components, which wraps fit
/predict
/etc and ensures the inputs and return type are consistent. We could use the same solution for #851
@@ -40,7 +40,7 @@ def transform(self, X, y=None): | |||
Returns: | |||
pd.DataFrame: Transformed X | |||
""" | |||
cols = self.parameters.get("columns", []) | |||
cols = self.parameters.get("columns") or [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I don't remember if this change was necessary, will try backing out
@@ -36,18 +36,19 @@ def transform(self, X, y=None): | |||
if isinstance(X, pd.DataFrame): | |||
self.input_feature_names = list(X.columns.values) | |||
else: | |||
self.input_feature_names = range(len(X.shape[1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug!
X_t = pd.DataFrame(X_t, columns=X_null_dropped.columns).astype(X_null_dropped.dtypes.to_dict()) | ||
return X_t | ||
return pd.DataFrame(X_t, columns=X_null_dropped.columns).astype(X_null_dropped.dtypes.to_dict()) | ||
return pd.DataFrame(X_t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't great. Suggestions appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad that you're doing this, even if it's just a short-term solution! 👍
@@ -186,11 +186,15 @@ def test_normalize_confusion_matrix_error(): | |||
normalize_confusion_matrix(conf_mat, 'all') | |||
|
|||
|
|||
def test_graph_roc_curve_binary(X_y): | |||
@pytest.mark.parametrize("data_type", ['np', 'pd']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oo cool, didn't know this was a thing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! Pretty fun :) and you can run individual tests from the cmdline with pytest evalml/tests/pipeline_tests/test_graph_utils.py::test_graph_roc_curve_binary\[np\]
(backslash escaping needed for unix shell)
@@ -64,9 +68,12 @@ def fit(self, X, y=None): | |||
|
|||
def predict(self, X): | |||
predictions = self._component_obj.predict(X) | |||
if predictions.ndim == 2 and predictions.shape[1] == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to the standardization or an unrelated change? :o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related. If predictions
is a 2d input with one column, inverse_transform
will fail.
y_true = [2, 0, 2, 2, 0, 1] | ||
y_predicted = [0, 0, 2, 2, 0, 2] | ||
expected_return_type = pd.DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, why not just check assert isinstance(conf_mat, pd.DataFrame)
? :o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks, yeah this was left over from an earlier state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed this too.
range_index = pd.RangeIndex(start=0, stop=X_np.shape[1], step=1) | ||
assert (X_df_no_col_names.columns == range_index).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I both like this check but am also conflicted because it seems more like we're testing pandas and what pandas DataFrame does when we don't specify a column parameter 🤷♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. This was a sanity check I left in from when I was setting up the test. I can clean this up. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I'm looking at an outdated version but just FYI!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Augh! Yes, thanks for the reminder
components = all_components() | ||
transformers = dict(filter(lambda el: issubclass(el[1], Transformer), components.items())) | ||
for component_name, component_class in transformers.items(): | ||
print('Testing transformer {}'.format(component_class.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentionally left here o:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found leaving print
s in these tests which iterate over each component/pipeline is helpful when there are failures, because the last printed value is the one for which the failure occurred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this particular print
is redundant because there's another below. Will remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense, but do we want to keep them in when we push to master? :O
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Is there a reason we shouldn't? My understanding is pytest will suppress print output unless there's a failure, which is when this information would become helpful. I say this because I don't see this print (or others) in the current test output!
@@ -59,4 +59,4 @@ def test_xgboost_feature_name_with_random_ascii(X_y): | |||
predictions = clf.predict(X) | |||
assert len(predictions) == len(y) | |||
assert not np.isnan(predictions).all() | |||
assert not np.isnan(clf.predict_proba(X)).all() | |||
assert not np.isnan(clf.predict_proba(X)).all().all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the second .all()
call? I see that when doing the tests, this generates a higher dimensional array so the second is necessary to get something that assert will accept, but why take np.isnan(clf.predict_proba(X))
instead of np.isnan(clf.predict_proba(X).to_numpy())
, which is generally what you have in the other files? Just wondering in terms of consistency if there was any reason for not doing the to_numpy
in this case as opposed to in the other cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in terms of my other comment, I think it just makes sense to me to choose one or the other (numpy array or pandas dataframes) for most tests or at least for classes of tests to standardize between them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, predict_proba
returns a 2d array of predicted probabilities, class vs row. Next, all estimators' predict()
methods now return pd.Series
, and all estimators' predict_proba()
methods now return pd.DataFrame
. Finally, in order to check if there are no nan
s in a 2d array, we have to do .all().all()
to AND
the check across every row and column. Does that make sense?
RE your second comment, I agree, and that's what this PR does! We've standardized component and pipeline predictions to pandas types. So all the tests check that the output is pandas. See the main new tests I added in test_components.py
for type checking across all components
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my question was more along the lines of why do you do .all()
twice instead of converting to numpy and just doing .all()
once, which is more similar to the way that some other tests are done?
And then re my second comment it seems like that change made sense, but then there were other tests that first converted to numpy and then checked the closeness of the arrays rather than just using the pandas assert_frame_equal
-it was more of a consistency question (for both)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, got it, np.isnan(clf.predict_proba(X).to_numpy()).all()
, yep that'd work.
RE assert_frame_equal
, yeah I think whatever gets the job done! In this case, assert_frame_equal
worked and didn't cause platform issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for future, does it matter if there is continuity in the testing between pd.testing.assert_frame_equal(a, b)
vs np.assert_almost_equal(a.as_numpy(), b.as_numpy())
, or is either ok in regard to style/efficiency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. As long as our tests provide good coverage, I think we're doing well. What I mean by that is that if we need the "almost" equal functionality to avoid numerical precision errors, we should use that, but otherwise its just a style thing.
@@ -716,4 +716,4 @@ def test_clone_fitted(X_y, lr_pipeline): | |||
pipeline_clone.fit(X, y) | |||
X_t_clone = pipeline_clone.predict_proba(X) | |||
|
|||
np.testing.assert_almost_equal(X_t, X_t_clone) | |||
pd.testing.assert_frame_equal(X_t, X_t_clone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you use pd.testing.assert_frame_equal
here and in other places convert to numpy array and then use the similar numpy version, np.assert_almost_equal
first? I think it might make sense to use this particular call elsewhere instead of converting to numpy arrays each time, especially since this PR was standardizing to pandas, it might make sense to test in that environment as well. It seems like this can also support the "almost equal" part of numpy comparisons, as there is an ability to specify how close you want the threshold for equal to be in the documentation of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this test isn't doing any conversion to np array -- predict
and predict_proba
now always return pd dataframes. In light of that, we need to use pandas' comparison method, because np's method doesn't work with pd types. Try it out on the cmdline if you're curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry my comment wasn't that clear-I think I was wondering why you do it this way here, but then convert to a numpy array and then compare with the numpy method in other places instead of doing it in this more simple way. I just left another comment at an example of using the numpy way of doing it (with the extra conversion) rather than the pandas way like you do here. I think the one here (this file, test_pipelines) makes more sense
@@ -64,13 +64,13 @@ def test_catboost_objective_tuning(X_y): | |||
# testing objective parameter passed in does not change results | |||
objective = Precision() | |||
y_pred_with_objective = clf.predict(X, objective) | |||
np.testing.assert_almost_equal(y_pred, y_pred_with_objective, decimal=5) | |||
np.testing.assert_almost_equal(y_pred.to_numpy(), y_pred_with_objective.to_numpy(), decimal=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry here is an example of the place where you convert to numpy and then use the numpy function instead of just using the pandas assertion. (sorry probably should have commented on this one instead of the one with the pandas version of testing array equality)
fcfdf0e
to
e6199a7
Compare
…unit test coverage for pd+np dtypes with objective scoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few small comments but otherwise LGTM!
if isinstance(y_true, pd.Series): | ||
y_true = y_true.to_numpy() | ||
if isinstance(y_pred_proba, (pd.Series, pd.DataFrame)): | ||
y_pred_proba = y_pred_proba.to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, along with the new static method you added in ObjectiveBase, makes me wonder if it'd be worth extracting into some util method but 🤷♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! It would definitely be worth it. In this PR I just wanted to get solid test coverage in place and apply minimal changes to get them passing. Once this is merged, yeah, we should get a better fix in place!
@@ -238,7 +240,7 @@ class MockTransformerWithFitAndTransform(Transformer): | |||
hyperparameter_ranges = {} | |||
|
|||
def fit(self, X, y=None): | |||
return X | |||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
components = all_components() | ||
transformers = dict(filter(lambda el: issubclass(el[1], Transformer), components.items())) | ||
for component_name, component_class in transformers.items(): | ||
print('Testing transformer {}'.format(component_class.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense, but do we want to keep them in when we push to master? :O
range_index = pd.RangeIndex(start=0, stop=X_np.shape[1], step=1) | ||
assert (X_df_no_col_names.columns == range_index).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I'm looking at an outdated version but just FYI!
This PR adds test coverage which checks that for any component listed in
all_components()
,predict
outputs apd.Series
andpredict_proba
outputs apd.DataFrame
. It also fixes specific components' impls to conform to this expectation.It would be great if we could find a way in the code to avoid having to covert to/from pd datastructures all over the place. We could define a component metaclass to help with this, to wrap subclasses'
predict
andpredict_proba
methods in standardization code. But for now, this PR was easier and adds value.We should add similar coverage for pipelines, but I'd like us to do that separately. Its unclear how necessary that is if we have good component-level coverage.