Add partial dependence for datetime columns#2180
Conversation
| is_categorical = [_is_feature_categorical(f, X) for f in features] | ||
| is_datetime = [_is_feature_datetime(f, X) for f in features] | ||
| feature_names = _get_feature_names_from_str_or_col_index(X, features) | ||
| if any(is_datetime): |
There was a problem hiding this comment.
Datetime columns run into the same problem as categorical columns where the sklearn pd implementation changes the dtypes to the first feature. So if a float feature comes first, the sklearn impl will convert the datetime into a float and will fail on a date time featurizer. However, unlike the categorical case we cannot solve the issue by moving the dt column first as datetime and numerical dtypes do not mix. Heres an example with just a date time featurizer and a logistic regression classifier. I think this is an acceptable solution for now as it solves the immediate issue with one way partial dependence for datetime. Moving forward we should explore rolling our own PD impl as @freddyaboulton has suggested in #1963 if more use cases make this sklearn impl precarious to use with its oddities.
evalml/model_understanding/graphs.py:665: in partial_dependence
avg_pred, values = sk_partial_dependence(wrapped, X=X, features=features, percentiles=percentiles, grid_resolution=grid_resolution)
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/utils/validation.py:63: in inner_f
return f(*args, **kwargs)
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py:486: in partial_dependence
estimator, grid, features_indices, X, response_method
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py:162: in _partial_dependence_brute
pred = prediction_method(X_eval)
evalml/pipelines/components/utils.py:172: in predict_proba
return _convert_woodwork_types_wrapper(self.pipeline.predict_proba(X).to_dataframe()).to_numpy()
evalml/pipelines/pipeline_meta.py:23: in _check_for_fit
return method(self, X)
evalml/pipelines/binary_classification_pipeline.py:46: in predict_proba
return super().predict_proba(X)
evalml/pipelines/pipeline_meta.py:23: in _check_for_fit
return method(self, X)
evalml/pipelines/classification_pipeline.py:104: in predict_proba
proba = self.estimator.predict_proba(X).to_dataframe()
evalml/pipelines/components/component_base_meta.py:27: in _check_for_fit
return method(self, X)
evalml/pipelines/components/estimators/estimator.py:79: in predict_proba
pred_proba = self._component_obj.predict_proba(X)
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:1469: in predict_proba
return super()._predict_proba_lr(X)
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/linear_model/_base.py:323: in _predict_proba_lr
prob = self.decision_function(X)
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/linear_model/_base.py:284: in decision_function
X = check_array(X, accept_sparse='csr')
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/utils/validation.py:63: in inner_f
return f(*args, **kwargs)
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/utils/validation.py:540: in check_array
dtype_orig = np.result_type(*dtypes_orig)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = (dtype('<M8[ns]'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), ...), kwargs = {}
relevant_args = (dtype('<M8[ns]'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), ...)
> ???
E TypeError: The DTypes <class 'numpy.dtype[float64]'> and <class 'numpy.dtype[datetime64]'> do not have a common DType. For example they cannot be stored in a single array unless the dtype is `object`.
There was a problem hiding this comment.
@jeremyliweishih and I talked about this and converting to object dtype doesn't work either.
Codecov Report
@@ Coverage Diff @@
## main #2180 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 296 296
Lines 24593 24677 +84
=========================================
+ Hits 24575 24659 +84
Misses 18 18
Continue to review full report at Codecov.
|
freddyaboulton
left a comment
There was a problem hiding this comment.
@jeremyliweishih Looks great! The only things blocking merge are:
- how the grid resolution is updated for datetimes. I think we need to do it only if the feature is a datetime to avoid more of #2116.
- Consensus from @dsherry @chukarsten that we're ok with this approach since I think it'll be slow for datasets with large number of dates. 1000 unique dates means 1000 pipeline evaluations.
| is_categorical = [_is_feature_categorical(f, X) for f in features] | ||
| is_datetime = [_is_feature_datetime(f, X) for f in features] | ||
| feature_names = _get_feature_names_from_str_or_col_index(X, features) | ||
| if any(is_datetime): |
There was a problem hiding this comment.
@jeremyliweishih and I talked about this and converting to object dtype doesn't work either.
freddyaboulton
left a comment
There was a problem hiding this comment.
@jeremyliweishih Looks fantastic! Thanks for making the changes. I left a small suggestion for improving performance that I'd like your thoughts on.
evalml/tests/model_understanding_tests/test_partial_dependence.py
Outdated
Show resolved
Hide resolved
evalml/model_understanding/graphs.py
Outdated
| if X_cats.shape[1] != 0: | ||
| is_categorical = [_is_feature_of_type(features, X, ww.logical_types.Categorical)] if not isinstance(features, (list, tuple)) else [_is_feature_of_type(f, X, ww.logical_types.Categorical) for f in features] | ||
| if X_cats.shape[1] != 0 and any(is_categorical): | ||
| max_num_cats = max(X_cats.describe().loc["nunique"]) |
There was a problem hiding this comment.
I think rather than looking at the max over all categoricals, we need to look at the max of the categorical features passed in by the user. My concern is that if there's a dataset with a numeric feature, a categorical feature with 10 values and a categorical feature with 100 values, and the user wants to compute the partial dependence of the numeric column and the column with 10 values, the grid size will be set to 100. That may slow down the entire process, especially if the user sets a 10 <= grid_size < = 100.
Let me know what you think. I think it's in scope of #2116.
There was a problem hiding this comment.
yup thats a good catch - I'll make the changes!
Fixes #1963, #2116.