Skip to content

Add partial dependence for datetime columns#2180

Merged
jeremyliweishih merged 18 commits intomainfrom
js_1963_dt_pd
Apr 27, 2021
Merged

Add partial dependence for datetime columns#2180
jeremyliweishih merged 18 commits intomainfrom
js_1963_dt_pd

Conversation

@jeremyliweishih
Copy link
Collaborator

@jeremyliweishih jeremyliweishih commented Apr 22, 2021

Fixes #1963, #2116.

is_categorical = [_is_feature_categorical(f, X) for f in features]
is_datetime = [_is_feature_datetime(f, X) for f in features]
feature_names = _get_feature_names_from_str_or_col_index(X, features)
if any(is_datetime):
Copy link
Collaborator Author

@jeremyliweishih jeremyliweishih Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datetime columns run into the same problem as categorical columns where the sklearn pd implementation changes the dtypes to the first feature. So if a float feature comes first, the sklearn impl will convert the datetime into a float and will fail on a date time featurizer. However, unlike the categorical case we cannot solve the issue by moving the dt column first as datetime and numerical dtypes do not mix. Heres an example with just a date time featurizer and a logistic regression classifier. I think this is an acceptable solution for now as it solves the immediate issue with one way partial dependence for datetime. Moving forward we should explore rolling our own PD impl as @freddyaboulton has suggested in #1963 if more use cases make this sklearn impl precarious to use with its oddities.

evalml/model_understanding/graphs.py:665: in partial_dependence
    avg_pred, values = sk_partial_dependence(wrapped, X=X, features=features, percentiles=percentiles, grid_resolution=grid_resolution)
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/utils/validation.py:63: in inner_f
    return f(*args, **kwargs)
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py:486: in partial_dependence
    estimator, grid, features_indices, X, response_method
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py:162: in _partial_dependence_brute
    pred = prediction_method(X_eval)
evalml/pipelines/components/utils.py:172: in predict_proba
    return _convert_woodwork_types_wrapper(self.pipeline.predict_proba(X).to_dataframe()).to_numpy()
evalml/pipelines/pipeline_meta.py:23: in _check_for_fit
    return method(self, X)
evalml/pipelines/binary_classification_pipeline.py:46: in predict_proba
    return super().predict_proba(X)
evalml/pipelines/pipeline_meta.py:23: in _check_for_fit
    return method(self, X)
evalml/pipelines/classification_pipeline.py:104: in predict_proba
    proba = self.estimator.predict_proba(X).to_dataframe()
evalml/pipelines/components/component_base_meta.py:27: in _check_for_fit
    return method(self, X)
evalml/pipelines/components/estimators/estimator.py:79: in predict_proba
    pred_proba = self._component_obj.predict_proba(X)
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:1469: in predict_proba
    return super()._predict_proba_lr(X)
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/linear_model/_base.py:323: in _predict_proba_lr
    prob = self.decision_function(X)
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/linear_model/_base.py:284: in decision_function
    X = check_array(X, accept_sparse='csr')
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/utils/validation.py:63: in inner_f
    return f(*args, **kwargs)
../../.pyenv/versions/evalml/lib/python3.7/site-packages/sklearn/utils/validation.py:540: in check_array
    dtype_orig = np.result_type(*dtypes_orig)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (dtype('<M8[ns]'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), ...), kwargs = {}
relevant_args = (dtype('<M8[ns]'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), ...)

>   ???
E   TypeError: The DTypes <class 'numpy.dtype[float64]'> and <class 'numpy.dtype[datetime64]'> do not have a common DType. For example they cannot be stored in a single array unless the dtype is `object`.

Copy link
Contributor

@freddyaboulton freddyaboulton Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih and I talked about this and converting to object dtype doesn't work either.

@codecov
Copy link

codecov bot commented Apr 22, 2021

Codecov Report

Merging #2180 (5e9f7a6) into main (ccf7022) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #2180     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         296      296             
  Lines       24593    24677     +84     
=========================================
+ Hits        24575    24659     +84     
  Misses         18       18             
Impacted Files Coverage Δ
evalml/model_understanding/graphs.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ccf7022...5e9f7a6. Read the comment docs.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih Looks great! The only things blocking merge are:

  1. how the grid resolution is updated for datetimes. I think we need to do it only if the feature is a datetime to avoid more of #2116.
  2. Consensus from @dsherry @chukarsten that we're ok with this approach since I think it'll be slow for datasets with large number of dates. 1000 unique dates means 1000 pipeline evaluations.

is_categorical = [_is_feature_categorical(f, X) for f in features]
is_datetime = [_is_feature_datetime(f, X) for f in features]
feature_names = _get_feature_names_from_str_or_col_index(X, features)
if any(is_datetime):
Copy link
Contributor

@freddyaboulton freddyaboulton Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih and I talked about this and converting to object dtype doesn't work either.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih Looks fantastic! Thanks for making the changes. I left a small suggestion for improving performance that I'd like your thoughts on.

if X_cats.shape[1] != 0:
is_categorical = [_is_feature_of_type(features, X, ww.logical_types.Categorical)] if not isinstance(features, (list, tuple)) else [_is_feature_of_type(f, X, ww.logical_types.Categorical) for f in features]
if X_cats.shape[1] != 0 and any(is_categorical):
max_num_cats = max(X_cats.describe().loc["nunique"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than looking at the max over all categoricals, we need to look at the max of the categorical features passed in by the user. My concern is that if there's a dataset with a numeric feature, a categorical feature with 10 values and a categorical feature with 100 values, and the user wants to compute the partial dependence of the numeric column and the column with 10 values, the grid size will be set to 100. That may slow down the entire process, especially if the user sets a 10 <= grid_size < = 100.

Let me know what you think. I think it's in scope of #2116.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup thats a good catch - I'll make the changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Partial dependence fails on datetime columns

2 participants