Skip to content

Commit

Permalink
Two-Way Dependence Plots (#1690)
Browse files Browse the repository at this point in the history
* Plots for two way partial dependence for binary and multi class are fully implemented.

* By necessity, the partial dependence calculation for multi class is also explicitly implemented as well.
  • Loading branch information
chukarsten committed Jan 25, 2021
1 parent 70c42b1 commit 1f8779b
Show file tree
Hide file tree
Showing 4 changed files with 227 additions and 69 deletions.
3 changes: 2 additions & 1 deletion docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Release Notes
* Added support for list inputs for objectives :pr:`1663`
* Added support for ``AutoMLSearch`` to handle time series classification pipelines :pr:`1666`
* Enhanced ``DelayedFeaturesTransformer`` to encode categorical features and targets before delaying them :pr:`1691`
* Added 2-way dependence plots. :pr:`1690`
* Added ability to directly iterate through components within Pipelines :pr:`1583`
* Fixes
* Fixed inconsistent attributes and added Exceptions to docs :pr:`1673`
Expand All @@ -32,7 +33,7 @@ Release Notes
* Fixed bug where time series baseline estimators were not receiving ``gap`` and ``max_delay`` in ``AutoMLSearch`` :pr:`1645`
* Fixed jupyter notebooks to help the RTD buildtime :pr:`1654`
* Added ``positive_only`` objectives to ``non_core_objectives`` :pr:`1661`
* Fixed stacking argument ``n_jobs`` for IterativeAlgorithm :pr:`1706`
* Fixed stacking argument ``n_jobs`` for IterativeAlgorithm :pr:`1706`
* Updated CatBoost estimators to return self in ``.fit()`` rather than the underlying model for consistency :pr:`1701`
* Added ability to initialize pipeline parameters in ``AutoMLSearch`` constructor :pr:`1676`
* Changes
Expand Down
4 changes: 2 additions & 2 deletions docs/source/user_guide/model_understanding.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@
"outputs": [],
"source": [
"from evalml.model_understanding.graphs import partial_dependence\n",
"partial_dependence(pipeline, X, feature='mean radius')"
"partial_dependence(pipeline, X, features='mean radius')"
]
},
{
Expand All @@ -136,7 +136,7 @@
"outputs": [],
"source": [
"from evalml.model_understanding.graphs import graph_partial_dependence\n",
"graph_partial_dependence(pipeline, X, feature='mean radius')"
"graph_partial_dependence(pipeline, X, features='mean radius')"
]
},
{
Expand Down
153 changes: 110 additions & 43 deletions evalml/model_understanding/graphs.py
Original file line number Diff line number Diff line change
Expand Up @@ -433,30 +433,52 @@ def graph_binary_objective_vs_threshold(pipeline, X, y, objective, steps=100):
return _go.Figure(layout=layout, data=data)


def partial_dependence(pipeline, X, feature, grid_resolution=100):
"""Calculates partial dependence.
def partial_dependence(pipeline, X, features, grid_resolution=100):
"""Calculates one or two-way partial dependence. If a single integer or
string is given for features, one-way partial dependence is calculated. If
a tuple of two integers or strings is given, two-way partial dependence
is calculated with the first feature in the y-axis and second feature in the
x-axis.
Arguments:
pipeline (PipelineBase or subclass): Fitted pipeline
X (ww.DataTable, pd.DataFrame, np.ndarray): The input data used to generate a grid of values
for feature where partial dependence will be calculated at
feature (int, string): The target features for which to create the partial dependence plot for.
If feature is an int, it must be the index of the feature to use.
If feature is a string, it must be a valid column name in X.
features (int, string, tuple[int or string]): The target feature for which to create the partial dependence plot for.
If features is an int, it must be the index of the feature to use.
If features is a string, it must be a valid column name in X.
If features is a tuple of int/strings, it must contain valid column integers/names in X.
grid_resolution (int): Number of samples of feature(s) for partial dependence plot
Returns:
pd.DataFrame: DataFrame with averaged predictions for all points in the grid averaged
over all samples of X and the values used to calculate those predictions. The dataframe will
contain two columns: "feature_values" (grid points at which the partial dependence was calculated) and
"partial_dependence" (the partial dependence at that feature value). For classification problems, there
will be a third column called "class_label" (the class label for which the partial
dependence was calculated). For binary classification, the partial dependence is only calculated for the
"positive" class.
over all samples of X and the values used to calculate those predictions.
In the one-way case: The dataframe will contain two columns, "feature_values" (grid points at which the
partial dependence was calculated) and "partial_dependence" (the partial dependence at that feature value).
For classification problems, there will be a third column called "class_label" (the class label for which
the partial dependence was calculated). For binary classification, the partial dependence is only calculated
for the "positive" class.
In the two-way case: The data frame will contain grid_resolution number of columns and rows where the
index and column headers are the sampled values of the first and second features, respectively, used to make
the partial dependence contour. The values of the data frame contain the partial dependence data for each
feature value pair.
Raises:
ValueError: if the user provides a tuple of not exactly two features.
ValueError: if the provided pipeline isn't fitted.
ValueError: if the provided pipeline is a Baseline pipeline.
"""
X = _convert_to_woodwork_structure(X)
X = _convert_woodwork_types_wrapper(X.to_dataframe())

if isinstance(features, (list, tuple)):
if len(features) != 2:
raise ValueError("Too many features given to graph_partial_dependence. Only one or two-way partial "
"dependence is supported.")
if not (all([isinstance(x, str) for x in features]) or all([isinstance(x, int) for x in features])):
raise ValueError("Features provided must be a tuple entirely of integers or strings, not a mixture of both.")
if not pipeline._is_fitted:
raise ValueError("Pipeline to calculate partial dependence for must be fitted")
if pipeline.model_family == ModelFamily.BASELINE:
Expand All @@ -466,10 +488,10 @@ def partial_dependence(pipeline, X, feature, grid_resolution=100):
elif isinstance(pipeline, evalml.pipelines.RegressionPipeline):
pipeline._estimator_type = "regressor"
pipeline.feature_importances_ = pipeline.feature_importance
if ((isinstance(feature, int) and X.iloc[:, feature].isnull().sum()) or (isinstance(feature, str) and X[feature].isnull().sum())):
if ((isinstance(features, int) and X.iloc[:, features].isnull().sum()) or (isinstance(features, str) and X[features].isnull().sum())):
warnings.warn("There are null values in the features, which will cause NaN values in the partial dependence output. Fill in these values to remove the NaN values.", NullsInColumnWarning)
try:
avg_pred, values = sk_partial_dependence(pipeline, X=X, features=[feature], grid_resolution=grid_resolution)
avg_pred, values = sk_partial_dependence(pipeline, X=X, features=features, grid_resolution=grid_resolution)
finally:
# Delete scikit-learn attributes that were temporarily set
del pipeline._estimator_type
Expand All @@ -480,34 +502,50 @@ def partial_dependence(pipeline, X, feature, grid_resolution=100):
elif isinstance(pipeline, evalml.pipelines.MulticlassClassificationPipeline):
classes = pipeline.classes_

data = pd.DataFrame({"feature_values": np.tile(values[0], avg_pred.shape[0]),
"partial_dependence": np.concatenate([pred for pred in avg_pred])})
if isinstance(features, (int, str)):
data = pd.DataFrame({"feature_values": np.tile(values[0], avg_pred.shape[0]),
"partial_dependence": np.concatenate([pred for pred in avg_pred])})
elif isinstance(features, (list, tuple)):
data = pd.DataFrame(avg_pred.reshape((-1, avg_pred.shape[-1])))
data.columns = values[1]
data.index = np.tile(values[0], avg_pred.shape[0])

if classes is not None:
data['class_label'] = np.repeat(classes, len(values[0]))

return data


def graph_partial_dependence(pipeline, X, feature, class_label=None, grid_resolution=100):
"""Create an one-way partial dependence plot.
def graph_partial_dependence(pipeline, X, features, class_label=None, grid_resolution=100):
"""Create an one-way or two-way partial dependence plot. Passing a single integer or
string as features will create a one-way partial dependence plot with the feature values
plotted against the partial dependence. Passing features a tuple of int/strings will create
a two-way partial dependence plot with a contour of feature[0] in the y-axis, feature[1]
in the x-axis and the partial dependence in the z-axis.
Arguments:
pipeline (PipelineBase or subclass): Fitted pipeline
X (ww.DataTable, pd.DataFrame, np.ndarray): The input data used to generate a grid of values
for feature where partial dependence will be calculated at
feature (int, string): The target feature for which to create the partial dependence plot for.
If feature is an int, it must be the index of the feature to use.
If feature is a string, it must be a valid column name in X.
features (int, string, tuple[int or string]): The target feature for which to create the partial dependence plot for.
If features is an int, it must be the index of the feature to use.
If features is a string, it must be a valid column name in X.
If features is a tuple of strings, it must contain valid column int/names in X.
class_label (string, optional): Name of class to plot for multiclass problems. If None, will plot
the partial dependence for each class. This argument does not change behavior for regression or binary
classification pipelines. For binary classification, the partial dependence for the positive label will
always be displayed. Defaults to None.
grid_resolution (int): Number of samples of feature(s) for partial dependence plot
Returns:
pd.DataFrame: pd.DataFrame with averaged predictions for all points in the grid averaged
over all samples of X and the values used to calculate those predictions.
plotly.graph_objects.Figure: figure object containing the partial dependence data for plotting
Raises:
ValueError: if a graph is requested for a class name that isn't present in the pipeline
"""
if isinstance(features, (list, tuple)):
mode = "two-way"
elif isinstance(features, (int, str)):
mode = "one-way"
_go = import_or_raise("plotly.graph_objects", error_msg="Cannot find dependency plotly.graph_objects")
if jupyter_check():
import_or_raise("ipywidgets", warning=True)
Expand All @@ -516,13 +554,21 @@ def graph_partial_dependence(pipeline, X, feature, class_label=None, grid_resolu
msg = f"Class {class_label} is not one of the classes the pipeline was fit on: {', '.join(list(pipeline.classes_))}"
raise ValueError(msg)

part_dep = partial_dependence(pipeline, X, feature=feature, grid_resolution=grid_resolution)
feature_name = str(feature)
title = f"Partial Dependence of '{feature_name}'"
layout = _go.Layout(title={'text': title},
xaxis={'title': f'{feature_name}'},
yaxis={'title': 'Partial Dependence'},
showlegend=False)
part_dep = partial_dependence(pipeline, X, features=features, grid_resolution=grid_resolution)

if mode == "two-way":
title = f"Partial Dependence of '{features[0]}' vs. '{features[1]}'"
layout = _go.Layout(title={'text': title},
xaxis={'title': f'{features[0]}'},
yaxis={'title': f'{features[1]}'},
showlegend=False)
elif mode == "one-way":
feature_name = str(features)
title = f"Partial Dependence of '{feature_name}'"
layout = _go.Layout(title={'text': title},
xaxis={'title': f'{feature_name}'},
yaxis={'title': 'Partial Dependence'},
showlegend=False)
if isinstance(pipeline, evalml.pipelines.MulticlassClassificationPipeline):
class_labels = [class_label] if class_label is not None else pipeline.classes_
_subplots = import_or_raise("plotly.subplots", error_msg="Cannot find dependency plotly.graph_objects")
Expand All @@ -534,21 +580,42 @@ def graph_partial_dependence(pipeline, X, feature, class_label=None, grid_resolu
# Don't specify share_xaxis and share_yaxis so that we get tickmarks in each subplot
fig = _subplots.make_subplots(rows=rows, cols=cols, subplot_titles=class_labels)
for i, label in enumerate(class_labels):

# Plotly trace indexing begins at 1 so we add 1 to i
fig.add_trace(_go.Scatter(x=part_dep.loc[part_dep.class_label == label, 'feature_values'],
y=part_dep.loc[part_dep.class_label == label, 'partial_dependence'],
line=dict(width=3),
name=label),
row=(i + 2) // 2, col=(i % 2) + 1)
label_df = part_dep.loc[part_dep.class_label == label]
if mode == "two-way":
x = label_df.index
y = np.array([col for col in label_df.columns if isinstance(col, (int, float))])
z = label_df.values
fig.add_trace(_go.Contour(x=x, y=y, z=z, name=label, coloraxis="coloraxis"),
row=(i + 2) // 2, col=(i % 2) + 1)
elif mode == "one-way":
x = label_df['feature_values']
y = label_df['partial_dependence']
fig.add_trace(_go.Scatter(x=x, y=y, line=dict(width=3), name=label),
row=(i + 2) // 2, col=(i % 2) + 1)
fig.update_layout(layout)
fig.update_xaxes(title=f'{feature_name}', range=_calculate_axis_range(part_dep['feature_values']))
fig.update_yaxes(range=_calculate_axis_range(part_dep['partial_dependence']))

if mode == "two-way":
title = f'{features[0]}'
xrange = _calculate_axis_range(part_dep.index)
yrange = _calculate_axis_range(np.array([x for x in part_dep.columns if isinstance(x, (int, float))]))
fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), showlegend=False)
elif mode == "one-way":
title = f'{feature_name}'
xrange = _calculate_axis_range(part_dep['feature_values'])
yrange = _calculate_axis_range(part_dep['partial_dependence'])
fig.update_xaxes(title=title, range=xrange)
fig.update_yaxes(range=yrange)
else:
trace = _go.Scatter(x=part_dep['feature_values'],
y=part_dep['partial_dependence'],
name='Partial Dependence',
line=dict(width=3))
if mode == "two-way":
trace = _go.Contour(x=part_dep.index,
y=part_dep.columns,
z=part_dep.values,
name="Partial Dependence")
elif mode == "one-way":
trace = _go.Scatter(x=part_dep['feature_values'],
y=part_dep['partial_dependence'],
name='Partial Dependence',
line=dict(width=3))
fig = _go.Figure(layout=layout, data=[trace])

return fig
Expand Down
Loading

0 comments on commit 1f8779b

Please sign in to comment.