
<br>
========================================<br>
Release Highlights for scikit-learn 0.22<br>
========================================<br>
.. currentmodule:: sklearn<br>
We are pleased to announce the release of scikit-learn 0.22, which comes<br>
with many bug fixes and new features! We detail below a few of the major<br>
features of this release. For an exhaustive list of all the changes, please<br>
refer to the :ref:`release notes <changes_0_22>`.<br>
To install the latest version (with pip)::<br>
    pip install --upgrade scikit-learn<br>
or with conda::<br>
    conda install scikit-learn<br>


############################################################################<br>
New plotting API<br>
----------------<br>
<br>
A new plotting API is available for creating visualizations. This new API<br>
allows for quickly adjusting the visuals of a plot without involving any<br>
recomputation. It is also possible to add different plots to the same<br>
figure. The following example illustrates :class:`~metrics.plot_roc_curve`,<br>
but other plots utilities are supported like<br>
:class:`~inspection.plot_partial_dependence`,<br>
:class:`~metrics.plot_precision_recall_curve`, and<br>
:class:`~metrics.plot_confusion_matrix`. Read more about this new API in the<br>
:ref:`User Guide <visualizations>`.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import plot_roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

In [None]:
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
svc = SVC(random_state=42)
svc.fit(X_train, y_train)
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, y_train)

In [None]:
svc_disp = plot_roc_curve(svc, X_test, y_test)
rfc_disp = plot_roc_curve(rfc, X_test, y_test, ax=svc_disp.ax_)
rfc_disp.figure_.suptitle("ROC curve comparison")

In [None]:
plt.show()

##########################################################################<br>
Stacking Classifier and Regressor<br>
---------------------------------<br>
:class:`~ensemble.StackingClassifier` and<br>
:class:`~ensemble.StackingRegressor`<br>
allow you to have a stack of estimators with a final classifier or<br>
a regressor.<br>
Stacked generalization consists in stacking the output of individual<br>
estimators and use a classifier to compute the final prediction. Stacking<br>
allows to use the strength of each individual estimator by using their output<br>
as input of a final estimator.<br>
Base estimators are fitted on the full ``X`` while<br>
the final estimator is trained using cross-validated predictions of the<br>
base estimators using ``cross_val_predict``.<br>
<br>
Read more in the :ref:`User Guide <stacking>`.

In [None]:
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import train_test_split

In [None]:
X, y = load_iris(return_X_y=True)
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('svr', make_pipeline(StandardScaler(),
                          LinearSVC(random_state=42)))
]
clf = StackingClassifier(
    estimators=estimators, final_estimator=LogisticRegression()
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=42
)
clf.fit(X_train, y_train).score(X_test, y_test)

############################################################################<br>
Permutation-based feature importance<br>
------------------------------------<br>
<br>
The :func:`inspection.permutation_importance` can be used to get an<br>
estimate of the importance of each feature, for any fitted estimator:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

In [None]:
X, y = make_classification(random_state=0, n_features=5, n_informative=3)
rf = RandomForestClassifier(random_state=0).fit(X, y)
result = permutation_importance(rf, X, y, n_repeats=10, random_state=0,
                                n_jobs=-1)

In [None]:
fig, ax = plt.subplots()
sorted_idx = result.importances_mean.argsort()
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=range(X.shape[1]))
ax.set_title("Permutation Importance of each feature")
ax.set_ylabel("Features")
fig.tight_layout()
plt.show()

############################################################################<br>
Native support for missing values for gradient boosting<br>
-------------------------------------------------------<br>
<br>
The :class:`ensemble.HistGradientBoostingClassifier`<br>
and :class:`ensemble.HistGradientBoostingRegressor` now have native<br>
support for missing values (NaNs). This means that there is no need for<br>
imputing data when training or predicting.

In [None]:
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
import numpy as np

In [None]:
X = np.array([0, 1, 2, np.nan]).reshape(-1, 1)
y = [0, 0, 1, 1]

In [None]:
gbdt = HistGradientBoostingClassifier(min_samples_leaf=1).fit(X, y)
print(gbdt.predict(X))

##########################################################################<br>
Precomputed sparse nearest neighbors graph<br>
------------------------------------------<br>
Most estimators based on nearest neighbors graphs now accept precomputed<br>
sparse graphs as input, to reuse the same graph for multiple estimator fits.<br>
To use this feature in a pipeline, one can use the `memory` parameter, along<br>
with one of the two new transformers,<br>
:class:`neighbors.KNeighborsTransformer` and<br>
:class:`neighbors.RadiusNeighborsTransformer`. The precomputation<br>
can also be performed by custom estimators to use alternative<br>
implementations, such as approximate nearest neighbors methods.<br>
See more details in the :ref:`User Guide <neighbors_transformer>`.

In [None]:
from tempfile import TemporaryDirectory
from sklearn.neighbors import KNeighborsTransformer
from sklearn.manifold import Isomap
from sklearn.pipeline import make_pipeline

In [None]:
X, y = make_classification(random_state=0)

In [None]:
with TemporaryDirectory(prefix="sklearn_cache_") as tmpdir:
    estimator = make_pipeline(
        KNeighborsTransformer(n_neighbors=10, mode='distance'),
        Isomap(n_neighbors=10, metric='precomputed'),
        memory=tmpdir)
    estimator.fit(X)

    # We can decrease the number of neighbors and the graph will not be
    # recomputed.
    estimator.set_params(isomap__n_neighbors=5)
    estimator.fit(X)

############################################################################<br>
KNN Based Imputation<br>
------------------------------------<br>
We now support imputation for completing missing values using k-Nearest<br>
Neighbors.<br>
<br>
Each sample's missing values are imputed using the mean value from<br>
``n_neighbors`` nearest neighbors found in the training set. Two samples are<br>
close if the features that neither is missing are close.<br>
By default, a euclidean distance metric<br>
that supports missing values,<br>
:func:`~metrics.nan_euclidean_distances`, is used to find the nearest<br>
neighbors.<br>
<br>
Read more in the :ref:`User Guide <knnimpute>`.

In [None]:
import numpy as np
from sklearn.impute import KNNImputer

In [None]:
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
print(imputer.fit_transform(X))

###########################################################################<br>
Tree pruning<br>
------------<br>
<br>
It is now possible to prune most tree-based estimators once the trees are<br>
built. The pruning is based on minimal cost-complexity. Read more in the<br>
:ref:`User Guide <minimal_cost_complexity_pruning>` for details.

In [None]:
X, y = make_classification(random_state=0)

In [None]:
rf = RandomForestClassifier(random_state=0, ccp_alpha=0).fit(X, y)
print("Average number of nodes without pruning {:.1f}".format(
    np.mean([e.tree_.node_count for e in rf.estimators_])))

In [None]:
rf = RandomForestClassifier(random_state=0, ccp_alpha=0.05).fit(X, y)
print("Average number of nodes with pruning {:.1f}".format(
    np.mean([e.tree_.node_count for e in rf.estimators_])))

##########################################################################<br>
Retrieve dataframes from OpenML<br>
-------------------------------<br>
:func:`datasets.fetch_openml` can now return pandas dataframe and thus<br>
properly handle datasets with heterogeneous data:

In [None]:
from sklearn.datasets import fetch_openml

In [None]:
titanic = fetch_openml('titanic', version=1, as_frame=True)
print(titanic.data.head()[['pclass', 'embarked']])

##########################################################################<br>
Checking scikit-learn compatibility of an estimator<br>
---------------------------------------------------<br>
Developers can check the compatibility of their scikit-learn compatible<br>
estimators using :func:`~utils.estimator_checks.check_estimator`. For<br>
instance, the ``check_estimator(LinearSVC)`` passes.<br>
<br>
We now provide a ``pytest`` specific decorator which allows ``pytest``<br>
to run all checks independently and report the checks that are failing.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.utils.estimator_checks import parametrize_with_checks

In [None]:
@parametrize_with_checks([LogisticRegression, DecisionTreeRegressor])
def test_sklearn_compatible_estimator(estimator, check):
    check(estimator)

##########################################################################<br>
ROC AUC now supports multiclass classification<br>
----------------------------------------------<br>
The :func:`roc_auc_score` function can also be used in multi-class<br>
classification. Two averaging strategies are currently supported: the<br>
one-vs-one algorithm computes the average of the pairwise ROC AUC scores, and<br>
the one-vs-rest algorithm computes the average of the ROC AUC scores for each<br>
class against all other classes. In both cases, the multiclass ROC AUC scores<br>
are computed from the probability estimates that a sample belongs to a<br>
particular class according to the model. The OvO and OvR algorithms support<br>
weighting uniformly (``average='macro'``) and weighting by the prevalence<br>
(``average='weighted'``).<br>
<br>
Read more in the :ref:`User Guide <roc_metrics>`.

In [None]:
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score

In [None]:
X, y = make_classification(n_classes=4, n_informative=16)
clf = SVC(decision_function_shape='ovo', probability=True).fit(X, y)
print(roc_auc_score(y, clf.predict_proba(X), multi_class='ovo'))