Skip to content

Latest commit

 

History

History
1086 lines (854 loc) · 48.7 KB

v0.22.rst

File metadata and controls

1086 lines (854 loc) · 48.7 KB

sklearn

Version 0.22.1

In Development

This is a bug-fix release to primarily resolve some packaging issues in version 0.22.0. It also includes minor documentation improvements and some bug fixes.

Changelog

sklearn.cluster

  • cluster.KMeans with algorithm="elkan" now uses the same stopping criterion as with the default algorithm="full". 15930 by inder128.

sklearn.inspection

  • inspection.permutation_importance will return the same importances when a random_state is given for both n_jobs=1 or n_jobs>1 both with shared memory backends (thread-safety) and isolated memory, process-based backends. Also avoid casting the data as object dtype and avoid read-only error on large dataframes with n_jobs>1 as reported in 15810. Follow-up of 15898 by Shivam Gargsya <shivamgargsya>. 15933 by Guillaume Lemaitre <glemaitre> and Olivier Grisel.
  • inspection.plot_partial_dependence and inspection.PartialDependenceDisplay.plot now consistently checks the number of axes passed in. 15760 by Thomas Fan.

sklearn.metrics

  • metrics.plot_confusion_matrix now raises error when normalize is invalid. Previously, it runs fine with no normalization. 15888 by Hanmin Qin.
  • metrics.plot_confusion_matrix now colors the label color correctly to maximize contrast with its background. 15936 by Thomas Fan and DizietAsahi.
  • metrics.classification_report does no longer ignore the value of the zero_division keyword argument. 15879 by Bibhash Chandra Mitra <Bibyutatsu>.

sklearn.model_selection

  • model_selection.GridSearchCV and model_selection.RandomizedSearchCV will accept scalar provided in fit_params. Change in 0.22 was breaking backward compatibility. 15863 by Adrin Jalali <adrinjalali> and Guillaume Lemaitre <glemaitre>.

sklearn.utils

  • utils.check_array now correctly converts pandas DataFrame with boolean columns to floats. 15797 by Thomas Fan.

Version 0.22.0

December 3 2019

For a short description of the main highlights of the release, please refer to sphx_glr_auto_examples_release_highlights_plot_release_highlights_0_22_0.py.

Website update

Our website was revamped and given a fresh new look. 14849 by Thomas Fan.

Clear definition of the public API

Scikit-learn has a public API, and a private API.

We do our best not to break the public API, and to only introduce backward-compatible changes that do not require any user action. However, in cases where that's not possible, any change to the public API is subject to a deprecation cycle of two minor versions. The private API isn't publicly documented and isn't subject to any deprecation cycle, so users should not rely on its stability.

A function or object is public if it is documented in the API Reference and if it can be imported with an import path without leading underscores. For example sklearn.pipeline.make_pipeline is public, while sklearn.pipeline._name_estimators is private. sklearn.ensemble._gb.BaseEnsemble is private too because the whole _gb module is private.

Up to 0.22, some tools were de-facto public (no leading underscore), while they should have been private in the first place. In version 0.22, these tools have been made properly private, and the public API space has been cleaned. In addition, importing from most sub-modules is now deprecated: you should for example use from sklearn.cluster import Birch instead of from sklearn.cluster.birch import Birch (in practice, birch.py has been moved to _birch.py).

Note

All the tools in the public API should be documented in the API Reference. If you find a public tool (without leading underscore) that isn't in the API reference, that means it should either be private or documented. Please let us know by opening an issue!

This work was tracked in issue 9250 and issue 12927.

Deprecations: using FutureWarning from now on

When deprecating a feature, previous versions of scikit-learn used to raise a DeprecationWarning. Since the DeprecationWarnings aren't shown by default by Python, scikit-learn needed to resort to a custom warning filter to always show the warnings. That filter would sometimes interfere with users custom warning filters.

Starting from version 0.22, scikit-learn will show FutureWarnings for deprecations, as recommended by the Python documentation. FutureWarnings are always shown by default by Python, so the custom filter has been removed and scikit-learn no longer hinders with user filters. 15080 by Nicolas Hug.

Changed models

The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.

  • cluster.KMeans when n_jobs=1.
  • decomposition.SparseCoder, decomposition.DictionaryLearning, and decomposition.MiniBatchDictionaryLearning
  • decomposition.SparseCoder with algorithm='lasso_lars'
  • decomposition.SparsePCA where normalize_components has no effect due to deprecation.
  • ensemble.HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor , , .
  • impute.IterativeImputer when X has features with no missing values.
  • linear_model.Ridge when X is sparse.
  • model_selection.StratifiedKFold and any use of cv=int with a classifier.
  • cross_decomposition.CCA when using scipy >= 1.3

Details are listed in the changelog below.

(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)

Changelog

sklearn.base

  • From version 0.24 base.BaseEstimator.get_params will raise an AttributeError rather than return None for parameters that are in the estimator's constructor but not stored as attributes on the instance. 14464 by Joel Nothman.

sklearn.calibration

  • Fixed a bug that made calibration.CalibratedClassifierCV fail when given a sample_weight parameter of type list (in the case where sample_weights are not supported by the wrapped estimator). 13575 by William de Vazelhes <wdevazelhes>.

sklearn.cluster

  • cluster.SpectralClustering now accepts precomputed sparse neighbors graph as input. 10482 by Tom Dupre la Tour and Kumar Ashutosh <thechargedneutron>.
  • cluster.SpectralClustering now accepts a n_components parameter. This parameter extends SpectralClustering class functionality to match cluster.spectral_clustering. 13726 by Shuzhe Xiao <fdas3213>.
  • Fixed a bug where cluster.KMeans produced inconsistent results between n_jobs=1 and n_jobs>1 due to the handling of the random state. 9288 by Bryan Yang <bryanyang0528>.
  • Fixed a bug where elkan algorithm in cluster.KMeans was producing Segmentation Fault on large arrays due to integer index overflow. 15057 by Vladimir Korolev <balodja>.
  • ~cluster.MeanShift now accepts a max_iter with a default value of 300 instead of always using the default 300. It also now exposes an n_iter_ indicating the maximum number of iterations performed on each seed. 15120 by Adrin Jalali.
  • cluster.AgglomerativeClustering and cluster.FeatureAgglomeration now raise an error if affinity='cosine' and X has samples that are all-zeros. 7943 by mthorrell.

sklearn.compose

  • Adds compose.make_column_selector which is used with compose.ColumnTransformer to select DataFrame columns on the basis of name and dtype. 12303 by Thomas Fan.
  • Fixed a bug in compose.ColumnTransformer which failed to select the proper columns when using a boolean list, with NumPy older than 1.12. 14510 by Guillaume Lemaitre.
  • Fixed a bug in compose.TransformedTargetRegressor which did not pass **fit_params to the underlying regressor. 14890 by Miguel Cabrera <mfcabrera>.
  • The compose.ColumnTransformer now requires the number of features to be consistent between fit and transform. A FutureWarning is raised now, and this will raise an error in 0.24. If the number of features isn't consistent and negative indexing is used, an error is raised. 14544 by Adrin Jalali.

sklearn.cross_decomposition

  • cross_decomposition.PLSCanonical and cross_decomposition.PLSRegression have a new function inverse_transform to transform data to the original space. 15304 by Jaime Ferrando Huertas <jiwidi>.
  • decomposition.KernelPCA now properly checks the eigenvalues found by the solver for numerical or conditioning issues. This ensures consistency of results across solvers (different choices for eigen_solver), including approximate solvers such as 'randomized' and 'lobpcg' (see 12068). 12145 by Sylvain Marié <smarie>
  • Fixed a bug where cross_decomposition.PLSCanonical and cross_decomposition.PLSRegression were raising an error when fitted with a target matrix Y in which the first column was constant. 13609 by Camila Williamson <camilaagw>.
  • cross_decomposition.CCA now produces the same results with scipy 1.3 and previous scipy versions. 15661 by Thomas Fan.

sklearn.datasets

  • datasets.fetch_openml now supports heterogeneous data using pandas by setting as_frame=True. 13902 by Thomas Fan.
  • datasets.fetch_openml now includes the target_names in the returned Bunch. 15160 by Thomas Fan.
  • The parameter return_X_y was added to datasets.fetch_20newsgroups and datasets.fetch_olivetti_faces . 14259 by Sourav Singh <souravsingh>.
  • datasets.make_classification now accepts array-like weights parameter, i.e. list or numpy.array, instead of list only. 14764 by Cat Chenal <CatChenal>.
  • The parameter normalize was added to

    datasets.fetch_20newsgroups_vectorized. 14740 by Stéphan Tulkens <stephantul>

  • Fixed a bug in datasets.fetch_openml, which failed to load an OpenML dataset that contains an ignored feature. 14623 by Sarra Habchi <HabchiSarra>.

sklearn.decomposition

  • decomposition.NMF(solver='mu') fitted on sparse input matrices now uses batching to avoid briefly allocating an array with size (#non-zero elements, n_components). 15257 by Mart Willocx.
  • decomposition.dict_learning() and decomposition.dict_learning_online() now accept method_max_iter and pass it to decomposition.sparse_encode. 12650 by Adrin Jalali.
  • decomposition.SparseCoder, decomposition.DictionaryLearning, and decomposition.MiniBatchDictionaryLearning now take a transform_max_iter parameter and pass it to either decomposition.dict_learning() or decomposition.sparse_encode(). 12650 by Adrin Jalali.
  • decomposition.IncrementalPCA now accepts sparse matrices as input, converting them to dense in batches thereby avoiding the need to store the entire dense matrix at once. 13960 by Scott Gigante <scottgigante>.
  • decomposition.sparse_encode() now passes the max_iter to the underlying linear_model.LassoLars when algorithm='lasso_lars'. 12650 by Adrin Jalali.

sklearn.dummy

  • dummy.DummyClassifier now handles checking the existence of the provided constant in multiouput cases. 14908 by Martina G. Vilas <martinagvilas>.
  • The default value of the strategy parameter in dummy.DummyClassifier will change from 'stratified' in version 0.22 to 'prior' in 0.24. A FutureWarning is raised when the default value is used. 15382 by Thomas Fan.
  • The outputs_2d_ attribute is deprecated in dummy.DummyClassifier and dummy.DummyRegressor. It is equivalent to n_outputs > 1. 14933 by Nicolas Hug

sklearn.ensemble

  • Added ensemble.StackingClassifier and ensemble.StackingRegressor to stack predictors using a final classifier or regressor. 11047 by Guillaume Lemaitre <glemaitre> and Caio Oliveira <caioaao> and 15138 by Jon Cusick <jcusick13>..
  • Many improvements were made to ensemble.HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor:

    • Estimators now natively support dense data with missing values both for training and predicting. They also support infinite values. 13911 and 14406 by Nicolas Hug, Adrin Jalali and Olivier Grisel.
    • Estimators now have an additional warm_start parameter that enables warm starting. 14012 by Johann Faouzi <johannfaouzi>.
    • inspection.partial_dependence and inspection.plot_partial_dependence now support the fast 'recursion' method for both estimators. 13769 by Nicolas Hug.
    • for ensemble.HistGradientBoostingClassifier the training loss or score is now monitored on a class-wise stratified subsample to preserve the class balance of the original training set. 14194 by Johann Faouzi <johannfaouzi>.
    • ensemble.HistGradientBoostingRegressor now supports the 'least_absolute_deviation' loss. 13896 by Nicolas Hug.
    • Estimators now bin the training and validation data separately to avoid any data leak. 13933 by Nicolas Hug.
    • Fixed a bug where early stopping would break with string targets. 14710 by Guillaume Lemaitre.
    • ensemble.HistGradientBoostingClassifier now raises an error if categorical_crossentropy loss is given for a binary classification problem. 14869 by Adrin Jalali.

    Note that pickles from 0.21 will not work in 0.22.

  • Addition of max_samples argument allows limiting size of bootstrap samples to be less than size of dataset. Added to ensemble.RandomForestClassifier, ensemble.RandomForestRegressor, ensemble.ExtraTreesClassifier, ensemble.ExtraTreesRegressor. 14682 by Matt Hancock <notmatthancock> and 5963 by Pablo Duboue <DrDub>.
  • ensemble.VotingClassifier.predict_proba will no longer be present when voting='hard'. 14287 by Thomas Fan.
  • The named_estimators_ attribute in ensemble.VotingClassifier and ensemble.VotingRegressor now correctly maps to dropped estimators. Previously, the named_estimators_ mapping was incorrect whenever one of the estimators was dropped. 15375 by Thomas Fan.
  • Run by default utils.estimator_checks.check_estimator on both ensemble.VotingClassifier and ensemble.VotingRegressor. It leads to solve issues regarding shape consistency during predict which was failing when the underlying estimators were not outputting consistent array dimensions. Note that it should be replaced by refactoring the common tests in the future. 14305 by Guillaume Lemaitre.
  • ensemble.AdaBoostClassifier computes probabilities based on the decision function as in the literature. Thus, predict and predict_proba give consistent results. 14114 by Guillaume Lemaitre.
  • Stacking and Voting estimators now ensure that their underlying estimators are either all classifiers or all regressors. ensemble.StackingClassifier, ensemble.StackingRegressor, and ensemble.VotingClassifier and VotingRegressor now raise consistent error messages. 15084 by Guillaume Lemaitre.
  • ensemble.AdaBoostRegressor where the loss should be normalized by the max of the samples with non-null weights only. 14294 by Guillaume Lemaitre.
  • presort is now deprecated in ensemble.GradientBoostingClassifier and ensemble.GradientBoostingRegressor, and the parameter has no effect. Users are recommended to use ensemble.HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor instead. 14907 by Adrin Jalali.

sklearn.feature_extraction

  • A warning will now be raised if a parameter choice means that another parameter will be unused on calling the fit() method for feature_extraction.text.HashingVectorizer, feature_extraction.text.CountVectorizer and feature_extraction.text.TfidfVectorizer. 14602 by Gaurav Chawla <getgaurav2>.
  • Functions created by build_preprocessor and build_analyzer of feature_extraction.text.VectorizerMixin can now be pickled. 14430 by Dillon Niederhut <deniederhut>.
  • feature_extraction.text.strip_accents_unicode now correctly removes accents from strings that are in NFKD normalized form. 15100 by Daniel Grady <DGrady>.
  • Fixed a bug that caused feature_extraction.DictVectorizer to raise an OverflowError during the transform operation when producing a scipy.sparse matrix on large input data. 15463 by Norvan Sahiner <norvan>.
  • Deprecated unused copy param for feature_extraction.text.TfidfVectorizer.transform it will be removed in v0.24. 14520 by Guillem G. Subies <guillemgsubies>.

sklearn.feature_selection

  • Updated the following feature_selection estimators to allow NaN/Inf values in transform and fit: feature_selection.RFE, feature_selection.RFECV, feature_selection.SelectFromModel, and feature_selection.VarianceThreshold. Note that if the underlying estimator of the feature selector does not allow NaN/Inf then it will still error, but the feature selectors themselves no longer enforce this restriction unnecessarily. 11635 by Alec Peters <adpeters>.
  • Fixed a bug where feature_selection.VarianceThreshold with threshold=0 did not remove constant features due to numerical instability, by using range rather than variance in this case. 13704 by Roddy MacSween <rlms>.

sklearn.gaussian_process

  • Gaussian process models on structured data: gaussian_process.GaussianProcessRegressor and gaussian_process.GaussianProcessClassifier can now accept a list of generic objects (e.g. strings, trees, graphs, etc.) as the X argument to their training/prediction methods. A user-defined kernel should be provided for computing the kernel matrix among the generic objects, and should inherit from gaussian_process.kernels.GenericKernelMixin to notify the GPR/GPC model that it handles non-vectorial samples. 15557 by Yu-Hang Tang <yhtang>.
  • gaussian_process.GaussianProcessClassifier.log_marginal_likelihood and gaussian_process.GaussianProcessRegressor.log_marginal_likelihood now accept a clone_kernel=True keyword argument. When set to False, the kernel attribute is modified, but may result in a performance improvement. 14378 by Masashi Shibata <c-bata>.
  • From version 0.24 gaussian_process.kernels.Kernel.get_params will raise an AttributeError rather than return None for parameters that are in the estimator's constructor but not stored as attributes on the instance. 14464 by Joel Nothman.

sklearn.impute

  • Added impute.KNNImputer, to impute missing values using k-Nearest Neighbors. 12852 by Ashim Bhattarai <ashimb9> and Thomas Fan and 15010 by Guillaume Lemaitre.
  • impute.IterativeImputer has new skip_compute flag that is False by default, which, when True, will skip computation on features that have no missing values during the fit phase. 13773 by Sergey Feldman <sergeyf>.
  • impute.MissingIndicator.fit_transform avoid repeated computation of the masked matrix. 14356 by Harsh Soni <harsh020>.
  • impute.IterativeImputer now works when there is only one feature. By Sergey Feldman <sergeyf>.
  • Fixed a bug in impute.IterativeImputer where features where imputed in the reverse desired order with imputation_order either "ascending" or "descending". 15393 by Venkatachalam N <venkyyuvy>.

sklearn.inspection

  • inspection.permutation_importance has been added to measure the importance of each feature in an arbitrary trained model with respect to a given scoring function. 13146 by Thomas Fan.
  • inspection.partial_dependence and inspection.plot_partial_dependence now support the fast 'recursion' method for ensemble.HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor. 13769 by Nicolas Hug.
  • inspection.plot_partial_dependence has been extended to now support the new visualization API described in the User Guide <visualizations>. 14646 by Thomas Fan.
  • inspection.partial_dependence accepts pandas DataFrame and pipeline.Pipeline containing compose.ColumnTransformer. In addition inspection.plot_partial_dependence will use the column names by default when a dataframe is passed. 14028 and 15429 by Guillaume Lemaitre.

sklearn.kernel_approximation

  • Fixed a bug where kernel_approximation.Nystroem raised a KeyError when using kernel="precomputed". 14706 by Venkatachalam N <venkyyuvy>.

sklearn.linear_model

  • The 'liblinear' logistic regression solver is now faster and requires less memory. 14108, 14170, 14296 by Alex Henrie <alexhenrie>.
  • linear_model.BayesianRidge now accepts hyperparameters alpha_init and lambda_init which can be used to set the initial value of the maximization procedure in fit. 13618 by Yoshihiro Uchida <c56pony>.
  • linear_model.Ridge now correctly fits an intercept when X is sparse, solver="auto" and fit_intercept=True, because the default solver in this configuration has changed to sparse_cg, which can fit an intercept with sparse data. 13995 by Jérôme Dockès <jeromedockes>.
  • linear_model.Ridge with solver='sag' now accepts F-ordered and non-contiguous arrays and makes a conversion instead of failing. 14458 by Guillaume Lemaitre.
  • linear_model.LassoCV no longer forces precompute=False when fitting the final model. 14591 by Andreas Müller.
  • linear_model.RidgeCV and linear_model.RidgeClassifierCV now correctly scores when cv=None. 14864 by Venkatachalam N <venkyyuvy>.
  • Fixed a bug in linear_model.LogisticRegressionCV where the scores_, n_iter_ and coefs_paths_ attribute would have a wrong ordering with penalty='elastic-net'. 15044 by Nicolas Hug
  • linear_model.MultiTaskLassoCV and linear_model.MultiTaskElasticNetCV with X of dtype int and fit_intercept=True. 15086 by Alex Gramfort <agramfort>.
  • The liblinear solver now supports sample_weight. 15038 by Guillaume Lemaitre.

sklearn.manifold

  • manifold.Isomap, manifold.TSNE, and manifold.SpectralEmbedding now accept precomputed sparse neighbors graph as input. 10482 by Tom Dupre la Tour and Kumar Ashutosh <thechargedneutron>.
  • Exposed the n_jobs parameter in manifold.TSNE for multi-core calculation of the neighbors graph. This parameter has no impact when metric="precomputed" or (metric="euclidean" and method="exact"). 15082 by Roman Yurchak.
  • Improved efficiency of manifold.TSNE when method="barnes-hut" by computing the gradient in parallel. 13213 by Thomas Moreau <tommoral>
  • Fixed a bug where manifold.spectral_embedding (and therefore manifold.SpectralEmbedding and cluster.SpectralClustering) computed wrong eigenvalues with eigen_solver='amg' when n_samples < 5 * n_components. 14647 by Andreas Müller.
  • Fixed a bug in manifold.spectral_embedding used in manifold.SpectralEmbedding and cluster.SpectralClustering where eigen_solver="amg" would sometimes result in a LinAlgError. 13393 by Andrew Knyazev <lobpcg> 13707 by Scott White <whitews>
  • Deprecate training_data_ unused attribute in manifold.Isomap. 10482 by Tom Dupre la Tour.

sklearn.metrics

  • metrics.plot_roc_curve has been added to plot roc curves. This function introduces the visualization API described in the User Guide <visualizations>. 14357 by Thomas Fan.
  • Added a new parameter zero_division to multiple classification metrics: precision_score, recall_score, f1_score, fbeta_score, precision_recall_fscore_support, classification_report. This allows to set returned value for ill-defined metrics. 14900 by Marc Torrellas Socastro <marctorrellas>.
  • Added the metrics.pairwise.nan_euclidean_distances metric, which calculates euclidean distances in the presence of missing values. 12852 by Ashim Bhattarai <ashimb9> and Thomas Fan.
  • New ranking metrics metrics.ndcg_score and metrics.dcg_score have been added to compute Discounted Cumulative Gain and Normalized Discounted Cumulative Gain. 9951 by Jérôme Dockès <jeromedockes>.
  • metrics.plot_precision_recall_curve has been added to plot precision recall curves. 14936 by Thomas Fan.
  • metrics.plot_confusion_matrix has been added to plot confusion matrices. 15083 by Thomas Fan.
  • Added multiclass support to metrics.roc_auc_score with corresponding scorers 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', and 'roc_auc_ovo_weighted'. 12789 and 15274 by Kathy Chen <kathyxchen>, Mohamed Maskani <maskani-moh>, and Thomas Fan.
  • Add metrics.mean_tweedie_deviance measuring the Tweedie deviance for a given power parameter. Also add mean Poisson deviance metrics.mean_poisson_deviance and mean Gamma deviance metrics.mean_gamma_deviance that are special cases of the Tweedie deviance for power=1 and power=2 respectively. 13938 by Christian Lorentzen <lorentzenchr> and Roman Yurchak.
  • Improved performance of metrics.pairwise.manhattan_distances in the case of sparse matrices. 15049 by Paolo Toccaceli <ptocca>.
  • The parameter beta in metrics.fbeta_score is updated to accept the zero and float('+inf') value. 13231 by Dong-hee Na <corona10>.
  • Added parameter squared in metrics.mean_squared_error to return root mean squared error. 13467 by Urvang Patel <urvang96>.
  • Allow computing averaged metrics in the case of no true positives. 14595 by Andreas Müller.
  • Multilabel metrics now supports list of lists as input. 14865 Srivatsan Ramesh <srivatsan-ramesh>, Herilalaina Rakotoarison <herilalaina>, Léonard Binet <leonardbinet>.
  • metrics.median_absolute_error now supports multioutput parameter. 14732 by Agamemnon Krasoulis <agamemnonc>.
  • 'roc_auc_ovr_weighted' and 'roc_auc_ovo_weighted' can now be used as the scoring parameter of model-selection tools. 14417 by Thomas Fan.
  • metrics.confusion_matrix accepts a parameters normalize allowing to normalize the confusion matrix by column, rows, or overall. 15625 by Guillaume Lemaitre <glemaitre>.
  • Raise a ValueError in metrics.silhouette_score when a precomputed distance matrix contains non-zero diagonal entries. 12258 by Stephen Tierney <sjtrny>.
  • scoring="neg_brier_score" should be used instead of scoring="brier_score_loss" which is now deprecated. 14898 by Stefan Matcovici <stefan-matcovici>.

sklearn.model_selection

  • Improved performance of multimetric scoring in model_selection.cross_validate, model_selection.GridSearchCV, and model_selection.RandomizedSearchCV. 14593 by Thomas Fan.
  • model_selection.learning_curve now accepts parameter return_times which can be used to retrieve computation times in order to plot model scalability (see learning_curve example). 13938 by Hadrien Reboul <H4dr1en>.
  • model_selection.RandomizedSearchCV now accepts lists of parameter distributions. 14549 by Andreas Müller.
  • Reimplemented model_selection.StratifiedKFold to fix an issue where one test set could be n_classes larger than another. Test sets should now be near-equally sized. 14704 by Joel Nothman.
  • The cv_results_ attribute of model_selection.GridSearchCV and model_selection.RandomizedSearchCV now only contains unfitted estimators. This potentially saves a lot of memory since the state of the estimators isn't stored. #15096 by Andreas Müller.
  • model_selection.KFold and model_selection.StratifiedKFold now raise a warning if random_state is set but shuffle is False. This will raise an error in 0.24.

sklearn.multioutput

  • multioutput.MultiOutputClassifier now has attribute classes_. 14629 by Agamemnon Krasoulis <agamemnonc>.
  • multioutput.MultiOutputClassifier now has predict_proba as property and can be checked with hasattr. 15488 15490 by Rebekah Kim <rebekahkim>

sklearn.naive_bayes

  • Added naive_bayes.CategoricalNB that implements the Categorical Naive Bayes classifier. 12569 by Tim Bicker <timbicker> and Florian Wilhelm <FlorianWilhelm>.

sklearn.neighbors

  • Added neighbors.KNeighborsTransformer and neighbors.RadiusNeighborsTransformer, which transform input dataset into a sparse neighbors graph. They give finer control on nearest neighbors computations and enable easy pipeline caching for multiple use. 10482 by Tom Dupre la Tour.
  • neighbors.KNeighborsClassifier, neighbors.KNeighborsRegressor, neighbors.RadiusNeighborsClassifier, neighbors.RadiusNeighborsRegressor, and neighbors.LocalOutlierFactor now accept precomputed sparse neighbors graph as input. 10482 by Tom Dupre la Tour and Kumar Ashutosh <thechargedneutron>.
  • neighbors.RadiusNeighborsClassifier now supports predicting probabilities by using predict_proba and supports more outlier_label options: 'most_frequent', or different outlier_labels for multi-outputs. 9597 by Wenbo Zhao <webber26232>.
  • Efficiency improvements for neighbors.RadiusNeighborsClassifier.predict. 9597 by Wenbo Zhao <webber26232>.
  • neighbors.KNeighborsRegressor now throws error when metric='precomputed' and fit on non-square data. 14336 by Gregory Dexter <gdex1>.

sklearn.neural_network

  • Add max_fun parameter in neural_network.BaseMultilayerPerceptron, neural_network.MLPRegressor, and neural_network.MLPClassifier to give control over maximum number of function evaluation to not meet tol improvement. 9274 by Daniel Perry <daniel-perry>.

sklearn.pipeline

  • pipeline.Pipeline now supports score_samples if the final estimator does. 13806 by Anaël Beaugnon <ab-anssi>.
  • The fit in ~pipeline.FeatureUnion now accepts fit_params to pass to the underlying transformers. 15119 by Adrin Jalali.
  • None as a transformer is now deprecated in pipeline.FeatureUnion. Please use 'drop' instead. 15053 by Thomas Fan.

sklearn.preprocessing

  • preprocessing.PolynomialFeatures is now faster when the input data is dense. 13290 by Xavier Dupré <sdpython>.
  • Avoid unnecessary data copy when fitting preprocessors preprocessing.StandardScaler, preprocessing.MinMaxScaler, preprocessing.MaxAbsScaler, preprocessing.RobustScaler and preprocessing.QuantileTransformer which results in a slight performance improvement. 13987 by Roman Yurchak.
  • KernelCenterer now throws error when fit on non-square preprocessing.KernelCenterer 14336 by Gregory Dexter <gdex1>.
  • preprocessing.QuantileTransformer now guarantees the quantiles_ attribute to be completely sorted in non-decreasing manner. 15751 by Tirth Patel <tirthasheshpatel>.

sklearn.model_selection

  • model_selection.GridSearchCV and model_selection.RandomizedSearchCV now supports the _pairwise property, which prevents an error during cross-validation for estimators with pairwise inputs (such as neighbors.KNeighborsClassifier when metric is set to 'precomputed'). 13925 by Isaac S. Robson <isrobson> and 15524 by Xun Tang <xun-tang>.

sklearn.svm

  • svm.SVC and svm.NuSVC now accept a break_ties parameter. This parameter results in predict breaking the ties according to the confidence values of decision_function, if decision_function_shape='ovr', and the number of target classes > 2. 12557 by Adrin Jalali.
  • SVM estimators now throw a more specific error when kernel='precomputed' and fit on non-square data. 14336 by Gregory Dexter <gdex1>.
  • svm.SVC, svm.SVR, svm.NuSVR and svm.OneClassSVM when received values negative or zero for parameter sample_weight in method fit(), generated an invalid model. This behavior occurred only in some border scenarios. Now in these cases, fit() will fail with an Exception. 14286 by Alex Shacked <alexshacked>.
  • The n_support_ attribute of svm.SVR and svm.OneClassSVM was previously non-initialized, and had size 2. It has now size 1 with the correct value. 15099 by Nicolas Hug.
  • fixed a bug in BaseLibSVM._sparse_fit where n_SV=0 raised a ZeroDivisionError. 14894 by Danna Naser <danna-naser>.
  • The liblinear solver now supports sample_weight. 15038 by Guillaume Lemaitre.

sklearn.tree

  • Adds minimal cost complexity pruning, controlled by ccp_alpha, to tree.DecisionTreeClassifier, tree.DecisionTreeRegressor, tree.ExtraTreeClassifier, tree.ExtraTreeRegressor, ensemble.RandomForestClassifier, ensemble.RandomForestRegressor, ensemble.ExtraTreesClassifier, ensemble.ExtraTreesRegressor, ensemble.GradientBoostingClassifier, and ensemble.GradientBoostingRegressor. 12887 by Thomas Fan.
  • presort is now deprecated in tree.DecisionTreeClassifier and tree.DecisionTreeRegressor, and the parameter has no effect. 14907 by Adrin Jalali.
  • The classes_ and n_classes_ attributes of tree.DecisionTreeRegressor are now deprecated. 15028 by Mei Guan <meiguan>, Nicolas Hug, and Adrin Jalali.

sklearn.utils

  • ~utils.estimator_checks.check_estimator can now generate checks by setting generate_only=True. Previously, running ~utils.estimator_checks.check_estimator will stop when the first check fails. With generate_only=True, all checks can run independently and report the ones that are failing. Read more in rolling_your_own_estimator. 14381 by Thomas Fan.
  • Added a pytest specific decorator, ~utils.estimator_checks.parametrize_with_checks, to parametrize estimator checks for a list of estimators. 14381 by Thomas Fan.
  • A new random variable, utils.fixes.loguniform implements a log-uniform random variable (e.g., for use in RandomizedSearchCV). For example, the outcomes 1, 10 and 100 are all equally likely for loguniform(1, 100). See 11232 by Scott Sievert <stsievert> and Nathaniel Saul <sauln>, and SciPy PR 10815 <scipy/scipy#10815>.
  • utils.safe_indexing (now deprecated) accepts an axis parameter to index array-like across rows and columns. The column indexing can be done on NumPy array, SciPy sparse matrix, and Pandas DataFrame. An additional refactoring was done. 14035 and 14475 by Guillaume Lemaitre.
  • utils.extmath.safe_sparse_dot works between 3D+ ndarray and sparse matrix. 14538 by Jérémie du Boisberranger <jeremiedbb>.
  • utils.check_array is now raising an error instead of casting NaN to integer. 14872 by Roman Yurchak.
  • utils.check_array will now correctly detect numeric dtypes in pandas dataframes, fixing a bug where float32 was upcast to float64 unnecessarily. 15094 by Andreas Müller.
  • The following utils have been deprecated and are now private:
    • choose_check_classifiers_labels
    • enforce_estimator_tags_y
    • mocking.MockDataFrame
    • mocking.CheckingClassifier
    • optimize.newton_cg
    • random.random_choice_csc
    • utils.choose_check_classifiers_labels
    • utils.enforce_estimator_tags_y
    • utils.optimize.newton_cg
    • utils.random.random_choice_csc
    • utils.safe_indexing
    • utils.mocking
    • utils.fast_dict
    • utils.seq_dataset
    • utils.weight_vector
    • utils.fixes.parallel_helper (removed)
    • All of utils.testing except for all_estimators which is now in utils.

sklearn.isotonic

  • Fixed a bug where isotonic.IsotonicRegression.fit raised error when X.dtype == 'float32' and X.dtype != y.dtype. 14902 by Lucas <lostcoaster>.

Miscellaneous

  • Port lobpcg from SciPy which implement some bug fixes but only available in 1.3+. 13609 and 14971 by Guillaume Lemaitre.
  • Scikit-learn now converts any input data structure implementing a duck array to a numpy array (using __array__) to ensure consistent behavior instead of relying on __array_function__ (see NEP 18). 14702 by Andreas Müller.
  • Replace manual checks with check_is_fitted. Errors thrown when using a non-fitted estimators are now more uniform. 13013 by Agamemnon Krasoulis <agamemnonc>.

Changes to estimator checks

These changes mostly affect library developers.

  • Estimators are now expected to raise a NotFittedError if predict or transform is called before fit; previously an AttributeError or ValueError was acceptable. 13013 by by Agamemnon Krasoulis <agamemnonc>.
  • Binary only classifiers are now supported in estimator checks. Such classifiers need to have the binary_only=True estimator tag. 13875 by Trevor Stephens.
  • Estimators are expected to convert input data (X, y, sample_weights) to numpy.ndarray and never call __array_function__ on the original datatype that is passed (see NEP 18). 14702 by Andreas Müller.
  • requires_positive_X estimator tag (for models that require X to be non-negative) is now used by utils.estimator_checks.check_estimator to make sure a proper error message is raised if X contains some negative entries. 14680 by Alex Gramfort <agramfort>.
  • Added check that pairwise estimators raise error on non-square data 14336 by Gregory Dexter <gdex1>.
  • Added two common multioutput estimator tests ~utils.estimator_checks.check_classifier_multioutput and ~utils.estimator_checks.check_regressor_multioutput. 13392 by Rok Mihevc <rok>.
  • Added check_transformer_data_not_an_array to checks where missing
  • The estimators tags resolution now follows the regular MRO. They used to be overridable only once. 14884 by Andreas Müller.

Code and Documentation Contributors

Thanks to everyone who has contributed to the maintenance and improvement of the project since version 0.20, including:

Aaron Alphonsus, Abbie Popa, Abdur-Rahmaan Janhangeer, abenbihi, Abhinav Sagar, Abhishek Jana, Abraham K. Lagat, Adam J. Stewart, Aditya Vyas, Adrin Jalali, Agamemnon Krasoulis, Alec Peters, Alessandro Surace, Alexandre de Siqueira, Alexandre Gramfort, alexgoryainov, Alex Henrie, Alex Itkes, alexshacked, Allen Akinkunle, Anaël Beaugnon, Anders Kaseorg, Andrea Maldonado, Andrea Navarrete, Andreas Mueller, Andreas Schuderer, Andrew Nystrom, Angela Ambroz, Anisha Keshavan, Ankit Jha, Antonio Gutierrez, Anuja Kelkar, Archana Alva, arnaudstiegler, arpanchowdhry, ashimb9, Ayomide Bamidele, Baran Buluttekin, barrycg, Bharat Raghunathan, Bill Mill, Biswadip Mandal, blackd0t, Brian G. Barkley, Brian Wignall, Bryan Yang, c56pony, camilaagw, cartman_nabana, catajara, Cat Chenal, Cathy, cgsavard, Charles Vesteghem, Chiara Marmo, Chris Gregory, Christian Lorentzen, Christos Aridas, Dakota Grusak, Daniel Grady, Daniel Perry, Danna Naser, DatenBergwerk, David Dormagen, deeplook, Dillon Niederhut, Dong-hee Na, Dougal J. Sutherland, DrGFreeman, Dylan Cashman, edvardlindelof, Eric Larson, Eric Ndirangu, Eunseop Jeong, Fanny, federicopisanu, Felix Divo, flaviomorelli, FranciDona, Franco M. Luque, Frank Hoang, Frederic Haase, g0g0gadget, Gabriel Altay, Gabriel do Vale Rios, Gael Varoquaux, ganevgv, gdex1, getgaurav2, Gideon Sonoiya, Gordon Chen, gpapadok, Greg Mogavero, Grzegorz Szpak, Guillaume Lemaitre, Guillem García Subies, H4dr1en, hadshirt, Hailey Nguyen, Hanmin Qin, Hannah Bruce Macdonald, Harsh Mahajan, Harsh Soni, Honglu Zhang, Hossein Pourbozorg, Ian Sanders, Ingrid Spielman, J-A16, jaehong park, Jaime Ferrando Huertas, James Hill, James Myatt, Jay, jeremiedbb, Jérémie du Boisberranger, jeromedockes, Jesper Dramsch, Joan Massich, Joanna Zhang, Joel Nothman, Johann Faouzi, Jonathan Rahn, Jon Cusick, Jose Ortiz, Kanika Sabharwal, Katarina Slama, kellycarmody, Kennedy Kang'ethe, Kensuke Arai, Kesshi Jordan, Kevad, Kevin Loftis, Kevin Winata, Kevin Yu-Sheng Li, Kirill Dolmatov, Kirthi Shankar Sivamani, krishna katyal, Lakshmi Krishnan, Lakshya KD, LalliAcqua, lbfin, Leland McInnes, Léonard Binet, Loic Esteve, loopyme, lostcoaster, Louis Huynh, lrjball, Luca Ionescu, Lutz Roeder, MaggieChege, Maithreyi Venkatesh, Maltimore, Maocx, Marc Torrellas, Marie Douriez, Markus, Markus Frey, Martina G. Vilas, Martin Oywa, Martin Thoma, Masashi SHIBATA, Maxwell Aladago, mbillingr, m-clare, Meghann Agarwal, m.fab, Micah Smith, miguelbarao, Miguel Cabrera, Mina Naghshhnejad, Ming Li, motmoti, mschaffenroth, mthorrell, Natasha Borders, nezar-a, Nicolas Hug, Nidhin Pattaniyil, Nikita Titov, Nishan Singh Mann, Nitya Mandyam, norvan, notmatthancock, novaya, nxorable, Oleg Stikhin, Oleksandr Pavlyk, Olivier Grisel, Omar Saleem, Owen Flanagan, panpiort8, Paolo, Paolo Toccaceli, Paresh Mathur, Paula, Peng Yu, Peter Marko, pierretallotte, poorna-kumar, pspachtholz, qdeffense, Rajat Garg, Raphaël Bournhonesque, Ray, Ray Bell, Rebekah Kim, Reza Gharibi, Richard Payne, Richard W, rlms, Robert Juergens, Rok Mihevc, Roman Feldbauer, Roman Yurchak, R Sanjabi, RuchitaGarde, Ruth Waithera, Sackey, Sam Dixon, Samesh Lakhotia, Samuel Taylor, Sarra Habchi, Scott Gigante, Scott Sievert, Scott White, Sebastian Pölsterl, Sergey Feldman, SeWook Oh, she-dares, Shreya V, Shubham Mehta, Shuzhe Xiao, SimonCW, smarie, smujjiga, Sönke Behrends, Soumirai, Sourav Singh, stefan-matcovici, steinfurt, Stéphane Couvreur, Stephan Tulkens, Stephen Cowley, Stephen Tierney, SylvainLan, th0rwas, theoptips, theotheo, Thierno Ibrahima DIOP, Thomas Edwards, Thomas J Fan, Thomas Moreau, Thomas Schmitt, Tilen Kusterle, Tim Bicker, Timsaur, Tim Staley, Tirth Patel, Tola A, Tom Augspurger, Tom Dupré la Tour, topisan, Trevor Stephens, ttang131, Urvang Patel, Vathsala Achar, veerlosar, Venkatachalam N, Victor Luzgin, Vincent Jeanselme, Vincent Lostanlen, Vladimir Korolev, vnherdeiro, Wenbo Zhao, Wendy Hu, willdarnell, William de Vazelhes, wolframalpha, xavier dupré, xcjason, x-martian, xsat, xun-tang, Yinglr, yokasre, Yu-Hang "Maxin" Tang, Yulia Zamriy, Zhao Feng