sklearn
In Development
Put the changes in their relevant module.
In an effort to promote clear and non-ambiguous use of the library, most constructor and function parameters are now expected to be passed as keyword arguments (i.e. using the param=value syntax) instead of positional. To ease the transition, a FutureWarning is raised if a keyword-only parameter is used as positional. In version 0.25, these parameters will be strictly keyword-only, and a TypeError will be raised. 15005
by Joel Nothman, Adrin Jalali, Thomas Fan, and Nicolas Hug. See SLEP009 for more details.
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
ensemble.BaggingClassifier
,ensemble.BaggingRegressor
, andensemble.IsolationForest
.cluster.KMeans
withalgorithm="elkan"
andalgorithm="full"
.cluster.Birch
compose.ColumnTransformer.get_feature_names
compose.ColumnTransformer.fit
datasets.make_multilabel_classification
decomposition.PCA
with n_components='mle'decomposition.NMF
anddecomposition.non_negative_factorization
with float32 dtype input.decomposition.KernelPCA.inverse_transform
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegrerssor
estimator_samples_
inensemble.BaggingClassifier
,ensemble.BaggingRegressor
andensemble.IsolationForest
ensemble.StackingClassifier
andensemble.StackingRegressor
with sample_weightgaussian_process.GaussianProcessRegressor
linear_model.RANSACRegressor
withsample_weight
.linear_model.RidgeClassifierCV
metrics.mean_squared_error
with squared and multioutput='raw_values'.metrics.mutual_info_score
with negative scores.metrics.confusion_matrix
with zero length y_true and y_predneural_network.MLPClassifier
preprocessing.StandardScaler
with partial_fit and sparse input.preprocessing.Normalizer
with norm='max'- Any model using the
svm.libsvm
or thesvm.liblinear
solver, includingsvm.LinearSVC
,svm.LinearSVR
,svm.NuSVC
,svm.NuSVR
,svm.OneClassSVM
,svm.SVC
,svm.SVR
,linear_model.LogisticRegression
. tree.DecisionTreeClassifier
,tree.ExtraTreeClassifier
andensemble.GradientBoostingClassifier
as well aspredict
method oftree.DecisionTreeRegressor
,tree.ExtraTreeRegressor
, andensemble.GradientBoostingRegressor
and read-only float32 input inpredict
,decision_path
andpredict_proba
.
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
cluster.Birch
implementation of the predict method avoids high memory footprint by calculating the distances matrix using a chunked scheme.16149
byJeremie du Boisberranger <jeremiedbb>
andAlex Shacked <alexshacked>
.- The critical parts of
cluster.KMeans
have a more optimized implementation. Parallelism is now over the data instead of over initializations allowing better scalability.11950
byJeremie du Boisberranger <jeremiedbb>
. cluster.KMeans
now supports sparse data when solver = "elkan".11950
byJeremie du Boisberranger <jeremiedbb>
.cluster.AgglomerativeClustering
has a faster and more memory efficient implementation of single linkage clustering.11514
byLeland McInnes <lmcinnes>
.cluster.KMeans
withalgorithm="elkan"
now converges withtol=0
as with the defaultalgorithm="full"
.16075
byErich Schubert <kno10>
.- Fixed a bug in
cluster.Birch
where the n_clusters parameter could not have a np.int64 type.16484
byJeremie du Boisberranger <jeremiedbb>
. - The
n_jobs
parameter ofcluster.KMeans
,cluster.SpectralCoclustering
andcluster.SpectralBiclustering
is deprecated. They now use OpenMP based parallelism. For more details on how to control the number of threads, please refer to ourparallelism
notes.11950
byJeremie du Boisberranger <jeremiedbb>
. - The
precompute_distances
parameter ofcluster.KMeans
is deprecated. It has no effect.11950
byJeremie du Boisberranger <jeremiedbb>
.
compose.ColumnTransformer
is now faster when working with dataframes and strings are used to specific subsets of data for transformers.16431
by Thomas Fan.compose.ColumnTransformer
methodget_feature_names
now supports 'passthrough' columns, with the feature name being either the column name for a dataframe, or 'xi' for column index i.14048
byLewis Ball <lrjball>
.compose.ColumnTransformer
methodget_feature_names
now returns correct results when one of the transformer steps applies on an empty list of columns15963
by Roman Yurchak.compose.ColumnTransformer.fit
will error when selecting a column name that is not unique in the dataframe.16431
by Thomas Fan.
datasets.fetch_openml
has reduced memory usage because it no longer stores the full dataset text stream in memory.16084
by Joel Nothman.datasets.fetch_california_housing
now supports heterogeneous data using pandas by setting as_frame=True.15950
byStephanie Andrews <gitsteph>
andReshama Shaikh <reshamas>
.- embedded dataset loaders
load_breast_cancer
,load_diabetes
,load_digits
,load_iris
,load_linnerud
andload_wine
now support loading as a pandasDataFrame
by setting as_frame=True.15980
bywconnell
andReshama Shaikh <reshamas>
. - Added
return_centers
parameter indatasets.make_blobs
, which can be used to return centers for each cluster.15709
byshivamgargsya
andVenkatachalam N <venkyyuvy>
. - Functions
datasets.make_circles
anddatasets.make_moons
now accept two-element tuple.15707
byMaciej J Mikulski <mjmikulski>
. datasets.make_multilabel_classification
now generates ValueError for arguments n_classes < 1 OR length < 1.16006
byRushabh Vasani <rushabh-v>
.- The StreamHandler was removed from sklearn.logger to avoid double logging of messages in common cases where a hander is attached to the root logger, and to follow the Python logging documentation recommendation for libraries to leave the log message handling to users and application code.
16451
byChristoph Deil <cdeil>
.
decomposition.NMF
anddecomposition.non_negative_factorization
now preserves float32 dtype.16280
byJeremie du Boisberranger <jeremiedbb>
.TruncatedSVD.transform
is now faster on given sparsecsc
matrices.16837
bywornbb
.decomposition.PCA
with a float n_components parameter, will exclusively choose the components that explain the variance greater than n_components.15669
byKrishna Chaitanya <krishnachaitanya9>
decomposition.PCA
with n_components='mle' now correctly handles small eigenvalues, and does not infer 0 as the correct number of components.16224
byLisa Schwetlick <lschwetlick>
, andGelavizh Ahmadi <gelavizh1>
andMarija Vlajic Wheeler <marijavlajic>
and16841
by Nicolas Hug.decomposition.KernelPCA
methodinverse_transform
now applies the correct inverse transform to the transformed data.16655
byLewis Ball <lrjball>
.- Fixed bug that was causing
decomposition.KernelPCA
to sometimes raise invalid value encountered in multiply during fit.16718
byGui Miotto <gui-miotto>
. - Added n_components_ attribute to
decomposition.SparsePCA
anddecomposition.MiniBatchSparsePCA
.16981
byMateusz Górski <Reksbril>
.
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
now supportsample_weight
.14696
by Adrin Jalali and Nicolas Hug.- Early stopping in
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
is now determined with a new early_stopping parameter instead of n_iter_no_change. Default value is 'auto', which enables early stopping if there are at least 10,000 samples in the training set.14516
byJohann Faouzi <johannfaouzi>
. ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
now support monotonic constraints, useful when features are supposed to have a positive/negative effect on the target.15582
by Nicolas Hug.- Added boolean verbose flag to classes:
ensemble.VotingClassifier
andensemble.VotingRegressor
.16069
bySam Bail <spbail>
,Hanna Bruce MacDonald <hannahbrucemacdonald>
,Reshama Shaikh <reshamas>
, andChiara Marmo <cmarmo>
. - Fixed a bug in
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegrerssor
that would not respect the max_leaf_nodes parameter if the criteria was reached at the same time as the max_depth criteria.16183
by Nicolas Hug. - Changed the convention for max_depth parameter of
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
. The depth now corresponds to the number of edges to go from the root to the deepest leaf. Stumps (trees with one split) are now allowed.16182
bySanthosh B <santhoshbala18>
- Fixed a bug in
ensemble.BaggingClassifier
,ensemble.BaggingRegressor
andensemble.IsolationForest
where the attribute estimators_samples_ did not generate the proper indices used during fit.16437
byJin-Hwan CHO <chofchof>
. - Fixed a bug in
ensemble.StackingClassifier
andensemble.StackingRegressor
where the sample_weight argument was not being passed to cross_val_predict when evaluating the base estimators on cross-validation folds to obtain the input to the meta estimator.16539
byBill DeRose <wderose>
. - Added additional option loss="poisson" to
ensemble.HistGradientBoostingRegressor
, which adds Poisson deviance with log-link useful for modeling count data.16692
byChristian Lorentzen <lorentzenchr>
- Fixed a bug where
ensemble.HistGradientBoostingRegressor
andensemble.HistGradientBoostingClassifier
would fail with multiple calls to fit when warm_start=True, early_stopping=True, and there is no validation set.16663
by Thomas Fan.
feature_extraction.text.CountVectorizer
now sorts features after pruning them by document frequency. This improves performances for datasets with large vocabularies combined withmin_df
ormax_df
.15834
bySantiago M. Mola <smola>
.
- Added support for multioutput data in
feature_selection.RFE
andfeature_selection.RFECV
.16103
byDivyaprabha M <divyaprabha123>
. - Adds
feature_selection.SelectorMixin
back to public API.16132
bytrimeta
.
gaussian_process.kernels.Matern
returns the RBF kernel whennu=np.inf
.15503
bySam Dixon <sam-dixon>
.- Fixed bug in
gaussian_process.GaussianProcessRegressor
that caused predicted standard deviations to only be between 0 and 1 when WhiteKernel is not used.15782
byplgreenLIRU
.
impute.IterativeImputer
accepts both scalar and array-like inputs formax_value
andmin_value
. Array-like inputs allow a different max and min to be specified for each feature.16403
byNarendra Mukherjee <narendramukherjee>
.impute.SimpleImputer
,impute.KNNImputer
, andimpute.SimpleImputer
accepts pandas' nullable integer dtype with missing values.16508
by Thomas Fan.
inspection.partial_dependence
andinspection.plot_partial_dependence
now support the fast 'recursion' method forensemble.RandomForestRegressor
andtree.DecisionTreeRegressor
.15864
by Nicolas Hug.
- Added generalized linear models (GLM) with non normal error distributions, including
linear_model.PoissonRegressor
,linear_model.GammaRegressor
andlinear_model.TweedieRegressor
which use Poisson, Gamma and Tweedie distributions respectively.14300
byChristian Lorentzen <lorentzenchr>
, Roman Yurchak, and Olivier Grisel. - Support of sample_weight in
linear_model.ElasticNet
andlinear_model.Lasso
for dense feature matrix X.15436
byChristian Lorentzen <lorentzenchr>
. linear_model.RidgeCV
andlinear_model.RidgeClassifierCV
now does not allocate a potentially large array to store dual coefficients for all hyperparameters during its fit, nor an array to store all error or LOO predictions unless store_cv_values is True.15652
byJérôme Dockès <jeromedockes>
.linear_model.LassoLars
andlinear_model.Lars
now support a jitter parameter that adds random noise to the target. This might help with stability in some edge cases.15179
byangelaambroz
.- Fixed a bug where if a sample_weight parameter was passed to the fit method of
linear_model.RANSACRegressor
, it would not be passed to the wrapped base_estimator during the fitting of the final model.15773
byJeremy Alexandre <J-A16>
. - add best_score_ attribute to
linear_model.RidgeCV
andlinear_model.RidgeClassifierCV
.15653
byJérôme Dockès <jeromedockes>
. - Fixed a bug in
linear_model.RidgeClassifierCV
to pass a specific scoring strategy. Before the internal estimator outputs score instead of predictions.14848
byVenkatachalam N <venkyyuvy>
. linear_model.LogisticRegression
will now avoid an unnecessary iteration when solver='newton-cg' by checking for inferior or equal instead of strictly inferior for maximum of absgrad and tol in utils.optimize._newton_cg.16266
byRushabh Vasani <rushabh-v>
.- Deprecated public attributes standard_coef_, standard_intercept_, average_coef_, and average_intercept_ in
linear_model.SGDClassifier
,linear_model.SGDRegressor
,linear_model.PassiveAggressiveClassifier
,linear_model.PassiveAggressiveRegressor
.16261
byCarlos Brandt <chbrandt>
. linear_model.ARDRegression
is more stable and much faster when n_samples > n_features. It can now scale to hundreds of thousands of samples. The stability fix might imply changes in the number of non-zero coefficients and in the predicted output.16849
by Nicolas Hug.- Fixed a bug in
linear_model.ElasticNetCV
,linear_model.MultitaskElasticNetCV
,linear_model.LassoCV
andlinear_model.MultitaskLassoCV
where fitting would fail when using joblib loky backend.14264
byJérémie du Boisberranger <jeremiedbb>
.
metrics.pairwise.pairwise_distances_chunked
now allows itsreduce_func
to not have a return value, enabling in-place operations.16397
by Joel Nothman.- Fixed a bug in
metrics.mean_squared_error
to not ignore argument squared when argument multioutput='raw_values'.16323
byRushabh Vasani <rushabh-v>
- Fixed a bug in
metrics.mutual_info_score
where negative scores could be returned.16362
by Thomas Fan. - Fixed a bug in
metrics.confusion_matrix
that would raise an error when y_true and y_pred were length zero and labels was not None. In addition, we raise an error when an empty list is given to the labels parameter.16442
by Kyle Parsons <parsons-kyle-89>. - Changed the formatting of values in
metrics.ConfusionMatrixDisplay.plot
andmetrics.plot_confusion_matrix
to pick the shorter format (either '2g' or 'd').16159
byRick Mackenbach <Rick-Mackenbach>
and Thomas Fan. - From version 0.25,
metrics.pairwise.pairwise_distances
will no longer automatically compute theVI
parameter for Mahalanobis distance and theV
parameter for seuclidean distance ifY
is passed. The user will be expected to compute this parameter on the training data of their choice and pass it to pairwise_distances.16993
by Joel Nothman.
model_selection.GridSearchCV
andmodel_selection.RandomizedSearchCV
yields stack trace information in fit failed warning messages in addition to previously emitted type and details.15622
byGregory Morse <GregoryMorse>
.- :func: cross_val_predict supports method="predict_proba" when y=None.
15918
byLuca Kubin <lkubin>
. model_selection.fit_grid_point
is deprecated in 0.23 and will be removed in 0.25.16401
byArie Pratama Sutiono <ariepratama>
multioutput.RegressorChain
now supports fit_params for base_estimator during fit.16111
byVenkatachalam N <venkyyuvy>
.
- A correctly formatted error message is shown in
naive_bayes.CategoricalNB
when the number of features in the input differs between predict and fit.16090
byMadhura Jayaratne <madhuracj>
.
neural_network.MLPClassifier
andneural_network.MLPRegressor
has reduced memory footprint when using stochastic solvers, 'sgd' or 'adam', and shuffle=True.14075
bymeyer89
.- Increases the numerical stability of the logistic loss function in
neural_network.MLPClassifier
by clipping the probabilities.16117
by Thomas Fan.
inspection.PartialDependenceDisplay
now exposes the deciles lines as attributes so they can be hidden or customized.15785
by Nicolas Hug
- argument drop of
preprocessing.OneHotEncoder
will now accept value 'if_binary' and will drop the first category of each feature with two categories.16245
byRushabh Vasani <rushabh-v>
. preprocessing.OneHotEncoder
's drop_idx_ ndarray can now contain None, where drop_idx_[i] = None means that no category is dropped for index i.16585
byChiara Marmo <cmarmo>
.preprocessing.MaxAbsScaler
,preprocessing.MinMaxScaler
,preprocessing.StandardScaler
,preprocessing.PowerTransformer
,preprocessing.QuantileTransformer
,preprocessing.RobustScaler
now supports pandas' nullable integer dtype with missing values.16508
by Thomas Fan.preprocessing.OneHotEncoder
is now faster at transforming.15762
by Thomas Fan.- Fix a bug in
preprocessing.StandardScaler
which was incorrectly computing statistics when calling partial_fit on sparse inputs.16466
byGuillaume Lemaitre <glemaitre>
. - Fix a bug in
preprocessing.Normalizer
with norm='max', which was not taking the absolute value of the maximum values before normalizing the vectors.16632
byMaura Pintor <Maupin1991>
andBattista Biggio <bbiggio>
.
semi_supervised.LabelSpreading
andsemi_supervised.LabelPropagation
avoids divide by zero warnings when normalizing label_distributions_.15946
byngshya
.
- Improved
libsvm
andliblinear
random number generators used to randomly select coordinates in the coordinate descent algorithms. Platform-dependent Crand()
was used, which is only able to generate numbers up to32767
on windows platform (see this blog post) and also has poor randomization power as suggested by this presentation. It was replaced with C++11mt19937
, a Mersenne Twister that correctly generates 31bits/63bits random numbers on all platforms. In addition, the crude "modulo" postprocessor used to get a random number in a bounded interval was replaced by the tweaked Lemire method as suggested by this blog post. Any model using thesvm.libsvm
or thesvm.liblinear
solver, includingsvm.LinearSVC
,svm.LinearSVR
,svm.NuSVC
,svm.NuSVR
,svm.OneClassSVM
,svm.SVC
,svm.SVR
,linear_model.LogisticRegression
, is affected. In particular users can expect a better convergence when the number of samples (LibSVM) or the number of features (LibLinear) is large.13511
bySylvain Marié <smarie>
. - Fix use of custom kernel not taking float entries such as string kernels in
svm.SVC
andsvm.SVR
. Note that custom kennels are now expected to validate their input where they previously received valid numeric arrays.11296
by Alexandre Gramfort andGeorgi Peev <georgipeev>
. svm.SVR
andsvm.OneClassSVM
attributes, probA_ and probB_, are now deprecated as they were not useful.15558
by Thomas Fan.
tree.plot_tree
rotate parameter was unused and has been deprecated.15806
byChiara Marmo <cmarmo>
.- Fix support of read-only float32 array input in
predict
,decision_path
andpredict_proba
methods oftree.DecisionTreeClassifier
,tree.ExtraTreeClassifier
andensemble.GradientBoostingClassifier
as well aspredict
method oftree.DecisionTreeRegressor
,tree.ExtraTreeRegressor
, andensemble.GradientBoostingRegressor
.16331
byAlexandre Batisse <batalex>
.
- improve error message in
utils.validation.column_or_1d
.15926
byLoïc Estève <lesteve>
. - add warning in
utils.check_array
for pandas sparse DataFrame.16021
byRushabh Vasani <rushabh-v>
. utils.check_array
now constructs a sparse matrix from a pandas DataFrame that contains only SparseArray columns.16728
by Thomas Fan.utils.validation.check_array
supports pandas' nullable integer dtype with missing values when force_all_finite is set to False or 'allow-nan' in which case the data is converted to floating point values where pd.NA values are replaced by np.nan. As a consequence, allsklearn.preprocessing
transformers that accept numeric inputs with missing values represented as np.nan now also accepts being directly fed pandas dataframes with pd.Int* or `pd.Uint* typed columns that use pd.NA as a missing value marker.16508
by Thomas Fan.- Passing classes to
utils.estimator_checks.check_estimator
andutils.estimator_checks.parametrize_with_checks
is now deprecated, and support for classes will be removed in 0.24. Pass instances instead.17032
by Nicolas Hug. utils.all_estimators
now only returns public estimators.15380
by Thomas Fan.
cluster.AgglomerativeClustering
add specific error when distance matrix is not square and affinity=precomputed.16257
bySimona Maggio <simonamaggio>
.
scikit-learn
now works withmypy
without errors.16726
by Roman Yurchak.- Most estimators now expose a n_features_in_ attribute. This attribute is equal to the number of features passed to the fit method. See SLEP010 for details.
16112
by Nicolas Hug. - Estimators now have a requires_y tags which is False by default except for estimators that inherit from ~sklearn.base.RegressorMixin or ~sklearn.base.ClassifierMixin. This tag is used to ensure that a proper error message is raised when y was expected but None was passed.
16622
by Nicolas Hug. - The default setting print_changed_only has been changed from False to True. This means that the repr of estimators is now more concise and only shows the parameters whose default value has been changed when printing an estimator. You can restore the previous behaviour by using sklearn.set_config(print_changed_only=False). Also, note that it is always possible to quickly inspect the parameters of any estimator using est.get_params(deep=False).
17061
by Nicolas Hug.
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 0.20, including: