- Fixed packaging
986
- Compatibility with Python 3.10
- Dropped support for Python 3.7
- Compatibility with scikit-learn 1.2.0 and newer
- Compatibility with scikit-learn 1.1 and newer (
910
)
- Fixed regression in meta inference for wrappers when the base estimator returned a scipy.sparse matrix (
889
)
- Meta-estimators like
wrappers.ParallelPostFit
now work with cuDF and CuPy objects. (862
) - Fixed incompatibility with new Dask optimizations in
wrappers.ParallelPostFit
(878
)
- Added support for scikit-learn 1.0.0. scikit-learn 1.0.0 is now the minimum-supported version.
LogisticRegression.predict_proba
now correctly returns an(n, 2)
array for binary classification (760
)- Fixed multioutput behavior to be consistent with scikit-learn (
820
) - Added MAPE to regression metrics (
822
) - NumPy 1.20 compatability (
784
)
- Compatibility with scikit-learn 0.24
- Improved documentation for working with PyTorch models, see
pytorch
(699
) - Improved documentation for working with Keras / TensorFlow models, see
keras
(713
) - Fixed handling of remote vocabularies in
dask_ml.feature_extraction.text.HashingVectorizer
(719
) - Added
dask_ml.metrics.regression.mean_squared_log_error
(725
) - Allow user-provided categories in
dask_ml.preprocessing.OneHotEncoder
(727
) - Added
dask_ml.linear_model.LogisticRegression.decision_function
(728
) - Added
compute
argument todask_ml.decomposition.TruncatedSVD
(743
) - Fixed sign stability in incremental PCA (
742
)
- Improved documentation for RandomizedSearchCV
- Improved logging in
dask_ml.cluster.KMeans
(688
) - Added support for
dask.dataframe
objects indask_ml.model_selection.HyperbandSearchCV
(701
) - Added
squared=True
option todask_ml.metrics.mean_squared_error
(707
) - Added
dask_ml.feature_extraction.text.CountVectorizer
(705
)
- Support for Python 3.8 (
669
) - Compatibility with Scikit-Learn 0.23.0 (
669
) - Scikit-Learn 0.23.0 or newer is now required (
669
) - Removed previously deprecated Partial classes. Use
dask_ml.wrappers.Incremental
instead (674
)
- Added
dask_ml.decomposition.IncrementalPCA
for out-of-core / distributed incremental PCA (619
) - Improved logging and monitoring in incremental model selection (
528
) - Added
dask_ml.ensemble.BlockwiseVotingClassifier
anddask_ml.ensemble.BlockwiseVotingRegressor
for blockwise training and ensemble prediction (657
) - Improved documentation for
hyper-parameter-search
(432
)
- Added
shuffle
support todask_ml.model_selection.train_test_split
forDataFrame
input (625
) - Improved performance of
dask_ml.model_selection.GridSearchCV
by re-using cached tasks (622
) - Add support for
DataFrame
todask_ml.model_selection.GridSearchCV
(612
) - Fixed
dask_ml.linear_model.LinearRegression.score
to user2_score
rather thanmse
(614
) - Handle missing data in
dask_ml.preprocessing.StandardScaler
(608
)
- Changed the name of the second positional argument in
model_selection.IncrementalSearchCV
fromparam_distribution
toparameters
to match the name of the base class. - Compatibility with scikit-learn 0.22.1.
- Added
dask_ml.preprocessing.BlockTransfomer
an extension of scikit-learn's FunctionTransformer (366
). - Added
dask_ml.feature_extraction.FeatureHasher
which is similar to scikit-learn's implementation.
- Fixed an issue with the 1.1.0 wheel (
575
) - Make svd_flip work even when arrays are read only (
592
)
- Non-arrays (e.g. Dask Bags and DataFrames) are now allowed in
dask_ml.wrappers.Incremental
. This is useful for text classification pipelines (pr:570) - The index is now preserved in
dask_ml.preprocessing.PolynomialFeatures
for DataFrame inputs (563
) dask_ml.decomposition.PCA
now works with DataFrame inputs (543
)dask_ml.cluster.KMeans
handles inputs where some blocks are length-0 (559
)- Improved error reporting for mixed inputs to
dask_ml.model_selection.train_test_split
(552
) - Removed deprecated
dask_ml.joblib
module. Usejoblib.parallel_backend
instead (545
) dask_ml.preprocessing.QuantileTransformer
now handles DataFrame input (533
)
- Added new hyperparameter search meta-estimators for hyperparameter search on distributed datasets:
~dask_ml.model_selection.HyperbandSearchCV
and~dask_ml.model_selection.SuccessiveHalvingSearchCV
- Dropped Python 2 support (
500
)
- Compatibility with scikit-learn 0.21.1
- Cross-validation results in
GridSearchCV
andRandomizedSearchCV
are now gathered as completed, in case a worker is lost (433
) - Fixed bug in
dask_ml.model_selection.train_test_split
when only one of train / test size is provided (502
) - Consistent random state for
dask_ml.model_selection.IncrementalSearchCV
- Fixed various issues with 32-bit Windows builds (
487
)
Note
dask-ml 0.13.0 will be the last release to support Python 2.
dask_ml.model_selection.IncrementalSearchCV
now returns Dask objects for post-fit methods like.predict
, etc (423
).
Note that this version of Dask-ML requires scikit-learn >= 0.20.0.
- Added
dask_ml.model_selection.IncrementalSearchCV
, a meta-estimator for hyperparameter optimization on larger-than-memory datasets (356
). Seehyperparameter.incremental
for more. - Added
dask_ml.preprocessing.PolynomialTransformer
, a drop-in replacement for the scikit-learn version (347
). - Added auto-rechunking to Dask Arrays with more than one block along the features in
dask_ml.model_selection.ParallelPostFit
(376
) - Added support for Dask DataFrame inputs to
dask_ml.cluster.KMeans
(390
) - Added a
compute
keyword todask_ml.wrappers.ParallelPostFit.score
to support lazily evaluating a model's score (402
)
- Changed
dask_ml.wrappers.ParallelPostFit
to automatically rechunk input arrays to methods likepredict
when they have more than one block along the features (376
). - Bug in
dask_ml.impute.SimpleImputer
with Dask DataFrames filling the count of the most frequent item, rather than the item itself (385
). - Bug in
dask_ml.model_selection.ShuffleSplit
returning the same split when therandom_state
was set (380
).
- Added support for
dask.dataframe.DataFrame
todask_ml.model_selection.train_test_split
(351
)
- Added
dask_ml.model_selection.ShuffleSplit
(340
)
- Fixed handling of errors in the predict and score steps of
dask_ml.model_selection.GridSearchCV
anddask_ml.model_selection.RandomizedSearchCV
(339
) - Compatability with Dask 0.18 for
dask_ml.preprocessing.LabelEncoder
(you'll also notice improved performance) (336
).
- Added a
roadmap
. Please open an issue if you'd like something to be included on the roadmap. (322
) - Added many
examples
to the documentation and the dask examples binder.
We're now using Numba for performance-sensitive parts of Dask-ML. Dask-ML is now a pure-python project, so we can provide universal wheels.
- Automatically replace default scikit-learn scorers with dask-aware versions in Incremental (
200
) - Added the
dask_ml.metrics.log_loss
loss function andneg_log_loss
scorer (318
) - Fixed handling of array-like fit-parameters to GridSearchCV and BaseSearchCV (
320
)
- Fixed dtype in
LabelEncoder.fit_transform
to be integer, rather than the dtype of the classes for dask arrays (311
)
- Added
sample_weight
support fordask_ml.metrics.accuracy_score
. (217
) - Improved performance of training on
dask_ml.cluster.SpectralClustering
(152
) - Added
dask_ml.preprocessing.LabelEncoder
. (226
) - Fixed issue in
model_selection
meta-estimators not respecting the default Dask scheduler (260
)
- Removed the
basis_inds_
attribute fromdask_ml.cluster.SpectralClustering
as its no longer used (152
) - Change
dask_ml.wrappers.Incremental.fit
to clone the underlying estimator before training (258
). This induces a few changes- The underlying estimator no longer gives access to learned attributes like
coef_
. We recommend usingIncremental.coef_
. - State no longer leaks between successive
fit
calls. Note thatIncremental.partial_fit
is still available if you want state, like learned attributes or random seeds, to be re-used. This is useful if you're making multiple passes over the training data.
- The underlying estimator no longer gives access to learned attributes like
Changed
get_params
andset_params
fordask_ml.wrappers.Incremental
to no longer magically get / set parameters for the underlying estimator (258
). To specify parameters for the underlying estimator, use the double-underscore prefix convention established by scikit-learn:inc.set_params('estimator__alpha': 10)
Dask-SearchCV is now being developed in the dask/dask-ml
repository. Users who previously installed dask-searchcv
should now just install dask-ml
.
- Fixed random seed generation on 32-bit platforms (
230
)
- Removed the get keyword from the incremental learner
fit
methods. (187
) - Deprecated the various
Partial*
estimators in favor of thedask_ml.wrappers.Incremental
meta-estimator (190
)
- Added a new meta-estimator
dask_ml.wrappers.Incremental
for wrapping any estimator with a partial_fit method. Seeincremental.blockwise-metaestimator
for more. (190
) - Added an R2-score metric
dask_ml.metrics.r2_score
.
- The n_samples_seen_ attribute on
dask_ml.preprocessing.StandardScalar
is now consistentlynumpy.nan
(157
). - Changed the algorithm for
dask_ml.datasets.make_blobs
,dask_ml.datasets.make_regression
anddask_ml.datasets.make_classfication
to reduce the single-machine peak memory usage (67
)
- Added
dask_ml.model_selection.train_test_split
anddask_ml.model_selection.ShuffleSplit
(172
) - Added
dask_ml.metrics.classification_score
,dask_ml.metrics.mean_absolute_error
, anddask_ml.metrics.mean_squared_error
.
dask_ml.preprocessing.StandardScalar
now works on DataFrame inputs (157
).
This release added several new estimators.
Scale features using statistics that are robust to outliers. This mirrors sklearn.preprocessing.RobustScalar
(62
).
Encodes categorical features as ordinal, in one ordered feature (119
).
A meta-estimator for fitting with any scikit-learn estimator, but post-processing (predict
, transform
, etc.) in parallel on dask arrays. See parallel-meta-estimators
for more (132
).
Changed the arguments of the dask-glm based estimators in
dask_glm.linear_model
to match scikit-learn's API (94
).- To specify
lambuh
useC = 1.0 / lambduh
(the default of 1.0 is unchanged) - The
rho
,over_relax
,abstol
andreltol
arguments have been removed. Provide them insolver_kwargs
instead.
This affects the
LinearRegression
,LogisticRegression
andPoissonRegression
estimators.- To specify
- Accept
dask.dataframe
for dask-glm based estimators (84
).
- Added
dask_ml.preprocessing.TruncatedSVD
anddask_ml.preprocessing.PCA
(78
)
- Added
KMeans.predict
(83
)
- Changed the fitted attributes on
MinMaxScaler
andStandardScaler
to be concrete NumPy or pandas objects, rather than persisted dask objects (75
).