Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modified cross_validation.rst file with a k-fold cross validation example #994

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions docs/source/cross_validation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,47 @@ The interface for splitting Dask arrays is the same as scikit-learn's version.

X_train.compute()[:3]

Here is another illustration of performing k-fold cross validation purely in Dask. Here a link to gather more information on k-fold cross validation :func:`https://ml.dask.org/modules/generated/dask_ml.model_selection.KFold.html`:
npk7 marked this conversation as resolved.
Show resolved Hide resolved
npk7 marked this conversation as resolved.
Show resolved Hide resolved

.. ipython:: python

import dask.array as da
from dask_ml.model_selection import KFold
from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression
from statistics import mean

X, y = make_regression(n_samples=200, # choosing number of observations
n_features=5, # number of features
random_state=0, # random seed
chunks=20) # partitions to be made

train_scores: list[int] = []
test_scores: list[int] = []

model = LinearRegression()

The Dask kFold method splits the data into k consecutive subsets of data. Here we specify k to be 5, hence, 5-fold cross validation

.. ipython:: python
npk7 marked this conversation as resolved.
Show resolved Hide resolved
kf = KFold(n_splits=5)

for i, j in kf.split(X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the doc build failed here: https://github.com/dask/dask-ml/actions/runs/9434455032/job/26270343032?pr=994

Is that from the lack of a newline after the ipython directive?

Copy link
Author

@npk7 npk7 Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for spotting that, added a new line

X_train, X_test = X[i], X[j]
npk7 marked this conversation as resolved.
Show resolved Hide resolved
y_train, y_test = y[i], y[j]

model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

train_scores.append(train_score)
test_scores.append(test_score)

print("mean training score:", mean(train_scores))
print("mean testing score:", mean(train_scores))



While it's possible to pass dask arrays to :func:`sklearn.model_selection.train_test_split`, we recommend
using the Dask version for performance reasons: the Dask version is faster
Expand Down