Using scikit-learn estimators with Dask-ML helps scale to large datasets. All the computation remains on scikit-learn's shoulders but the data management is handled by Dask. This allows scaling to large datasets distributed across many machines, or to datasets that do not fit in memory.

This example shows a particular wrapping of a scikit-learn object with `Incremental`. There are other algorithms inside Dask-ML that follow the scikit-learn API but take advantage of Dask's distributed architecture for the optimization algorithms, especially with clustering or matrix decomposition (e.g., spectral clustering and PCA respectively).

This example will show

* wrapping a scikit-learn estimator that implement `partial_fit` with `dask_ml`
* training, predicting, and scoring on this wrapped classifier
* integration with other parts of sklearn (e.g., with GridSearchCV)

## Getting start with Dask
First, we create the distributed scheduler:

In [None]:
from distributed import Client

client = Client()
client

## Data creation
We are concerned with the plumbing for parallel or distributed training is in place on somewhat realistic datasets. This means that we will create synethic data.

Let's create some synthetic data that's not too large (so it runs in a reasonable time for you) and is realistically large. We have 100,000 examples and 100 features in this (synthetic) dataset we create.



In [None]:
import sklearn.datasets
import dask.array as da
import numpy as np
n, d = int(100e3), 100

X = da.random.normal(size=(n, d), chunks=n // 10)
w_star = da.exp(da.random.uniform(size=d, chunks=d))
w_star = w_star**4
noise = da.random.normal(size=n, chunks=n) * d / 1
y = da.sign(X @ w_star + noise)

In [None]:
from dask_ml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Model creation

Now, let's create an sklearn model as normal than wrap it with `dask_ml.wrappers.Incremental`. This is fairly straightforward:

In [None]:
from dask_ml.wrappers import Incremental
from sklearn.linear_model import SGDClassifier

est = SGDClassifier(loss='log', penalty='l1', tol=1e-3)
inc = Incremental(est, scoring='accuracy')

It is important to specify the scoring parameter in `Incremental`; otherwise, scikit-learn scorers are fed Dask arrays (which they're not optimized for).

Our model (SGDClassifier here) must implement `partial_fit` to be used with `Incremental`.

`Incremental` does data management: it calls `est.partial_fit` on each chunk of the passed data. It's not clear how to let `est` use parallel algorithms that look at all the data at once.

## Model training
For an `Incremental` model, `partial_fit` and `fit` aliased to the same item and do one complete pass over the dataset. Let's call `partial_fit` a couple times, and score it after it's called each time (which will allow for a visualization):

In [None]:
%%time
# takes about 1min with 8 workers
from distributed.metrics import time
data = []
start = time()
for iteration in range(10):
    inc.partial_fit(X_train, y_train, classes=da.unique(y))
    data += [{'score': inc.score(X_test, y_test),
             'epochs': iteration + 1,
             'time': time() - start}]
    print(data[-1])

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(data)
df.plot(x='epochs', y='score')
plt.title('Score over iterations')
plt.ylabel('Score')

## Scoring and predicting

In [None]:
y_hat = inc.predict(X_test)
errors = np.abs(y_hat - y_test) / 2
accuracy = 1 - errors.mean()
accuracy.compute()

In [None]:
inc.score(X_test, y_test)

## Grid search

The accuracy we obtained is good, but not great. Specifying a different loss function and different amount of penalty might help.

Let's do a grid search to see:

In [None]:
from dask_ml.model_selection import GridSearchCV
params = {'alpha': np.logspace(-6, 0, num=10),
          'loss': ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron']}
grid = GridSearchCV(est, params, return_train_score=False)

Notice that we pass the scikit-learn estimator here. `Incremental` is only a small wrapper for data management, and `GridSearchCV` is also a dask tool that manges data management, so this isn't a huge deal: all the core items happen inside the scikit-learn model.

In [None]:
%%time
# took about 3 minutes with 8 workers
grid.fit(X, y)

In [None]:
grid.best_params_

Now, let's plot a heatmap to see how these two parameters influences the prediction scores:

In [None]:
import seaborn as sns

df = pd.DataFrame(grid.cv_results_)
show = df.pivot_table(index='param_loss', columns='param_alpha', values='mean_test_score')
show.columns = ['%0.3f' % np.log10(x) for x in show.columns]
show.columns.name = 'log10(param_alpha)'

sns.heatmap(show, fmt='0.2f', cmap='magma')
_ = plt.xticks(rotation=45)