# Incremental training of large datasets

Machine learning with big data is difficult. The Scikit-Learn documentation discusses this approach in more depth in their user guide, "[Strategies to scale computationally: bigger data](http://scikit-learn.org/stable/modules/scaling_strategies.html)". To help with this, many Scikit-Learn estimators implement a `partial_fit` method to enable incremental learning in batches. 

This notebook demonstrates the use of Dask-ML's `Incremental` meta-estimator, which takes advantage of `partial_fit`. Scikit-Learn handles all of the computation while Dask handles the data management, loading and moving batches of data as necessary. This allows scaling to large datasets distributed across many machines, or to datasets that do not fit in memory, all with a familiar workflow.

This example will show the following:

* wrapping a scikit-learn estimator that implements `partial_fit` with [dask_ml.wrappers.Incremental](https://dask-ml.readthedocs.io/en/latest/modules/generated/dask_ml.wrappers.Incremental.html#dask_ml.wrappers.Incremental)
* training, predicting, and scoring on this wrapped classifier

Although this example is made with the Scikit-Learn SGDClassifer, `Incremental` will work for any class that implements `partial_fit` and the [scikit-learn base estimator API].

<img src="http://scikit-learn.org/stable/_static/scikit-learn-logo-small.png"> <img src="https://www.continuum.io/sites/default/files/dask_stacked.png" width="100px">

[scikit-learn base estimator API]:http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html



## Dask setup

Creating a distributed scheduler will provide good feedback because we can view the dask.distributed dashboard. This will provide progress and performance metrics, including visualization of jobs that are running.

In [2]:
from distributed import Client, LocalCluster
client = Client(n_workers=4, threads_per_worker=1)
client

0,1
Client  Scheduler: tcp://127.0.0.1:62513  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 8.59 GB


## Data creation

We create a synthetic dataset that is large enough to be interesting, but small enough to run quickly.  It has 1,000,000 examples and 100 features.

In [3]:
import numpy as np
import dask.array as da
from dask_ml.datasets import make_classification
import dask

n, d = 1000000, 100

X, y = make_classification(n_samples=n, n_features=d,
                           chunks=n // 100, flip_y=0.2)
X

dask.array<normal, shape=(1000000, 100), dtype=float64, chunksize=(10000, 100)>

For more information on creating dask arrays and dataframes from real data, see documentation on [Dask arrays](https://dask.pydata.org/en/latest/array-creation.html) and [Dask dataframes](https://dask.pydata.org/en/latest/dataframe-create.html).

## Split data for training and testing

We split our dataset into training and testing data to aid evaluation by making sure we have a fair test:

In [4]:
from dask_ml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train

dask.array<concatenate, shape=(900000, 100), dtype=float64, chunksize=(9000, 100)>

### Persist data in memory

This dataset is small enough to fit in distributed memory, so we call `dask.persist` now.

In [5]:
X_train, X_test, y_train, y_test = dask.persist(X_train, X_test, y_train, y_test)

If you are working in a situation where your dataset does not fit in memory then you should skip this step.  Everything will still work, but will be slower and use less memory.

This will preserve this data in memory, so no computation will be needed to be done when referenced. If this data came from a CSV and was not persisted, the CSV would have to be re-read everytime the data was called.

### Precompute classes

We pre-compute the classes from our training data, which is required for this classification example:

In [6]:
classes = da.unique(y_train).compute()
classes

array([0, 1])

## Create Scikit-Learn model

We make the underlying Scikit-Learn estimator:

In [6]:
from sklearn.linear_model import SGDClassifier

est = SGDClassifier(loss='log', penalty='l2', tol=1e-3)

Here we use `SGDClassifier`, but any estimator that implements the `partial_fit` method will work.  A list of Scikit-Learn models that implement this API is available [here](http://scikit-learn.org/stable/modules/scaling_strategies.html#incremental-learning).


## Wrap with Dask-ML's Incremental meta-estimator

We now wrap our `SGDClassifer` with the [`dask_ml.wrappers.Incremental`](https://dask-ml.readthedocs.io/en/latest/modules/generated/dask_ml.wrappers.Incremental.html#dask_ml.wrappers.Incremental) meta-estimator.

In [7]:
from dask_ml.wrappers import Incremental

inc = Incremental(est, scoring='accuracy')

Recall that `Incremental` only does data management while leaving the actual algorithm to the underlying Scikit-Learn estimator.

Note: If using Dask arrays for testing data it is helpful to specify the scoring parameter in `Incremental`; otherwise, Scikit-Learn scorers are fed Dask arrays, for which they are not well optimized.

## Model training

`Incremental` implements a `fit` method, which will perform one loop over the dataset, calling `partial_fit` over each chunk in the Dask array.

In [8]:
inc.fit(X_train, y_train, classes=classes)

Incremental(estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=0.001, verbose=0, warm_start=False),
      random_state=None, scoring='accuracy', shuffle_blocks=True)

In [9]:
inc.score(X_test, y_test)

0.58387

### Pass over the training data many times

Calling `.fit` passes over all chunks our data once.  However, in many cases we may want to pass over the training data many times.  To do this we can use the `Incremental.partial_fit` method and a for loop.

In [10]:
est = SGDClassifier(loss='log', penalty='l2', tol=0e-3)
inc = Incremental(est, scoring='accuracy')

In [11]:
for _ in range(10):
    inc.partial_fit(X_train, y_train, classes=classes)
    print('Score:', inc.score(X_test, y_test))    

Score: 0.59736
Score: 0.61563
Score: 0.61536
Score: 0.62153
Score: 0.62517
Score: 0.6231
Score: 0.62423
Score: 0.62558
Score: 0.62435
Score: 0.62726


We could pass `da.unique(y)` to `partial_fit`, but then we'd have to recompute this each time.

## Predict and Score

Finally we can also call `Incremental.predict` and `Incremental.score` on our testing data 

In [12]:
inc.predict(X_test)  # Predict produces lazy dask arrays

dask.array<predict, shape=(100000,), dtype=int64, chunksize=(1000,)>

In [13]:
inc.predict(X_test)[:100].compute()  # call compute to get results

array([1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0])

In [14]:
inc.score(X_test, y_test)

0.62726

## Learn more

In this notebook we went over using Dask-ML's `Incremental` meta-estimator to automate the process of incremental training with Scikit-Learn estimators that implement the `partial_fit` method.  If you want to learn more about this process you might want to investigate the following documentation:

1.  http://scikit-learn.org/stable/modules/scaling_strategies.html
2.  [Dask-ML Incremental API documentation](https://dask-ml.readthedocs.io/en/latest/modules/generated/dask_ml.wrappers.Incremental.html#dask_ml.wrappers.Incremental)
3.  [List of Scikit-Learn estimators compatible with Dask-ML's Incremental](http://scikit-learn.org/stable/modules/scaling_strategies.html#incremental-learning)
4. For more info on the train-test split for model evaluation, see [Hyperparameters and Model Validation](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html).