<br>
<br>
<center><img src="images/horizontal.png" alt="Coiled logo" style="width: 500px;" align="center"/></center>
<br>
<center><img src="images/dask_horizontal_no_pad.svg" alt="Dask logo" style="width: 500px;"/></center>

# Parallel and Distributed Machine Learning

We've now seen how Dask makes data analysis scalable with parallelization via Dask DataFrames. Let's now see how [Dask-ML](https://dask-ml.readthedocs.io) allows us to do machine learning in a parallel and distributed manner. Note, machine learning is really just a special case of data analysis (one that automates analytical model building), so the 💪 Dask gains 💪 we've seen should apply here as well!

In this notebook, we'll 

* break down machine learning scaling problems into two categories
* solve an ML problem with a single machine (with Scikit-Learn)
* solve an ML problem with a single machine and parallelism (with Scikit-Learn and Joblib)
* solve an ML problem with a multiple machines and parallelism (with Scikit-Learn, Joblib and Dask)
* solve an ML problem with a multiple machines *in the cloud* and parallelism (with Scikit-Learn, Joblib, Dask and Coiled)

*A bit about me:* I'm Hugo Bowne-Anderson, Head of Data Science Evangelism and Marketing at [Coiled](coiled.io/). We build products that bring the power of scalable data science and machine learning to you, such as single-click hosted clusters on the cloud. We want to take the DevOps out of data science so you can get back to your real job. If you're interested in taking Coiled for a test drive, you can sign up for our [free Beta here](beta.coiled.io/).

## 1. Types of scaling problems in machine learning

So you have your machine learning workflow that works well for small problems. Then there are two main types of scaling challenges you can run into: scaling the **size of your data** and scaling the **size of your model**. That is:

1. CPU-bound problems: Data fits in RAM, but training takes too long. Many hyperparameter combinations, a large ensemble of many models, etc.
2. Memory-bound problems: Data is larger than RAM, and sampling isn't an option.

Here's a handy diagram for visualizing these problems:

![](images/ml-dimensions.png)

In the bottom-left quadrant, your datasets aren’t too large (and therefore fit comfortably in RAM) and your model isn’t too large. Here, you’re much better off using something like scikit-learn, XGBoost, and similar libraries. You don't need to leverage multiple machines in a distributed manner with a library like Dask-ML here.

If you’re in any of the other quadrants, however, distributed machine learning is the way to go.

Here's a bird's eye view of the strategy we'll apply in this notebook:

* For in-memory problems, just use scikit-learn (or your favorite ML library).
* For large models, use `dask_ml.joblib` and your favorite scikit-learn estimator.
* For large datasets, use `dask_ml` estimators.

## 2. Scikit-Learn in five minutes

<img src="images/scikit_learn_logo_small.svg" alt="scikit-learn logo"/>

Scikit-Learn has a nice, consistent API.

1. You instantiate an `Estimator` (e.g. `LinearRegression`, `RandomForestClassifier`, etc.). All of the models *hyperparameters* (user-specified parameters, not the ones learned by the estimator) are passed to the estimator when it's created.
2. You call `estimator.fit(X, y)` to train the estimator.
3. Use `estimator` to inspect attributes, make predictions, etc. 

Let's generate some random data.

In [1]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=10000, n_features=4, random_state=0)
X[:8]

array([[-0.77244139,  0.3607576 , -2.38110133,  0.08757   ],
       [ 1.14946035,  0.62254594,  0.37302939,  0.45965795],
       [-1.90879217, -1.1602627 , -0.27364545, -0.82766028],
       [-0.77694695,  0.31434299, -2.26231851,  0.06339125],
       [-1.17047054,  0.02212382, -2.17376797, -0.13421976],
       [ 0.79010037,  0.68530624, -0.44740487,  0.44692959],
       [ 1.68616989,  1.6329131 , -1.42072654,  1.04050557],
       [-0.93912893, -1.02270838,  1.10093827, -0.63714432]])

In [2]:
y[:8]

array([0, 0, 1, 0, 0, 0, 0, 1])

We'll fit a Support Vector Classifier.

In [3]:
from sklearn.svm import SVC

Now we create the estimator and fit it.

In [4]:
estimator = SVC(random_state=0)
estimator.fit(X, y)

SVC(random_state=0)

We inspect the learned attributes.

In [5]:
estimator.support_vectors_[:4]

array([[-0.77244139,  0.3607576 , -2.38110133,  0.08757   ],
       [ 1.14946035,  0.62254594,  0.37302939,  0.45965795],
       [-0.77694695,  0.31434299, -2.26231851,  0.06339125],
       [ 0.79010037,  0.68530624, -0.44740487,  0.44692959]])

And check the accuracy.

In [6]:
estimator.score(X, y)

0.905

## 3. Hyperparameters

Most models have *hyperparameters*. They affect the fit, but are specified up front instead of learned during training.

In [7]:
estimator = SVC(C=0.00001, shrinking=False, random_state=0)
estimator.fit(X, y)
estimator.support_vectors_[:4]

array([[-0.77244139,  0.3607576 , -2.38110133,  0.08757   ],
       [ 1.14946035,  0.62254594,  0.37302939,  0.45965795],
       [-0.77694695,  0.31434299, -2.26231851,  0.06339125],
       [-1.17047054,  0.02212382, -2.17376797, -0.13421976]])

In [8]:
estimator.score(X, y)

0.5007

## 4. Hyperparameter Optimization

There are a few ways to learn the best *hyper*parameters while training. One is `GridSearchCV`.
As the name implies, this does a brute-force search over a grid of hyperparameter combinations.

In [9]:
from sklearn.model_selection import GridSearchCV

In [10]:
%%time
estimator = SVC(gamma='auto', random_state=0, probability=True)
param_grid = {
    'C': [0.001, 10.0],
    'kernel': ['rbf', 'poly'],
}

grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2)
grid_search.fit(X, y)

Fitting 2 folds for each of 4 candidates, totalling 8 fits
[CV] C=0.001, kernel=rbf .............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .............................. C=0.001, kernel=rbf, total=   3.0s
[CV] C=0.001, kernel=rbf .............................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.0s remaining:    0.0s


[CV] .............................. C=0.001, kernel=rbf, total=   3.1s
[CV] C=0.001, kernel=poly ............................................
[CV] ............................. C=0.001, kernel=poly, total=   1.5s
[CV] C=0.001, kernel=poly ............................................
[CV] ............................. C=0.001, kernel=poly, total=   1.5s
[CV] C=10.0, kernel=rbf ..............................................
[CV] ............................... C=10.0, kernel=rbf, total=   1.0s
[CV] C=10.0, kernel=rbf ..............................................
[CV] ............................... C=10.0, kernel=rbf, total=   1.0s
[CV] C=10.0, kernel=poly .............................................
[CV] .............................. C=10.0, kernel=poly, total=   2.0s
[CV] C=10.0, kernel=poly .............................................
[CV] .............................. C=10.0, kernel=poly, total=   2.0s


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   15.1s finished


CPU times: user 18.2 s, sys: 385 ms, total: 18.6 s
Wall time: 18.8 s


GridSearchCV(cv=2,
             estimator=SVC(gamma='auto', probability=True, random_state=0),
             param_grid={'C': [0.001, 10.0], 'kernel': ['rbf', 'poly']},
             verbose=2)

## 5. Single-machine parallelism with Joblib

<img src="images/joblib_logo.svg" alt="Joblib logo" style="width: 300px;"/>

![](images/unmerged_grid_search_graph.svg)

Scikit-Learn has nice *single-machine* parallelism, via Joblib.
Any Scikit-Learn estimator that can operate in parallel exposes an `n_jobs` keyword.
This controls the number of CPU cores that will be used.

In [11]:
%%time
grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2, n_jobs=-1)
grid_search.fit(X, y)

Fitting 2 folds for each of 4 candidates, totalling 8 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   8 | elapsed:    8.4s remaining:   14.0s
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:   11.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:   11.2s finished


CPU times: user 3.71 s, sys: 275 ms, total: 3.98 s
Wall time: 15.1 s


GridSearchCV(cv=2,
             estimator=SVC(gamma='auto', probability=True, random_state=0),
             n_jobs=-1,
             param_grid={'C': [0.001, 10.0], 'kernel': ['rbf', 'poly']},
             verbose=2)

## 6. Multi-machine parallelism with Dask

<img src="images/dask_horizontal_no_pad.svg" alt="Dask logo" style="width: 500px;"/>

![](images/merged_grid_search_graph.svg)

Dask can talk to Scikit-Learn (via Joblib) so that your *cluster* is used to train a model. 

If you run this on a laptop, it will take quite some time, but the CPU usage will be satisfyingly near 100% for the duration. To run faster, you would need a distributed cluster. That would mean putting something in the call to `Client` something like

```
c = Client('tcp://my.scheduler.address:8786')
```

Details on the many ways to create a cluster can be found [here](https://docs.dask.org/en/latest/setup/single-distributed.html).

Let's try it on a larger problem (more hyperparameters).

In [12]:
import joblib
import dask.distributed

c = dask.distributed.Client()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 65243 instead


In [13]:
param_grid = {
    'C': [0.001, 0.1, 1.0, 2.5, 5, 10.0],
    # Uncomment this for larger Grid searches on a cluster
    # 'kernel': ['rbf', 'poly', 'linear'],
    # 'shrinking': [True, False],
}

grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=5, n_jobs=-1)

In [14]:
%%time
with joblib.parallel_backend("dask", scatter=[X, y]):
    grid_search.fit(X, y)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend DaskDistributedBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.1min finished


CPU times: user 15.8 s, sys: 1.46 s, total: 17.3 s
Wall time: 1min 11s


In [15]:
grid_search.best_params_, grid_search.best_score_

({'C': 10.0}, 0.9119000000000002)

## 7. Multi-machine parallelism in the cloud with Coiled

<br>
<img src="images/horizontal.png" alt="Coiled logo" style="width: 500px;"/>
<br>

Coiled, [among other things](https://coiled.io/why-coiled/), provides hosted and scalable Dask clusters. The biggest barriers to entry for doing machine learning at scale are "Do you have access to a cluster?" and "Do you know how to manage it?" Coiled solves both of those problems. Let's see how.

We'll spin up a Coiled cluster (with 10 workers in this case), then instantiante a Dask Client to use with that cluster.

In [16]:
import coiled
from dask.distributed import LocalCluster, Client

In [17]:
# Spin up cluster, instantiate a Client
cluster = coiled.Cluster(n_workers=10, configuration="my-cluster-config")
client = Client(cluster)
client

Creating Cluster. This takes about a minute ...Checking environment images
Valid environment image found


0,1
Client  Scheduler: tls://ec2-18-217-2-109.us-east-2.compute.amazonaws.com:8786  Dashboard: http://ec2-18-217-2-109.us-east-2.compute.amazonaws.com:8787/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


Now, watch this. We can fit our estimator with multi-machine paralellism in the cloud by quickly *switching to a Dask parallel backend*.

In [19]:
%%time
with joblib.parallel_backend("dask", scatter=[X, y]):
    grid_search.fit(X, y)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend DaskDistributedBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  30 | elapsed:    8.7s remaining:    8.7s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:   23.7s finished


CPU times: user 8.06 s, sys: 820 ms, total: 8.88 s
Wall time: 40.7 s


How does this work so seamlessly? Dask-ML developers worked with the Scikit-Learn and Joblib developers to implement a Dask parallel backend. So internally, scikit-learn now talks to Joblib, and Joblib talks to Dask, and Dask is what handles scheduling all of those tasks on the cluster. Our cluster being a cloud-based cluster that adds no complexity is Coiled's mission on full display.

The best parameters and best score:

In [20]:
grid_search.best_params_, grid_search.best_score_

({'C': 10.0}, 0.9119000000000002)

distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError


## Bonus! Training on large datasets

Let's talk about one more thing. Sometimes you'll want to train on a larger than memory dataset. `dask-ml` has implemented estimators that work well on Dask Arrays and DataFrames that may be larger than your machine's RAM.

In [None]:
import dask.array as da
import dask.delayed
from sklearn.datasets import make_blobs
import numpy as np

We'll make a small (random) dataset locally using Scikit-Learn.

In [None]:
n_centers = 12
n_features = 20

X_small, y_small = make_blobs(n_samples=1000, centers=n_centers, n_features=n_features, random_state=0)

centers = np.zeros((n_centers, n_features))

for i in range(n_centers):
    centers[i] = X_small[y_small == i].mean(0)
    
centers[:4]

The small dataset will be the template for our large random dataset.
We'll use `dask.delayed` to adapt `sklearn.datasets.make_blobs`, so that the actual dataset is being generated on our workers. 

In [None]:
n_samples_per_block = 200000
n_blocks = 500

delayeds = [dask.delayed(make_blobs)(n_samples=n_samples_per_block,
                                     centers=centers,
                                     n_features=n_features,
                                     random_state=i)[0]
            for i in range(n_blocks)]
arrays = [da.from_delayed(obj, shape=(n_samples_per_block, n_features), dtype=X.dtype)
          for obj in delayeds]
X = da.concatenate(arrays)
X

In [None]:
X = X.persist()  # Only run this on the cluster.

The algorithms implemented in Dask-ML are scalable. They handle larger-than-memory datasets just fine.

They follow the scikit-learn API, so if you're familiar with scikit-learn, you'll feel at home with Dask-ML.

In [None]:
from dask_ml.cluster import KMeans

In [None]:
clf = KMeans(init_max_iter=3, oversampling_factor=10)

In [None]:
%time clf.fit(X)

In [None]:
clf.labels_

In [None]:
clf.labels_[:10].compute()

**Recap:** We
* broke down machine learning scaling problems into to two categories (data size vs. model size).
* solved an ML problem with a single machine (with Scikit-Learn).
* solved an ML problem with a single machine and parallelism (with Scikit-Learn and Joblib).
* solved an ML problem with a multiple machines and parallelism (with Scikit-Learn, Joblib and Dask).
* solved an ML problem with a multiple machines *in the cloud* and parallelism (with Scikit-Learn, Joblib, Dask and Coiled).

We also
*  Used `dask-ml` estimators that work well on Dask Arrays and DataFrames to train on datasets larger than your machine's RAM.