# Dask for Machine Learning

Dask integrates well with machine learning libraries like scikit-learn.

In [1]:
from dask.distributed import Client, progress
client = Client(processes=False, threads_per_worker=4, n_workers=1, memory_limit='2GB')
client

0,1
Client  Scheduler: inproc://192.168.7.20/93787/1  Dashboard: http://localhost:8787/status,Cluster  Workers: 1  Cores: 4  Memory: 2.00 GB


## Distributed Training

<img src="images/scikit-learn-logo-notext.png"/> <img src="images/joblib_logo.svg" width="20%"/> 

Scikit-learn uses [joblib](http://joblib.readthedocs.io/) for single-machine parallelism. This lets you train most estimators (anything that accepts an `n_jobs` parameter) using call the cores of your laptop or workstation.

Dask registers a joblib backend. This lets you train those estimators using all the cores of your *cluster*, by changing one line of code.

In [2]:
import dask_ml.joblib  # register the distriubted backend
from sklearn.datasets import make_classification
from sklearn.linear_model import LassoCV 

## Create a Random Array

We'll use scikit-learn to create a pair of random arrays, one for the features `X`, and one for the target `y`.

In [3]:
X, y = make_classification(n_samples=1000)
X[:5]

array([[-0.33858087,  0.10661578, -1.41685748,  0.85416003, -1.13306368,
         0.43936999, -0.89155779, -0.57320501,  0.6961914 , -0.31448662,
         0.32456905,  3.34340464,  1.53807366, -1.12349946, -1.68874323,
        -0.3993426 ,  1.89866847,  0.96504723,  0.66673932, -0.9261896 ],
       [ 0.73393777,  0.41953825, -2.20315789,  0.70963521, -0.47589734,
         0.18055579, -0.11839911,  0.1566506 , -0.54005911, -3.21362844,
        -0.58537196, -0.62717674, -0.69529353, -0.76255876,  0.41663376,
        -0.34438597, -0.36720999, -1.26199171, -1.31643299,  2.16659469],
       [ 0.91076983, -0.45821806,  0.46588253,  2.58363374,  0.59169886,
         0.64780522, -0.51263702, -1.08780596,  0.50648207, -0.36087092,
        -0.83027318, -0.96445621, -1.20075522, -0.19168403,  1.25850669,
        -0.58273732,  0.92955235,  2.13926301, -0.865342  ,  0.9191456 ],
       [ 2.37885291,  1.65703815,  2.06113315, -1.28752591,  1.23545085,
        -0.13716117, -0.00335223,  0.6448825 , -

We'll then fit a `LassoCV` model, testing out 500 values of $\alpha$.

In [4]:
clf = LassoCV(n_alphas=500, n_jobs=-1)

To fit that normally, we'd just call `clf.fit(X, y)`. To fit it using the cluster, we just need to use a context manager provided by joblib.
We'll pre-scatter the data to each worker, which can help with performance.

In [5]:
from sklearn.externals import joblib

with joblib.parallel_backend('dask', scatter=[X, y]):
    clf.fit(X, y)

The 500 training tasks were split among all the workers on the cluster.