# Dask-ML

## Notebook Objectives
* **Demonstrate scikit-learn**, a library for machine learning in Python.
* Use **Joblib and Dask to leverage parallelism** in case of compute-bound challenges.
* Use **Dask-ML for distributed machine learning** in case of memory-bound challenges.
* A brief look at **machine learning in the cloud** for additional computing resources. (Optional)
* **References** for further reading.

## scikit-learn for machine learning

scikit-learn is a powerful library for machine learning in Python. It provides tools for pre-processing, model training, evaluation, and more.

If your model and data fits on your computer, we recommend using scikit-learn as usual with no parallelism.

Let's take a look at at how you can train machine learning models in scikit-learn.

#### Creating Datasets

We start by generating some synthetic data using scikit-learn's [`make_classifaction`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) module. `make_classification` creates random classification problems, we create one with 100k samples and 10 features.

In [6]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100_000, n_features=10, random_state=0)

Let's examine the X and y variables. Note that X represents the set of input variables and y the output/target variables.

In [7]:
X[:5]

array([[-0.7462974 ,  0.19602952,  0.11141229,  0.59340009,  1.32627975,
        -1.10504115, -0.63411817,  1.19223806, -0.32277383, -0.03057938],
       [-0.74584283, -0.24857446,  0.50831426, -0.6628635 ,  1.24896798,
         0.95601408, -2.28687281,  1.12441665, -1.53928374,  0.78151558],
       [-0.62459237, -0.02605275, -0.18403411, -0.94905415,  1.07726998,
         1.18669218,  0.30910096,  0.8074069 , -0.79054371,  0.059631  ],
       [-0.99690131, -0.09017488,  0.67867704,  0.28108283,  1.71104871,
         1.01523959,  0.78247076,  1.26565066, -1.39478782,  1.37608239],
       [ 0.40153919,  0.29434464, -1.76744682,  1.20321684, -0.64477815,
        -0.36214576,  0.61815685,  0.93696374,  1.26810107,  0.2989785 ]])

In [8]:
y[:5]

array([0, 0, 0, 0, 1])

#### k-nearest neighbors Classification

Next, we will implement a [k-NN classifier](https://scikit-learn.org/stable/modules/neighbors.html#classification) that creates a model based on the 'k' nearest neighbors of the query points.

Scikit-learn makes it very easy to train this model. All we need to do is call the fit method, and the score method computes the accuracy (the fraction of the data the model gets correct).

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
%%time

neigh = KNeighborsClassifier(n_neighbors=3)
clf = neigh.fit(X, y)

In [None]:
clf.score(X,y)

#### Hyperparameter Tuning

Hyperparameters are some predefined attributes that impact the performance of your models. For example, in the above k-NN example, the value of k is defined ahead of time. We might want to check how the model performs with different values of k, and select the best value of k. This process of selecting the best hyperparameters is called Hyperparameter Tuning.

There are many ways to tune hyerparameters, we will look at GridSearchCV in this notebook.

In [10]:
from sklearn.model_selection import GridSearchCV

We can specify the parameters to be explored as shown below, and then run `fit` on all the sets of parameters.

In [11]:
param_grid = {
    'n_neighbors': [3, 5, 8],
    'weights': ['uniform', 'distance'],
}

`verbose` gives us a detailed output for each fit and `cv` is used to define the number of folds during cross-validation.

In [12]:
%%time

grid_search = GridSearchCV(clf, param_grid, verbose=2, cv=2)
grid_search.fit(X, y)

Fitting 2 folds for each of 6 candidates, totalling 12 fits
[CV] END .....................n_neighbors=3, weights=uniform; total time=   5.6s
[CV] END .....................n_neighbors=3, weights=uniform; total time=   5.7s
[CV] END ....................n_neighbors=3, weights=distance; total time=   4.3s
[CV] END ....................n_neighbors=3, weights=distance; total time=   4.3s
[CV] END .....................n_neighbors=5, weights=uniform; total time=   7.1s
[CV] END .....................n_neighbors=5, weights=uniform; total time=   9.9s
[CV] END ....................n_neighbors=5, weights=distance; total time=   7.7s
[CV] END ....................n_neighbors=5, weights=distance; total time=   5.8s
[CV] END .....................n_neighbors=8, weights=uniform; total time=   7.3s
[CV] END .....................n_neighbors=8, weights=uniform; total time=   6.5s
[CV] END ....................n_neighbors=8, weights=distance; total time=   5.3s
[CV] END ....................n_neighbors=8, weigh

GridSearchCV(cv=2, estimator=KNeighborsClassifier(n_neighbors=3),
             param_grid={'n_neighbors': [3, 5, 8],
                         'weights': ['uniform', 'distance']},
             verbose=2)

Note the time taken!

Now, we can check what that best parameters were and the best score they produced.

In [13]:
grid_search.best_params_

{'n_neighbors': 8, 'weights': 'distance'}

In [14]:
grid_search.best_score_

0.8952100000000001

## joblib and Dask for compute bound problems

If you data fits in memory but your model is complex, a general solution is to leverage parallel computing.

### Single machine parallelism: scikit-learn + joblib

scikit-learn offers **single-machine parallelism** using a tool called Joblib. We can parallelize some algorithms by passing the number of cores in the `n_jobs` parameter.

Let's look at GridSearchCV again, but this time we will use all available CPU cores. To do this, we can define `n_jobs=-1`. Note that you can also define the exact number of core to use, for example `n_jobs=4` will use 4 cores.

In [15]:
%%time

grid_search = GridSearchCV(clf, param_grid, cv=2, n_jobs=-1)
grid_search.fit(X, y)

CPU times: user 176 ms, sys: 118 ms, total: 294 ms
Wall time: 32.4 s


GridSearchCV(cv=2, estimator=KNeighborsClassifier(n_neighbors=3), n_jobs=-1,
             param_grid={'n_neighbors': [3, 5, 8],
                         'weights': ['uniform', 'distance']})

Notice how the the compute time is almost reduced by half!

### Multi-machine parallelis: scikit-learn + joblib + Dask

Dask offers a *parallel backend* scale this computation to a cluster. 

First, let's spin up a cluster and open the dashboard plots!

In [24]:
import joblib
from dask.distributed import Client

client = Client(n_workers=4)
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 55138 instead


0,1
Client  Scheduler: tcp://127.0.0.1:55139  Dashboard: http://127.0.0.1:55138/status,Cluster  Workers: 4  Cores: 12  Memory: 16.00 GiB


Continuing with the previous GridSearchCV Example, we can use Dask as shown below:

In [25]:
%%time

with joblib.parallel_backend("dask", scatter=[X, y]):
    grid_search.fit(X, y)

CPU times: user 5.3 s, sys: 2.57 s, total: 7.87 s
Wall time: 36 s


## Checkpoint

**Question:** Fit a LogisticRegresstionCV model on the given data. Implement it with and without parallelism and note the time. Reference: [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html?highlight=logistic%20regression#sklearn.linear_model.LogisticRegressionCV)

In [25]:
from sklearn.linear_model import LogisticRegressionCV

In [None]:
# Your answer here

In [38]:
%%time

# Without parallelism
clf = LogisticRegressionCV(cv=4, random_state=0).fit(X, y)

CPU times: user 11.3 s, sys: 206 ms, total: 11.5 s
Wall time: 1.01 s


In [40]:
%%time

# With parallelism (We can have all 4 folds execute in parallel!)
clf = LogisticRegressionCV(cv=4, random_state=0, n_jobs=4).fit(X, y)

CPU times: user 265 ms, sys: 46.1 ms, total: 311 ms
Wall time: 657 ms


## Dask-ML for memory bound problems

Memory-bound problems arise when your dataset is too large to even read. This is where Dask can help. In the previous course, we saw how Dask DataFrame can be used to perform pandas-like operations on larger-than-memory data. Similarly, we can use Dask-ML to perform scikit-learn-like operations on our large datasets.

We can use Dask-ML on the previous GridSearchCV example, but this time, with more parameters.

In [26]:
import dask_ml.model_selection as dcv

In [27]:
param_grid = {
   'n_neighbors': [3, 5, 8],
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree'],
}

In [28]:
%%time

grid_search = dcv.GridSearchCV(clf, param_grid, cv=2)
grid_search.fit(X, y)

CPU times: user 34.9 s, sys: 5.52 s, total: 40.5 s
Wall time: 4min 14s


GridSearchCV(cv=2, estimator=KNeighborsClassifier(n_neighbors=3),
             param_grid={'algorithm': ['auto', 'ball_tree'],
                         'n_neighbors': [3, 5, 8],
                         'weights': ['uniform', 'distance']})

Let's look at another algorithm: Logistic Regression using Dask-ML. As Dask-ML implements the scikit-learn API, the code is similar.

In [29]:
from dask_ml.linear_model import LogisticRegression

In [30]:
%%time

clf = LogisticRegression().fit(X,y)
clf.score(X,y)

CPU times: user 609 ms, sys: 136 ms, total: 744 ms
Wall time: 1.82 s


0.88207

In [31]:
clf.predict(X)[:5]

array([False, False, False, False,  True])

That's it!

## Checkpoint

**Question:** Use Dask-ML to implement a [Naive Bayes classifier](https://ml.dask.org/naive-bayes.html) on the given dataset.

In [None]:
# Your answer here

In [45]:
from dask_ml.naive_bayes import GaussianNB

clf = GaussianNB().fit(X,y)
clf.predict(X)[:5].compute()

array([0, 0, 0, 0, 1])

Finally, let's close the cluster.

In [None]:
client.close()

## Machine Learning in the Cloud (Optional)

As we saw in the first course, Dask can also scale this computation to the cloud! There are many ways to do this, but here, we will be using Coiled. Coiled provides cluster-as-a-service functionality to provision hosted Dask clusters. It manages software environments, networking, etc. so that we can connect to the cloud quickly.

To get started, sign-up on [cloud.coiled.io](https://cloud.coiled.io) and get your coiled login token. Then in terminal (or command prompt), execute `coiled login` and share your token when prompted.

That's it! We can work from this same notebook now. We can import coiled and create a cluster as shown below:

In [1]:
import coiled

cluster = coiled.Cluster(n_workers=10)

Output()

Found software environment build


In [2]:
from dask.distributed import Client

client = Client(cluster)

print('Dashboard:', client.dashboard_link)

Dashboard: http://ec2-54-158-32-172.compute-1.amazonaws.com:8787



+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| blosc   | None   | 1.10.2    | 1.10.2  |
| lz4     | None   | 3.1.3     | 3.1.3   |
| numpy   | 1.20.3 | 1.21.0    | 1.21.0  |
+---------+--------+-----------+---------+


Note that the dashboard link points to AWS.

Now, let's implement KMeans on some generated data using sklearn and Dask-ML.

In [3]:
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=100, n_features=5, random_state=0)

In [4]:
X.shape

(100, 5)

In [5]:
from dask_ml.cluster import KMeans

In [6]:
%%time

clf = KMeans().fit(X)

CPU times: user 883 ms, sys: 39.8 ms, total: 923 ms
Wall time: 26 s


In [7]:
clf.labels_

Unnamed: 0,Array,Chunk
Bytes,400 B,32 B
Shape,"(100,)","(8,)"
Count,78 Tasks,13 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 400 B 32 B Shape (100,) (8,) Count 78 Tasks 13 Chunks Type int32 numpy.ndarray",100  1,

Unnamed: 0,Array,Chunk
Bytes,400 B,32 B
Shape,"(100,)","(8,)"
Count,78 Tasks,13 Chunks
Type,int32,numpy.ndarray


In [8]:
clf.labels_[:10].compute()

array([2, 4, 0, 6, 0, 7, 7, 5, 1, 2], dtype=int32)

In [9]:
client.close()

## References

* [Dask-ML documentation](https://ml.dask.org/)
* [Dask Examples - Machine Learning](https://examples.dask.org/machine-learning.html)
* [Dask Tutorial - Machine Learning](https://tutorial.dask.org/08_machine_learning.html)