# Dask Machine Learning

- Dask-ML enables scalable machine learning
- It comes with explicit support for certain models such as dask-xgboost
- It supports existing machine learning methods such as scikit-learn, tensorflow, keras, etc. 
- Large Model (Exploit parallelism with delayed executions, Hyperparameter tunning, etc.,)
- Large Data (Dask Collections to manage memory)

#### CPU bound vs MEM bound Machine Learning Models
<img src="https://raw.githubusercontent.com/dmbala/python-bigData/main/Figures/cpu_mem_bound.png" width=500 height=400>


#### Distributed Machine Learning across multiple nodes

<img src="https://raw.githubusercontent.com/dmbala/python-bigData/main/Figures/DaskDistributedJob.png" width=500 height=200>

### Compute Bound

- Distribute training and prediction across multiple nodes. 
- Hyperparameter tunning

### Memory Bound 
- Blockwise Ensemble Methods
- Incremental Learning

### Compute and Memory Bound
Re-implemented Models like dask-xgboost and dask-knn are efficeint with both CPU and Memor intensive computations. 

In [None]:
!pip install dask-ml

In [None]:
!pip install memory_profiler

In [None]:
# Importing dask 
import dask
import dask.array as da
import dask.dataframe as dd
import dask.delayed as delayed
import dask_ml.datasets
import dask_ml.cluster
import time
%load_ext memory_profiler
dask.__version__

## Blockwise Ensemble Methods

- Ensemble methods such as Bagging methods, Forrests of randomized trees, etc., are good for blockwise approaches. 
- Create homogenous data blocks from dask.array or dask.dataframe. 
- Train a copy of the model on each block. 
- At prediction, take an ensemble average of the trainined models. 

In [None]:
# A classification example from dask_ml.datasets
X, y = dask_ml.datasets.make_classification(n_samples=1e4, chunks=1e3, random_state=0)
X

The sub-estimator should be an instantiated scikit-learn-API compatible estimator (anything that implements the fit / predict API, including pipelines). It only needs to handle in-memory datasets. We’ll use sklearn.linear_model.RandomForestClassifier.

In [None]:
import dask_ml.ensemble
from sklearn.ensemble import RandomForestClassifier
subestimator = RandomForestClassifier(random_state=0)
clf = dask_ml.ensemble.BlockwiseVotingClassifier(
    subestimator,
    classes=[0, 1]
)
clf

We can train the esemble of models on data chunks. This will independently fit a clone of subestimator on each partition of X and y.

In [None]:
clf.fit(X, y)

In [None]:
clf.estimators_

Different estimators were trained on separate batches of data. Each estimator has its own set of parameters. 

In [None]:
preds = clf.predict(X[:20])
preds.compute()

The prediction calls subestimator.predict(chunk) for each subestimator (20 in our case). These subestimator predictions are averaged at the end. 

The blockwise algorithm was applied to the training and the prediction steps. 

In [None]:
%%time
%memit clf.score(X, y)

### Predictions on large data sets

In [None]:
#da.concatenate([X, X, X, X])
N = 10
X_large = da.concatenate([ X for _ in range(N)])
y_large = da.concatenate([ y for _ in range(N)])
X_large

In [None]:
X_large.rechunk(10000, 10000)

In [None]:
y_large.rechunk(10000, 10000)

In [None]:
clf.score(X_large, y_large)

In [None]:
%%time
%memit clf.score(X, y)

In [None]:
%%time
%memit clf.score(X_large, y_large)

## Incremental learning

- Some estimators are suitable for incremental training. This is useful for on-line training and as well training of large data sets. 

- Scikit-Learn provides partial_fit function for incremental learning. The partial_fit function works with Stochastic Gradient Descent, K-means, and Passive-Aggresive, and Naive Bayes based ML methods. 

- dask_ml.wrappers.Incremental acts as a bridge between Dask and Scikit-Learn estimators supporting the partial_fit API. 


In [None]:
from dask_ml.wrappers import Incremental
from sklearn.linear_model import SGDClassifier

In [None]:
X, y = dask_ml.datasets.make_classification(n_samples=10000, chunks=1000, random_state=0)
X

In [None]:
estimator = SGDClassifier(random_state=10, max_iter=100)
clf = Incremental(estimator)
clf.fit(X, y, classes=[0, 1])

As usual with Dask-ML, scoring is done in parallel (and distributed on a cluster if you’re connected to one).

In [None]:
clf.score(X, y)

## Hyper parameter Search -  Support Vector Classifier (CPU Bound)

<img src="https://raw.githubusercontent.com/dmbala/python-bigData/main/Figures/svc.png" width=500 height=400>

 https://www.datacamp.com/tutorial/svm-classification-scikit-learn-python

In [None]:
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import pandas as pd
import joblib

In [None]:
from dask.distributed import Client, LocalCluster
client = Client(n_workers=2, threads_per_worker=2, memory_limit='4GB')
client 

In [None]:
X, y = make_classification(n_samples=1000, random_state=0)
X[:2]

In [None]:
param_grid = {"C": [0.001, 0.01, 0.1, 1.0, 2.0],
              "kernel": ['rbf', 'poly', 'sigmoid'],
              "shrinking": [True, False]}

grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
                           param_grid=param_grid,
                           n_jobs=-1,
                           cv=3)
                           

In [None]:
%%time
grid_search.fit(X, y)

In [None]:
%%time
with joblib.parallel_backend('dask'):
    grid_search.fit(X, y)

In [None]:
grid_search.score(X, y)

In [None]:
client.shutdown()

## Summary
- Deploy dask-ml and dask collections to manage large data for the machine learning
- Hyperparameter training of models can be accomplished by distributing the jobs on multiple machines