# 4.9 Machine learning with Dask-ML

## *Subjects covered*

* Building machine learning models using the Dask-ML API
* Using the Dask-ML API to extend scikit-learn
* Validating models and tuning hyperparameters using cross-validated gridsearch
* Using serialization to save and publish trained models

## *Content*

- [Text data preparation using Dask bags](#Text-data-preparation-using-Dask-bags)
- [Building linear models with Dask-ML](#Building-linear-models-with-Dask-ML)
- [Evaluating and tuning Dask-ML models](#Evaluating-and-tuning-Dask-ML-models)
- [Persisting Dask-ML models](#Persisting-Dask-ML-models)
- [Summary](#Summary)

## Text data preparation using Dask bags

**Tagging the review data based on the review score**

Get the data here: https://snap.stanford.edu/data/web-FineFoods.html

In [None]:
# Listing 10.1
import dask.bag as bag
from dask.diagnostics import ProgressBar

# private laptop
directory_path = 'C:\\Users\\olive\\Desktop\\DAT300\\'

# work laptop
#directory_path = 'C:\\Users\\olto\\Desktop\\DAT300\\'

file_name = 'foods.txt'
file_path = directory_path + file_name
raw_data = bag.read_text(file_path)

def get_next_part(file, start_index, span_index=0, blocksize=1024):
    file.seek(start_index)
    buffer = file.read(blocksize + span_index).decode('cp1252')
    delimiter_position = buffer.find('\n\n')
    if delimiter_position == -1:
        return get_next_part(file, start_index, span_index + blocksize)
    else:
        file.seek(start_index)
        return start_index, delimiter_position
    
def get_item(filename, start_index, delimiter_position, encoding='cp1252'):
    with open(filename, 'rb') as file_handle:
        file_handle.seek(start_index)
        text = file_handle.read(delimiter_position).decode(encoding)
        elements = text.strip().split('\n')
        key_value_pairs = [(element.split(': ')[0], element.split(': ')[1]) 
                               if len(element.split(': ')) > 1 
                               else ('unknown', element) 
                               for element in elements]
        return dict(key_value_pairs)
    
with open(file_path, 'rb') as file_handle:
    size = file_handle.seek(0,2) - 1
    more_data = True
    output = []
    current_position = next_position = 0
    while more_data:
        if current_position >= size:
            more_data = False
        else:
            current_position, next_position = get_next_part(file_handle, current_position, 0)
            output.append((current_position, next_position))
            current_position = current_position + next_position + 2
            
reviews = bag.from_sequence(output).map(lambda x: get_item(file_path, x[0], x[1]))

def tag_positive_negative_by_score(element):
    if float(element['review/score']) > 3:
        element['review/sentiment'] = 'positive'
    else:
        element['review/sentiment'] = 'negative'
    return element

tagged_reviews = reviews.map(tag_positive_negative_by_score)

In [None]:
tagged_reviews.take(2)

**Tokenizing text and Removing stopwords**

In [None]:
# Listing 10.2
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from functools import partial

tokenizer = RegexpTokenizer(r'\w+')

def extract_reviews(element):
    element['review/tokens'] = element['review/text'].lower()
    return element

def tokenize_reviews(element):
    element['review/tokens'] = tokenizer.tokenize(element['review/tokens'])
    return element

def filter_stopword(word, stopwords):
    return word not in stopwords

def filter_stopwords(element, stopwords):
    element['review/tokens'] = list(filter(partial(filter_stopword, stopwords=stopwords), element['review/tokens']))
    return element

stopword_set = set(stopwords.words('english'))
more_stopwords = {'br', 'amazon', 'com', 'http', 'www', 'href', 'gp'}
all_stopwords = stopword_set.union(more_stopwords)

review_extracted_text = tagged_reviews.map(extract_reviews)
review_tokens = review_extracted_text.map(tokenize_reviews)
review_text_clean = review_tokens.map(partial(filter_stopwords, stopwords=all_stopwords))

**Counting the unique words in the Amazon Fine Foods review set**

In [None]:
# Listing 10.3
def extract_tokens(element):
    return element['review/tokens']

extracted_tokens = review_text_clean.map(extract_tokens)
unique_tokens = extracted_tokens.flatten().distinct()

with ProgressBar():
    number_of_tokens = unique_tokens.count().compute()
number_of_tokens

**Finding the top 100 most common words in the reviews dataset**

In [None]:
# Listing 10.4
def count(accumulator, element):
    return accumulator + 1

def combine(total_1, total_2):
    return total_1 + total_2

with ProgressBar():
    token_counts = extracted_tokens.flatten().foldby(lambda x: x, count, 0, combine, 0).compute()
    
top_tokens = sorted(token_counts, key=lambda x: x[1], reverse=True)
top_100_tokens = list(map(lambda x: x[0], top_tokens[:100]))

**Generating training data by applying binary vectorization**

In [None]:
# Listing 10.5
import numpy as np
def vectorize_tokens(element):
    vectorized_tokens = np.where(np.isin(top_100_tokens, element['review/tokens']), 1, 0)
    element['review/token_vector'] = vectorized_tokens
    return element

def prep_model_data(element):
    return {'target': 1 if element['review/sentiment'] == 'positive' else 0,
            'features': element['review/token_vector']}

model_data = review_text_clean.map(vectorize_tokens).map(prep_model_data)

model_data.take(5)

**Creating the feature array**

In [None]:
# Listing 10.6
from dask import array as dask_array
def stacker(partition):
    return dask_array.concatenate([element for element in partition])

with ProgressBar():
    feature_arrays = model_data.pluck('features').map(lambda x: dask_array.from_array(x, 1000).reshape(1,-1)).reduction(perpartition=stacker, aggregate=stacker)
    feature_array = feature_arrays.compute()
feature_array

**Writing the data to ZARR and reading it back in**

**Note**

You need to install the **zarr** Python package to be able to run this code, since the data are stored in the ZARR format. It is available in the Anaconda distribution.

In [None]:
# Listing 10.7
with ProgressBar():
    feature_array.rechunk(5000).to_zarr('sentiment_feature_array.zarr')
    feature_array = dask_array.from_zarr('sentiment_feature_array.zarr')
    
with ProgressBar():
    target_arrays = model_data.pluck('target').map(lambda x: dask_array.from_array(x, 1000).reshape(-1,1)).reduction(perpartition=stacker, aggregate=stacker)
    target_arrays.compute().rechunk(5000).to_zarr('sentiment_target_array.zarr')
    target_array = dask_array.from_zarr('sentiment_target_array.zarr')

## Building linear models with Dask-ML

**Building the logistic regression**

In [9]:
# Listing 10.8
from dask_ml.linear_model import LogisticRegression
from dask_ml.model_selection import train_test_split

In [10]:
X = feature_array
y = target_array.flatten()

In [11]:
X

Unnamed: 0,Array,Chunk
Bytes,227.38 MB,2.00 MB
Shape,"(568454, 100)","(5000, 100)"
Count,115 Tasks,114 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 227.38 MB 2.00 MB Shape (568454, 100) (5000, 100) Count 115 Tasks 114 Chunks Type int32 numpy.ndarray",100  568454,

Unnamed: 0,Array,Chunk
Bytes,227.38 MB,2.00 MB
Shape,"(568454, 100)","(5000, 100)"
Count,115 Tasks,114 Chunks
Type,int32,numpy.ndarray


In [12]:
y

Unnamed: 0,Array,Chunk
Bytes,2.27 MB,20.00 kB
Shape,"(568454,)","(5000,)"
Count,229 Tasks,114 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 2.27 MB 20.00 kB Shape (568454,) (5000,) Count 229 Tasks 114 Chunks Type int32 numpy.ndarray",568454  1,

Unnamed: 0,Array,Chunk
Bytes,2.27 MB,20.00 kB
Shape,"(568454,)","(5000,)"
Count,229 Tasks,114 Chunks
Type,int32,numpy.ndarray


Note that the ``train_test_split`` function by default does a 90/10 split.

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [14]:
lr = LogisticRegression()

The ``fit`` method is not lazy (``.compute()`` is not needed), meaning computations will start right away after calling ``fit``. Therefore, to monitor progress ``ProgressBar`` is used here.

In [15]:
with ProgressBar():
    lr.fit(X_train, y_train)

[########################################] | 100% Completed |  2.1s
[########################################] | 100% Completed |  5.7s
[########################################] | 100% Completed |  5.7s
[########################################] | 100% Completed |  6.2s
[########################################] | 100% Completed |  7.7s
[########################################] | 100% Completed |  5.9s
[########################################] | 100% Completed |  5.8s
[########################################] | 100% Completed |  5.7s
[########################################] | 100% Completed |  6.2s
[########################################] | 100% Completed |  7.0s
[########################################] | 100% Completed |  6.6s
[########################################] | 100% Completed |  5.8s
[########################################] | 100% Completed |  6.8s
[########################################] | 100% Completed |  6.5s
[########################################] | 100

## Evaluating and tuning Dask-ML models

**Scoring the logistic regression model**

In [16]:
# Listing 10.9
lr.score(X_test, y_test).compute()

0.7968370685712275

### scikit-learn and ``partial_fit``

The ``partial_fit`` method is available on some models to allow for batch training. This
allows you to effectively “update” a model with additional data rather than retraining
from scratch every time your training data has been refreshed. It can also be used to
train models from large datasets that can’t be held in memory all at once. For example,
a model could be trained by loading 1,000 rows of a DataFrame, training the model on
those 1,000 rows, loading the next 1,000 rows, continuing to train, and so forth. Dask‑ML
uses this interface to train scikit-learn models with minimal configuration by the user.

**Models in scikit-learn that implement ``partial_fit`` (Incremental learning)**

https://scikit-learn.org/stable/modules/computing.html

**Training a naïve Bayes classifier with the Incremental wrapper**

* Import a naive Bayes classifier from scikit-learn
* Wrap the estimator in the ``Incremental`` wrapper
* Call ``fit`` on the ``Incremental`` wrapped estimator
* Note that the classes **must be predefined**

In [17]:
# Listing 10.10
from sklearn.naive_bayes import BernoulliNB
from dask_ml.wrappers import Incremental

nb = BernoulliNB()

parallel_nb = Incremental(nb)

with ProgressBar():
    parallel_nb.fit(X_train, y_train, classes=[0,1])

[########################################] | 100% Completed |  2.4s


* A growing number of scikit-learn algorithms are supporting this interface because there has been a growing interest in batch learning for large datasets
* The naïve Bayes algorithms fall into the group of algorithms that support batch learning, so they can easily be used with Dask to parallelize training
* The Incremental wrapper is essentially a helperfunction that tells Dask about the estimator object so it can pass it off to the workers for training

**Scoring the Incremental wrapped model**

In [18]:
# Listing 10.11
parallel_nb.score(X_test, y_test)

0.7898356964430215

**Using ``GridSearchCV`` to tune hyperparameters**

In [19]:
# Listing 10.12
from dask_ml.model_selection import GridSearchCV

parameters = {'penalty': ['l1', 'l2'], 'C': [0.5, 1, 2]}

lr = LogisticRegression()
tuned_lr = GridSearchCV(lr, parameters)

with ProgressBar():
    tuned_lr.fit(X_train, y_train)

[########################################] | 100% Completed | 28min 34.8s
[########################################] | 100% Completed |  6min 19.7s


* The ``GridSearchCV`` object behaves like the ``Incremental`` wrapper
    * Each model can be built on a separate worker
    * Search time can be reduced much by deploying a cluster or scaling up number of workers
* Take any algorithm, such as Dask-ML’s logistic regression, and wrap it in ``GridSearchCV``
* Note: ``GridSearchCV`` may give you memory errors
* Alternative: ``IncrementalSearchCV`` (should be used when your full dataset doesn’t fit in memory on a single machine)

**Visualising the results of ``GridSearchCV``**

In [20]:
# Listing 10.13
import pandas as pd
pd.DataFrame(tuned_lr.cv_results_)

Unnamed: 0,params,mean_fit_time,std_fit_time,mean_score_time,std_score_time,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,param_C,param_penalty
0,"{'C': 0.5, 'penalty': 'l1'}",855.061492,0.486273,0.497073,0.072905,0.79066,0.79409,0.796506,0.793752,0.002399,1,0.5,l1
1,"{'C': 0.5, 'penalty': 'l2'}",106.952981,0.184854,0.598966,0.036569,0.790836,0.793821,0.796477,0.793711,0.002304,2,0.5,l2
2,"{'C': 1, 'penalty': 'l1'}",789.225405,46.388903,0.551647,0.02769,0.790736,0.793668,0.796483,0.793629,0.002346,6,1.0,l1
3,"{'C': 1, 'penalty': 'l2'}",62.3774,0.207629,0.556441,0.033168,0.790836,0.793821,0.796477,0.793711,0.002304,2,1.0,l2
4,"{'C': 2, 'penalty': 'l1'}",346.042469,75.587293,0.401259,0.15074,0.790707,0.793727,0.796512,0.793649,0.002371,5,2.0,l1
5,"{'C': 2, 'penalty': 'l2'}",50.758152,1.961505,0.405135,0.086069,0.790836,0.793821,0.796477,0.793711,0.002304,2,2.0,l2


**Get hyperparameters of best performing model**

In [22]:
# Print best parameters
tuned_lr.best_params_

{'C': 0.5, 'penalty': 'l1'}

## Persisting Dask-ML models

* Persisting a trained model allows for
    * quickly accessing model from where it is stored
    * publishing and sharing model with others
    * deploying model elswhere
    * distributed learning

* Less powerful machines can use the model for predictions
* Prediction usually less resource consuming than model traning

In [23]:
import dill
with open('naive_bayes_model.pkl', 'wb') as file:
    dill.dump(parallel_nb, file)

* Model saved using Python built-in library ``pickle``
* ``pickle`` is a binary serialisation library
* ``dill`` is a wrapper around ``pickle`` library and has better support for complex structures
* ``dump`` function serialises the object and writes it to the file

* Because pickle files are binary files, it always needs to be read and write with the ``b`` flag in the file handle to denote that the file should be opened in binary mode

In [24]:
with open('naive_bayes_model.pkl', 'rb') as file:
    nb = dill.load(file)
nb.predict(np.random.randint(0, 2, (100, 100)))

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0])

## Summary

* Training models using algorithms implemented in 
    * Dask-ML
    * scikit-learn implementing ``partial_fit`` with Dask-ML, using ``Incremental`` wrapper
* Trained machine learning models can be saved using the ``dill`` library to reuse later to generate predictions.