# Machine Learning with Dask

1. In this task, you'll train several machine-learning models from scikit-learn, using Dask as the backend of joblib. This time, you need to use all of the variables except Class as your feature set. The Class variable will be your target variable.

2. Compare the results of your models.

In [1]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.metrics import roc_auc_score
import joblib
from dask.distributed import Client, progress
import dask.dataframe as dd
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings("ignore")

In [2]:
# Client setup
import warnings
warnings.filterwarnings("ignore")

from dask.distributed import Client, progress

client = Client(n_workers=4, threads_per_worker=2, memory_limit='2GB')
client

0,1
Client  Scheduler: tcp://127.0.0.1:54473  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 8.00 GB


In [3]:
# Load credit card fraud dataset (File stored locally)
df = dd.read_csv(r'C:\Users\felix\Downloads\archive (1)\creditcard.csv', dtype={'Time': 'float64'})

In [None]:
# Install dask-ml and import train_test_split
!pip install dask-ml
from dask_ml.model_selection import train_test_split

In [5]:
# This is our feature set
X = df.drop(columns = ['Class'])

# This is our target variable
Y = df["Class"]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

# Since our data can fit into memory
# we persist them to the RAM.
X_train.persist()
X_test.persist()
y_train.persist()
y_test.persist()

Dask Series Structure:
npartitions=3
    int64
      ...
      ...
      ...
Name: Class, dtype: int64
Dask Name: split, 3 tasks

**Random Forest**

In [6]:
# Instantiate a model
rf_model = RandomForestClassifier()

# Use parallelization to cross validate
with joblib.parallel_backend('dask'):
    scores = cross_validate(rf_model, X_train.compute(), y_train.compute(), cv=4)
    
scores

{'fit_time': array([146.0893867 , 144.20853758, 145.42736745, 139.58098388]),
 'score_time': array([0.13609958, 0.26554942, 0.23430848, 0.56800294]),
 'test_score': array([0.02029621, 0.99949172, 0.99942161, 0.99931644])}

In [7]:
# grid search parameter
rf_params = {"max_depth": [2, 4, 8, 16]}

# Instantiate model
rf_model = RandomForestClassifier()

# run grid search
grid_search_rf = GridSearchCV(rf_model,
                           param_grid=rf_params,
                           return_train_score=True,
                           iid=True,
                           cv=2,
                           scoring='roc_auc')

In [8]:
# Train model with training data
with joblib.parallel_backend('dask'):
    grid_search_rf.fit(X_train.compute(), y_train.compute())

In [9]:
print("The best value is: ", grid_search_rf.best_params_)
print("The test AUC score is: ", grid_search_rf.score(X_test.compute(), y_test.compute()))

The best value is:  {'max_depth': 4}
The test AUC score is:  0.9529803879110763


**Logistic Regression**

In [13]:
# Instantiate a logistice regression model and train
lr = LogisticRegression()

with joblib.parallel_backend('dask'):
    lr.fit(X_train.values.compute(), y_train.values.compute())

In [14]:
# Obtain scores for both train and test
preds_train = lr.predict(X_train.values.compute())
preds_test = lr.predict(X_test.values.compute())

print("Training score is: ", roc_auc_score(preds_train, y_train.values.compute()))
print("Test score is: ", roc_auc_score(preds_test, y_test.values.compute()))

Training score is:  0.9371863195855031
Test score is:  0.9503862509341974


**XGBoost**

In [17]:
# Grid search parameter
xgboost_params = {"n_estimators": [5, 10, 20]}

# Instantiate XGBoost Classifier Model
xgboost = GradientBoostingClassifier()

# run grid search
grid_search_xgb = GridSearchCV(xgboost, param_grid = xgboost_params)

In [18]:
# Train model with training data
with joblib.parallel_backend('dask'):
    grid_search_xgb.fit(X_train.compute(), y_train.compute())

In [19]:
print("The best estimator is: ", grid_search_xgb.best_params_)
print("The test AUC score is: ", grid_search_xgb.score(X_test.compute(), y_test.compute()))

The best estimator is:  {'n_estimators': 10}
The test AUC score is:  0.9991164204424966


In [None]:
# Close connection
client.close()

The XGBoost classifier has the best test score using all standard parameters with the exception of the number of estimators.