# Voting Classifier

A [Voting classifier](http://scikit-learn.org/stable/modules/ensemble.html#voting-classifier) model combines multiple different models into a single model, which is (ideally) stronger than any of the individual models alone. 

Dask enables users to train more individual models in parallel than would have been otherwise possible, which prevents the additional classifier models from needing additional training time.

<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg" width="30%" alt="Dask logo">

Import necessary libraries.

In [1]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

import sklearn.datasets

Create fake data that we can use for training the voting classifier model.

In [2]:
X, y = sklearn.datasets.make_classification(n_samples=1_000)

Create a list of tuples where the first item in the tuple is the name of an individual model and the second item is a sklearn classifier.

Load the list of models into the Voting Classifier. We've set the n_jobs parameters to be -1, which instructs sklearn to use all available processors (notice that we haven't used dask yet).

In [3]:
classifiers = [
    ('sgd', SGDClassifier(max_iter=1000)),
    ('logisticregression', LogisticRegression()),
    ('svc', SVC(gamma='auto')),
]
clf = VotingClassifier(classifiers, n_jobs=-1)

Train the classifier.

In [4]:
%time clf.fit(X, y)

CPU times: user 61.6 ms, sys: 19.1 ms, total: 80.7 ms
Wall time: 288 ms


VotingClassifier(estimators=[('sgd', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=1000, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=T...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=-1, voting='hard', weights=None)

Setup a Dask client, which provides performance and progress metrics via the dashboard.

You can view the dashboard by clicking the link after running the cell.

In [5]:
import dask_ml.joblib
from sklearn.externals import joblib
from distributed import Client

client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:39415  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 4.02 GB


Now, train the same model but using dask.

In [6]:
%%time 
with joblib.parallel_backend("dask", scatter=[X, y]):
    clf.fit(X, y)

print(clf)

VotingClassifier(estimators=[('sgd', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=1000, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=T...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=-1, voting='hard', weights=None)
CPU times: user 172 ms, sys: 30.2 ms, total: 202 ms
Wall time: 1.19 s


I see no advantage of using dask because sklearn is already using all my computers processors, but dask enables users to take advantage of cluster computing and train models across multiple different machines.

Record software used in this notebook.

In [7]:
import sys, IPython, platform, dask, sklearn, distributed
print(("This notebook was createad on a "
       "computer {comp} " 
       "running {os} and "
       "using:\n"
       "Python {python}\n"
       "IPython {ipython}\n"
       "Dask {dask}\n"
       "Scikit Learn {sklearn}\n"
       "Distributed {distributed}\n").format(**{'comp': platform.machine(),
                                                'os': ' '.join(platform.linux_distribution()[:2]),
                                                'python': sys.version[:5],
                                                'ipython': IPython.__version__,
                                                'dask': dask.__version__,
                                                'sklearn': sklearn.__version__,
                                                'distributed': distributed.__version__
                                               }))

This notebook was createad on a computer x86_64 running debian stretch/sid and using:
Python 3.6.6
IPython 6.5.0
Dask 0.18.1
Scikit Learn 0.19.1
Distributed 1.22.0

