# Basic ensemble in Sklearn

This week we are going to review how basic ensembles can be created in Sklearn. 

We will focus on the Voting classifiers and the Bagging classifiers: 

* Voting Classifier
* Bagging

To exemplify these classifiers, let me tell you a story: On the island of [Hoarafushi](https://www.google.com/maps/place/Hoarafushi/@6.97941,72.874541,8691m/data=!3m1!1e3!4m5!3m4!1s0x3b6d0fd6b9e86b61:0xa214d5fa9f65c83a!8m2!3d6.9826462!4d72.8951151), in the Maldives, there is a tower with a LiDAR sensor on top. This radar is taking mesurements every ten minutes on different weather conditions: wind direction and intensity at different heights, air pressure, relative humidity, the sensor's temperature... And finally there is one column called: `r1_rain_percentage_occurance`.

Can we predict whether `r1_rain_percentage_occurance` is different than zero? (i.e. if it rained in that 10 minutes between one reading and the previous one?

### Be patient...
**Note:** The data will take a little bit to download (around 3-5mins depending on your internet connection), it is a bit large.

In [1]:
import io

import numpy as np
import pandas as pd

import requests

from sklearn.ensemble import VotingClassifier, BaggingClassifier
from sklearn.impute import SimpleImputer

from sklearn.model_selection import cross_validate

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.preprocessing import LabelEncoder, MinMaxScaler

In [2]:
data_url = 'https://energydata.info/dataset/d2fbd01c-31f5-4180-9796-ad5828c9d742/resource/51e19759-28d4-4fc8-8558-b33ad7805a22/download/wind-measurements_maldives_hoarafushi_wb-esmap-qc.csv'

data_stream=requests.get(data_url).content

In [3]:
df=pd.read_csv(io.StringIO(data_stream.decode('utf-8')), sep=',', engine='python')

In [4]:
# I will delete a few columns to simplify the data a little bit (this is NOT feature selection, I am just randomly dropping)
columns_to_delete = [
    'time', 'a20_wind_speed_mean', 'a20_wind_speed_stddev', 'a20_wind_speed_min', 'a20_wind_speed_max', 'd20_wind_direction_mean',
    'a20_vertical_wind_speed_mean', 'a20_turbulence_intensity', 'a30_wind_speed_mean', 'a30_wind_speed_stddev',
    'a30_wind_speed_min', 'a30_wind_speed_max', 'd30_wind_direction_mean', 'a30_vertical_wind_speed_mean',
    'a30_turbulence_intensity', 'a39_wind_speed_mean', 'a39_wind_speed_stddev', 'a39_wind_speed_min', 'a39_wind_speed_max',
    'd39_wind_direction_mean', 'a39_vertical_wind_speed_mean', 'a39_turbulence_intensity', 'a60_wind_speed_mean',
    'a60_wind_speed_stddev', 'a60_wind_speed_min', 'a60_wind_speed_max', 'd60_wind_direction_mean', 'a60_vertical_wind_speed_mean',
    'a60_turbulence_intensity', 'a80_wind_speed_mean', 'a80_wind_speed_stddev', 'a80_wind_speed_min', 'a80_wind_speed_max',
    'd80_wind_direction_mean', 'a80_vertical_wind_speed_mean', 'a80_turbulence_intensity', 'a100_wind_speed_mean',
    'a100_wind_speed_stddev', 'a100_wind_speed_min', 'a100_wind_speed_max', 'd100_wind_direction_mean',
    'a100_vertical_wind_speed_mean', 'a100_turbulence_intensity', 'a120_wind_speed_mean', 'a120_wind_speed_stddev',
    'a120_wind_speed_min', 'a120_wind_speed_max', 'd120_wind_direction_mean', 'a120_vertical_wind_speed_mean',
    'a120_turbulence_intensity', 'a150_wind_speed_mean', 'a150_wind_speed_stddev', 'a150_wind_speed_min', 'a150_wind_speed_max',
    'd150_wind_direction_mean', 'a150_vertical_wind_speed_mean', 'a150_turbulence_intensity', 'a11_points_in_average',
    'a20_points_in_average', 'a30_points_in_average', 'a39_points_in_average', 'a50_points_in_average', 'a60_points_in_average',
    'a80_points_in_average', 'a100_points_in_average', 'a120_points_in_average', 'a150_points_in_average', 'a200_points_in_average',
    'v1_voltage_external_mean', 'v1_voltage_internal_mean', 'b1_bearing_device_mean', 'b1_tilt_device_mean',
]

df.drop(columns_to_delete, axis = 1, inplace=True)

In [5]:
print('There are',len(df[df['r1_rain_percentage_occurance']==9999]),'samples in which the target is null. We will delete those')
df = df[df.r1_rain_percentage_occurance != 9999]

# Now we will categorise the target: 0 if didn't rain, 1 if it rained, even a little.
df.rename(columns = {'r1_rain_percentage_occurance':'target'}, inplace = True)
df['target'] = df['target'].apply(lambda x: 0 if x==0 else 1)

# Let's encode the target variable
categoriser = LabelEncoder()
categoriser.fit(df['target'])
df['target'] = categoriser.transform(df['target'])

There are 59 samples in which the target is null. We will delete those


In [6]:
# The dataset is quite imbalanced. Didn't rain much...
print('It did not rain in',len(df[df.target == 0]), 'samples; and it rained in', len(df[df.target == 1]))

It did not rain in 96929 samples; and it rained in 6896


In [7]:
# We will randomly subsample 7,000 samples of the no-rain class. This is to make our classifiers faster and 
# also to reduce a little bit the class imbalances. (Sorry, we downloaded so much data for nothing..., although you can
# now play with it by sampling more, or just taking all of the dataset)
training = df[df['target'] == 0].sample(n=7000)
training = training.append(df[df['target']==1])
training = training.sample(frac=1) # Sampling everything is actually "shuffling" the rows in Pandas.

In [8]:
# By the way, in the data the value 9999 means NULL!
training.replace(to_replace=9999, value=np.NaN, inplace=True)

X = training.iloc[:, :-1]
y = training['target']

In [9]:
print('After subsampling, our training data now has', len(training),'samples')

After subsampling, our training data now has 13896 samples


In [10]:
# Scaling the data:
mm = MinMaxScaler()
X = mm.fit_transform(X)

# We will impute the missing data with the median strategy: The missing value will be the median of the column
imputer = SimpleImputer(missing_values=np.NaN, strategy='median')
X = imputer.fit_transform(X)

## Testing the classifiers separately

We are going to build an ensemble of 4 classifiers: 2 KNN classifiers (with different neighbours) and 2 SVC classifiers (one with a polynomial kernel and another one with a RBF kernel). 

But first, let's check their individual Cross-Validation performances:

In [11]:
k_neighbours = [3, 15]
kernels = ['poly', 'rbf']

classifiers = []

for k in k_neighbours:
    classifiers.append(KNeighborsClassifier(n_neighbors=k))
    
# In the SVMs, we set probability=True in order to be able to use predict_proba for soft voting
for kernel in kernels:
    classifiers.append(SVC(kernel=kernel, probability=True))

In [12]:
scores = []
for c in classifiers:
    print('Doing, ', c)
    score = cross_validate(
        c, X, y, cv=5, scoring='accuracy'
    )
    print(c, 'done')
    scores.append(score['test_score'].mean())

Doing,  KNeighborsClassifier(n_neighbors=3)
KNeighborsClassifier(n_neighbors=3) done
Doing,  KNeighborsClassifier(n_neighbors=15)
KNeighborsClassifier(n_neighbors=15) done
Doing,  SVC(kernel='poly', probability=True)
SVC(kernel='poly', probability=True) done
Doing,  SVC(probability=True)
SVC(probability=True) done


In [13]:
scores

[0.8074253975732691,
 0.8123909796236418,
 0.8226099135085597,
 0.8196596260235426]

The accuracy scores of all of our classifiers are definitely **better than random (over 50%)** and circle around 77-82% for all of them. Let's put them together in a [Voting Ensemble](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) and see how they perform!

In [14]:
votingEnsemble = VotingClassifier(
    # We can set any list of classifiers (with a string to identify them) as the estimators for our Voting Ensemble!
    estimators=[(str(c), c) for c in classifiers],
    voting='hard', # Could also be 'soft' if all your classifiers have the predict_proba method
    weights=scores, # I will weight the classifiers according to the accuracy score the obtained in the CV training I did just before
    n_jobs=-1, # Ensembles are great to work in parallel
)

In [15]:
print('Doing, ', votingEnsemble)
score = cross_validate(
    votingEnsemble, X, y, cv=5, scoring='accuracy'
)
print(votingEnsemble, 'done')

Doing,  VotingClassifier(estimators=[('KNeighborsClassifier(n_neighbors=3)',
                              KNeighborsClassifier(n_neighbors=3)),
                             ('KNeighborsClassifier(n_neighbors=15)',
                              KNeighborsClassifier(n_neighbors=15)),
                             ("SVC(kernel='poly', probability=True)",
                              SVC(kernel='poly', probability=True)),
                             ('SVC(probability=True)', SVC(probability=True))],
                 n_jobs=-1,
                 weights=[0.8074253975732691, 0.8123909796236418,
                          0.8226099135085597, 0.8196596260235426])
VotingClassifier(estimators=[('KNeighborsClassifier(n_neighbors=3)',
                              KNeighborsClassifier(n_neighbors=3)),
                             ('KNeighborsClassifier(n_neighbors=15)',
                              KNeighborsClassifier(n_neighbors=15)),
                             ("SVC(kernel='poly', probabilit

In [16]:
print('Score provided by the ensemble: ')
score['test_score'].mean()

Score provided by the ensemble: 


0.8227539796158755

We got a *slightly higher accuracy* in the majority voting ensemble. However did you notice that training time? 

Was the accuracy increase worth the time we had to wait? It will always depend on the problem we are facing, but we need to be careful with this.

## Bagging

Let's now implement some [**Bagging**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier) strategies in our ensemble: This time we will use an homogeneous ensemble with LogisticRegression, which is faster to train by the way. But first let's see how one single LogisticRegression performs:

In [17]:
score = cross_validate(
        LogisticRegression(penalty='l2', max_iter=300, verbose=0), X, y, cv=5, scoring='accuracy'
    )
print('One LogisticRegression score:', score['test_score'].mean())

One LogisticRegression score: 0.809585094788509


In [18]:
bagging = BaggingClassifier(
    base_estimator=LogisticRegression(penalty='l2', max_iter=300, verbose=0),
    n_estimators=50, # We will use 50 LogisticRegressions, FEEL FREE TO CHANGE THIS
    max_samples=0.30, # A maximum of 30% of the data will be used in each LogisticRegression, FEEL FREE TO CHANGE THIS
    n_jobs=-1,
    verbose=0, # Put '1' or '2' here if you want to see how Bagging is training each step
    
    # PLAYGROUND
    bootstrap=True,
    bootstrap_features=False,
)

In [19]:
print('Doing, ', bagging)
score = cross_validate(
    bagging, X, y, cv=5, scoring='accuracy'
)
print(bagging, 'done')

Doing,  BaggingClassifier(base_estimator=LogisticRegression(max_iter=300),
                  max_samples=0.3, n_estimators=50, n_jobs=-1)
BaggingClassifier(base_estimator=LogisticRegression(max_iter=300),
                  max_samples=0.3, n_estimators=50, n_jobs=-1) done


In [20]:
print('Score provided by the ensemble: ')
score['test_score'].mean()

Score provided by the ensemble: 


0.80519546651272

Well that's disappointing... The score is actually the same or even worse! What have we done wrong? Can you test some other classifier apart from LogisticRegression and see what happens?

**Note:** Not even with 500 Logistic Regressions I was able to improve the result.

In [21]:
baggingDT = BaggingClassifier(
    base_estimator=None, # When this is NULL, Bagging will use DecisionTrees as the base estimator
    n_estimators=50, # We will use 50 DecisionTreeClassifiers, FEEL FREE TO CHANGE THIS
    max_samples=0.30, # A maximum of 30% of the data will be used in each DecisionTree, FEEL FREE TO CHANGE THIS
    n_jobs=-1,
    verbose=0, # Put '1' or '2' here if you want to see how Bagging is training each step
    
    # PLAYGROUND
    bootstrap=True,
    bootstrap_features=False,
)

In [22]:
print('Doing, ', baggingDT)
score = cross_validate(
    baggingDT, X, y, cv=5, scoring='accuracy'
)
print(baggingDT, 'done')

Doing,  BaggingClassifier(max_samples=0.3, n_estimators=50, n_jobs=-1)
BaggingClassifier(max_samples=0.3, n_estimators=50, n_jobs=-1) done


In [23]:
print('Score provided by the Decision Tree Bagging ensemble: ')
score['test_score'].mean()

Score provided by the Decision Tree Bagging ensemble: 


0.8460705289672544

Now that's some improvement at least! (Almost 5% in my case!)

At least, after all of the small painful processes and the disappointing results, we can finish this interactive activity, which doesn't even have any plot in it, with a happy ending.

# Conclusion and Learning Exercises:

I must confess I did this Interactive Activity slightly annoying on purpose: Ensemble is not always good, and I wanted to show that. Just because you use more computing power and do more calculations doesn't mean that you will get better results. We need to be clever about this.

This is why this week's Learning Exercises are focused on improving what we have here: 
* Bagging without DecisionTrees, are there any better weak learners than those? What if we build a Bagging ensemble of Random Forests? (_a Random Jungle?_) - also without Bagging, in the Voting Classifier, this could be an interesting exercise for you.
* Majority Voting: Is there any improvement with Soft Voting, compared to Hard Voting?
* Can we build an heterogeneous ensemble that improves the individual accuracies by over 5%?
* What can happen if we build ensembles of Naive Bayes learners?
* Do you remember which one was the fastest classification algorithm to train of the ones you know so far? Can we build a super-ensemble with thousands of those and see if we can reach 90% accuracy? 
* And finally, as always, play with the code and run different simulations: **more estimators, in Bagging more or less max_samples, etc.**
