## Ensemble learning and Random Forest


This is a notebook to learn about different optimization techniques. First libraries are loaded and a dataset is loaded. The data is prepared for the machine learning algorithms and splitted into a training set and a test set. Different classifiers are tried and evaluated. After that different ensemble methods are applied to improve the performance. The learning outcomes of this exercise are

1. understand the differences in methodologies 
2. apply these to your own dataset. Bring your own or choose one from  http://archive.ics.uci.edu/ml/index.php 

#### Solve the following
1. Review the code below briefly. What are the repeating parts?
2. Run the code for a different dataset
3. For every classifier assess possible overfit (implement code for this)
4. Implement a Naive Bayes classifier as well
5. Explain the difference between RandomForest and Voting Classifier
6. Explain the difference between hard and soft voting
7. Explain the difference between bagging and stacking 
8. Adjust the bagging classifier in such way that it uses another algorithm 
9. Implement stacking 
10. Implement gradient boosting
11. What algorithm can you use for very big datasets, and why
12. What is the best strategy to find an optimal model for your chosen dataset

You can use any sources to find the answers. More information is to be found in the book, the presentation and the internet. The 11 questions and small assignments are a guideline to get familiar with the material. You are encouraged to explore more configurations.

# Load Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, VotingClassifier

  'Matplotlib is building the font cache using fc-list. '
  from numpy.core.umath_tests import inner1d


# Load and prepare Data

In [2]:
df = pd.read_pickle('../data/breast_cancer.pkl')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '../data/breast_cancer.pkl'

In [None]:
#X en y
cols = [
        'texture_mean', 
        'area_mean', 
        'smoothness_mean', 
        'compactness_mean', 
        'concavity_mean',
        'concave points_mean', 
        'symmetry_mean', 
        'fractal_dimension_mean']


In [None]:
y = np.array(df['diagnosis'])
X = np.array(df[cols])
X.shape

In [5]:
#normalize
from sklearn.preprocessing import StandardScaler

def normalize(X):
    scalar = StandardScaler()
    scalar = scalar.fit(X)
    X = scalar.transform(X)
    return X

X = normalize(X)

In [6]:
#split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)


---
# Try different classifiers

## Logistic

In [7]:
#train
from sklearn.linear_model import LogisticRegression

lg = LogisticRegression()
lg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [8]:
# evaluation
y_pred = lg.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[101   2]
 [  6  62]]
              precision    recall  f1-score   support

         0.0       0.94      0.98      0.96       103
         1.0       0.97      0.91      0.94        68

    accuracy                           0.95       171
   macro avg       0.96      0.95      0.95       171
weighted avg       0.95      0.95      0.95       171



In [9]:
# compare accuracy train versus test to access overfit 
print(lg.score(X_test, y_test))
print(lg.score(X_train, y_train))

0.9532163742690059
0.9472361809045227


## Decision Tree

In [10]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [11]:
y_pred = dt.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(dt.score(X_test, y_test))
print(dt.score(X_train, y_train))

[[94  9]
 [ 4 64]]
              precision    recall  f1-score   support

         0.0       0.96      0.91      0.94       103
         1.0       0.88      0.94      0.91        68

    accuracy                           0.92       171
   macro avg       0.92      0.93      0.92       171
weighted avg       0.93      0.92      0.92       171

0.9239766081871345


## SVM 


In [12]:
svm = SVC(kernel='rbf')
svm.fit(X_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [13]:
y_pred = svm.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(svm.score(X_test, y_test))

[[102   1]
 [  6  62]]
              precision    recall  f1-score   support

         0.0       0.94      0.99      0.97       103
         1.0       0.98      0.91      0.95        68

    accuracy                           0.96       171
   macro avg       0.96      0.95      0.96       171
weighted avg       0.96      0.96      0.96       171

0.9590643274853801


## Naive Bayes

In [14]:
# Implement code here

---
# Ensemble learning

## Random Forest

In [15]:
rf = RandomForestClassifier(n_estimators = 10)
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [16]:
y_pred = rf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(rf.score(X_test, y_test))

[[98  5]
 [ 8 60]]
              precision    recall  f1-score   support

         0.0       0.92      0.95      0.94       103
         1.0       0.92      0.88      0.90        68

    accuracy                           0.92       171
   macro avg       0.92      0.92      0.92       171
weighted avg       0.92      0.92      0.92       171

0.9239766081871345


## Bagging with Decicion Tree classifier

In [17]:
bg = BaggingClassifier(DecisionTreeClassifier(), max_features = 1.0, max_samples = 0.5) 
bg.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=None,


In [18]:
y_pred = bg.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(bg.score(X_test, y_test))

[[97  6]
 [ 5 63]]
              precision    recall  f1-score   support

         0.0       0.95      0.94      0.95       103
         1.0       0.91      0.93      0.92        68

    accuracy                           0.94       171
   macro avg       0.93      0.93      0.93       171
weighted avg       0.94      0.94      0.94       171

0.935672514619883


# Boosting

In [19]:
adb = AdaBoostClassifier(LogisticRegression(), n_estimators = 10, learning_rate = 1)
adb.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=LogisticRegression(C=1.0, class_weight=None,
                                                     dual=False,
                                                     fit_intercept=True,
                                                     intercept_scaling=1,
                                                     l1_ratio=None,
                                                     max_iter=100,
                                                     multi_class='auto',
                                                     n_jobs=None, penalty='l2',
                                                     random_state=None,
                                                     solver='lbfgs', tol=0.0001,
                                                     verbose=0,
                                                     warm_start=False),
                   learning_rate=1, n_estimators=10, random_state=None)

In [20]:
y_pred = adb.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(adb.score(X_test, y_test))

[[103   0]
 [  6  62]]
              precision    recall  f1-score   support

         0.0       0.94      1.00      0.97       103
         1.0       1.00      0.91      0.95        68

    accuracy                           0.96       171
   macro avg       0.97      0.96      0.96       171
weighted avg       0.97      0.96      0.96       171

0.9649122807017544


# Stacking

In [21]:
# Your code here

# Gradient Boosting

In [22]:
# Your code here

# Voting classifier

In [23]:
evc = VotingClassifier(estimators = [('dt', dt), ('lg',lg), ('svm', svm)], voting = 'hard')
evc.fit(X_train, y_train)

VotingClassifier(estimators=[('dt',
                              DecisionTreeClassifier(ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features=None,
                                                     max_leaf_nodes=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     presort='deprecated',
                                                     random_state=None,
   

In [24]:
y_pred = evc.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(evc.score(X_test, y_test))

[[101   2]
 [  5  63]]
              precision    recall  f1-score   support

         0.0       0.95      0.98      0.97       103
         1.0       0.97      0.93      0.95        68

    accuracy                           0.96       171
   macro avg       0.96      0.95      0.96       171
weighted avg       0.96      0.96      0.96       171

0.9590643274853801
