# Data Science Programming Assignment 4

<ul>
    <li>Load the Banknote Authentication dataset and split it into a training set, a validation set and a test set</li>
    <li>Create a Bagging classifier with Decision Tree classifiers</li>
    <li>Then train various classifiers, such as Logistic Regression, Random Forest classifier, and an Extra-Trees classifier</li>
    <li>Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier 
    <li>Once you have found one, try it on the test set</li>
    <li>How much better does it perform compared to the individual classifiers?</li>
</ul>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier,RandomForestClassifier,ExtraTreesClassifier,VotingClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

In [31]:
banknotes=pd.read_csv("data_banknote_authentication.csv",header='infer')

In [32]:
#Define the features and target data from the file read

X=banknotes.iloc[:,:-1]
y=banknotes.iloc[:,-1:]

In [73]:
#Split the data into train, validation and test sets

X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.2, random_state=42)

### Bagging Classifier

In [74]:
#Create a bagging classifier that trains an ensemble of 200 Decision Tree Classifiers, each trained on 100 training instances

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=200,
    max_samples=100, bootstrap=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)
bag_pred = bag_clf.predict(X_test)

In [75]:
print("Accuracy score of the Baging Classifier: ",accuracy_score(y_test, bag_pred))

Accuracy score of the Baging Classifier:  0.9381818181818182


### Logistic Regression, Random Forest Classifier and Extra Trees Classifier

In [76]:
#Create a Logistic Regression, Random Forest Classifier and an Extra Trees Classifier

log_clf = LogisticRegression(solver="liblinear", random_state=42)
random_forest_clf = RandomForestClassifier(n_estimators=10, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=10, random_state=42)

In [77]:
#Fit the above models on the train data

estimators = [log_clf, random_forest_clf, extra_trees_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)
Training the RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
Training the ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fra

In [78]:
print("Classification scores of Logisitic Regression, Random Forest Classifier, and Extra Trees Classifier respectively are :")
[estimator.score(X_val, y_val) for estimator in estimators]

Classification scores of Logisitic Regression, Random Forest Classifier, and Extra Trees Classifier respectively are :


[0.9772727272727273, 0.9863636363636363, 0.990909090909091]

### Hard Voting Classifier

In [79]:
#Ensemble: Build a hard voting classifier by combining the models created above

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', random_forest_clf), ('et',extra_trees_clf)],
    voting='hard')

In [80]:
#Train the ensemble

voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)), ('rf', Rando...imators=10, n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None)

In [81]:
print("Classification score of the hard voting classifier is ",voting_clf.score(X_val, y_val))

Classification score of the hard voting classifier is  1.0


In [82]:
print("Classification scores of Logisitic Regression, Random Forest Classifier, and Extra Trees Classifier on the test set respectively are :")
[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

Classification scores of Logisitic Regression, Random Forest Classifier, and Extra Trees Classifier on the test set respectively are :


[0.9854545454545455, 0.9890909090909091, 0.9963636363636363]

#### The ensemble has a prediciting accuracy of 100% which is better than the best individual classifier, the Extra Trees Classifier which has a predicting accuracy of 99.63%

In [83]:
#Now test the performance of the ensemble on the test data

print("Hard voting classifier accuracy on the test set: ",voting_clf.score(X_test, y_test))

Hard voting classifier accuracy on the test set:  0.9963636363636363


### Soft Voting Classifier

In [84]:
voting_clf.voting = "soft"

In [86]:
print("Classification scores of Random Forest Classifier, and Extra Trees Classifier on the test set respectively are :")
[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

Classification scores of Random Forest Classifier, and Extra Trees Classifier on the test set respectively are :


[0.9854545454545455, 0.9890909090909091, 0.9963636363636363]

In [87]:
print("Soft voting classifier accuracy on the test set: ",voting_clf.score(X_test, y_test))

Soft voting classifier accuracy on the test set:  1.0


#### Both the hard voting and the soft voting classifiers have performed better than the best of the individual classifiers