#### Necessary Imports

In [31]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt


#### Specifying link where the csv dataset is present

In [32]:
url='https://archive.ics.uci.edu/ml/machine-learning-databases/00279/SUSY.csv.gz'

#### Reading the csv file using the pandas library. 

In [33]:
Suzy=pd.read_csv(url,header=None,index_col=False)

#### Head of the df

In [34]:
Suzy.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0.0,0.972861,0.653855,1.176225,1.157156,-1.739873,-0.874309,0.567765,-0.175,0.810061,-0.252552,1.921887,0.889637,0.410772,1.145621,1.932632,0.994464,1.367815,0.040714
1,1.0,1.667973,0.064191,-1.225171,0.506102,-0.338939,1.672543,3.475464,-1.219136,0.012955,3.775174,1.045977,0.568051,0.481928,0.0,0.44841,0.205356,1.321893,0.377584
2,1.0,0.44484,-0.134298,-0.709972,0.451719,-1.613871,-0.768661,1.219918,0.504026,1.831248,-0.431385,0.526283,0.941514,1.587535,2.024308,0.603498,1.562374,1.135454,0.18091
3,1.0,0.381256,-0.976145,0.693152,0.448959,0.891753,-0.677328,2.03306,1.533041,3.04626,-1.005285,0.569386,1.015211,1.582217,1.551914,0.761215,1.715464,1.492257,0.090719
4,1.0,1.309996,-0.690089,-0.676259,1.589283,-0.693326,0.622907,1.087562,-0.381742,0.589204,1.365479,1.179295,0.968218,0.728563,0.0,1.083158,0.043429,1.154854,0.094859


Displays top few rows of the data frame, providing us with an idea of the values present.

 #### Dropping columns

In [35]:
Features=Suzy.drop(columns=[0,11,14,17,18],axis=1)

This is the part that makes the last section different from this one. Here only 4 out of the 8 attributes/features are selected and the rest are dropped along with labels to achieve the required features dataframe.
- Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in.

#### Features table

In [36]:
Features.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,12,13,15,16
0,0.972861,0.653855,1.176225,1.157156,-1.739873,-0.874309,0.567765,-0.175,0.810061,-0.252552,0.889637,0.410772,1.932632,0.994464
1,1.667973,0.064191,-1.225171,0.506102,-0.338939,1.672543,3.475464,-1.219136,0.012955,3.775174,0.568051,0.481928,0.44841,0.205356
2,0.44484,-0.134298,-0.709972,0.451719,-1.613871,-0.768661,1.219918,0.504026,1.831248,-0.431385,0.941514,1.587535,0.603498,1.562374
3,0.381256,-0.976145,0.693152,0.448959,0.891753,-0.677328,2.03306,1.533041,3.04626,-1.005285,1.015211,1.582217,0.761215,1.715464
4,1.309996,-0.690089,-0.676259,1.589283,-0.693326,0.622907,1.087562,-0.381742,0.589204,1.365479,0.968218,0.728563,1.083158,0.043429


First few rows are displayed after dropping the first column and the unnecessay features

#### Labels df

In [37]:
Labels=Suzy[0]

Creating a Labels dataframe from the 0th column of the dataframe. 

#### Displaying top few rows of Labels

In [38]:
Labels.head()

0    0.0
1    1.0
2    1.0
3    1.0
4    1.0
Name: 0, dtype: float64

It can be observed that it is a binary classification problem, i.e the label has two values 1 or 0. Thus need to predict if its 1 or 0 based on the 14 features.

#### Shapes of the dfs

In [39]:
print(Features.shape , Labels.shape)

(5000000, 14) (5000000,)


There are 14 columns(4 features are dropped) and 5000000 rows in the Features database.
And Labels have 1 column and 5000000 rows. Makes sense as there are 18 attributes and 5000000 data values.

#### Importing necessary modules from sklearn

In [40]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split,StratifiedKFold,cross_validate
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, make_scorer

Sklearn is a  library that features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means etc. Here, decision tree classification, model selection and metrics algorithms are imported.

#### Training and Testing split

In [41]:
x_train,x_test,y_train,y_test=train_test_split(Features,Labels,test_size=0.3,random_state=1,shuffle=True)


In the above cell, the data is split into two subsets one for training and the other for testing.
The first two parameters to the function are the dataframes which is required to be split. test_size determines the percentage of data used for testing.Here 30 percent of data is kept for testing and the rest 70% is used for training the classifier.
The choice of test_data is based on two factors, if it is large enough to yield statistically meaningful results and if it
is representative of the data set as a whole(Not to pick a test set with different characteristics than the training set.)

The reason testing and training isn't performed on the same data is to test the generalisation of  the model. The whole purpose of training a model is to use it to infer different data that is given to it. To test on the same data that is used for training is analogous to a mathematics professor asking the same questions he taught the concepts with in the test(overfitting).

#### Dimensions after splitting.

In [42]:
print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

(3500000, 14) (1500000, 14) (3500000,) (1500000,)



- Training Features 3500000 x 14
- Testing Features 1500000 x 14
- Training Labels 3500000 x 1
- Testing Labels 1500000 x 1

In [43]:
for dataset in [y_train,y_test]:
    print(round(len(dataset) / len(Labels), 2))

0.7
0.3


- Training percentage - 70
- Testing perscentage - 30

#### Decision tree classifier

In [44]:
Suzy_DT = DecisionTreeClassifier(criterion="gini", random_state=1)


A decision tree classifier is used as the classification algorithm. It uses a decision tree to go from observations about an item to conclusions about the item's target value. It is one of the predictive modeling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. 
criterion is the function to measure the quality of a split.Here criteria is “gini” for the Gini impurity.
random_state is the seed used by the random number generator

#### Training

In [45]:
Suzy_DT.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1, splitter='best')

The above cell is responsible for training the model. It takes in Features for training(x_train) and Labels for training (y_training) and fits them. Model fitting is essentially training and it helps provide a measure of how well a machine learning model generalizes to similar data to that on which it was fitted. A model that is well-fitted produces more accurate outcomes. A model that is overfitted matches the data too closely. A model that is underfitted doesn’t match closely enough.

#### Prediction

In [46]:
y_pred=Suzy_DT.predict(x_test)

The model(function) is learnt and it can be used to predict the label of new data. This is essentially what the above cell is performing. It's taking in Features test( x_test) and predicting what it's corresponding y value must be.

#### Scoring the prediction

In [47]:
Suzy_DT.score(x_test,y_test)

0.7157366666666667

The above cell computes the accuracy of the classifier.
which is 71.57.. percent.

#### Computing evaluation metrics.


In [48]:
nfolds = 10
def tn(y_true, y_pred): 
	return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): 
	return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): 
	return confusion_matrix(y_true, y_pred)[1, 0]
def tp(y_true, y_pred): 
	return confusion_matrix(y_true, y_pred)[1, 1]

The functions help place the scores in a confusion matrix and computes true and false, positive and negatives..

In [49]:
scoring = {'tp': make_scorer(tp), 'tn': make_scorer(tn),
           'fp': make_scorer(fp), 'fn': make_scorer(fn),
           'ac' : make_scorer(accuracy_score),
           're' : make_scorer(recall_score),
           'pr' : make_scorer(precision_score),
           'f1' : make_scorer(f1_score),
           'auc' : make_scorer(roc_auc_score),
          } 

A dictionary 'scoring' is used to address each of the evaluation metrics. make_scorer creates a scorer from a performance metric or loss function.
- A true positive is an outcome where the model correctly predicts the positive class. A true negative is an outcome where the model correctly predicts the negative class.A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.
- accuracy_score computes subset accuracy. 
Accuracy
- Recall = tp / (tp + fn).It calculates how many of the actual positives our model capture through labeling it as Positive (True Positive).  If a sick patient (Actual Positive) goes through the test and predicted as not sick (Predicted Negative). The cost associated with False Negative will be extremely high if the sickness is contagious.
- Precision = tp / (tp + fp) . Precision is a good measure to determine, when the costs of False Positive is high. For instance, email spam detection. In email spam detection, a false positive means that an email that is non-spam (actual negative) has been identified as spam (predicted spam). The email user might lose important emails if the precision is not high for the spam detection model.
- f1_score=  = 2 * (precision * recall) / (precision + recall).  It can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0
- roc_auc_score- An evaluation metric that considers all possible classification thresholds.The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.


#### Validation

In [50]:
cv_results = cross_validate(Suzy_DT, x_train, y_train, scoring=scoring, cv=StratifiedKFold(n_splits=nfolds, random_state=1))

THe above cell performs cross validation using the stratified kfold cross validation method. Validation is the process of making sure that the model generalizes well. Generalization is when model is built using one set of data and it performs well on a completely different set of data. 
- K-Fold Cross Validation- In this the data is split into K buckets called folds.K-fold validation evaluates the data across the entire training set, but it does so by dividing the training set into K folds — or subsections — (where K is a positive integer) and then training the model K times, each time leaving a different fold out of the training data and using it instead as a validation set.At the end, the performance metric (e.g. accuracy, ROC, etc.) is averaged across all K tests. Once the best parameter combination has been found, the model is retrained on the full data. In the startified k folds the validation returns stratified folds. The folds are made by preserving the percentage of samples for each class.

#### Displaying evaluation metrics results.

In [51]:
print('Cross Validation scores (nfolds = %d):'% nfolds)
print('tp: ', cv_results['test_tp'], '; mean:', cv_results['test_tp'].mean())
print('fn: ', cv_results['test_fn'], '; mean:', cv_results['test_fn'].mean())
print('fp: ', cv_results['test_fp'], '; mean:', cv_results['test_fp'].mean())
print('tn: ', cv_results['test_tn'], '; mean:', cv_results['test_tn'].mean())
print('ac: ', cv_results['test_ac'], '; mean:', cv_results['test_ac'].mean())
print('re: ', cv_results['test_re'], '; mean:', cv_results['test_re'].mean())
print('pr: ', cv_results['test_pr'], '; mean:', cv_results['test_pr'].mean())
print('f1: ', cv_results['test_f1'], '; mean:', cv_results['test_f1'].mean())
print('auc: ', cv_results['test_auc'], '; mean:', cv_results['test_auc'].mean())

Cross Validation scores (nfolds = 10):
tp:  [111083 111168 111109 111204 111167 110785 111020 111337 111135 111260] ; mean: 111126.8
fn:  [49032 48947 49006 48911 48947 49329 49094 48777 48979 48854] ; mean: 48987.6
fp:  [51161 50954 50561 50655 50627 50897 50865 51059 50649 51027] ; mean: 50845.5
tn:  [138725 138932 139325 139231 139259 138989 139020 138826 139236 138858] ; mean: 139040.1
ac:  [0.7137351  0.71456939 0.71552367 0.71552653 0.71550286 0.71364
 0.71440204 0.71475347 0.71534776 0.7146249 ] ; mean: 0.7147625713638519
re:  [0.6937701  0.69430097 0.69393249 0.69452581 0.69429906 0.69191326
 0.69338097 0.69536081 0.6940992  0.6948799 ] ; mean: 0.6940462567112372
pr:  [0.68466631 0.68570583 0.68725799 0.68704243 0.68708976 0.68520305
 0.68579547 0.68558955 0.68693443 0.68557555] ; mean: 0.6860860365206534
f1:  [0.68918814 0.68997663 0.69057911 0.69076385 0.6906756  0.68854181
 0.68956736 0.69044061 0.69049823 0.69019637] ; mean: 0.690042771112482
auc:  [0.71217001 0.71298051 0.

EAch of the metrics scores are printed above using which different classifir=ers are compared later on.

### Bagging classifier

Bagging classifier is a ensemble method technique. Ensemble methods combines several decision trees to produce better predictive performance than utilizing a single decision tree. 

Objective is to create several subsets of data from training sample chosen randomly with replacement. Now, each collection of subset data is used to train their decision trees. 

As a result, we end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree

THus, reducing variance of the decision tree.

Other alternative ensemble methods are random forest, gradient bossting etc

In [52]:
from sklearn.ensemble import BaggingClassifier

Importing bagging classifier module from sklearn

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

#### Creating model

In [53]:
SelectedBC_Model=BaggingClassifier(base_estimator=Suzy_DT,n_estimators=10,random_state=1)

Creating a bagging classifier model called BC_MODEL
base_estimator:The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.
n_estimators :gives the number of base estimators in the ensemble.

#### Training

In [54]:
SelectedBC_Model.fit(x_train,y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort=False,
                                                        random_state=1,
                                                        splitter='best'),
       

The above cell caries out training i.e fits feature values with label values, i.e learns the mapping between x and y values.

#### Prediction

In [55]:
y_pred=SelectedBC_Model.predict(x_test)

Here the new data input is predicted based on the features given(x_test).  It's taking in Features test( x_test) and predicting what it's corresponding y value must be.

#### Evaluation

In [56]:
SelectedBC_Model.score(x_test,y_test)

0.7826333333333333

The above cell computes the accuracy of the classifier.
which is 78.26.. percent. which is lesser than the model with all the features(<78.34).

In [57]:
nfolds = 10
def tn(y_true, y_pred): 
	return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): 
	return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): 
	return confusion_matrix(y_true, y_pred)[1, 0]
def tp(y_true, y_pred): 
	return confusion_matrix(y_true, y_pred)[1, 1]

The functions compute true and false , positive and negatives.And also place the scores in a confusion matrix.

In [58]:
scoring = {'tp': make_scorer(tp), 'tn': make_scorer(tn),
           'fp': make_scorer(fp), 'fn': make_scorer(fn),
           'ac' : make_scorer(accuracy_score),
           're' : make_scorer(recall_score),
           'pr' : make_scorer(precision_score),
           'f1' : make_scorer(f1_score),
           'auc' : make_scorer(roc_auc_score),
          } 

A dictionary 'scoring' is used to address each of the evaluation metrics. make_scorer creates a scorer from a performance metric or loss function. Each metric's significance is as explained before.

#### Validation

In [59]:
cv_results = cross_validate(SelectedBC_Model, x_train, y_train, scoring=scoring, cv=StratifiedKFold(n_splits=nfolds, random_state=1))

The above cell performs cross validation using the stratified kfold cross validation method about which is explained before.

#### Evaluation results

In [60]:
print('Cross Validation scores (nfolds = %d):'% nfolds)
print('tp: ', cv_results['test_tp'], '; mean:', cv_results['test_tp'].mean())
print('fn: ', cv_results['test_fn'], '; mean:', cv_results['test_fn'].mean())
print('fp: ', cv_results['test_fp'], '; mean:', cv_results['test_fp'].mean())
print('tn: ', cv_results['test_tn'], '; mean:', cv_results['test_tn'].mean())
print('ac: ', cv_results['test_ac'], '; mean:', cv_results['test_ac'].mean())
print('re: ', cv_results['test_re'], '; mean:', cv_results['test_re'].mean())
print('pr: ', cv_results['test_pr'], '; mean:', cv_results['test_pr'].mean())
print('f1: ', cv_results['test_f1'], '; mean:', cv_results['test_f1'].mean())
print('auc: ', cv_results['test_auc'], '; mean:', cv_results['test_auc'].mean())

Cross Validation scores (nfolds = 10):
tp:  [108826 108493 108854 108458 108881 108715 108875 108991 109095 109072] ; mean: 108826.0
fn:  [51289 51622 51261 51657 51233 51399 51239 51123 51019 51042] ; mean: 51288.4
fp:  [24755 24848 24748 24921 24983 25147 24846 24852 24864 25048] ; mean: 24901.2
tn:  [165131 165038 165138 164965 164903 164739 165039 165033 165021 164837] ; mean: 164984.4
ac:  [0.78273205 0.78151491 0.78283205 0.78120634 0.78224    0.78129714
 0.78261366 0.78292795 0.78319081 0.78259938] ; mean: 0.7823154294418445
re:  [0.67967398 0.67759423 0.67984886 0.67737564 0.68002173 0.67898497
 0.67998426 0.68070875 0.68135828 0.68121463] ; mean: 0.6796765339116453
pr:  [0.81468173 0.81365072 0.81476325 0.81315649 0.81337029 0.81214236
 0.81419523 0.81431976 0.81439097 0.81324187] ; mean: 0.8137912664946277
f1:  [0.74107921 0.73941579 0.74121689 0.73908155 0.7407425  0.7396182
 0.74106216 0.74154383 0.74195863 0.7413963 ] ; mean: 0.7407115064660499
auc:  [0.77465315 0.77336838

Results of the evaluation metrics for the selected bagging classifier computed above.