# Data Classification

This dataset is generated to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. The dataset consists of two classes; gammas (signal) and hadrons (background). There
are 12332 gamma events and 6688 hadron events. You are required to apply preprocessing techniques on this
dataset and use the preprocessed dataset to construct different classification models such as Decision Trees,
Naïve Bayes Classifier, Random Forests, AdaBoost, K-Nearest Neighbor (K-NN) and Support Vector
Machines (SVM). You are also required to tune the parameters of these models, compare the performance of
the learned models before and after preprocessing and compare the performance of models with each other.

## Importing dataset

In [2]:
import pandas as pd
import numpy as np
from time import time

data = pd.read_csv("dataset/magic04.data",
                  names = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym",
                           "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"])

In [2]:
display(data.head(10))
print(len(data))

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g
5,51.624,21.1502,2.9085,0.242,0.134,50.8761,43.1887,9.8145,3.613,238.098,g
6,48.2468,17.3565,3.0332,0.2529,0.1515,8.573,38.0957,10.5868,4.792,219.087,g
7,26.7897,13.7595,2.5521,0.4236,0.2174,29.6339,20.456,-2.9292,0.812,237.134,g
8,96.2327,46.5165,4.154,0.0779,0.039,110.355,85.0486,43.1844,4.854,248.226,g
9,46.7619,15.1993,2.5786,0.3377,0.1913,24.7548,43.8771,-6.6812,7.875,102.251,g


19020


## Data balancing
We can notice that there is an imbalance between the number of rows existent in 'g' and the number of rows existent in 'h'.
This problem must be solved by removing random samples from data where class is 'g' until both number of rows are equal.

In [3]:
data_g = data[data['class'] == 'g']
data_h = data[data['class'] == 'h']

print (len(data_g))
print (len(data_h))

12332
6688


In [4]:
data_g = data_g.sample(n=len(data_h))

data = pd.concat([data_g, data_h], ignore_index=True)

In [5]:
print(len(data[data['class'] == 'g']))

6688


## Visualization

In [6]:
# TODO

## Data Split
We need to randomly split the data to 70% training set and 30% testing set. We do this by first separating between class column (y) from features (x), and finally use sklearn.

In [6]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data, test_size=0.3)

In [7]:
print(len(train_df))
print(len(test_df))

9363
4013


## Classification

In [8]:
from sklearn.metrics import accuracy_score

X_train = train_df.drop('class', axis=1)
y_train = train_df['class']

X_test = test_df.drop('class', axis=1)
y_test = test_df['class']

#### 1. AdaBoost Classifier
We can see from the following results that the AdaBoostClassifier model achieves a maximum accuracy result of **82.63%** at number of estimators = 1000, using default base estimator as DecisionTreeClassifier(max_depth=1). It can be noticed that the increase of the number of estimators results in a better accuracy, but takes a longer time to fit and predict.

In [24]:
from sklearn.ensemble import AdaBoostClassifier

n_estimators = [int(x) for x in np.linspace(start=10, stop=1000, num=10)]

for n in n_estimators:
    start_time = time()
    model = AdaBoostClassifier(n_estimators=n)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(accuracy_score(y_pred, y_test), " at n_estimators = ", n, " time = ", time() - start_time)

0.7680039870421131  at n_estimators =  10  time =  0.2566192150115967
0.8258160976825317  at n_estimators =  120  time =  2.842287540435791
0.8250685272863194  at n_estimators =  230  time =  5.582809209823608
0.8230750062297533  at n_estimators =  340  time =  8.04919719696045
0.8238225766259656  at n_estimators =  450  time =  10.693397521972656
0.8230750062297533  at n_estimators =  560  time =  13.367497444152832
0.8230750062297533  at n_estimators =  670  time =  16.633725881576538
0.8260652878146025  at n_estimators =  780  time =  21.740336418151855
0.8253177174183902  at n_estimators =  890  time =  24.896631240844727
0.8263144779466733  at n_estimators =  1000  time =  27.745393991470337


We can increase the max depth of the decision tree classifier acting as a base estimator for the adaboost algorithm, we notice that the accuracy has increased to **84.77%** at n_estimators=890, but the time taken to fit the model is significantly increasing too.


In [26]:
from sklearn.tree import DecisionTreeClassifier

n_estimators = [int(x) for x in np.linspace(start=10, stop=1000, num=10)]

for n in n_estimators:
    start_time = time()
    model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5), n_estimators=n)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(accuracy_score(y_pred, y_test), " at n_estimators = ", n, " time = ", time() - start_time)

0.827809618739098  at n_estimators =  10  time =  0.6828384399414062
0.8198355345128333  at n_estimators =  120  time =  7.991923570632935
0.8280588088711687  at n_estimators =  230  time =  15.19045352935791
0.8357837029653625  at n_estimators =  340  time =  23.63278603553772
0.8317966608522303  at n_estimators =  450  time =  31.768905639648438
0.8407675056067779  at n_estimators =  560  time =  43.15023469924927
0.8357837029653625  at n_estimators =  670  time =  48.17034864425659
0.8387739845502118  at n_estimators =  780  time =  50.58229374885559
0.8477448293047596  at n_estimators =  890  time =  57.874876499176025
0.847246449040618  at n_estimators =  1000  time =  64.2132306098938


#### 2. K-NN classifier
We can see from the following results that the best accuracy of **76.7%** is achieved at k = 24. Which is less than the accuracy we achieved using **AdaBoost (84.77%)**, but it is noticed that the time taken to fit the model and predict is way less than that of AdaBoost (0.17s vs 57s).

In [32]:
from sklearn.neighbors import KNeighborsClassifier

k_s = [int(x) for x in np.linspace(start=5, stop=40, num=10)]

for k in k_s:
    start_time = time()
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(accuracy_score(y_pred, y_test), " at k = ", k, " time = ", time() - start_time)

0.7622726140044854  at k =  5  time =  0.14264750480651855
0.7600299028158485  at k =  8  time =  0.13023829460144043
0.75728881136307  at k =  12  time =  0.15084028244018555
0.7600299028158485  at k =  16  time =  0.15511250495910645
0.7630201844006977  at k =  20  time =  0.18316936492919922
0.76700722651383  at k =  24  time =  0.17857074737548828
0.7647645153251931  at k =  28  time =  0.2509725093841553
0.7667580363817593  at k =  32  time =  0.1994931697845459
0.7617742337403439  at k =  36  time =  0.2338404655456543
0.7625218041365562  at k =  40  time =  0.24196386337280273


#### 3. Random Forests
The random forests is trained the exact same way AdaBoost is trained earlier. It can be seen that random forests achieves maximum accuracy of **86.51%** at n_estimators = 670, and takes 18.8s to fit and predict data, which is better than AdaBoost in 

In [9]:
from sklearn.ensemble import RandomForestClassifier

n_estimators = [int(x) for x in np.linspace(start=10, stop=1000, num=10)]

for n in n_estimators:
    start_time = time()
    model = RandomForestClassifier(n_estimators=n)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(accuracy_score(y_pred, y_test), " at n_estimators = ", n, " time = ", time() - start_time)

0.8479940194368303  at n_estimators =  10  time =  0.3142087459564209
0.8631946174931473  at n_estimators =  120  time =  3.436056137084961
0.8649389484176426  at n_estimators =  230  time =  6.41419529914856
0.8616994767007227  at n_estimators =  340  time =  9.746688604354858
0.8629454273610765  at n_estimators =  450  time =  12.339335203170776
0.8612010964365812  at n_estimators =  560  time =  15.854724168777466
0.8651881385497134  at n_estimators =  670  time =  18.82325768470764
0.8644405681535011  at n_estimators =  780  time =  26.190967321395874
0.862447047096935  at n_estimators =  890  time =  27.244975566864014
0.8649389484176426  at n_estimators =  1000  time =  30.12388300895691
