# Q1 Comparison of Classifiers

## Data Description

We use the **letter** dataset from Statlog. The statistics of dataset is shown in Table 1. The class number of the data is 26.

| Dataset  |Size|Feature|
|--|----|-------|
|Train|15000|16|
|Test|5000|16|

## Comparison of Classifiers

You are required to implement the following classifiers and compute the performance achieved by different classifiers.

* **Decision Tree**
  
  You should form decision trees on dataset in terms of entropy and gini criterions. For each criterion, you should set the depth as [5,10,15,20,25] separately. You need to compare the performance (accuracy, precision, recall, f1 score and training time) and give a brief discussion.
* **KNN, Random Forest**
  
  Apply three different classifiers KNN and Random Forest on the dataset. For each classifier, evaluate the performance (accuracy, precision, recall, f1 score and training time) . You are required to compare the performance of different classifiers and give a brief discussion.
  
  

In [57]:
import pandas as pd
import numpy as np

def preprocess(path):
    """
    Preprocess the txt file
    
    Input: the file path
    Output: 
        X: a list contains all feature vectors
        y: a list of labels
    """
    
    with open(path) as f:
        X, y = list(), list()
        for line in f.readlines():
            data = line.strip().split(" ")
            i = 0
            vec = list()
            for item in data:
                if i == 0:
                    y.append(item) # the first item is the label
                else:
                    vec.append(item.split(":")[1]) # traversal 16 features and form features vector
                i += 1
            X.append(vec)
    return X, y

X_train, y_train = preprocess("Q1_dataset/letter_train")
X_test, y_test = preprocess("Q1_dataset/letter_test")

In [58]:
def evaluate(y_predict, y_test):
    """
    Evaluate the classifier performance by accuracy, percision, recall and f1
    
    Input: y_predict and the ground truth y_test
    Output: accuracy, precision, recall, f1
    """
    accuracy = metrics.accuracy_score(y_test, y_predict)
    precision = metrics.precision_score(y_test, y_predict, average="macro")
    recall = metrics.recall_score(y_test, y_predict, average="macro")
    f1 = metrics.f1_score(y_test, y_predict, average="macro")
    print("accuracy = ", accuracy)
    print("precision = ", precision)
    print("recall = ", recall)
    print("f1 = ", f1)
    
    return accuracy, precision, recall, f1

## Decision Tree

In [59]:
import time
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

criterion = ["gini", "entropy"]
max_depth = [5, 10, 15, 20, 25]

for c in criterion:
    for m in max_depth:
        t0 = time.time()
        DT = DecisionTreeClassifier(criterion=c, max_depth=m)
        DT.fit(X_train, y_train)
        t1 = time.time()
        y_predict = DT.predict(X_test)
        print("criterion = %s, max_depth = %s" % (c, m))
        evaluate(y_predict, y_test)
        print("trainning time = ", t1 - t0)

criterion = gini, max_depth = 5
accuracy =  0.3694
precision =  0.3924221306348351
recall =  0.36850352493813343
f1 =  0.3270313126141437
trainning time =  0.09574317932128906
criterion = gini, max_depth = 10
accuracy =  0.7138
precision =  0.7636010679690324
recall =  0.7171218331670801
f1 =  0.7267729630119254
trainning time =  0.10671448707580566


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


criterion = gini, max_depth = 15
accuracy =  0.8328
precision =  0.8433642286386743
recall =  0.8334905973140099
f1 =  0.8359003682415955
trainning time =  0.1266627311706543
criterion = gini, max_depth = 20
accuracy =  0.8648
precision =  0.8660853178901877
recall =  0.8651363229926136
f1 =  0.8649724157707273
trainning time =  0.14162158966064453
criterion = gini, max_depth = 25
accuracy =  0.8752
precision =  0.8759060911437004
recall =  0.8758210306194438
f1 =  0.8755666418386217
trainning time =  0.13463997840881348
criterion = entropy, max_depth = 5
accuracy =  0.5014
precision =  0.5609909827716968
recall =  0.501366857622793
f1 =  0.49858248010632233
trainning time =  0.0937490463256836


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


criterion = entropy, max_depth = 10
accuracy =  0.7984
precision =  0.8029257770797809
recall =  0.798420071372091
f1 =  0.7991154249568595
trainning time =  0.11768460273742676
criterion = entropy, max_depth = 15
accuracy =  0.8716
precision =  0.8721783610718289
recall =  0.8722052134592271
f1 =  0.871932274124534
trainning time =  0.13065028190612793
criterion = entropy, max_depth = 20
accuracy =  0.8742
precision =  0.8748118963197716
recall =  0.8750407872449356
f1 =  0.8746052245087904
trainning time =  0.13065052032470703
criterion = entropy, max_depth = 25
accuracy =  0.8748
precision =  0.8752430241101877
recall =  0.8756309923714058
f1 =  0.8751226356128214
trainning time =  0.12768888473510742


According to the above evaluation result, we can summarize the performance of decision tree with different parameters in this table (keep 3 significant digits)

|Criterion|Max Depth|Accuracy|Precision|Recall|F1|Training Time|
|---------|---------|---------|--------|------|--|-------------|
|gini|5|0.369|0.392|0.368|0.327|0.0942s|
|gini|10|0.713|0.763|0.716|0.726|0.103s|
|gini|15|0.832|0.842|0.833|0.835|0.125s|
|gini|20|0.867|0.869|0.867|0.867|0.126s|
|gini|25|0.872|0.872|0.872|0.872|0.126s|
|entropy|5|0.501|0.561|0.501|0.498|0.086s|
|entropy|10|0.799|0.804|0.799|0.800|0.118s|
|entropy|15|0.874|0.875|0.875|0.875|0.132s|
|entropy|20|0.871|0.871|0.872|0.871|0.151s|
|entropy|25|0.874|0.874|0.874|0.874|0.157s|

From this table, we can see that with the same parameter of max_depth, entropy always performs better than gini. It's obvious that the larger the max_depth, the better performance of the classifier with higher accuracy, precison, recall and F1 but also with longer time to train. The best accuracy that the clssifier can achieve is about 0.874.

In [60]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

n_neighbors = [2, 5, 8]
for n in n_neighbors:
    t0 = time.time()
    neigh = KNeighborsClassifier(n_neighbors=n, radius=1, algorithm="auto")
    neigh.fit(X_train, y_train)
    t1 = time.time()
    y_predict = neigh.predict(X_test)
    print("n_neighbors = ", n)
    evaluate(y_predict, y_test)
    print("trainning time = ", t1 - t0)

n_estimators = [50, 100, 150]
for n in n_estimators:
    t0 = time.time()
    RF = RandomForestClassifier(n_estimators=n, criterion="gini", max_depth=20)
    RF.fit(X_train, y_train)
    t1 = time.time()
    y_predict = RF.predict(X_test)
    print("n_estimators = ", n)
    evaluate(y_predict, y_test)
    print("trainning time = ", t1 - t0)



n_neighbors =  2
accuracy =  0.9474
precision =  0.9483786076543728
recall =  0.9487558725937107
f1 =  0.9475318066767836
trainning time =  0.1904895305633545




n_neighbors =  5
accuracy =  0.9524
precision =  0.9524332800528715
recall =  0.9527133453583622
f1 =  0.9523114644552255
trainning time =  0.18749642372131348




n_neighbors =  8
accuracy =  0.9482
precision =  0.9486989364162672
recall =  0.9484795525263109
f1 =  0.9481845764367763
trainning time =  0.18849420547485352
n_estimators =  50
accuracy =  0.9582
precision =  0.958763912446811
recall =  0.9584578191891976
f1 =  0.9584704567851836
trainning time =  0.9225304126739502
n_estimators =  100
accuracy =  0.961
precision =  0.961599134728099
recall =  0.9611258587790934
f1 =  0.9611716900976468
trainning time =  1.8101222515106201
n_estimators =  150
accuracy =  0.962
precision =  0.9627023409630729
recall =  0.9619864479713358
f1 =  0.9621593877244844
trainning time =  2.691796064376831


We summarize the performance of different classifiers in the following table. Notice we tune the `n_neighbors`(number of neighbors to use for K nearest neighbors queries) in the KNN and adjust the `n_estimators`(number of trees in the forest) in RandomForest. Then we get 6 different classifiers.

|Classifier|Accuracy|Precision|Recall|F1|Training Time|
|---------|---------|--------|------|--|-------------|
|KNN -> n_neighbors = 2 |  0.947  |  0.948  | 0.949  | 0.947  |   0.198s   |
|KNN -> n_neighbors = 5 |  0.952  |  0.952  | 0.952  | 0.952  |   0.189s   |
|KNN -> n_neighbors = 8 |  0.948  |  0.949  | 0.948  | 0.948  |  0.187s    |
|RandomForest-> n_estimators = 50 |  0.957  | 0.958  | 0.958  |  0.958 | 0.932s |
|RandomForest-> n_estimators = 100 | 0.960   | 0.961  | 0.961 | 0.961  | 1.813s |
|RandomForest-> n_estimators = 150 | 0.962   | 0.963  | 0.963 | 0.963  | 2.734s |

Because RandomForest is an ensemble methods that it need to train multiple base classifiers to combine, it needs much more time to train compared with KNN (2.734s is 15X of 0.187s). More number of trees in the forest, it needs more training time but can achieve better performance. However, in KNN, the training time becomes smaller when `n_neighbors` becomes larger. The best `n_neighbors` is  5 and the best accuracy that KNN achieves is about 0.952 which is lower than the performance of RandomForest's best accuracy 0.962. 

In a word, RandomForest classifiers reduces the variance and has better performnce than KNN but need more time to train.