# Loading Data

The leaf database comprises 40 different plant species. Each leaf has 15 different feature including it's class. These features are:

1. Class
2. Specimen
3. Eccentricity
4. Aspect Ratio
5. Elongation
6. Solidity
7. Stochastic Convexity
8. Isometric Factor
9. Maximal Indentation Depth
10. Lobedness
11. Average Contrast
12. Smoothness
13. Third Moment
14. Uniformity
15. Entropy

In [1]:
import utils.data_utils as data_util


data = data_util.load_data()

#print data
data.head(340)



Unnamed: 0,Class,Specimen,Eccentricity,Aspect Ratio,Elongation,Solidity,Stochastic Convexity,Isometric Factor,Maximal Indentation Depth,Lobedness,Average Contrast,Smoothness,Third Moment,Uniformity,Entropy
0,1,1,0.72694,1.4742,0.32396,0.98535,1.00000,0.835920,0.004657,0.003947,0.047790,0.127950,0.016108,0.005232,0.000275
1,1,2,0.74173,1.5257,0.36116,0.98152,0.99825,0.798670,0.005242,0.005002,0.024160,0.090476,0.008119,0.002708,0.000075
2,1,3,0.76722,1.5725,0.38998,0.97755,1.00000,0.808120,0.007457,0.010121,0.011897,0.057445,0.003289,0.000921,0.000038
3,1,4,0.73797,1.4597,0.35376,0.97566,1.00000,0.816970,0.006877,0.008607,0.015950,0.065491,0.004271,0.001154,0.000066
4,1,5,0.82301,1.7707,0.44462,0.97698,1.00000,0.754930,0.007428,0.010042,0.007938,0.045339,0.002051,0.000560,0.000024
5,1,6,0.72997,1.4892,0.34284,0.98755,1.00000,0.844820,0.004945,0.004451,0.010487,0.058528,0.003414,0.001125,0.000025
6,1,7,0.82063,1.7529,0.44458,0.97964,0.99649,0.767700,0.005928,0.006395,0.018375,0.080587,0.006452,0.002271,0.000041
7,1,8,0.77982,1.6215,0.39222,0.98512,0.99825,0.808160,0.005099,0.004731,0.024875,0.089686,0.007979,0.002466,0.000147
8,1,9,0.83089,1.8199,0.45693,0.98240,1.00000,0.771060,0.006005,0.006564,0.007245,0.040616,0.001647,0.000388,0.000033
9,1,10,0.90631,2.3906,0.58336,0.97683,0.99825,0.664190,0.008402,0.012848,0.007010,0.042347,0.001790,0.000459,0.000028


## Preparing Training and Testing Data
I will use 80% of the data as training data. I will use the rest of the data is for testing. After splitting data I have:
- 257 data for training.
- 83 data for testing.

In [2]:
import numpy as np
#Splitting Training and Testing Data

training_data, testing_data = data_util.split_training_and_testing_data(data)

total_train = 0
for x in range(0,len(training_data)):
    total_train += len(training_data[x])

print("Train: ", total_train )
total_test = 0
for x in range(0,len(testing_data)):
    total_test += len(testing_data[x])
print("Test: ", total_test )

train_numpy = np.asarray(training_data[0])
test_numpy = np.asarray(testing_data[0])
for x in range(1,len(training_data)):
    train_numpy = np.concatenate((train_numpy,np.asarray(training_data[x])), axis=0 )
    
for x in range(1,len(testing_data)):
    test_numpy = np.concatenate((test_numpy,np.asarray(testing_data[x])), axis=0 )


#print(test_numpy)
print(train_numpy.shape)
print(test_numpy.shape)

Train:  257
Test:  83
(257, 15)
(83, 15)


In [3]:
import utils.data_utils as data_util

sorted, neigbours = data_util.k_nearest_neighbourhood(train_numpy,test_numpy[0],1)

print(sorted)
print(neigbours)


32.0
[220]


## PART 1 KNN with Euclidean Distance

### Finding Besk K Parameter 
I will make a greedy search for finding best k value. 

In [192]:
from sklearn.model_selection import KFold

training_data, testing_data = data_util.split_training_and_testing_data(data)

train_numpy = np.asarray(training_data[0])
test_numpy = np.asarray(testing_data[0])
# make numpy array here.# start from 2 we dont need them.
for x in range(2, len(training_data)):
    train_numpy = np.concatenate((train_numpy, np.asarray(training_data[x])), axis=0)

for x in range(2, len(testing_data)):
    test_numpy = np.concatenate((test_numpy, np.asarray(testing_data[x])), axis=0)

print(len(train_numpy))
print(len(test_numpy))
def find_best_k_value(train_numpy):
    train_results = {}
    split_size = 5
    kf = KFold(n_splits=split_size, random_state=False, shuffle=True)

    splits = kf.get_n_splits(train_numpy)

    print("Splits: " + str(splits))

    for k_size in range(1, 15, 2):
        mean_pos = 0
        temp_acc = 0
       
        for train_index, test_index in kf.split(train_numpy):
            # print("TRAIN:", train_index, "TEST:", test_index)
            X_train = []
            X_test = []
            X_train, X_test = train_numpy[train_index], train_numpy[test_index]

            positive = 0
            negative = 0
            for x in range(0, len(X_test)):
                sorted2, neigbours2 = data_util.k_nearest_neighbourhood(X_train, X_test[x], k_size)
                if int(X_test[x][0]) == int(sorted2):
                    positive += 1
                else:
                    negative += 1

            temp_acc = (100 * positive) / (positive + negative)
            mean_pos += temp_acc
            print("TrainSize: " + str(len(X_train)) + " TestSize:" + str(len(X_test)))
        # print("K : " + str(k_size) + " True:" + str(positive) + " False: " + str(negative) + "  Positive Accuracy Rate % " + str(
        # (100 * positive) / (positive + negative)))
        train_results[k_size] = mean_pos / split_size
        # print("K : " + str(k_size) + " True:" + str(positive) + " False: " + str(negative) + "  Negative Accuracy Rate % " + str(
        # (100 * negative) / (positive + negative)))

        print("K SIZE : " + str(k_size) + " MEAN ACC  % " + str(mean_pos / split_size))
    return train_results

## DATA is NOT NORMALIZED
print(find_best_k_value(train_numpy))

def calcTestScore(X_train, X_test,k_size):
    positive = 0
    negative = 0
    for x in range(0, len(X_test)):
        sorted2, neigbours2 = data_util.k_nearest_neighbourhood(X_train, X_test[x],k_size)
        if int(X_test[x][0]) == int(sorted2):
            positive += 1
        else:
            negative += 1
    print( "Accuracy Rate % " + str((100 * positive) / (positive + negative)))

for index in range(1,15,2):
    print("Test"+ str(index))
    calcTestScore(train_numpy, test_numpy,index)
    print("Train"+ str(index))
    calcTestScore(train_numpy, train_numpy,index)
    print("----------------------\n")



249
81
Splits: 5
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 200 TestSize:49
K SIZE : 1 MEAN ACC  % 61.46122448979592
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 200 TestSize:49
K SIZE : 3 MEAN ACC  % 53.01224489795918
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 200 TestSize:49
K SIZE : 5 MEAN ACC  % 52.220408163265304
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 200 TestSize:49
K SIZE : 7 MEAN ACC  % 48.99591836734694
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 200 TestSize:49
K SIZE : 9 MEAN ACC  % 44.587755102040816
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 199 TestSize:50
TrainSize: 199 TestS

Without normalizing data we the best accuracy at K=1 and accuracy for TRAIN % 61.46122448979592, TEST: %56.

Now lets try to use the same approach with normalized data.



In [4]:
## Normalized Data
def normalizeData(norm_data):
    for x in range(0, len(norm_data), 1):
        temp_point = norm_data[x]
        # print("BEFORE_NORM")
        # print(temp_point)
        total_sum = sum(temp_point) - norm_data[x][0] - norm_data[x][1]
        for y in range(2, len(temp_point)):
            temp_point[y] = temp_point[y] / total_sum

        # print("AFTER_NORM")
        # print(temp_point)
        norm_data[x] = temp_point
    
    return norm_data


norm_train = normalizeData(train_numpy)
norm_test = normalizeData(test_numpy)

print(norm_train[1])

print(find_best_k_value(norm_train))

def calcTestScore(X_train, X_test,k_size):
    positive = 0
    negative = 0
    for x in range(0, len(X_test)):
        sorted2, neigbours2 = data_util.k_nearest_neighbourhood(X_train, X_test[x],k_size)
        if int(X_test[x][0]) == int(sorted2):
            positive += 1
        else:
            negative += 1
    print( "Accuracy Rate % " + str((100 * positive) / (positive + negative)))

for index in range(1,15,2):
    print("Test"+ str(index))
    calcTestScore(norm_train, norm_test,index)
    print("Train"+ str(index))
    calcTestScore(norm_train, norm_train,index)
    print("----------------------\n")


[1.00000000e+00 2.00000000e+00 1.33818352e-01 2.75257384e-01
 6.51582597e-02 1.77079785e-01 1.80098108e-01 1.44091116e-01
 9.45783434e-04 9.02357825e-04 4.35879819e-03 1.63231219e-02
 1.46487011e-03 4.88560658e-04 1.35032537e-05]


NameError: name 'find_best_k_value' is not defined

In [None]:
Normalizing data increased our accuracy. Train: %88 Test:%60

Same approach we sklearn's KNN. I have added this code the benchmark my custom code. Sklearn's GridSearchCV api is making Train Accuracy %100 but Test Accuracy still remains at %69. I have nearly same results on test accuracy.

In [5]:
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier


knn = KNeighborsClassifier(n_neighbors=1)
# Fit the classifier to the data

label_x = np.empty(len(norm_train))
test_label_x = np.empty(len(norm_test))
for x in range(0, len(train_numpy), 1):
    label_x[x] = train_numpy[x][0]

for x in range(0, len(test_numpy), 1):
    test_label_x[x] = test_numpy[x][0]

awesome = knn.fit(train_numpy, label_x)

param_grid = {"n_neighbors": np.arange(1, 25)}
knn_gscv = GridSearchCV(knn, param_grid, cv=5)
# fit model to data
knn_gscv.fit(train_numpy, label_x)
print("TRAIN: ", knn.score(train_numpy, label_x))
print("TEST: ", knn.score(test_numpy, test_label_x))

TRAIN:  1.0
TEST:  0.6867469879518072




## RESULTS - KNN - Euclidian

| K-Size 	| TrainData 	| TestData 	| TrainAccuracy (%) TP(KFold-5) 	| TrainAccuracy (%)  All Data 	| TestAccuracy(%) TP 	| NormalizedData 	|
|:------:	|:---------:	|:--------:	|:-----------------------------:	|:---------------------------:	|:------------------:	|:--------------:	|
| 1 	| 249 	| 81 	| % 61.46 	| %100 	| % 56.79 	| NO 	|
| 3 	| 249 	| 81 	| % 53.01 	| % 78.71 	| % 51.85 	| NO 	|
| 5 	| 249 	| 81 	| % 52.22 	| % 75.10 	| % 55.55 	| NO 	|
| 7 	| 249 	| 81 	| % 48.99 	| % 69.07 	| % 54.32 	| NO 	|
| 9 	| 249 	| 81 	| % 44.58 	| % 65.86 	| % 50.61 	| NO 	|
| 11 	| 249 	| 81 	| % 43.78 	| % 61.84 	| % 50.61 	| NO 	|
| 13 	| 249 	| 81 	| % 42.17 	| % 61.04 	| % 50.61 	| NO 	|
| 1 	| 249 	| 81 	| % 60.66 	| % 60.49 	| % 100 	| YES 	|
| 3 	| 249 	| 81 	| % 55.41 	| % 77.10 	| % 62.96 	| YES 	|
| 5 	| 249 	| 81 	| % 53.02 	| % 68.67 	| % 68.67 	| YES 	|
| 7 	| 249 	| 81 	| % 50.98 	| % 64.65 	| % 61.72 	| YES 	|
| 9 	| 249 	| 81 	| % 47.80 	| % 63.05 	| % 59.25 	| YES 	|
| 11 	| 249 	| 81 	| % 45.00 	| % 59.43 	| % 51.85 	| YES 	|
| 13 	| 249 	| 81 	| % 43.79 	| % 59.43 	| % 46.91 	| YES 	|

## Comments
Without data normalization we have better training results Train:%100, Test:%56 but with data normalization we more general model. Train:%77 , Test%62 with k=1. Data normalization is very important for generalizing model.

## PART 2 KNN with Manhattan Distance

I will use the function. I have just changed the distance formula euclidian to manhattan.

In [194]:
def find_best_k_value(train_numpy):
    train_results = {}
    split_size = 5
    kf = KFold(n_splits=split_size, random_state=False, shuffle=True)

    splits = kf.get_n_splits(train_numpy)

    print("Splits: " + str(splits))

    for k_size in range(1, 15, 2):
        mean_pos = 0
        temp_acc = 0
       
        for train_index, test_index in kf.split(train_numpy):
            # print("TRAIN:", train_index, "TEST:", test_index)
            X_train = []
            X_test = []
            X_train, X_test = train_numpy[train_index], train_numpy[test_index]

            positive = 0
            negative = 0
            for x in range(0, len(X_test)):
                sorted2, neigbours2 = data_util.k_nearest_neighbourhood_manhattan(X_train, X_test[x], k_size)
                if int(X_test[x][0]) == int(sorted2):
                    positive += 1
                else:
                    negative += 1

            temp_acc = (100 * positive) / (positive + negative)
            mean_pos += temp_acc
            #print("TrainSize: " + str(len(X_train)) + " TestSize:" + str(len(X_test)))
        # print("K : " + str(k_size) + " True:" + str(positive) + " False: " + str(negative) + "  Positive Accuracy Rate % " + str(
        # (100 * positive) / (positive + negative)))
        train_results[k_size] = mean_pos / split_size
        # print("K : " + str(k_size) + " True:" + str(positive) + " False: " + str(negative) + "  Negative Accuracy Rate % " + str(
        # (100 * negative) / (positive + negative)))

        print("K SIZE : " + str(k_size) + " MEAN ACC  % " + str(mean_pos / split_size))
    return train_results

print(find_best_k_value(norm_train))


def calcTestScore(X_train, X_test,k_size):
    positive = 0
    negative = 0
    for x in range(0, len(X_test)):
        sorted2, neigbours2 = data_util.k_nearest_neighbourhood_manhattan(X_train, X_test[x],k_size)
        if int(X_test[x][0]) == int(sorted2):
            positive += 1
        else:
            negative += 1
    print( "Accuracy Rate % " + str((100 * positive) / (positive + negative)))

for index in range(1,15,2):
    print("Test"+ str(index))
    calcTestScore(norm_train, norm_test,index)
    print("Train"+ str(index))
    calcTestScore(norm_train, norm_train,index)
    print("----------------------\n")

Splits: 5
K SIZE : 1 MEAN ACC  % 88.35918367346939
K SIZE : 3 MEAN ACC  % 69.90204081632653
K SIZE : 5 MEAN ACC  % 72.29387755102042
K SIZE : 7 MEAN ACC  % 59.436734693877554
K SIZE : 9 MEAN ACC  % 63.83673469387755
K SIZE : 11 MEAN ACC  % 62.63673469387756
K SIZE : 13 MEAN ACC  % 55.81224489795918
{1: 88.35918367346939, 3: 69.90204081632653, 5: 72.29387755102042, 7: 59.436734693877554, 9: 63.83673469387755, 11: 62.63673469387756, 13: 55.81224489795918}
Test1
Accuracy Rate % 86.41975308641975
Train1
Accuracy Rate % 100.0
----------------------

Test3
Accuracy Rate % 70.37037037037037
Train3
Accuracy Rate % 100.0
----------------------

Test5
Accuracy Rate % 76.54320987654322
Train5
Accuracy Rate % 98.39357429718875
----------------------

Test7
Accuracy Rate % 70.37037037037037
Train7
Accuracy Rate % 96.3855421686747
----------------------

Test9
Accuracy Rate % 71.60493827160494
Train9
Accuracy Rate % 97.59036144578313
----------------------

Test11
Accuracy Rate % 71.60493827160494
T

## RESULTS - KNN - Manhattan

| K-Size 	| TrainData 	| TestData 	| TrainAccuracy (%) TP(KFold-5) 	| TrainAccuracy (%)  All Data 	| TestAccuracy(%) TP 	| NormalizedData 	|
|:------:	|:---------:	|:--------:	|:-----------------------------:	|:---------------------------:	|:------------------:	|:--------------:	|
| 1 	| 249 	| 81 	| % 88.35 	| % 100.0 	| % 86.41 	| YES 	|
| 3 	| 249 	| 81 	| % 69.90 	| % 100.0 	| % 70.37 	| YES 	|
| 5 	| 249 	| 81 	| % 72.29 	| % 98.39 	| % 76.54 	| YES 	|
| 7 	| 249 	| 81 	| % 59.43 	| % 96.38 	| % 70.37 	| YES 	|
| 9 	| 249 	| 81 	| % 63.83 	| % 97.59 	| % 71.60 	| YES 	|
| 11 	| 249 	| 81 	| % 62.63 	| % 95.18 	| % 71.60 	| YES 	|
| 13 	| 249 	| 81 	| % 55.81 	| % 96.38 	| % 69.13 	| YES 	|

## Comments
Using Manhattan distance as metric increased our test accuracy (K=1) **%60** to **%86**. (Normalized data.)

## PART 3 Linear SVM

I will use SKLearn's SVM implementation. Sklearn needs labels and data separately. So we need to separate our data first.

### Data Separation
Delete first two column and separate labels.

In [26]:
training_data, testing_data = data_util.split_training_and_testing_data(data)

total_train = 0
for x in range(0,len(training_data)):
    total_train += len(training_data[x])

print("Train: ", total_train )
total_test = 0
for x in range(0,len(testing_data)):
    total_test += len(testing_data[x])
print("Test: ", total_test )

train_numpy = np.asarray(training_data[0])
test_numpy = np.asarray(testing_data[0])
for x in range(1,len(training_data)):
    train_numpy = np.concatenate((train_numpy,np.asarray(training_data[x])), axis=0 )
    
for x in range(1,len(testing_data)):
    test_numpy = np.concatenate((test_numpy,np.asarray(testing_data[x])), axis=0 )


#print(test_numpy)
print(train_numpy.shape)
print(test_numpy.shape)




train_labels = []
test_labels = []


def delete_first_two_column(data_array):
    label_array = np.empty(len(data_array))
    for index_label in range(0, len(data_array), 1):
        #print("Doing:" + str(index_label))
        #print("DATA: " + str(data_array[index_label][0]))
        label_array[index_label] = data_array[index_label][0]
    #comment this line if you using svm or decision tree.
    data_array = np.delete(data_array, np.s_[0:2], axis=1)

    return data_array, label_array

Train:  257
Test:  83
(257, 15)
(83, 15)


In [27]:
## delete first two rows for each test and train data.
train_numpy, train_labels = delete_first_two_column(train_numpy)
test_numpy, test_labels = delete_first_two_column(test_numpy)


print(train_numpy[0])
print(train_labels[0])

print(len(train_numpy[0]))
print(len(train_labels))

[7.2694e-01 1.4742e+00 3.2396e-01 9.8535e-01 1.0000e+00 8.3592e-01
 4.6566e-03 3.9465e-03 4.7790e-02 1.2795e-01 1.6108e-02 5.2323e-03
 2.7477e-04]
1.0
13
257


In [8]:
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics, svm
 
kf = KFold(n_splits=5, random_state=False, shuffle=True)

clf = svm.LinearSVC(penalty='l2', verbose=2, max_iter=100000, class_weight=None)

clf.fit(train_numpy, train_labels, sample_weight=None)

# print(str(train_numpy))
#print(str(test_labels))

print("\n###---- TRAIN DATA SCORE ----####")
print(clf.score(train_numpy, train_labels))
print("###--------------------------####")
print("###---- TEST DATA SCORE ----####")
print(clf.score(test_numpy, test_labels))

average_train=0
average_test=0
train_results_dict= {}
test_results_dict={}
index_dict= 0
for train_index, test_index in kf.split(train_numpy):
    print("####################")
    
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = train_numpy[train_index], train_numpy[test_index]
    X_train_labels, X_test_labels = train_labels[train_index], train_labels[test_index]
    clf.fit(X_train, X_train_labels)
    train_results_dict[index_dict] = clf.score(X_train, X_train_labels)
    test_results_dict[index_dict] = clf.score(X_test, X_test_labels)
    average_train += clf.score(X_train, X_train_labels)
    average_test += clf.score(X_test, X_test_labels)
    index_dict +=1
    
    
    print(clf.score(X_train, X_train_labels))
    print(clf.score(X_test, X_test_labels))

print("AVERAGE RESULTS")
print(str(average_train / 5))
print(str(average_test / 5))


print(train_results_dict)
print(test_results_dict)


[LibLinear]
###---- TRAIN DATA SCORE ----####
0.8521400778210116
###--------------------------####
###---- TEST DATA SCORE ----####
0.7831325301204819
####################
[LibLinear]0.8146341463414634
0.5961538461538461
####################
[LibLinear]0.8195121951219512
0.7307692307692307
####################
[LibLinear]0.8252427184466019
0.6666666666666666
####################
[LibLinear]0.8495145631067961
0.6666666666666666
####################
[LibLinear]0.8495145631067961
0.5294117647058824
AVERAGE RESULTS
0.8316836372247216
0.6379336349924585
{0: 0.8146341463414634, 1: 0.8195121951219512, 2: 0.8252427184466019, 3: 0.8495145631067961, 4: 0.8495145631067961}
{0: 0.5961538461538461, 1: 0.7307692307692307, 2: 0.6666666666666666, 3: 0.6666666666666666, 4: 0.5294117647058824}


In [11]:
def normalizeData(norm_data):
    for x in range(0, len(norm_data), 1):
        temp_point = norm_data[x]
        # print("BEFORE_NORM")
        # print(temp_point)
        total_sum = sum(temp_point)
        for y in range(0, len(temp_point)):
            temp_point[y] = temp_point[y] / total_sum

        # print("AFTER_NORM")
        # print(temp_point)
        norm_data[x] = temp_point
    
    return norm_data

In [208]:
norm_train_data = normalizeData(train_numpy)
norm_test_data = normalizeData(test_numpy)

In [209]:


clf_norm = svm.LinearSVC(penalty='l2', verbose=1, max_iter=10000, class_weight=None)

clf_norm.fit(norm_train_data, train_labels, sample_weight=None)

# print(str(train_numpy))
#print(str(test_labels))

print("###---- TRAIN DATA SCORE ----####")
print(clf.score(norm_train_data, train_labels))
print("###--------------------------####")
print("###---- TEST DATA SCORE ----####")
print(clf.score(norm_test_data, test_labels))

average_train=0
average_test=0
train_results_dict= {}
test_results_dict={}
index_dict= 0
for train_index, test_index in kf.split(norm_train_data):
    print("####################")
    
    # print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = norm_train_data[train_index], norm_train_data[test_index]
    X_train_labels, X_test_labels = train_labels[train_index], train_labels[test_index]
    clf_norm.fit(X_train, X_train_labels)
    train_results_dict[index_dict] = clf_norm.score(X_train, X_train_labels)
    test_results_dict[index_dict] = clf_norm.score(X_test, X_test_labels)
    average_train += clf_norm.score(X_train, X_train_labels)
    average_test += clf_norm.score(X_test, X_test_labels)
    index_dict +=1
    
    
    print(clf_norm.score(X_train, X_train_labels))
    print(clf_norm.score(X_test, X_test_labels))

print("AVERAGE RESULTS")
print(str(average_train / 5))
print(str(average_test / 5))


print(train_results_dict)
print(test_results_dict)

[LibLinear]###---- TRAIN DATA SCORE ----####
0.09727626459143969
###--------------------------####
###---- TEST DATA SCORE ----####
0.07228915662650602
####################
[LibLinear]0.33170731707317075
0.15384615384615385
####################
[LibLinear]0.2926829268292683
0.15384615384615385
####################
[LibLinear]0.3106796116504854
0.19607843137254902
####################
[LibLinear]0.3058252427184466
0.1568627450980392
####################
[LibLinear]0.2669902912621359
0.19607843137254902
AVERAGE RESULTS
0.3015770779067014
0.171342383107089
{0: 0.33170731707317075, 1: 0.2926829268292683, 2: 0.3106796116504854, 3: 0.3058252427184466, 4: 0.2669902912621359}
{0: 0.15384615384615385, 1: 0.15384615384615385, 2: 0.19607843137254902, 3: 0.1568627450980392, 4: 0.19607843137254902}


## RESULTS - Linear SVM
| Fold 	| TrainData 	| TestData 	| TrainAccuracy (%) TP 	| TestAccuracy(%) TP 	| NormalizedData 	|
|:----:	|:---------:	|:--------:	|:--------------------:	|:------------------:	|:--------------:	|
| 1 	| 249 	| 81 	| % 63.03 	| % 57.83 	| NO 	|
| 2 	| 249 	| 81 	| % 63.41 	| % 51.92 	| NO 	|
| 3 	| 249 	| 81 	| % 59.02 	| % 34.61 	| NO 	|
| 4 	| 249 	| 81 	| % 63.10 	| % 45.09 	| NO 	|
| 5 	| 249 	| 81 	| % 64.07 	| % 37.25 	| NO 	|
| ALL 	| 249 	| 81 	| % 62.54 	| % 44.36 	| NO 	|
| ALL 	| 249 	| 81 	| % 30.15 	| % 17.13 	| YES 	|

## Comments
In Linear SVM, normalized data training results are awful. Not normalized data training results are much better than this. Best results are (Train **%63 and Test %57**). Incresing max iterations didn't change the results.L2 penalty is improved the score.

## PART 4 Polynomial SVM

In [9]:
for degrees in range(1,10,1):
    poly_svm = svm.SVC(kernel="poly", degree=degrees)

    poly_svm.fit(train_numpy, train_labels)

    #print("POLY NOT MIXED RESULTS")
    #print(len(train_numpy))
    #print(len(test_numpy))

    print(poly_svm.score(train_numpy, train_labels))
    print(poly_svm.score(test_numpy, test_labels))

    average_train = 0
    average_test = 0
    for train_index, test_index in kf.split(train_numpy):
        print("####################")
        # print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = train_numpy[train_index], train_numpy[test_index]
        X_train_labels, X_test_labels = train_labels[train_index], train_labels[test_index]
        poly_svm.fit(X_train, X_train_labels)
        average_train += poly_svm.score(X_train, X_train_labels)
        average_test += poly_svm.score(X_test, X_test_labels)
        #print(poly_svm.score(X_train, X_train_labels))
        #print(poly_svm.score(X_test, X_test_labels))

    print("AVERAGE KFOLD RESULTS"+ str(degrees))
    print(str(average_train / 5))
    print(str(average_test / 5))

0.8482490272373541
0.8192771084337349
####################
####################
####################
####################
####################
AVERAGE KFOLD RESULTS1
0.7772341937011602
0.6065610859728506
1.0
1.0
####################
####################
####################
####################
####################
AVERAGE KFOLD RESULTS2
1.0
0.996078431372549
1.0
0.9879518072289156
####################
####################
####################
####################




####################
AVERAGE KFOLD RESULTS3
0.9990243902439024
1.0
1.0
0.9759036144578314
####################
####################
####################
####################
####################
AVERAGE KFOLD RESULTS4
0.9990243902439024
0.996078431372549
1.0
0.9397590361445783
####################
####################
####################
####################
####################
AVERAGE KFOLD RESULTS5
0.9990243902439024
0.9883861236802414
1.0
0.927710843373494
####################
####################
####################
####################
####################
AVERAGE KFOLD RESULTS6
0.9990243902439024
0.9845399698340875
1.0
0.927710843373494
####################
####################
####################




####################
####################
AVERAGE KFOLD RESULTS7
0.9990243902439024
0.9766968325791856
0.9961089494163424
0.9036144578313253
####################
####################
####################
####################
####################
AVERAGE KFOLD RESULTS8
0.9970779067013972
0.9689291101055806
0.9961089494163424
0.9036144578313253
####################
####################
####################
####################
####################
AVERAGE KFOLD RESULTS9
0.9970779067013972
0.9766214177978885




In [14]:
norm_train_data = normalizeData(train_numpy)
norm_test_data = normalizeData(test_numpy)


In [15]:

poly_svm_norm = svm.SVC(kernel="poly", degree=3)

poly_svm_norm.fit(norm_train_data, train_labels)

print("POLY NOT MIXED RESULTS")

print(poly_svm_norm.score(norm_train_data, train_labels))
print(poly_svm_norm.score(norm_test_data, test_labels))

average_train = 0
average_test = 0
for train_index, test_index in kf.split(norm_train_data):
    print("####################")
    # print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = norm_train_data[train_index], norm_train_data[test_index]
    X_train_labels, X_test_labels = train_labels[train_index], train_labels[test_index]
    poly_svm_norm.fit(X_train, X_train_labels)
    average_train += poly_svm_norm.score(X_train, X_train_labels)
    average_test += poly_svm_norm.score(X_test, X_test_labels)
    print(poly_svm_norm.score(X_train, X_train_labels))
    print(poly_svm_norm.score(X_test, X_test_labels))

print("AVERAGE KFOLD RESULTS")
print(str(average_train / 5))
print(str(average_test / 5))

POLY NOT MIXED RESULTS
0.04669260700389105
0.04819277108433735
####################
0.04878048780487805
0.038461538461538464
####################
0.08780487804878048
0.038461538461538464
####################
0.16990291262135923
0.0196078431372549
####################
0.04854368932038835
0.0392156862745098
####################
0.07766990291262135
0.0392156862745098
AVERAGE KFOLD RESULTS
0.0865403741416055
0.03499245852187029




## RESULTS - Poly SVM
| Poly Degree 	| TrainData 	| TestData 	| TrainAccuracy (%) TP 	| TestAccuracy(%) TP 	| NormalizedData 	|
|:-----------:	|:---------:	|:--------:	|:--------------------:	|:------------------:	|:--------------:	|
| 1 	| 257 	| 83 	| % 77.72 	| % 60.65 	| NO 	|
| 2 	| 257 	| 83 	| % 100 	| % 99.60 	| NO 	|
| 3 	| 257 	| 83 	| % 99.90 	| % 97.59 	| NO 	|
| 4 	| 257 	| 83 	| % 99.90 	| % 99.60 	| NO 	|
| 5 	| 257 	| 83 	| % 99.90 	| % 98.73 	| NO 	|
| 6 	| 257 	| 83 	| % 99.90 	| % 98.45 	| NO 	|
| 50 	| 257 	| 83 	| % 63.41 	| % 51.92 	| NO 	|


## Comments
The normalized result is not so good. The best results are in where at degree 2. I have used all data in SVM. It gave me better results. I didn't delete 2 rows. All csv data more fit for SVM.

## Part 5 Decision Tree

In [36]:
from sklearn import tree
from sklearn.model_selection import cross_val_score

my_decision_tree = tree.DecisionTreeClassifier(criterion="gini", class_weight="balanced")

my_decision_tree.fit(X_train, X_train_labels)

print("DTREE RESULTS")

print(my_decision_tree.score(X_train, X_train_labels))
print(my_decision_tree.score(X_test, X_test_labels))

print(cross_val_score(my_decision_tree, X_train, X_train_labels, cv=3))
print(cross_val_score(my_decision_tree, X_test, X_test_labels, cv=3))

DTREE RESULTS
1.0
0.6274509803921569
[0.46835443 0.72058824 0.38983051]
[0.30769231 0.5625     0.33333333]




In [34]:
from sklearn import tree
from sklearn.model_selection import cross_val_score

dtree = tree.DecisionTreeClassifier(class_weight="balanced", criterion='gini', max_depth=150,
                                    max_features=13, max_leaf_nodes=50, min_samples_leaf=1,
                                    min_samples_split=2, min_weight_fraction_leaf=0.0,
                                    presort=False, random_state=None, splitter='random')
dtree.fit(X_train, X_train_labels)

print(dtree.score(X_train, X_train_labels))
print(dtree.score(X_test, X_test_labels))

print(len(X_train_labels))
print(len(X_train))

print(cross_val_score(dtree,X_train,X_train_labels, cv=3))

0.8737864077669902
0.7254901960784313
206
206
[0.49367089 0.52941176 0.38983051]


In [40]:
## diff params
from sklearn import tree
from sklearn.model_selection import cross_val_score

dtree = tree.DecisionTreeClassifier(class_weight="balanced", criterion='gini', max_depth=100,
                                    max_features=9, max_leaf_nodes=100, min_samples_leaf=1,
                                    min_samples_split=2, min_weight_fraction_leaf=0.0,
                                    presort=False, random_state=None, splitter='random')
dtree.fit(X_train, X_train_labels)

print(dtree.score(X_train, X_train_labels))
print(dtree.score(X_test, X_test_labels))

print(len(X_train_labels))
print(len(X_train))

print(cross_val_score(dtree,X_train,X_train_labels, cv=3))

1.0
0.6078431372549019
206
206
[0.48101266 0.55882353 0.3220339 ]


In [41]:
## diff params
from sklearn import tree
from sklearn.model_selection import cross_val_score

dtree = tree.DecisionTreeClassifier(class_weight="balanced", criterion='gini', max_depth=150,
                                    max_features=11, max_leaf_nodes=25, min_samples_leaf=1,
                                    min_samples_split=2, min_weight_fraction_leaf=0.0,
                                    presort=False, random_state=None, splitter='random')
dtree.fit(X_train, X_train_labels)

print(dtree.score(X_train, X_train_labels))
print(dtree.score(X_test, X_test_labels))

print(len(X_train_labels))
print(len(X_train))

print(cross_val_score(dtree,X_train,X_train_labels, cv=3))

0.6650485436893204
0.4117647058823529
206
206
[0.30379747 0.5        0.33898305]


In [42]:
## diff params
from sklearn import tree
from sklearn.model_selection import cross_val_score

dtree = tree.DecisionTreeClassifier(class_weight="balanced", criterion='gini', max_depth=150,
                                    max_features=13, max_leaf_nodes=200, min_samples_leaf=1,
                                    min_samples_split=2, min_weight_fraction_leaf=0.0,
                                    presort=False, random_state=None, splitter='random')
dtree.fit(X_train, X_train_labels)

print(dtree.score(X_train, X_train_labels))
print(dtree.score(X_test, X_test_labels))

print(len(X_train_labels))
print(len(X_train))

print(cross_val_score(dtree,X_train,X_train_labels, cv=3))

1.0
0.5882352941176471
206
206
[0.40506329 0.69117647 0.45762712]


In [44]:
## diff params
from sklearn import tree
from sklearn.model_selection import cross_val_score

dtree = tree.DecisionTreeClassifier(class_weight="balanced", criterion='gini', max_depth=30,
                                    max_features=13, max_leaf_nodes=30, min_samples_leaf=3,
                                    min_samples_split=2, min_weight_fraction_leaf=0.0,
                                    presort=False, random_state=None, splitter='random')
dtree.fit(X_train, X_train_labels)

print(dtree.score(X_train, X_train_labels))
print(dtree.score(X_test, X_test_labels))

print(len(X_train_labels))
print(len(X_train))

print(cross_val_score(dtree,X_train,X_train_labels, cv=3))

0.7135922330097088
0.45098039215686275
206
206
[0.44303797 0.5        0.47457627]


In [45]:
## diff params
from sklearn import tree
from sklearn.model_selection import cross_val_score

dtree = tree.DecisionTreeClassifier(class_weight="balanced", criterion='gini', max_depth=1000,
                                    max_features=13, max_leaf_nodes=1000, min_samples_leaf=5,
                                    min_samples_split=3, min_weight_fraction_leaf=0.0,
                                    presort=False, random_state=None, splitter='random')
dtree.fit(X_train, X_train_labels)

print(dtree.score(X_train, X_train_labels))
print(dtree.score(X_test, X_test_labels))

print(len(X_train_labels))
print(len(X_train))

print(cross_val_score(dtree,X_train,X_train_labels, cv=3))

0.6699029126213593
0.39215686274509803
206
206
[0.39240506 0.47058824 0.3220339 ]


### Drawing Tree

In [38]:
def draw_tree(tree, feature_names):
    left = tree.tree_.children_left
    right = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features = [feature_names[i] for i in tree.tree_.feature]

    # get ids of child nodes
    idx = np.argwhere(left == -1)[:, 0]

    def recurse(left, right, child, lineage=None):
        if lineage is None:
            lineage = [child]
        if child in left:
            parent = np.where(left == child)[0].item()
            split = 'l'
        else:
            parent = np.where(right == child)[0].item()
            split = 'r'

        lineage.append((parent, split, threshold[parent], features[parent]))

        if parent == 0:
            lineage.reverse()
            return lineage
        else:
            return recurse(left, right, parent, lineage)

    for child in idx:
        for node in recurse(left, right, child):
            print(node)
            


In [39]:
draw_tree(my_decision_tree,X_train_labels)

(0, 'l', 0.09156093746423721, 2.0)
(1, 'l', 0.019514335319399834, 1.0)
(2, 'l', 0.0138433831743896, 1.0)
3
(0, 'l', 0.09156093746423721, 2.0)
(1, 'l', 0.019514335319399834, 1.0)
(2, 'r', 0.0138433831743896, 1.0)
(4, 'l', 0.0007800880412105471, 1.0)
(5, 'l', 0.019926609471440315, 1.0)
6
(0, 'l', 0.09156093746423721, 2.0)
(1, 'l', 0.019514335319399834, 1.0)
(2, 'r', 0.0138433831743896, 1.0)
(4, 'l', 0.0007800880412105471, 1.0)
(5, 'r', 0.019926609471440315, 1.0)
7
(0, 'l', 0.09156093746423721, 2.0)
(1, 'l', 0.019514335319399834, 1.0)
(2, 'r', 0.0138433831743896, 1.0)
(4, 'r', 0.0007800880412105471, 1.0)
8
(0, 'l', 0.09156093746423721, 2.0)
(1, 'r', 0.019514335319399834, 1.0)
(9, 'l', 0.0011780084460042417, 2.0)
10
(0, 'l', 0.09156093746423721, 2.0)
(1, 'r', 0.019514335319399834, 1.0)
(9, 'r', 0.0011780084460042417, 2.0)
(11, 'l', 0.18758953362703323, 1.0)
(12, 'l', 0.11888254806399345, 1.0)
13
(0, 'l', 0.09156093746423721, 2.0)
(1, 'r', 0.019514335319399834, 1.0)
(9, 'r', 0.0011780084460

## RESULTS 
My best result for decision tree: 
0.8737864077669902 - Training
0.7254901960784313 - Testing

I tried to avoid overfit. And prune the tree.

| TrainData 	| TestData 	| TrainAccuracy (%) TP 	| TestAccuracy(%) TP 	| Max_Depth 	| Max_Features 	| Max_Leaf_Nodes 	| Min_Samples_Leaf 	| Min_Samples_Split 	| Criterion 	|
|:---------:	|:--------:	|:--------------------:	|:------------------:	|:---------:	|:------------:	|:--------------:	|:----------------:	|:-----------------:	|:---------:	|
| 257 	| 83 	| % 100 	| % 62.74 	| infinite 	| 13 	| infinite 	| 1 	| 2 	| gini 	|
| 257 	| 83 	| % 87.37 	| % 72.54 	| 150 	| 13 	| 50 	| 1 	| 2 	| gini 	|
| 257 	| 83 	| % 100 	| % 60.57 	| 100 	| 9 	| 100 	| 1 	| 2 	| gini 	|
| 257 	| 83 	| % 66.50 	| % 41.17 	| 150 	| 11 	| 25 	| 1 	| 2 	| gini 	|
| 257 	| 83 	| % 66.99 	| % 39.21 	| 1000 	| 13 	| 1000 	| 5 	| 3 	| gini 	|
| 257 	| 83 	| % 72.33 	| % 49.01 	| 150 	| 13 	| 30 	| 3 	| 2 	| gini 	|
| 257 	| 83 	| % 71.35 	| % 45.09 	| 30 	| 13 	| 30 	| 3 	| 2 	| gini 	|


## Comments
Pruning helps us to avoid overfitting.
Generally it is preferred to have a simple model, it avoids overfitting issue.
Any additional split that does not add significant value is not worth while.
We can avoid overfitting by changing the parameters like
- max_leaf_nodes
- min_samples_leaf
- max_depth

Pruning Parameters:

- max_leaf_nodes
- Reduce the number of leaf nodes
- min_samples_leaf
- Restrict the size of sample leaf
- Minimum sample size in terminal nodes can be fixed to 30, 100, 300 or 5% of total
- max_depth
- Reduce the depth of the tree to build a generalized tree
- Set the depth of the tree to 3, 5, 10 depending after verification on test data


I have tried different parameters and find the best parameters as:

**(class_weight="balanced", criterion='gini', max_depth=150,
                                    max_features=13, max_leaf_nodes=50, min_samples_leaf=1,
                                    min_samples_split=2, min_weight_fraction_leaf=0.0,
                                    presort=False, random_state=None, splitter='random')**
