# BUMİN KAĞAN ÇETİN 3128785 MACHINE LEARNING PROJECT

DESCRIPTION ----

This dataset is composed of 1000 samples with 30 features each. The first column
is the sample id. The second column in the dataset represents the label. There
are 3 possible values for the labels. The remaining columns are numeric
features.

Your task is the following: you should compare the performance of a Support-
Vector Machine (implemented by sklearn.svm.LinearSVC) with that of a Random
Forest (implemented by sklearn.ensemble.RandomForestClassifier). Try to optimize
both algorithms' parameters and determine which one is best for this dataset. At
the end of the analysis, you should have chosen an algorithm and its optimal set
of parameters: write this choice explicitly in the conclusions of your notebook.

Your notebook should detail the procedure you have used to choose the optimal
parameters (graphs are a good idea when possible/sensible).

The notebook will be evaluated not only based on the final results, but also on
the procedure employed, which should balance practical considerations (one may
not be able to exhaustively explore all possible combinations of the parameters)
with the desire for achieving the best possible performance in the least amount
of time.

Bonus points may be assigned for particularly clean/nifty code and/or well-
presented results.

You are also free to attempt other strategies beyond the one in the assignment
(which however is mandatory!).

In [11]:
#importing some of the relevant libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import time

start_time = time.time() #I start measuring the time it takes for my code to evaluate how much I can further improve my code 

In [12]:
df = pd.read_csv('mldata_0013128785.csv', sep=',') #uploading the dataset and taking a glance at the types of the data in it
df.head(10)

Unnamed: 0.1,Unnamed: 0,label,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,...,feature_21,feature_22,feature_23,feature_24,feature_25,feature_26,feature_27,feature_28,feature_29,feature_30
0,0,2,0.319464,-1.177577,-1.593623,-0.164374,-1.469506,-0.38314,1.151623,1.172442,...,-1.469506,0.823956,0.450095,-0.013027,0.247483,-1.222473,1.168376,0.607972,0.574189,-0.160841
1,1,0,-2.027124,0.199568,1.970646,1.1464,0.344652,0.454211,0.199387,-0.442092,...,0.344652,0.803195,0.706193,0.466416,0.429637,0.070651,2.762511,-0.474402,3.447011,1.470679
2,2,2,1.185738,-2.743891,-2.109143,-0.335438,-1.333614,0.299884,1.500342,-0.08911,...,-1.333614,0.761581,-0.156411,0.575707,-0.21819,1.209016,0.39205,-1.095556,0.03865,-0.07655
3,3,1,-0.362543,-3.917672,-0.98508,0.950043,-4.133465,1.167349,0.714509,0.330346,...,-4.133465,-0.287202,0.522918,0.250201,-0.585323,-2.177589,-0.214211,-0.775195,3.480752,-0.193634
4,4,2,0.246926,-0.114464,2.7659,1.400277,0.187917,-0.788065,1.758315,-0.567783,...,0.187917,0.00919,-0.560612,-0.29922,-0.822967,-0.131734,0.199559,0.856419,-1.754675,5.329163
5,5,0,-0.203283,-0.207301,2.561853,-1.778584,1.473254,-0.721459,-0.524057,0.962013,...,1.473254,-0.085583,-0.162413,-0.478596,-0.292862,-0.68727,-0.062224,1.82956,-1.1497,-1.721763
6,6,1,-2.252601,-1.185303,-1.683359,0.436008,-1.951988,-1.49587,0.443065,2.42161,...,-1.951988,1.108829,-0.880836,0.373057,0.588445,-1.067123,0.5039,-0.079612,-5.120304,6.213706
7,7,2,0.531746,-0.608035,-2.43983,2.051423,-0.151138,-1.447613,-2.153115,-1.684537,...,-0.151138,-0.087412,-0.533995,1.259073,0.298055,-0.405859,1.811476,-2.311418,-4.25071,4.62065
8,8,2,-0.482255,-2.326329,-0.620424,1.804502,-1.175,0.337116,-0.013467,-0.523706,...,-1.175,0.641196,0.126834,-1.402866,0.503003,-0.506131,1.780878,1.156384,0.661142,5.359189
9,9,2,-2.022322,-0.54579,-1.350792,1.292113,-2.574619,0.505346,1.013889,0.111736,...,-2.574619,-0.032152,-1.906381,0.053999,0.270222,-0.716175,0.320396,-2.077989,1.003213,2.565052


In [13]:
# Setting the feature space and corresponding labels

X = [] # features
y = [] # labels

for i in range(1000):
    tmp_row = df.iloc[i]
    arr = tmp_row.tolist()
    X.append(arr[2:])
    y.append(arr[1])

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) #after trying a couple of various
                                                                                             # values of test_size, I decided to
                                                                                             # have 0.2 as the best value so far

#### NOW IT'S TIME TO START BUILDING OUR FIRST MODEL: SVC

In [14]:
#Linear SVM by randomly leaving the parameters in the default mode

from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

mach = make_pipeline(StandardScaler(),
                     LinearSVC(max_iter = 5000, random_state=0))
mach.fit(x_train, y_train)
y_pred = mach.predict(x_test)

In [15]:
print(metrics.accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
                                       #after careful implementation of sklearn, I check the classification scores to see 
                                       #how well the algorithm has performed

0.635
              precision    recall  f1-score   support

         0.0       0.74      0.70      0.72        71
         1.0       0.63      0.54      0.58        67
         2.0       0.55      0.66      0.60        62

    accuracy                           0.64       200
   macro avg       0.64      0.63      0.63       200
weighted avg       0.64      0.64      0.64       200



#### NOW IT'S TIME TO START BUILDING OUR SECOND MODEL: RANDOM FOREST

In [16]:
#RF by randomly choosing the parameters
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 100, random_state = 0) 
rf.fit(x_train, y_train)

RandomForestClassifier(random_state=0)

In [17]:
y_pred = rf.predict(x_test)
print(metrics.accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
                                        #after careful implementation of sklearn, I check the classification scores to see 
                                        #how well the algorithm has performed

0.795
              precision    recall  f1-score   support

         0.0       0.81      0.76      0.78        71
         1.0       0.82      0.81      0.81        67
         2.0       0.76      0.82      0.79        62

    accuracy                           0.80       200
   macro avg       0.80      0.80      0.80       200
weighted avg       0.80      0.80      0.79       200



###### WE SAW HOW THE ALGORITHMS ARE PERFORMING. NOW IT IS TIME TO TRY TO OPTIMIZE THE PARAMETERS AND HOPEFULLY OBTAIN HIGHER CLASSIFICATION SCORES, STARTING WITH RANDOM FOREST ALGORITHM:

In [18]:
#After implementing the algorithms, I try to optimize the parameters by randomized search cv

n_estimators = [x for x in range(50, 101)]                  # number of trees in the random forest
max_features = ['auto', 'sqrt']                             # number of features in consideration at every split
max_depth = [int(x) for x in np.linspace(10, 120, num = 12)]# maximum number of levels allowed in each decision tree
min_samples_split = [2*x for x in range(1,6)]               # minimum sample number to split a node
min_samples_leaf = [1, 3, 4]                                # minimum sample number that can be stored in a leaf node
bootstrap = [True, False]                                   # method used to sample data points

random_grid =  {'n_estimators': n_estimators,

                'max_features': max_features,

                'max_depth': max_depth,

                'min_samples_split': min_samples_split,

                'min_samples_leaf': min_samples_leaf,

                'bootstrap': bootstrap}

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

rf = RandomForestClassifier(random_state = 0) 

rf_random = RandomizedSearchCV(estimator = rf,param_distributions = random_grid,
               n_iter = 100, cv = 5, scoring = "accuracy", verbose=2, random_state=0, n_jobs = -1)
rf_random.fit(x_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=0),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      120],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 3, 4],
                                        'min_samples_split': [2, 4, 6, 8, 10],
                                        'n_estimators': [50, 51, 52, 53, 54, 55,
                                                         56, 57, 58, 59, 60, 61,
                                                         62, 63, 64, 65, 66, 67,
                                                         68, 69, 70, 71, 72, 73,
                                                     

In [20]:
print ('Best Parameters for random forest classifier: ', rf_random.best_params_)

Best Parameters for random forest classifier:  {'n_estimators': 99, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': 'auto', 'max_depth': 100, 'bootstrap': False}


In [21]:
from sklearn.ensemble import RandomForestClassifier

rf_optimal = RandomForestClassifier(n_estimators = 99, max_depth = 100,
                                    min_samples_split = 4, min_samples_leaf = 1, 
                                    max_features = "auto", bootstrap = False, random_state = 0) 
rf_optimal.fit(x_train, y_train)
y_pred = rf_optimal.predict(x_test)
metrics.accuracy_score(y_test, y_pred) #after careful implementation of sklearn, I check the accuracy score to see how well
                                       #the algorithm has performed. I obtain 0.845 as the accuracy score which means that the 
                                       # algorith performed (0.845 - 0.795)/0.795 = %6.29 better.

0.845

#### MOVING ONTO THE OPTIMIZATION OF SUPPORT VECTOR MACHINE ALGORTIHM:

In [22]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

mach = SVC(max_iter = 5000, random_state=0, kernel = "linear")

# defining parameter range for our algorithm

param_grid = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
             }

#mach_grid = RandomizedSearchCV(estimator = rf,param_distributions = param_grid,
               #n_iter = 100, cv = 5, verbose=2, scoring = "accuracy", random_state=0, n_jobs = -1)
    
#The total space of parameters 25 is smaller than n_iter=100. Therefore, I used GridSearchCV for this exhaustive search

mach_grid = GridSearchCV(mach, param_grid, cv = 5, refit = True, scoring = "accuracy", verbose = 3)
 
# Scaling the data and fitting the model for grid search. I tried not scaling but it gives off worse results than my random try.
# Thus, I decided to scale the data.

sc = StandardScaler()
x_train = sc.fit_transform(x_train)
mach_grid.fit(x_train, y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5] END ....................C=0.1, gamma=1;, score=0.544 total time=   0.0s
[CV 2/5] END ....................C=0.1, gamma=1;, score=0.600 total time=   0.0s
[CV 3/5] END ....................C=0.1, gamma=1;, score=0.619 total time=   0.0s
[CV 4/5] END ....................C=0.1, gamma=1;, score=0.631 total time=   0.0s
[CV 5/5] END ....................C=0.1, gamma=1;, score=0.525 total time=   0.0s
[CV 1/5] END ..................C=0.1, gamma=0.1;, score=0.544 total time=   0.0s
[CV 2/5] END ..................C=0.1, gamma=0.1;, score=0.600 total time=   0.0s
[CV 3/5] END ..................C=0.1, gamma=0.1;, score=0.619 total time=   0.0s
[CV 4/5] END ..................C=0.1, gamma=0.1;, score=0.631 total time=   0.0s
[CV 5/5] END ..................C=0.1, gamma=0.1;, score=0.525 total time=   0.0s
[CV 1/5] END .................C=0.1, gamma=0.01;, score=0.544 total time=   0.0s
[CV 2/5] END .................C=0.1, gamma=0.01



[CV 3/5] END ......................C=1, gamma=1;, score=0.600 total time=   0.0s
[CV 4/5] END ......................C=1, gamma=1;, score=0.619 total time=   0.0s
[CV 5/5] END ......................C=1, gamma=1;, score=0.544 total time=   0.0s
[CV 1/5] END ....................C=1, gamma=0.1;, score=0.562 total time=   0.0s
[CV 2/5] END ....................C=1, gamma=0.1;, score=0.575 total time=   0.0s




[CV 3/5] END ....................C=1, gamma=0.1;, score=0.600 total time=   0.0s
[CV 4/5] END ....................C=1, gamma=0.1;, score=0.619 total time=   0.0s
[CV 5/5] END ....................C=1, gamma=0.1;, score=0.544 total time=   0.0s
[CV 1/5] END ...................C=1, gamma=0.01;, score=0.562 total time=   0.0s
[CV 2/5] END ...................C=1, gamma=0.01;, score=0.575 total time=   0.0s




[CV 3/5] END ...................C=1, gamma=0.01;, score=0.600 total time=   0.0s
[CV 4/5] END ...................C=1, gamma=0.01;, score=0.619 total time=   0.0s
[CV 5/5] END ...................C=1, gamma=0.01;, score=0.544 total time=   0.0s
[CV 1/5] END ..................C=1, gamma=0.001;, score=0.562 total time=   0.0s
[CV 2/5] END ..................C=1, gamma=0.001;, score=0.575 total time=   0.0s
[CV 3/5] END ..................C=1, gamma=0.001;, score=0.600 total time=   0.0s
[CV 4/5] END ..................C=1, gamma=0.001;, score=0.619 total time=   0.0s
[CV 5/5] END ..................C=1, gamma=0.001;, score=0.544 total time=   0.0s
[CV 1/5] END .................C=1, gamma=0.0001;, score=0.562 total time=   0.0s
[CV 2/5] END .................C=1, gamma=0.0001;, score=0.575 total time=   0.0s




[CV 3/5] END .................C=1, gamma=0.0001;, score=0.600 total time=   0.0s
[CV 4/5] END .................C=1, gamma=0.0001;, score=0.619 total time=   0.0s
[CV 5/5] END .................C=1, gamma=0.0001;, score=0.544 total time=   0.0s
[CV 1/5] END .....................C=10, gamma=1;, score=0.419 total time=   0.0s




[CV 2/5] END .....................C=10, gamma=1;, score=0.431 total time=   0.0s
[CV 3/5] END .....................C=10, gamma=1;, score=0.431 total time=   0.0s
[CV 4/5] END .....................C=10, gamma=1;, score=0.487 total time=   0.0s
[CV 5/5] END .....................C=10, gamma=1;, score=0.381 total time=   0.0s




[CV 1/5] END ...................C=10, gamma=0.1;, score=0.419 total time=   0.0s
[CV 2/5] END ...................C=10, gamma=0.1;, score=0.431 total time=   0.0s
[CV 3/5] END ...................C=10, gamma=0.1;, score=0.431 total time=   0.0s




[CV 4/5] END ...................C=10, gamma=0.1;, score=0.487 total time=   0.0s
[CV 5/5] END ...................C=10, gamma=0.1;, score=0.381 total time=   0.0s
[CV 1/5] END ..................C=10, gamma=0.01;, score=0.419 total time=   0.0s
[CV 2/5] END ..................C=10, gamma=0.01;, score=0.431 total time=   0.0s




[CV 3/5] END ..................C=10, gamma=0.01;, score=0.431 total time=   0.0s
[CV 4/5] END ..................C=10, gamma=0.01;, score=0.487 total time=   0.0s
[CV 5/5] END ..................C=10, gamma=0.01;, score=0.381 total time=   0.0s
[CV 1/5] END .................C=10, gamma=0.001;, score=0.419 total time=   0.0s




[CV 2/5] END .................C=10, gamma=0.001;, score=0.431 total time=   0.0s
[CV 3/5] END .................C=10, gamma=0.001;, score=0.431 total time=   0.0s
[CV 4/5] END .................C=10, gamma=0.001;, score=0.487 total time=   0.0s




[CV 5/5] END .................C=10, gamma=0.001;, score=0.381 total time=   0.0s
[CV 1/5] END ................C=10, gamma=0.0001;, score=0.419 total time=   0.0s




[CV 2/5] END ................C=10, gamma=0.0001;, score=0.431 total time=   0.0s
[CV 3/5] END ................C=10, gamma=0.0001;, score=0.431 total time=   0.0s
[CV 4/5] END ................C=10, gamma=0.0001;, score=0.487 total time=   0.0s
[CV 5/5] END ................C=10, gamma=0.0001;, score=0.381 total time=   0.0s
[CV 1/5] END ....................C=100, gamma=1;, score=0.394 total time=   0.0s
[CV 2/5] END ....................C=100, gamma=1;, score=0.481 total time=   0.0s
[CV 3/5] END ....................C=100, gamma=1;, score=0.338 total time=   0.0s




[CV 4/5] END ....................C=100, gamma=1;, score=0.400 total time=   0.0s
[CV 5/5] END ....................C=100, gamma=1;, score=0.306 total time=   0.0s
[CV 1/5] END ..................C=100, gamma=0.1;, score=0.394 total time=   0.0s




[CV 2/5] END ..................C=100, gamma=0.1;, score=0.481 total time=   0.0s
[CV 3/5] END ..................C=100, gamma=0.1;, score=0.338 total time=   0.0s
[CV 4/5] END ..................C=100, gamma=0.1;, score=0.400 total time=   0.0s
[CV 5/5] END ..................C=100, gamma=0.1;, score=0.306 total time=   0.0s




[CV 1/5] END .................C=100, gamma=0.01;, score=0.394 total time=   0.0s
[CV 2/5] END .................C=100, gamma=0.01;, score=0.481 total time=   0.0s
[CV 3/5] END .................C=100, gamma=0.01;, score=0.338 total time=   0.0s




[CV 4/5] END .................C=100, gamma=0.01;, score=0.400 total time=   0.0s
[CV 5/5] END .................C=100, gamma=0.01;, score=0.306 total time=   0.0s
[CV 1/5] END ................C=100, gamma=0.001;, score=0.394 total time=   0.0s
[CV 2/5] END ................C=100, gamma=0.001;, score=0.481 total time=   0.0s




[CV 3/5] END ................C=100, gamma=0.001;, score=0.338 total time=   0.0s
[CV 4/5] END ................C=100, gamma=0.001;, score=0.400 total time=   0.0s
[CV 5/5] END ................C=100, gamma=0.001;, score=0.306 total time=   0.0s
[CV 1/5] END ...............C=100, gamma=0.0001;, score=0.394 total time=   0.0s




[CV 2/5] END ...............C=100, gamma=0.0001;, score=0.481 total time=   0.0s
[CV 3/5] END ...............C=100, gamma=0.0001;, score=0.338 total time=   0.0s
[CV 4/5] END ...............C=100, gamma=0.0001;, score=0.400 total time=   0.0s
[CV 5/5] END ...............C=100, gamma=0.0001;, score=0.306 total time=   0.0s




[CV 1/5] END ...................C=1000, gamma=1;, score=0.394 total time=   0.0s
[CV 2/5] END ...................C=1000, gamma=1;, score=0.481 total time=   0.0s
[CV 3/5] END ...................C=1000, gamma=1;, score=0.338 total time=   0.0s
[CV 4/5] END ...................C=1000, gamma=1;, score=0.400 total time=   0.0s




[CV 5/5] END ...................C=1000, gamma=1;, score=0.306 total time=   0.0s
[CV 1/5] END .................C=1000, gamma=0.1;, score=0.394 total time=   0.0s
[CV 2/5] END .................C=1000, gamma=0.1;, score=0.481 total time=   0.0s
[CV 3/5] END .................C=1000, gamma=0.1;, score=0.338 total time=   0.0s




[CV 4/5] END .................C=1000, gamma=0.1;, score=0.400 total time=   0.0s
[CV 5/5] END .................C=1000, gamma=0.1;, score=0.306 total time=   0.0s
[CV 1/5] END ................C=1000, gamma=0.01;, score=0.394 total time=   0.0s
[CV 2/5] END ................C=1000, gamma=0.01;, score=0.481 total time=   0.0s




[CV 3/5] END ................C=1000, gamma=0.01;, score=0.338 total time=   0.0s
[CV 4/5] END ................C=1000, gamma=0.01;, score=0.400 total time=   0.0s
[CV 5/5] END ................C=1000, gamma=0.01;, score=0.306 total time=   0.0s
[CV 1/5] END ...............C=1000, gamma=0.001;, score=0.394 total time=   0.0s




[CV 2/5] END ...............C=1000, gamma=0.001;, score=0.481 total time=   0.0s
[CV 3/5] END ...............C=1000, gamma=0.001;, score=0.338 total time=   0.0s
[CV 4/5] END ...............C=1000, gamma=0.001;, score=0.400 total time=   0.0s
[CV 5/5] END ...............C=1000, gamma=0.001;, score=0.306 total time=   0.0s




[CV 1/5] END ..............C=1000, gamma=0.0001;, score=0.394 total time=   0.0s
[CV 2/5] END ..............C=1000, gamma=0.0001;, score=0.481 total time=   0.0s
[CV 3/5] END ..............C=1000, gamma=0.0001;, score=0.338 total time=   0.0s
[CV 4/5] END ..............C=1000, gamma=0.0001;, score=0.400 total time=   0.0s
[CV 5/5] END ..............C=1000, gamma=0.0001;, score=0.306 total time=   0.0s




GridSearchCV(cv=5,
             estimator=SVC(kernel='linear', max_iter=5000, random_state=0),
             param_grid={'C': [0.1, 1, 10, 100, 1000],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001]},
             scoring='accuracy', verbose=3)

In [23]:
# print best parameter after tuning
print ('Best Parameters for support vector classifier: ', mach_grid.best_params_)
 

Best Parameters for support vector classifier:  {'C': 0.1, 'gamma': 1}


In [24]:
mach_optimal = SVC(max_iter = 5000, C = 0.1, gamma = 1, random_state=0, kernel = 'linear')

sc = StandardScaler()
x_train = sc.fit_transform(x_train)

mach_optimal.fit(x_train, y_train)
x_test = sc.transform (x_test)
y_pred = mach_optimal.predict(x_test)
 
# print classification report and accuracy
print(metrics.accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))      # We see that the optimization algorithm has provided us with a
                                                 #(0.705-0.635)/0.635 = 11.02 % of improvement with respect to random paramters.

0.705
              precision    recall  f1-score   support

         0.0       0.69      0.83      0.76        71
         1.0       0.75      0.60      0.67        67
         2.0       0.68      0.68      0.68        62

    accuracy                           0.70       200
   macro avg       0.71      0.70      0.70       200
weighted avg       0.71      0.70      0.70       200



In [25]:
print("--- %s seconds ---" % (time.time() - start_time))  # I also added this in order to see how long it takes for 
                                                          # my algorihthm to complete which turned out to be 43.01 seconds.

--- 40.07442665100098 seconds ---


## IN CONCLUSION:
I have implemented gridsearchcv to my SVM model, and randomizedsearchcv to my Random Forest model just because I found the respective algorithms easier to implement. 
At the end of my code, I have seen that my optimized random forest model performs much better ((0.845-0.705)/0.705 = %19.85) than the optimized support vector machine model. 
Therefore, it is safe to say that random forest is preferable to use svm in my case. Additonally, after carefully implementation of RandomizedSearchCV for the search of the optimal parameters, I found {'n_estimators': 99, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': 'auto', 'max_depth': 100, 'bootstrap': False} as the optimal parameters of the random forest algorithm.