# SVM, Linear Regression, & Logistic Regression Analyses

- Let's run SVM, linear regression, and logistic regression on the nba data.
- The class label $\in${name, not a name}

In [5]:
# import necessary packages
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import model_selection 
from sklearn import preprocessing 
from sklearn.metrics import precision_score, recall_score, f1_score

In [28]:
# Import data
fileName = 'I_examples.csv'
df = pd.read_csv(fileName)

# Get data X (excludes first three columns and class label)
X = df.values[:,3:25].astype(float)

classLabel = df.values[:, 25].astype(int)

# Standardize columns string_length and number_words
X1 = preprocessing.scale(X[:,0:2])

X_std = np.concatenate((X1,X[:,2:25]), axis = 1)

In [29]:
# Create table to place results in
resultMatrix_svm = np.zeros((11,3))
resultMatrix_linreg = np.zeros((11,3))
resultMatrix_logreg = np.zeros((11,3))

In [30]:
# Create folds
kfold = model_selection.StratifiedKFold(n_splits=10, random_state=1)

In [31]:
# Create classifiers
svm = SVC(kernel = 'rbf', random_state = 1, gamma = 0.1, C = 10.0)
logreg = LogisticRegression(C = 100.0, random_state = 1)
linreg = LinearRegression()

In [32]:
currRow = 0

for train_idx, test_idx in kfold.split(X_std, classLabel):
    print("TRAIN:", train_idx, "TEST:", test_idx)
    X_train, X_test = X_std[train_idx], X_std[test_idx]
    y_train, y_test = classLabel[train_idx], classLabel[test_idx]
    
    ###
    #debug
    #vals_yTrain, counts_yTrain = np.unique(y_train, return_counts = True)
    #vals_yTest, counts_yTest = np.unique(y_test, return_counts = True)

    #print("TRAIN LABEL COUNT of 0's:", counts_yTrain[0])
    #print("TRAIN LABEL COUNT of 1's:", counts_yTrain[1])
    #print("TRAIN LABEL COUNT of 0's:", counts_yTest[0])
    #print("TRAIN LABEL COUNT of 1's:", counts_yTest[1])
    ###
    
    # fit models to data
    svm.fit(X_train, y_train)
    logreg.fit(X_train, y_train)
    linreg.fit(X_train, y_train)
    
    # predict class labels
    y_pred_svm = svm.predict(X_test)
    y_pred_logreg = logreg.predict(X_test)
    y_pred_linreg = linreg.predict(X_test)
    
    # apply a threshold (using mean value)
    thresh = round(np.mean(y_pred_linreg), 2)
    y_pred_linreg = np.where(y_pred_linreg > thresh, 1, 0)
    
    #compute P, R, and F1 for each classifier
    P_svm = precision_score(y_test, y_pred_svm, average = "macro")
    R_svm = recall_score(y_test, y_pred_svm, average = "macro")
    F1_svm = f1_score(y_test, y_pred_svm, average = "macro")
    
    P_logreg = precision_score(y_test, y_pred_logreg, average = "macro")
    R_logreg = recall_score(y_test, y_pred_logreg, average = "macro")
    F1_logreg = f1_score(y_test, y_pred_logreg, average = "macro")
  
    P_linreg = precision_score(y_test, y_pred_linreg, average = "macro")
    R_linreg = recall_score(y_test, y_pred_linreg, average = "macro")
    F1_linreg = f1_score(y_test, y_pred_linreg, average = "macro")    

    # add to matrix (to return)
    currFoldResult = [P_svm, R_svm, F1_svm]
    resultMatrix_svm[currRow] = currFoldResult  
    
    currFoldResult = [P_logreg, R_logreg, F1_logreg]
    resultMatrix_logreg[currRow] = currFoldResult
    
    currFoldResult = [P_linreg, R_linreg, F1_linreg]
    resultMatrix_linreg[currRow] = currFoldResult
    
    currRow = currRow + 1

TRAIN: [ 2194  2195  2196 ..., 22411 22412 22413] TEST: [   0    1    2 ..., 2398 2402 2409]
TRAIN: [    0     1     2 ..., 22411 22412 22413] TEST: [2194 2195 2196 ..., 4874 4877 4879]
TRAIN: [    0     1     2 ..., 22411 22412 22413] TEST: [4397 4398 4399 ..., 7377 7394 7402]
TRAIN: [    0     1     2 ..., 22411 22412 22413] TEST: [6557 6558 6559 ..., 9496 9498 9505]
TRAIN: [    0     1     2 ..., 22411 22412 22413] TEST: [ 8800  8802  8803 ..., 11592 11602 11608]
TRAIN: [    0     1     2 ..., 22411 22412 22413] TEST: [11099 11100 11101 ..., 13672 13681 13683]
TRAIN: [    0     1     2 ..., 22411 22412 22413] TEST: [13376 13377 13378 ..., 15870 15873 15880]
TRAIN: [    0     1     2 ..., 22411 22412 22413] TEST: [15623 15624 15625 ..., 18063 18065 18071]
TRAIN: [    0     1     2 ..., 22411 22412 22413] TEST: [17903 17904 17905 ..., 20288 20293 20298]
TRAIN: [    0     1     2 ..., 20288 20293 20298] TEST: [20099 20102 20104 ..., 22411 22412 22413]


In [33]:
# calculate mean and add to resultMatrix
meanResult_svm = resultMatrix_svm[0:10,:].mean(0)
resultMatrix_svm[10] = meanResult_svm
resultMatrix_svm = resultMatrix_svm * 100

meanResult_logreg = resultMatrix_logreg[0:10,:].mean(0)
resultMatrix_logreg[10] = meanResult_logreg
resultMatrix_logreg = resultMatrix_logreg * 100

meanResult_linreg = resultMatrix_linreg[0:10,:].mean(0)
resultMatrix_linreg[10] = meanResult_linreg
resultMatrix_linreg = resultMatrix_linreg * 100

In [34]:
# save to csv file
np.savetxt("Stage1_Results/svmResults.csv", resultMatrix_svm, delimiter = ',', 
           header = 'P, R, F1', fmt = '%f', comments = '')

np.savetxt("Stage1_Results/logRegResults.csv", resultMatrix_logreg, delimiter = ',', 
           header = 'P, R, F1', fmt = '%f', comments = '')

np.savetxt("Stage1_Results/linRegResults.csv", resultMatrix_linreg, delimiter = ',', 
           header = 'P, R, F1', fmt = '%f', comments = '')

# Linear Regression Results

- A threshold $T$ was set by using the mean value of the test predictions. Values less than or equal to $T$ are labeled as class 0. Values greater than $T$ are labeled as class 1.

In [35]:
print("\nAverage Precision: "+ str(resultMatrix_linreg[10,0])+
      "\nAverage Recall: " + str(resultMatrix_linreg[10,1])+
      "\nAverage F1 score: "+ str(resultMatrix_linreg[10,2]))


Average Precision: 74.6905143007
Average Recall: 84.9258260986
Average F1 score: 75.5272893862


# Logistic Regression Results

- Parameter C was set to 100.0. 
- I just played around with some values and this gave the best results.

In [36]:
print("\nAverage Precision: "+ str(resultMatrix_logreg[10,0])+
      "\nAverage Recall: " + str(resultMatrix_logreg[10,1])+
      "\nAverage F1 score: "+ str(resultMatrix_logreg[10,2]))


Average Precision: 89.2859540047
Average Recall: 89.0085838677
Average F1 score: 89.1129662633


# SVM Results

- An RBF kernel was used for the SVM. In practice, this seems to be a good starting kernel to choose. The parameters C and gamma were set to the values: C = 10.0, gamma = 0.1.

In [37]:
print("\nAverage Precision: "+ str(resultMatrix_svm[10,0])+
      "\nAverage Recall: " + str(resultMatrix_svm[10,1])+
      "\nAverage F1 score: "+ str(resultMatrix_svm[10,2]))


Average Precision: 91.1454094032
Average Recall: 90.4030671802
Average F1 score: 90.7524818537


# Notes:

- I create a folder called "Stage1_Results" and placed the matrix results there. 
    - The first ten rows are the P, R, and F1 scores. The last row representes the average score for each P, R, and F1.
- SVM outperformed logistic and linear regression, respectively.
- Given that recall is high, we could focus on increasing precision.
- With precision hovering around 91%, it looks as though the classifier is correctly predicting names as names.
- However, there are times where the classifier predicts false positives.
    - i.e. incorrectly predicts an instance is a name, when in reality it is not.
- So how do we solve increasing P?
    - One way would be to set a higher threshold on our predictions. This will make the classifier more conservative, but we risk lowering recall (which we could do).