# SVM, Linear Regression, & Logistic Regression Analyses

- Let's run SVM, linear regression, and logistic regression on the nba data.
- The class label $\in${name, not a name}

In [37]:
# import necessary packages
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import model_selection 
from sklearn import preprocessing 
from sklearn.metrics import precision_score, recall_score, f1_score

In [40]:
# Import data
fileName = 'I_examples.csv'
df = pd.read_csv(fileName)

# Get data X (excludes first three columns and class label)
X = df.values[:,3:23].astype(float)

classLabel = df.values[:, 23].astype(int)

# Standardize columns string_length and number_words
X1 = preprocessing.scale(X[:,0:2])

X_std = np.concatenate((X1,X[:,2:23]), axis = 1)

In [41]:
# Create table to place results in
resultMatrix_svm = np.zeros((11,3))
resultMatrix_linreg = np.zeros((11,3))
resultMatrix_logreg = np.zeros((11,3))

In [42]:
# Create folds
kfold = model_selection.StratifiedKFold(n_splits=10, random_state=1)

In [43]:
# Create classifiers
svm = SVC(kernel = 'rbf', random_state = 1, gamma = 0.1, C = 10.0)
logreg = LogisticRegression(C = 100.0, random_state = 1)
linreg = LinearRegression()

In [44]:
currRow = 0

for train_idx, test_idx in kfold.split(X_std, classLabel):
    print("TRAIN:", train_idx, "TEST:", test_idx)
    X_train, X_test = X_std[train_idx], X_std[test_idx]
    y_train, y_test = classLabel[train_idx], classLabel[test_idx]
    
    ###
    #debug
    #vals_yTrain, counts_yTrain = np.unique(y_train, return_counts = True)
    #vals_yTest, counts_yTest = np.unique(y_test, return_counts = True)

    #print("TRAIN LABEL COUNT of 0's:", counts_yTrain[0])
    #print("TRAIN LABEL COUNT of 1's:", counts_yTrain[1])
    #print("TRAIN LABEL COUNT of 0's:", counts_yTest[0])
    #print("TRAIN LABEL COUNT of 1's:", counts_yTest[1])
    ###
    
    # fit models to data
    svm.fit(X_train, y_train)
    logreg.fit(X_train, y_train)
    linreg.fit(X_train, y_train)
    
    # predict class labels
    y_pred_svm = svm.predict(X_test)
    y_pred_logreg = logreg.predict(X_test)
    y_pred_linreg = linreg.predict(X_test)
    
    # apply a threshold (using mean value)
    thresh = round(np.mean(y_pred_linreg), 2)
    y_pred_linreg = np.where(y_pred_linreg > thresh, 1, 0)
    
    #compute P, R, and F1 for each classifier
    P_svm = precision_score(y_test, y_pred_svm, average = "macro")
    R_svm = recall_score(y_test, y_pred_svm, average = "macro")
    F1_svm = f1_score(y_test, y_pred_svm, average = "macro")
    
    P_logreg = precision_score(y_test, y_pred_logreg, average = "macro")
    R_logreg = recall_score(y_test, y_pred_logreg, average = "macro")
    F1_logreg = f1_score(y_test, y_pred_logreg, average = "macro")
  
    P_linreg = precision_score(y_test, y_pred_linreg, average = "macro")
    R_linreg = recall_score(y_test, y_pred_linreg, average = "macro")
    F1_linreg = f1_score(y_test, y_pred_linreg, average = "macro")    

    # add to matrix (to return)
    currFoldResult = [P_svm, R_svm, F1_svm]
    resultMatrix_svm[currRow] = currFoldResult  
    
    currFoldResult = [P_logreg, R_logreg, F1_logreg]
    resultMatrix_logreg[currRow] = currFoldResult
    
    currFoldResult = [P_linreg, R_linreg, F1_linreg]
    resultMatrix_linreg[currRow] = currFoldResult
    
    currRow = currRow + 1

TRAIN: [ 2199  2200  2201 ..., 22471 22472 22473] TEST: [   0    1    2 ..., 2408 2412 2419]
TRAIN: [    0     1     2 ..., 22471 22472 22473] TEST: [2199 2200 2201 ..., 4898 4901 4903]
TRAIN: [    0     1     2 ..., 22471 22472 22473] TEST: [4408 4409 4410 ..., 7403 7420 7428]
TRAIN: [    0     1     2 ..., 22471 22472 22473] TEST: [6572 6573 6574 ..., 9533 9540 9542]
TRAIN: [    0     1     2 ..., 22471 22472 22473] TEST: [ 8822  8823  8825 ..., 11637 11643 11651]
TRAIN: [    0     1     2 ..., 22471 22472 22473] TEST: [11124 11126 11127 ..., 13717 13726 13728]
TRAIN: [    0     1     2 ..., 22471 22472 22473] TEST: [13406 13408 13409 ..., 15915 15918 15925]
TRAIN: [    0     1     2 ..., 22471 22472 22473] TEST: [15665 15666 15667 ..., 18116 18118 18124]
TRAIN: [    0     1     2 ..., 22471 22472 22473] TEST: [17950 17951 17953 ..., 20344 20349 20354]
TRAIN: [    0     1     2 ..., 20344 20349 20354] TEST: [20151 20152 20155 ..., 22471 22472 22473]


In [45]:
# calculate mean and add to resultMatrix
meanResult_svm = resultMatrix_svm[0:10,:].mean(0)
resultMatrix_svm[10] = meanResult_svm
resultMatrix_svm = resultMatrix_svm * 100

meanResult_logreg = resultMatrix_logreg[0:10,:].mean(0)
resultMatrix_logreg[10] = meanResult_logreg
resultMatrix_logreg = resultMatrix_logreg * 100

meanResult_linreg = resultMatrix_linreg[0:10,:].mean(0)
resultMatrix_linreg[10] = meanResult_linreg
resultMatrix_linreg = resultMatrix_linreg * 100

In [46]:
# save to csv file
np.savetxt("Stage1_Results/svmResults.csv", resultMatrix_svm, delimiter = ',', 
           header = 'P, R, F1', fmt = '%f', comments = '')

np.savetxt("Stage1_Results/logRegResults.csv", resultMatrix_logreg, delimiter = ',', 
           header = 'P, R, F1', fmt = '%f', comments = '')

np.savetxt("Stage1_Results/linRegResults.csv", resultMatrix_linreg, delimiter = ',', 
           header = 'P, R, F1', fmt = '%f', comments = '')

# Linear Regression Results

- A threshold $T$ was set by using the mean value of the test predictions. Values less than or equal to $T$ are labeled as class 0. Values greater than $T$ are labeled as class 1.

In [47]:
print("\nAverage Precision: "+ str(resultMatrix_linreg[10,0])+
      "\nAverage Recall: " + str(resultMatrix_linreg[10,1])+
      "\nAverage F1 score: "+ str(resultMatrix_linreg[10,2]))


Average Precision: 74.5872978284
Average Recall: 84.7810218049
Average F1 score: 75.3998766757


# Logistic Regression Results

- Parameter C was set to 100.0. 
- I just played around with some values and this gave the best results.

In [68]:
print("\nAverage Precision: "+ str(resultMatrix_logreg[10,0])+
      "\nAverage Recall: " + str(resultMatrix_logreg[10,1])+
      "\nAverage F1 score: "+ str(resultMatrix_logreg[10,2]))


Average Precision: 88.7815939947
Average Recall: 88.7061409214
Average F1 score: 88.7086899461


# SVM Results

- An RBF kernel was used for the SVM. In practice, this seems to be a good starting kernel to choose. The parameters C and gamma were set to the values: C = 10.0, gamma = 0.1.

In [69]:
print("\nAverage Precision: "+ str(resultMatrix_svm[10,0])+
      "\nAverage Recall: " + str(resultMatrix_svm[10,1])+
      "\nAverage F1 score: "+ str(resultMatrix_svm[10,2]))


Average Precision: 90.5415481677
Average Recall: 90.0027569383
Average F1 score: 90.2495451337


# Notes:

- I create a folder called "Stage1_Results" and placed the matrix results there. 
    - The first ten rows are the P, R, and F1 scores. The last row representes the average score for each P, R, and F1.
- SVM outperformed logistic and linear regression, respectively.
- Given that recall is high, we should focus on increasing precision.
- With precision hovering around 90%, it looks as though the classifier is correctly predicting names as names.
- However, there are times where the classifier predicts false positives.
    - i.e. incorrectly predicts an instance is a name, when in reality it is not.
- So how do we solve increasing P?
    - One way would be to set a higher threshold on our predictions. This will make the classifier more conservative, but we risk lowering recall (which we could do).