# Homework 5: Logistic Regression and Support Vector Machines

by Natalia Frumkin and Karanraj Chauhan with help from B. Kulis, R. Manzelli, and A. Tsiligkardis

## Problem 1: SVM Toy Example

Given the following two-class data set:

**Class -1: **
A = (1,1)
B = (2,3)

**Class +1: **
C = (2,5)
D = (4,2)

<ol type="a">
  <li>Plot the data.</li>
  <li>Plot the hyperplane described by w = $(3,2)^T, b = -12$</li>
  <li>Calculate the $l_2$ distance of data point C from the hyperplane.</li>
  <li>Determine if the hyperplane linearly separates the data. Explain.</li>
  <li>Calculate the hard margin SVM hyperplane in canonical form.</li>
  <li>Which, if any, data points lie on the SVM hyperplane?</li>
</ol>

## Problem 2: Logistic Regression

<p>In this problem, we will use a logistic regression model to classify emails as "spam" (1) or "non-spam" (0). Recall that the hypothesis/decision rule in a logistic regression model is given by</p>

$$h_\theta(x) = \sigma(\theta^Tx) \\ \text{where } \sigma  \text{ is the sigmoid function}$$

<p>Since logistic regression does not have a closed form solution, we will use gradient descent to obtain the parameters $\theta$. We will use the negative log likelihood loss with L2 regularization as the loss function. Mathematically, the loss function $l(\theta)$ for a given set of parameters $\theta$ will be,</p>

$$l(\theta) = NLL(\theta) + \frac{\lambda}{2}||\theta||^2 \\ \text{where } NLL(\theta) = -\sum_{i=1}^{n} y_i\log(h(x_i)) + (1 - y_i)\log(1 - h(x_i))$$

<p>The good news is, you won't have to worry about these equations for implementing gradient descent (hurray!). However, what you will need is the gradient or the derivative of the loss function. For a given $n$$ x $$d$ matrix $X$ of data, $n$ x $1$ vector of labels (0/1) $y$, and corresponding $n$ x $1$ vector of predictions $\hat{y}$, the loss function gradient is</p>

$$\nabla l(\theta) = (\hat{y} - y)^{T} \cdot X + \lambda \cdot \theta$$

<ol type="a">
    <li>Load the dataset file spambase_data.csv using pandas. The last column in the data is the true labels column i.e. the $y$ vector (1 means spam, 0 means not spam), and the rest of the data is the features matrix i.e. the $X$ matrix. Split the dataset into a train set and a test set. Note: train/test ratio of 0.8/0.2 has been known to work, but you are welcome to try other values.</li>
    <li>Using the loss gradient equation above, implement gradient descent (use only the train set for this) to find the parameters $\theta$ of the logistic regression model. Note: $learning$ $rate = 0.00001$, $\lambda$ = $10$, and $number$ $of$ $steps = 3000$ have been known to give a decent accuracy but you are welcome to try other values, especially for $number$ $of$ $steps$.</li>
    <li>Report the correct classification rate (CCR) of the model on train data and test data. The CCR is defined as $$CCR = \frac{num\_correct\_predictions}{num\_samples}$$</li>   
</ol>

In [156]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [157]:
# read in raw dataset
spambase_df = pd.read_csv("spambase_data.csv")
print(spambase_df.head())

# split into train and test sets 
Ydata = np.zeros([4601,1])

Ydata = spambase_df['57'].values

x_train, x_test, y_train, ytest = train_test_split(spambase_df,Ydata,test_size = 0.20)


      0     1     2    3     4     5     6     7     8     9  ...    48  \
0  0.00  0.64  0.64  0.0  0.32  0.00  0.00  0.00  0.00  0.00  ...  0.00   
1  0.21  0.28  0.50  0.0  0.14  0.28  0.21  0.07  0.00  0.94  ...  0.00   
2  0.06  0.00  0.71  0.0  1.23  0.19  0.19  0.12  0.64  0.25  ...  0.01   
3  0.00  0.00  0.00  0.0  0.63  0.00  0.31  0.63  0.31  0.63  ...  0.00   
4  0.00  0.00  0.00  0.0  0.63  0.00  0.31  0.63  0.31  0.63  ...  0.00   

      49   50     51     52     53     54   55    56  57  
0  0.000  0.0  0.778  0.000  0.000  3.756   61   278   1  
1  0.132  0.0  0.372  0.180  0.048  5.114  101  1028   1  
2  0.143  0.0  0.276  0.184  0.010  9.821  485  2259   1  
3  0.137  0.0  0.137  0.000  0.000  3.537   40   191   1  
4  0.135  0.0  0.135  0.000  0.000  3.537   40   191   1  

[5 rows x 58 columns]
<class 'numpy.ndarray'>


In [158]:
# fit logistic regression model
import random
import math

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

w_init = np.random.rand(1,58)
new_xtrain = x_train.drop('57',axis=1)

# print(new_xtrain.shape)

nrow = new_xtrain.shape[0]
onesX = np.ones((nrow,1))
AugX = np.concatenate((new_xtrain,onesX),axis = 1)



lam= 10
rate = 0.00001
steps = 3000

# print(y_train.shape)

# print(w_init.shape)
# print(AugX.shape)

y_train = y_train.reshape(3680,1)

for i in range(0,3000):
    y_hat = sigmoid(w_init@AugX.T).reshape(3680,1)
    constant = lam*w_init
    slope = y_hat - y_train
    loss = slope.T @ AugX + constant
    w_init = w_init - rate*(loss)


print(w_init)

  


(1, 58)
(3680, 1)
[[-1.03990078e-01 -2.47811816e+00 -8.31885028e-01  1.50769806e+00
   8.28064642e-01  8.59121430e-01  2.85626474e+00  1.71193195e+00
   7.41649348e-01 -5.38413544e-01  1.06018054e+00 -8.96557038e+00
   1.85817512e-01 -1.14723436e-01  9.41451094e-01  3.77014228e+00
   1.43616104e+00  1.11531519e+00 -4.49359444e+00  2.03053100e+00
   1.86441620e+00  1.45167363e+00  2.74039695e+00  2.27381987e+00
  -1.95243835e+01 -9.81400586e+00 -8.35393765e+00 -4.42570521e+00
  -2.99419513e+00 -3.34718255e+00 -1.55426064e+00 -8.48108751e-01
  -3.96729890e+00 -1.19817293e+00 -3.53200027e+00 -2.27777785e+00
  -5.06975368e+00  1.50771172e-01 -3.13639735e+00 -9.44926430e-01
  -1.96897706e+00 -4.13302756e+00 -2.02354295e+00 -2.24169005e+00
  -6.45804702e+00 -7.13000841e+00 -1.48149618e-01 -8.92564646e-01
  -1.11464367e+00 -2.79308902e+00 -3.02299854e-03  2.97457095e+00
   2.14672264e+00  4.84800495e-01 -8.75288451e+00  5.58744475e+00
  -1.37419718e+00 -1.53608672e+01]]


In [159]:
# predict on test data and train data and calculate CCR
y_pred = sigmoid(w_init@AugX.T).reshape(3680,1)


counter = 0


for j in range(0,3680):
    if(y_pred[j] >= 0.5):
        y_pred[j] = 1
    elif(y_pred[j] < 0.5):
        y_pred[j] = 0

for k in range(0,3680):
    if(y_pred[k] == y_train[k]):
        counter = counter +1 
    
print("CCR for train: ", counter/3680)


# Test
new_xtrain2 = x_test.drop('57',axis=1)
nrow2 = new_xtrain2.shape[0]
onesX2 = np.ones((nrow2,1))
AugX2 = np.concatenate((new_xtrain2,onesX2),axis = 1)

#print(AugX2.shape)

y_pred2 = sigmoid(w_init@AugX2.T).reshape(921,1)

counter2 = 0

for j in range(0,921):
    if(y_pred2[j] >= 0.5):
        y_pred2[j] = 1
    elif(y_pred2[j] < 0.5):
        y_pred2[j] = 0

for k in range(0,921):
    if(y_pred2[k] == ytest[k]):
        counter2 = counter2 +1 
    
print("CCR for test: ", counter2/921)


# print("CCR for test: ", counter2/)

CCR for train:  0.6701086956521739
CCR for test:  0.6666666666666666


  
