# Logistic Regression & Gradient Descent

## Pre-processing Steps:

Build out necessary functions below.

In [1]:
# Gradient function
def grad(w, x, y):
    gradient = np.zeros(len(w))
    for i in range(len(y)):
        dot_prod = np.dot(w.T,x.iloc[i,:])
        gradient += (1/len(y)) * (1/ (1 + np.exp(-dot_prod)) - y[i]) * x.iloc[i,:] 
    return gradient

# Likelihood function
def fval(w, x, y):
    return none

# Gradient norm function
def gradnorm(w, x, y):
    norm = np.linalg.norm(grad(w, x, y))
    return norm

## Question 1.

First, let's load the necessary libraries and also our training/test sets into dataframes.

In [149]:
# Import libraries
import pandas as pd
import numpy as np
import csv

# Load the training and test datasets
df_train = pd.read_csv('LRTrain.csv')
df_test = pd.read_csv('LRTest.csv')

x_train = df_train.iloc[:, :30]
y_train = df_train.iloc[:, 30]

x_test = df_test.iloc[:, :30]
y_test = df_test.iloc[:, 30]

Next, let's create an array of weights with values of 0 and then use our gradient function to test how it works initially.

In [3]:
w = np.zeros(df_train.shape[1] - 1)
grad(w, x_train, y_train)

radius_mean                 0.458658
texture_mean                1.691417
perimeter_mean              2.375267
area_mean                 -40.365500
smoothness_mean             0.008910
compactness_mean           -0.002134
concavity_mean             -0.015543
concave points_mean        -0.008387
symmetry_mean               0.017676
fractal_dimension_mean      0.007700
radius_se                  -0.025793
texture_se                  0.173138
perimeter_se               -0.193368
area_se                    -6.799432
smoothness_se               0.000921
compactness_se              0.000944
concavity_se                0.000370
concave points_se           0.000161
symmetry_se                 0.002530
fractal_dimension_se        0.000454
radius_worst                0.140833
texture_worst               2.067767
perimeter_worst             0.150683
area_worst                -94.503167
smoothness_worst            0.011041
compactness_worst          -0.011923
concavity_worst            -0.031749
c

Now that we see that our gradient function works properly, let's do some experimentation with a for loop specifying step size and iterations:

In [56]:
step_size = 0.00001
iter_t = 2000

w = np.zeros(df_train.shape[1] - 1)

for i in range(iter_t):
    w = w - step_size * grad(w, x_train, y_train)

print(w)

radius_mean               -0.006887
texture_mean              -0.012517
perimeter_mean            -0.039848
area_mean                 -0.017586
smoothness_mean           -0.000067
compactness_mean           0.000029
concavity_mean             0.000125
concave points_mean        0.000055
symmetry_mean             -0.000136
fractal_dimension_mean    -0.000055
radius_se                 -0.000083
texture_se                -0.001038
perimeter_se              -0.000134
area_se                    0.008291
smoothness_se             -0.000004
compactness_se             0.000006
concavity_se               0.000013
concave points_se          0.000003
symmetry_se               -0.000013
fractal_dimension_se      -0.000002
radius_worst              -0.007117
texture_worst             -0.016294
perimeter_worst           -0.038313
area_worst                 0.023714
smoothness_worst          -0.000085
compactness_worst          0.000138
concavity_worst            0.000262
concave points_worst       0

Experimentation with a while loop using norm function and epsilon value of 1:

In [58]:
w_1 = np.zeros(df_train.shape[1] - 1)
grad1 = 1000000
while grad1 > 1:
    w_1 = w_1 - step_size * grad(w_1, x_train, y_train)
    grad1 = abs(gradnorm(w_1, x_train, y_train))

print(w_1)

radius_mean               -0.007451
texture_mean              -0.013128
perimeter_mean            -0.042786
area_mean                 -0.017743
smoothness_mean           -0.000070
compactness_mean           0.000041
concavity_mean             0.000148
concave points_mean        0.000064
symmetry_mean             -0.000144
fractal_dimension_mean    -0.000059
radius_se                 -0.000088
texture_se                -0.001095
perimeter_se              -0.000081
area_se                    0.009141
smoothness_se             -0.000004
compactness_se             0.000010
concavity_se               0.000018
concave points_se          0.000004
symmetry_se               -0.000013
fractal_dimension_se      -0.000002
radius_worst              -0.007712
texture_worst             -0.017034
perimeter_worst           -0.040814
area_worst                 0.024539
smoothness_worst          -0.000088
compactness_worst          0.000181
concavity_worst            0.000319
concave points_worst       0

## Question 2.

Build out functions for true positive rate (TPR) and false positive rate (FPR) below. Once we have functions for TPR and FPR built, we can calculate true negative rate (TNR) and false negative rate (FNR) thereafter.

In [14]:
# True positive rate
def TPR(true_p, total_p):
    return true_p/total_p

# False positive rate
def FPR(false_p, total_n):
    return false_p/total_n

Build out empty dataframe first to report the performance of our classifier below.

In [165]:
df_t = pd.DataFrame(columns = ['TPR', 'FPR', 'TNR', 'FNR'], 
                   index = ['0.0', '0.1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9', '1.0'])

Obtain actual number of positives and number of negatives from our test set below.

In [65]:
# Actual number of positives
total_p_test = y_test[y_test == 1].count()

# Actual number of negatives
total_n_test = y_test[y_test == 0].count()

print("Actual number of positives:", total_p_test)
print("Actual number of negatives:", total_n_test)

Actual number of positives: 98
Actual number of negatives: 171


After experimenting with specifying step size & iterations against using a termination criteria related to the norm of the gradient, let's use the optimal value `w_1` we obtained from our while loop that had the termination criteria to calculate our performance metric for the different thresholds.

In [120]:
t_values = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
performance_t = np.zeros((len(y_test), len(t_values)))

for t in range(len(t_values)):
    for i in range(len(y_test)):
        pred = 1/(1 + np.exp(np.dot(-w_1.T,x_test.iloc[i,:])))
        if pred > t_values[t]:
            performance_t[i,t] += 1
        else:
            continue

Let's create our dataframe comparing our actual `y_test` values against the values we achieved at our different thresholds.

In [152]:
# Create separate arrays of y values predicted for each threshold
t_0 = performance_t[:,0]
t_1 = performance_t[:,1]
t_2 = performance_t[:,2]
t_3 = performance_t[:,3]
t_4 = performance_t[:,4]
t_5 = performance_t[:,5]
t_6 = performance_t[:,6]
t_7 = performance_t[:,7]
t_8 = performance_t[:,8]
t_9 = performance_t[:,9]
t_10 = performance_t[:,10]

# Add each array above to a dataframe where we can compare predicted y values against actual y_test values
y_test_comp = y_test.to_frame()
y_test_comp['t=0'] = t_0
y_test_comp['t=0.1'] = t_1
y_test_comp['t=0.2'] = t_2
y_test_comp['t=0.3'] = t_3
y_test_comp['t=0.4'] = t_4
y_test_comp['t=0.5'] = t_5
y_test_comp['t=0.6'] = t_6
y_test_comp['t=0.7'] = t_7
y_test_comp['t=0.8'] = t_8
y_test_comp['t=0.9'] = t_9
y_test_comp['t=1.0'] = t_10

Now, let's iterate through each column and compare our predicted values for each threshold with the actual `y_test` diagnosis values. Then, we will obtain numbers for our true positives and false positives at each threshold value.

In [220]:
# Initialize variables for number of true positives and false positives for each threshold below
true_pos = np.zeros(11)
false_pos = np.zeros(11)

# Iteration through each row for each threshold value
for j in range(10):
    for i in y_test_comp.index:
        if y_test_comp['diagnosis'][i] == 1:
            if y_test_comp['diagnosis'][i] == y_test_comp.iloc[i][j+1]:
                true_pos[j] += 1
            else:
                continue
        if y_test_comp['diagnosis'][i] == 0:
            if y_test_comp['diagnosis'][i] != y_test_comp.iloc[i][j+1]:
                false_pos[j] += 1
            else:
                continue

Finally, let's use the number of true positives and false positives we found to calculate TPR, FPR, TNR, and FPR. Then, we can update these calculated metrics in our `df_t` dataframe we created before.

In [245]:
# TPR
for i in range(11):
    df_t.iloc[[i],[0]] = TPR(true_pos[i], total_p_test)

# FPR
for i in range(11):
    df_t.iloc[[i],[1]] = FPR(false_pos[i], total_n_test)

# TNR
for i in range(11):
    df_t.iloc[[i],[2]] = 1 - df_t.iloc[[i],[1]]

# FNR    
for i in range(11):
    df_t.iloc[[i],[3]] = 1 - df_t.iloc[[i],[0]]

Below, we can see the performance of our classifier.

In [246]:
df_t

Unnamed: 0,TPR,FPR,TNR,FNR
0.0,1.0,1.0,0.0,0.0
0.1,0.979592,0.538012,0.461988,0.020408
0.2,0.959184,0.192982,0.807018,0.040816
0.3,0.908163,0.105263,0.894737,0.091837
0.4,0.887755,0.05848,0.94152,0.112245
0.5,0.857143,0.040936,0.959064,0.142857
0.6,0.846939,0.02924,0.97076,0.153061
0.7,0.785714,0.011696,0.988304,0.214286
0.8,0.734694,0.005848,0.994152,0.265306
0.9,0.714286,0.005848,0.994152,0.285714
