# Logistic Regression

Prepared by: Benjamin Ricard, QBS108 TA | Last update: 12/3/2018

### Learning Objectives

By the end of this tutorial, students should be able to:

* Understand the theoretical and mathematical basis for logistic regression

### General Procedure

* Simulate data that can be classified according to logistic regresssion
* Analyze coeffecients and model output to understand model performance

# Background

 
Logistic regression represnts a regression that deviates from classical linear regression through incorporation of a logit link function. A link function is a general function that connects a linear regression output to a new form - here, from an unbounded linear function to a sigmoid with range [0,1].

 
Starting from the definition of regression, the log-odds or logit link function (prob. of i / prob. not i) can be defined via:
 $$\beta_0 x_0 + \beta_1 x_1 + ... + \beta_n x_n = logit(p_i(x)) = ln(\frac{p_i(x)}{1-p_i(x)})
$$
For a basic 2 class classification, we can simplify this equation to find the probability:
 
 

 $$
p_i = \frac{e^{\beta_0 x_0 + \beta_1 x_1 + ... + \beta_n x_n}}{e^{\beta_0 x_0 + \beta_1 x_1 + ... + \beta_n x_n}+1} = \frac{1}{e^{-(\beta_0 x_0 + \beta_1 x_1 + ... + \beta_n x_n)}}
$$

With, accordingly, $1-p_i$ representing the probability of *not* being class $i$. Plotting of the general shape of an equation of the form $y = \frac{1}{1+e^{-x}}$ shows a sigmoid that ranges in value between [0,1], and intepretation of the output of logistic regression can be accomplished by predicting class for an output $y_i$ based on probability. Usually, the rounded version of the probability can be used to determine predicted output. When training your own logistic regression model, care must be taken to carefully separate the predicted probabilities from classes. Minimizing differences in predicted probabilities vs predicted class can lead to very different results!

# Simulations

 Simulate 6 informative coefficients and 94 noisy labels. 3 infomative coeffecients random numbers from [0,1], 3 informative coeffecients random numbers from [-1,0], and 94 labels from [-1,1]. The *true* output generated by the summation of the 6 informative coeffecients; 1 if sum >=0, and 0 otherwise.

In [82]:
import random
import math
import random
import numpy
from sklearn.linear_model import LogisticRegression


emp=[]
for i in range(0,1001):
    temp=[]
    temp.append(random.uniform(0, 1))
    temp.append(random.uniform(0, 1))
    temp.append(random.uniform(0, 1))
    temp.append(random.uniform(-1, 0))
    temp.append(random.uniform(-1, 0))
    temp.append(random.uniform(-1, 0))
    for j in range(0,100-6):
        temp.append(random.uniform(-1, 1))
    emp.append(temp)

In [73]:
stack=numpy.vstack(emp)

## Separate training/Test data
Generate output according to informative coeffecients

In [74]:
training=stack[0:500]
testing=stack[501:1001]


In [75]:
output=[]
for i in range(0,len(emp)):
    output.append(sum(emp[i][0:6]))
binary=[]
for i in output:
    if i>=0:
        binary.append(1)
    else:
        binary.append(0)

In [76]:
X_train=training
X_test=testing
Y_train=binary[0:500]
Y_test = binary[501:1001]

You can solve the MLE of the above logit-equation (and incorporate regularization), but you can also just use Sklearn.

In [77]:
clf = LogisticRegression()
clf.fit(X_train,Y_train)
print('Mean Testing Accuracy: ', clf.score(X_test,Y_test))


Mean Testing Accuracy:  0.898


Recall that we made it so only the first 6 coefficients were informative - we expect that a good model would be able to find these informative features by weighing these variables more strongly. We see this is the case. 

In [78]:
print("First 15 Coefficients: ",clf.coef_[0][0:15])

First 15 Coefficients:  [ 4.23691694  3.46056115  3.36118641  3.97574661  3.5723408   3.8072048
 -0.3320955   0.08455914  0.10260776 -0.32276413  0.02081938 -0.02323587
 -0.0667668   0.03450831  0.16773663]


In [79]:
print("First 10 Testing Predictions: ")
print("[Prob = 1", "Prob = 0]")      
print(clf.predict_proba(X_test)[0:10])

First 10 Testing Predictions: 
[Prob = 1 Prob = 0]
[[0.02172683 0.97827317]
 [0.98085023 0.01914977]
 [0.00691872 0.99308128]
 [0.21911226 0.78088774]
 [0.01106238 0.98893762]
 [0.88934149 0.11065851]
 [0.05136115 0.94863885]
 [0.96333054 0.03666946]
 [0.25407114 0.74592886]
 [0.46572383 0.53427617]]


In [80]:
print("True Testing Labels: ",Y_test[0:10])

True Testing Labels:  [1, 0, 1, 1, 1, 0, 1, 0, 1, 0]


Our logistic regression model works well, as evidenced by the predicted vs true outcome.