# Binary Classification with Logistic Regression

## What is binary classification?

- Data has two possible values for the output/dependent variable
- Every $y$ is:
    - $0$ or $1$
    - $-1$ or $1$
    - yes or no
    - true or false


## Example Data: admission to Elay University, a fictional selective university

- The data below contains three columns:
    1. acceptance response
    1. amount donated to the university from the applicant's family
    1. the applicant's high school GPA
- The $y$ values for this data is the acceptance response which can be only one of two values:
    1. accepted
    2. not accepted
- Accepted will be $y=1$ and not accepted will be $y=0$

In [None]:
import torch
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.subplots as psub

In [None]:
df = torch.load('elay_acceptance.pt')
df

In [None]:
fig = px.scatter(df, x="donations", y="gpa", symbol='response', color='response')
fig.update_traces(opacity=0.5)

## What does logistic regression do?

- Find line that separates data best
- Above the line is "accepted"
- below the line is "not accepted"

## Loss/Objective Function

### Zero-One Loss

- Loss is 0 if you make the correct prediction, 1 otherwise
- Above the line will predict $\hat{y}^{(i)}=1$
- Below the line will predict $\hat{y}^{(i)}=-1$
  $$ l^{(i)} = \begin{cases} 
      0 & \text{if} \ \ \hat{y}^{(i)} = y^{(i)} \\
      1 & \text{if} \ \ \hat{y}^{(i)} \neq y^{(i)}
   \end{cases} $$
- This is a step function
- The gradient of this function is undefined at the separating line and zero everywhere else. Gradient descent can't be used here.


### Logistic Loss

#### Logistic Function

- Instead of making a binary prediction (0 or 1), predict the probability of being accepted.
- Probabilities are between 0.0 and 1.0
- A standard logistic function (also called a sigmoid) will take any number as an input and return a value between 0.0 and 1.0
    $$ f(z) = \frac{1}{1+e^{-z}} $$
- Code to graph this function is below


In [None]:
def logistic(z):
    return 1/(1+torch.exp(-z))

In [None]:
Zs = torch.linspace(-10,10,100)
Values = logistic(Zs)
px.line(x=Zs, y=Values, title='Logistic function').update_xaxes(title='z').update_yaxes(title='value')


- We can plug distance from the line/boundary as the input to the logistic function where $z$ is:
    $$ \hat{y}^{(i)} = \frac{1}{1+ \exp (-z^{(i)})} $$
    - in our case:
    $$ z^{(i)} = b +  w_1 x_1^{(i)} + w_2 x_2^{(i)} $$
    - *keep in mind that when plotting this data $x_2$ is what's displayed on "y-axis"/vertical-axis although it is not our output/y-variable.*
    - in general with any number of x-variable components:
    $$ z^{(i)} = b + \bm{w} \cdot \bm{x}^{(i)} $$
- A point on the boundary that separates the data will have probability equal to $0.5$ or $\hat{y}^{(i)} = 0.5$
- A point above the boundary will have $\hat{y}^{(i)} \geq 0.5$
- A point below the boundary will have $\hat{y}^{(i)} \leq 0.5$
- We can multiply both sides by any factor and it won't change the position of the boundary that separates the data
    - This factor adjusts the steepness of the logistic curve
    - A factor greater than one makes it steeper, which means a point one unit above the boundary will now have a higher probability (closer to 1.0) than before
    - A factor less than one makes it less steep, which means a point one unit above the boundary will now have a lower probability (closer to 0.5) than before
    - A negative factor flips the side that is the positve/true/1 prediction to the other side of the boundary

#### Likelihood

- Now that you have an equation to predict a probability for each of our data points, you can see how likely it is to produce our dataset
- We're going to assume the acceptance of applicants is independent from each other
- Probability of two independent events is the product of their probabilities
  $$ P(y^{(i)} | \hat{y}^{(i)}) = \begin{cases} 
      P(y^{(i)} = 1 | \hat{y}^{(i)}) = \hat{y}^{(i)}) \\
      P(y^{(i)} = 0 | \hat{y}^{(i)}) = 1 - \hat{y}^{(i)})
   \end{cases} $$
  $$\text{or}$$
  $$ P(y^{(i)} | \hat{y}^{(i)}) = y^{(i)} \hat{y}^{(i)} + (1-y^{(i)}) (1-\hat{y}^{(i)})$$
  $$ P(\bm{y} | \hat{\bm{y}}) = \prod_{i=1}^{N} P(y^{(i)} | \hat{y}^{(i)}) = P(y^{(1)} | \hat{y}^{(1)}) \cdot P(y^{(2)} | \hat{y}^{(2)}) \cdot ... \cdot P(y^{(N)} | \hat{y}^{(N)})$$
- We want to take find the $\bm{w}$ and $b$ that maximize the likelihood:
  $$ \underset{b, \bm{w}}{\operatorname{arg\ max}} \  P(\bm{y} | \hat{\bm{y}})$$
- The calculus for all those products is difficult so...

#### Log Likelihood

- Taking the log of a function does not affect its $\arg \max$ or $\arg \min$
- If you take the log of the likelihood it makes the calculus easier
    - $\ln (vw) = \ln(v) + \ln(w)$
- Take the natural log of our likelihood function
$$ \ln P(\bm{y} | \hat{\bm{y}}) = \ln \prod_{i=1}^{N} P(y^{(i)} | \hat{y}^{(i)}) = \sum_{i=1}^{N} \ln P(y^{(i)} | \hat{y}^{(i)}) = \ln P(y^{(1)} | \hat{y}^{(1)}) + \ln P(y^{(2)} | \hat{y}^{(2)}) + ... + \ln P(y^{(N)} | \hat{y}^{(N)})$$
$$ \ln P(y^{(i)} | \hat{y}^{(i)}) = \begin{cases} 
      \ln \left( \hat{y}^{(i)} \right) & \text{if} \ \ y^{(i)} = 1 \\
      \ln \left( 1 - \hat{y}^{(i)} \right) & \text{if} \ \ y^{(i)} = 0
   \end{cases} $$
$$\text{or}$$
$$ \ln P(y^{(i)} | \hat{y}^{(i)}) = y^{(i)} \ln \left( \hat{y}^{(i)} \right) + \left( 1-y^{(i)} \right) \ln \left( 1-\hat{y}^{(i)} \right) $$

#### Negative Log Likelihood

- We want to maximize the log-likelihood not minimize it
- We would have to do gradient **ascent** on log-likelihood
- or we could just multiply the log-likelihood by $-1$ and use the same gradient descent we have been using
$$ \underset{b, \bm{w}}{\operatorname{arg\ min}} \quad  - \ln P(\bm{y} | \hat{\bm{y}})$$
$$ \underset{b, \bm{w}}{\operatorname{arg\ min}} \quad  \sum_{i=1}^{N} \left[ - y^{(i)} \ln \left( \hat{y}^{(i)} \right) - (1-y^{(i)}) \ln \left( 1-\hat{y}^{(i)} \right) \right] $$

- Negative log-likelihood is the loss function for logistic regression

## Standardize the data

In [None]:
def standardize(data):
    assert isinstance(data, torch.Tensor) # check that data is a pytorch tensor
    column_mu = data.mean(dim=0) # take the mean of each column
    column_sigma = data.std(dim=0) # take the std deviation of each column
    data_standardized = (data-column_mu)/column_sigma # subract each values column mean then divide by the column std deviation
    return data_standardized, column_mu, column_sigma

def unstandardize(data_stand, column_mu, column_sigma):
    assert isinstance(data_stand, torch.Tensor)
    return data_stand * column_sigma + column_mu

In [None]:
X_all, mus, stds = standardize(torch.tensor(df[['donations','gpa']].values))

In [None]:
Y_all = torch.tensor(df['response'] == 'accepted')

## Split data into train, validate, and test

In [None]:
random_order_indices = torch.randperm(len(X_all)) # random order of X_all's indices
train_proportion = 0.7
valid_proportion = 0.15
test_proportion = 0.15
train_valid_index_border = torch.math.floor(len(X_all)*train_proportion)
valid_test_index_border = torch.math.floor(len(X_all)*(train_proportion+valid_proportion))
train_indices = random_order_indices[0:train_valid_index_border] # take first 80% of random_order_indices to be the train indices
valid_indices = random_order_indices[train_valid_index_border:valid_test_index_border]
test_indices = random_order_indices[valid_test_index_border:]

X = X_all[train_indices]
Y = Y_all[train_indices]
X_valid = X_all[valid_indices]
Y_valid = Y_all[valid_indices]
X_test  = X_all[test_indices]
Y_test  = Y_all[test_indices]

## Write your own pytorch module that will be our logistic regression module

In [None]:
class LogReg(torch.nn.Module): # inherits from torch.nn.Module
    def __init__(self): # defines initialization of this module
        super().__init__() # calls the parent class' (torch.nn.Module) initializer
        self.linear = torch.nn.Linear(in_features=2, out_features=1) # linear part of logistic regression
        self.sigmoid = torch.nn.Sigmoid() # exponential part of logistic regression
        
    def forward(self, x):
        z = self.linear(x) # apply the linear part
        y_pred = self.sigmoid(z) # apply the logistic/sigmoid part
        return y_pred # return the probability

## Metrics

- you can use a different metric other than your loss function to measure the performance of your model
- sometimes your loss function is difficult to explain to other people
- sometimes your loss function doesn't directly measure your goal
    - in logistic regression the loss function (negative log-likelihood) doesn't tell you directly how many predictions you got right vs wrong
    - zero-one loss would directly measure our goal but it wouldn't work with gradient descent

In [None]:
log_reg = LogReg() # initialize logist
optim = torch.optim.SGD(log_reg.parameters(), lr=1.0)
epoch = 0
Losses = []
Losses_valid = []
Accuracy = []
Accuracy_valid = []

In [None]:
def graph_steps(title):
    Losses = []
    Losses_valid = []
    Acc = []
    Acc_valid = []
    fig = go.FigureWidget(psub.make_subplots(rows=1, cols=2, subplot_titles=('Loss', 'Accuracy')))
    
    trace_train = go.Scatter( x=list(range(len(Losses))), y=Losses, line=dict(color=px.colors.qualitative.Plotly[0]), name='train loss' )
    fig.add_trace(trace_train, row=1, col=1)
    
    trace_valid = go.Scatter( x=list(range(len(Losses_valid))), y=Losses_valid, line=dict(color=px.colors.qualitative.Plotly[1]), name='valid loss' )
    fig.add_trace(trace_valid, row=1, col=1)
    
    trace_train_acc = go.Scatter( x=list(range(len(Losses))), y=Losses, line=dict(color=px.colors.qualitative.Plotly[2]), name='train accuracy' )
    fig.add_trace(trace_train_acc, row=1, col=2)
    
    trace_valid_acc = go.Scatter( x=list(range(len(Losses_valid))), y=Losses_valid, line=dict(color=px.colors.qualitative.Plotly[3]), name='valid accuracy' )
    fig.add_trace(trace_valid_acc, row=1, col=2)
    
    fig.update_layout( autosize=False, width=1500, height=500, title=title)
    fig.update_xaxes( title="steps", row=1, col=1)
    fig.update_yaxes( title='loss', row=1, col=1)
    fig.update_xaxes( title="steps", row=1, col=2)
    fig.update_yaxes( title='accuracy', row=1, col=2)
    return fig

In [None]:
def step(X, Y, X_valid, Y_valid, model, optim, Losses, Losses_valid, Accuracy, Accuracy_valid):
    global epoch
    print('epoch #{}'.format(epoch))
    optim.zero_grad() # reset the gradients
    model.train() # tells pytorch we're training the linear module

    pred_y = model(X).squeeze() # make a prediction

    loss = torch.nn.functional.binary_cross_entropy(pred_y, Y.float()) # calculate logistic loss
    loss.backward() # calculate gradients
    optim.step() # take gradient descent step
    
    pred_y_binary = pred_y > 0.5
    accuracy = (pred_y_binary == Y).float().mean()

    model.eval() # tells pytorch we're not training our linear module (we're validating now instead of training)
    with torch.no_grad(): # Tells pytorch to not track gradients. This prevents unnecessary calculations and memory usage
        pred_y_valid = model(X_valid).squeeze()
        loss_valid = torch.nn.functional.binary_cross_entropy(pred_y_valid, Y_valid.float())
        pred_y_valid_binary = pred_y_valid > 0.5
        accuracy_valid = (pred_y_valid_binary == Y_valid).float().mean()

    Losses.append(loss.item()) # add current training loss to Losses list
    Losses_valid.append(loss_valid.item()) # add current validation loss to Losses_valid list
    Accuracy.append(accuracy.item())
    Accuracy_valid.append(accuracy_valid.item())
    print('Validation Loss: ' + str(loss_valid.item()))
    epoch += 1

In [None]:
fig = graph_steps('Logistic Loss')
train_loss = fig.data[0] # use this variable to add train losses to graph
valid_loss = fig.data[1] # use this variable to add validation losses to graph
train_acc = fig.data[2]
valid_acc = fig.data[3]
fig

In [None]:
step(X, Y, X_valid, Y_valid, log_reg, optim, Losses, Losses_valid, Accuracy, Accuracy_valid)
# if Losses_valid[-1] < best_model_valid_loss:
#     best_model = copy_linear(model)
#     best_model = Losses_valid[-1]
train_loss.x = list(range(len(Losses))) # add the train steps to graph on x-axis
train_loss.y = Losses # add the train losses to graph on y-axis
valid_loss.x = list(range(len(Losses_valid)))
valid_loss.y = Losses_valid

train_acc.x = list(range(len(Accuracy))) # add the train steps to graph on x-axis
train_acc.y = Accuracy # add the train losses to graph on y-axis
valid_acc.x = list(range(len(Accuracy_valid)))
valid_acc.y = Accuracy_valid

In [None]:
y_test_pred = log_reg(X_test).squeeze()
y_test_pred_binary = y_test_pred > 0.5

In [None]:
correct = y_test_pred_binary == Y_test

In [None]:
log_reg.linear.weight.data[0,1]

In [None]:
df_std = (df['donations'] - df['donations'].mean())/df['donations'].std()

In [None]:
df_std.min()

In [None]:
x_plot_max = X[:,0].max().item()
x_plot_min = X[:,0].min().item()
weight = log_reg.linear.weight.data.squeeze().tolist()
bias = log_reg.linear.bias.data.item()

In [None]:
line_x1s = [x_plot_min, x_plot_max]
line_x2s = [(-bias - weight[0]*x_plot_min)/weight[1], (-bias - weight[0]*x_plot_max)/weight[1]]

In [None]:
fig = go.Figure()
true_ind = y_test_pred_binary == True
scatter_pred_true = go.Scatter(x=X_test[true_ind,0], y=X_test[true_ind,1], mode='markers', marker=dict(opacity=0.5), name="predicted accepted")
scatter_pred_false = go.Scatter(x=X_test[(~true_ind),0], y=X_test[(~true_ind),1], mode='markers', marker=dict(opacity=0.5), name="predicted not accepted")
line = go.Scatter(x=line_x1s, y=line_x2s, name='decision boundary')
fig.add_traces([scatter_pred_true, scatter_pred_false, line])

In [None]:
fig = go.Figure()
correct_ind = correct == True
scatter_correct = go.Scatter(x=X_test[correct_ind,0], y=X_test[correct_ind,1], mode='markers', marker=dict(opacity=0.5), name='correct predictions')
scatter_incorrect = go.Scatter(x=X_test[~correct_ind,0], y=X_test[~correct_ind,1], mode='markers', marker=dict(opacity=0.5), name='incorrect predictions')
# line = go.Scatter(x=X1_line, y=X2_line)
fig.add_traces([scatter_correct, scatter_incorrect, line])

In [None]:
fig = go.Figure()
accepted = Y_test == True
scatter_accepted = go.Scatter(x=X_test[accepted,0], y=X_test[accepted,1], mode='markers', marker=dict(opacity=0.5), name='accepted')
scatter_not_accepted = go.Scatter(x=X_test[~accepted,0], y=X_test[~accepted,1], mode='markers', marker=dict(opacity=0.5), name='not accepted')
# line = go.Scatter(x=X1_line, y=X2_line)
fig.add_traces([scatter_accepted, scatter_not_accepted, line])