# Logistic Regression

Just when we thought we were done with regression, it pulls us back in!

To get an idea of how this algorithm works we'll look at some phony classification data. In a later notebook you'll use logistic regression to build a cancer classifier.

## What You'll Accomplish

- Learn the logistic regression algorithm
- Show how you can interpret logistic regression output
- Talk about classification cutoffs
- Introduce the generalized linear model framework

In [None]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

## The Algorithm

We'll be using logistic regression for binary classification, classification problems with only two classes typically coded as $0$ or $1$. Normally the class denoted as $1$ is something we want to identify, for example someone that has a disease or someone that qualifies for a loan. 

<i>Note that logistic regression can be adapted for more than two classes, but for now we'll only focus on two.</i>

### From Binary to Continuous
<i>Logistic regression is a form of regression algorithm.</i> It turns out that this may be somewhat of a controversial statement... <a href="https://twitter.com/TenanATC/status/1386332061087277057?s=20">https://twitter.com/TenanATC/status/1386332061087277057?s=20</a>, but it is important to remember that it is in fact a regression technique used to solve classification problems.

Remember that regression algorithms are usually used to predict continuous outcomes, but binary classification is in no way continuous. This is where logisitic regression is a little clever, instead of modeling the output class, it models the probability that a particular data point is an instance of class $1$.

Let's dive in with some fake data. 

In [None]:
## Load in the randomly generated data
data = np.loadtxt("random_binary.csv",delimiter = ",")
X = data[:,0]
y = data[:,1]

In [None]:
## Perform a stratified test train split
## Practice, write the code to do that in these two blocks
## First import the package

## I'll wait before writing the code myself :)
from sklearn.model_selection import train_test_split

In [None]:
## Now split the data
## Have 20% for testing
## Set 614 as the random state
## and stratify the split

## I'll wait before writing the code myself :)
X_train,X_test,y_train,y_test = train_test_split(X,y,
                                                test_size=.2,
                                                shuffle=True,
                                                random_state=614,
                                                stratify=y)

In [None]:
# Plot the training data
plt.figure(figsize = (10,8))

plt.scatter(X_train,y_train)
plt.ylim((-.1,1.1))
plt.xlabel("Feature",fontsize = 16)
plt.ylabel("Class",fontsize = 16)

plt.show()

Now let's get down to the idea behind logistic regression. While the $y$-axis of the above plot says Class we could just as easily label it the Probability the instance is $1$. In this case since we know the class of each data point in the training set, the probability can only be $0$ or $1$. Now suppose you have a new data point for which you only have the vector of predictors, $X$. We're interested in the probability that this data point has class $y=1$, call this probability $P(y=1|X) = p(X)$. $p(X)$ can take on all values in $[0,1]$, this is a continuous variable.

The way we model the probability in logistic regression is with a sigmoidal curve, the general form looks like this:
$$
f(x) = \frac{1}{1+e^{-x}}.
$$
A graph of the curve this function produces is shown below.

In [None]:
x = 10

plt.figure(figsize = (8,6))

plt.plot(np.arange(-x,x,.01),1/(1+np.exp(-np.arange(-x,x,.01))))


plt.xlabel("x",fontsize = 14)
plt.ylabel("f(x)",fontsize = 14)

plt.title("A Sigmoidal Curve", fontsize=16)

plt.show()

Notice that this function stays between $0$ and $1$. Also like our phony data data it transitions from class $0$, to class $1$ in a continuous manner. This is the function type we'd like to use as our model.

The model that is used in logistic regression is:
$$
p(X) = \frac{1}{1 + e^{-X\beta}},
$$
where $\beta = \left(\beta_0,\beta_1,\dots,\beta_m\right)^T$ is a column vector of coefficients, $X$ has been extended to include a column of ones.

The model is fit using the statistical method of maximum likelihood estimators (for a derivation of the loss function see the homework on logistic regression).

Let's see how to use `sklearn` to fit a logistic regression model to our phony data, then see how we can use it for classification.

### You Code 

You'll make and fit the Logistic Regression model below for our phony data.

The procedure follows the standard `sklearn` pattern for fitting and predicting a model. If you need more help check out the docs, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html</a>. The model object is stored in `sklearn.linear_model` and is called `LogisticRegression`.

In [None]:
## import the logistic regression method
from sklearn.linear_model import LogisticRegression

In [None]:
## Make a model instance here
## call it log_reg
log_reg = 

In [None]:
## Fit the model here



Now that we've fit the model to our training data, let's plot the model along with our train data, and see what we're talking about.

In [None]:
## Plot the training data along with the fitted model

## use this to plot you fitted model
xs = np.linspace(0,1,1000)
pred = log_reg.predict_proba(xs.reshape(-1,1))[:,1]

# Plot figure here






The dotted red line shows how the logistic regression model fit on our training data changes the probability of being and instance of the $1$ class as we increase the value for our feature. 

### From Probabilities to Classifications

How would you propose we turn probability output into classifications?

<br>
<br>
<br>
<br>
<br>
<br>


The standard approach is to just choose a probability cutoff, for instance if $p(X) \geq .5$ we classify the instance as $1$, otherwise we say it is a $0$. This is an example of a <i>decision boundary</i>, any point to the left of the boundary gets classified as a $0$, on the right a $1$. Decision boundaries are a big part of many classification algorithms, so this won't be the last time we see them.

Let's see what our training accuracy is for a cutoff of your choice. Someone enter a cutoff value into the Zoom chat now :)

In [None]:
## Write code to calculate the accuracy for any cutoff, then choose your cutoff
cutoff = .5

## store the predicted probabilities
y_prob = log_reg.predict_proba(X_train.reshape(-1,1))[:,1]

## assign the value based on the cutoff
y_train_pred = 1*(y_prob > cutoff)

## print the accuracy
## input the accuracy after "is",
print("The training accuracy for a cutoff of",cutoff,
      "is", np.sum(y_train_pred == y_train)/len(y_train))

In [None]:
## Now plot how the accuracy changes with the cutoff
cutoffs = np.arange(0,1.01,.01)
accs = []

for cutoff in cutoffs:
    y_train_pred = 1*(y_prob > cutoff)
    accs.append(np.sum(y_train_pred == y_train)/len(y_train))

In [None]:
plt.figure(figsize=(12,8))

plt.scatter(cutoffs,accs)

plt.xlabel("Cutoff",fontsize=16)
plt.ylabel("Training Accuracy",fontsize=16)

plt.show()

## Interpreting Logistic Regression

One nice thing about this algorithm is that we can interpret the results. This is always a nice feature of an algorithm.

Reconsider the statistical model that we fit:
$$
p(X) = \frac{1}{1 + e^{- X \beta}}.
$$
Rearranging this equation we find the following:
$$
\log\left(\frac{p(X)}{1-p(X)}\right) =  X \beta.
$$
The expression $p(X)/(1-p(X))$ is known as the odds of the event $y=1$. So the statistical model for logistic regression is really just a linear model for the $\log$ odds of being class $1$. This allows us to interpret the coefficients of our model.

Look at the model we just fit:
$$
\log\left(\frac{p(x)}{1-p(x)}\right) = \beta_0 + \beta_1 x, \text{ or } \text{Odds}|x = C e^{\beta_1 x}
$$
where $x$ is the feature, and $C$ is some constant we don't care about. 

So if we increase $x$ from say $d$ to $d+1$, a $1$ unit increase, then our odds are $e^{\beta_1}$ units times larger (or smaller depending on the value of $\beta_1$), we can see this below:
$$
\frac{\text{Odds}|x = d+1}{\text{Odds}|x=d} = \frac{e^{\beta_1 (d+1)}}{e^{\beta_1 d}} = e^{\beta_1}
$$

Let's look at the coefficient from our phony data logistic regression and interpret it.

In [None]:
print("A .1 unit increase in our feature multiplies" + 
      " the odds of being classified as 1 by " + 
      str(np.round(np.exp(.1*log_reg.coef_[0][0]),2)))

### Algorithm Assumptions

While we were explaining the concept of logistic regression, we didn't mention any of the assumptions of the algorithm. Let's talk about that here before we move on to real data.
<ol>
    <li>Each sample must be independent from all other samples,</li>
    <li>When using multiple predictors, they shouldn't be correlated,</li>
    <li>The log odds depend linearly on the predictors,</li>
    <li>Logistic regression should have a largish data set to work with.</li>
</ol>

We did not worry about these assumptions in this notebook because the data were randomly generated to fit these assumptions. However, in real world applications you should check them when deciding whether or not logistic regression is a good model choice. 

## Generalized Linear Models

Before we end this notebook I want to briefly touch on a more general modeling framework that captures both linear and logistic regression, <i>generalized linear models</i>.

Let's review the two types of regression models we've discussed.

#### Linear Regression

For a continuous target, $y$, and a features matrix, $X$, we had:
$$
E(y|X) = X\beta.
$$

#### Logistic Regression

For a binary target, $y$, and a feature matrix, $X$, we had:
$$
\log\left( \frac{P(y=1|X)}{1-P(y=1|X)} \right) = X\beta.
$$

Where we should note that for a binary $0$-$1$ variable $P(y=1|X) = E(y|X)$ so in reality we had:
$$
\log\left( \frac{E(y|X)}{1-E(y|X)} \right) = X\beta.
$$

#### Notice Anything?

In both cases we could write the following:
$$
g(E(y|X)) = X\beta,
$$

where we made a specific choice for the functional form of $g$ depending on the data type of $y$. This is the idea behind generalized linear models.

### Three Components

Given features, $X$, and target, $y$, a generalized linear model relating $y$ to $X$ is composed of three components. 

##### 1.  Random Component

This is where you assume a probability distribution for $y|X$. It is typically assumed that distribution for $y|X$ comes from the <i>exponential family</i>, <a href="https://en.wikipedia.org/wiki/Exponential_family">https://en.wikipedia.org/wiki/Exponential_family</a>.

##### 2. Systematic Component

Where you relate the parameters $\beta$ to the features $X$. It is always the case in a generalized linear model that the systematic component is $X\beta$.

##### 3. Link Component

The connection between the random and systematic components.

Combining all three of these components gives the following:
$$
g(E(y|X)) = X\beta.
$$

We won't do anything else with generalized linear models in this program or in python. However, as you continue on in your own data science work it may be useful to be familiar with the generalized linear model setup. For those interested in learning more I encourage you to check out the following resources:

<a href="http://www.stat.cmu.edu/~ryantibs/advmethods/notes/glm.pdf">http://www.stat.cmu.edu/~ryantibs/advmethods/notes/glm.pdf</a>

<a href="http://www.utstat.toronto.edu/~brunner/oldclass/2201s11/readings/glmbook.pdf">http://www.utstat.toronto.edu/~brunner/oldclass/2201s11/readings/glmbook.pdf</a>

## That's it!

That's it for this notebook. Up next you'll use logistic regression to classify cancer cases!

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2021.

Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)