# Logistic Regression

In this notebook will be some additional problems regarding logistic regression. This material corresponds to `Lectures/Supervised Learning/Classification/4. Logistic Regression`.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

##### 1. How to Fit a Logistic Regression Model

We return to Maximum Likelihood Estimation from the our Regression `Practice Problems`.

Recall that in logistic regression we are interested in $P(y=1|X)$ let's call this $p(X;\beta)$. In logistic regression we are modeling this as:
$$
p(X;\beta) = \frac{1}{1 + e^{-X\beta}}.
$$

Now because our training data exists in a binary state we cannot rely on the same procedure we did for linear regression. We instead use maximum likelihood estimation. We first must write out the likelihood function.

First attempt to set up the $\log$-likelihood for the logistic regression model, hint: we can think of $y_i$ as a bernouli random variable with probability parameter $p_i=p(X_i;\beta)$.


After you have accomplished that read through this reference starting at page 5 to see the derivation of the maximum likelihood estimate for logistic regression, <a href="https://cseweb.ucsd.edu/~elkan/250B/logreg.pdf">https://cseweb.ucsd.edu/~elkan/250B/logreg.pdf</a>.

##### Sample Solution

The likelihood function is:
$$
\prod_{i=1}^n P(y=y_i|X_i) = \prod_{i=1}^n p(X_i;\beta)^{y_i} (1 - p(X_i;\beta))^{1-y_i}.
$$

Thus the log-likelihood is:
$$
\sum_{i=1}^n \log(P(y=y_i|X_i) = \sum_{i=1}^n y_i \log(p(X_i;\beta)) + (1 - y_i)\log(1 - p(X_i;\beta)) 
$$
$$
= \sum_{i=1}^n y_i \log \left( \frac{p(X_i;\beta))}{1 - p(X_i;\beta)} \right) + \log(1 - p(X_i;\beta)) = \sum_{i=1}^n y_i X_i \beta - \log(1+e^{-X_i\beta})
$$

##### 2. Regularization for logistic regression

You can also implement regularization with logistic regression (ridge, lasso or elastic net). In fact, by default `sklearn`'s `LogisticRegression` is ridge logistic regression.

In order to deliberately perform regularized logistic regression you will need/want to know these arguments:
- `penalty` this determines what kind of regularization you run:
    - `penalty = 'none'` performs normal logistic regression,
    - `penalty = 'l2'` performs ridge logistic regression,
    - `penalty = 'l1'` performs lasso logistic regression, and
    - `penalty = 'elasticnet'` performs elastic net logistic regression.
- `C` this controls the strength of the regularization, think of this like the `alpha` argument from `Ridge` and `Lasso`:
    - Large `C` results in a weaker regularization (think of this as a small value of `alpha`), and 
    - Small `C` results in a stronger regularization (equivalent to a large value of `alpha`).
- `solver` this is the algorithm that `sklearn` implements to fit the model, check the documentation, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html</a>, to see what argument you should use for the regularization you want to perform.
- `l1_ratio`, if you are performing Elastic Net regularization you need to set this, see the Regularization Regression homework notebook.


Load in the iris data set below. Then make a new column `virginica` that says whether an observation is of the virginica class. Attempt to perform feature selection using lasso logistic regression.

<i>Do not forget to scale the data prior to fitting the regularization model.<i>

In [2]:
## to get the iris data
from sklearn.datasets import load_iris

## Load the data
iris = load_iris()
iris_df = pd.DataFrame(iris['data'],columns = ['sepal_length','sepal_width','petal_length','petal_width'])
iris_df['iris_class'] = iris['target']

## import train_test_split
from sklearn.model_selection import train_test_split

## Making the split
iris_train, iris_test = train_test_split(iris_df.copy(), 
                                            random_state=431,
                                            shuffle=True,
                                            test_size=.2,
                                            stratify=iris_df['iris_class'])

In [3]:
iris_train['virginica'] = 1*(iris_train['iris_class'] == 2)

##### Sample Solution

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [5]:
scale = StandardScaler()

iris_train_scale = scale.fit_transform(iris_train[['sepal_length','sepal_width','petal_length','petal_width']])


Cs = [100,10,1,.1,.01,.001,.0001,.00001,.000001,.0000001]

coefs = np.zeros((len(Cs), 4))

i = 0
for C in Cs:
    log_reg = LogisticRegression(penalty='l1',C=C,solver='liblinear')
    log_reg.fit(iris_train_scale, iris_train['virginica'])
    
    coefs[i,:] = log_reg.coef_
    i = i + 1

In [6]:
pd.DataFrame(coefs, 
             index=["C="+str(C) for C in Cs], 
             columns = ['sepal_length','sepal_width','petal_length','petal_width']).round(6)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
C=100,-8.575021,-7.91157,49.931031,18.482984
C=10,-1.987029,-2.022275,13.044011,6.3045
C=1,0.0,-0.635549,3.090743,3.599688
C=0.1,0.0,0.0,0.0,1.506548
C=0.01,0.0,0.0,0.0,0.0
C=0.001,0.0,0.0,0.0,0.0
C=0.0001,0.0,0.0,0.0,0.0
C=1e-05,0.0,0.0,0.0,0.0
C=1e-06,0.0,0.0,0.0,0.0
C=1e-07,0.0,0.0,0.0,0.0


It appears that `petal_width` is the most important, followed by `petal_length` then `sepal_width`. The least important seems to be `sepal_length`.

##### 3. Interpreting coefficients for categorical features

While we did not build a logistic regression model using categorical features in our lecture notebook, this can be done. Just like we built $K-1$ dummy variables for a feature with $K$ possible categories, we do the same for logistic regression. So if we wanted to use a feature with $2$ possible categories, we would need a single dummmy variable. If we wanted to use a feature with $5$ possible categories, we would need $4$ dummy variables.

While the process for adding a categorical feature to a logistic regression model is the same, the way the coefficient estimate for such a feature is interpreted is slightly different.

In order to help explain the interpretation let's fit a model together.

First generate the data below.

In [7]:
np.random.seed(135135)
X = np.zeros((200,2))
y = np.zeros(200)
X[:,0] = 10*np.random.random(200)

X[:101,1] = 0
X[101:,1] = 1

y[X[:,0] > 7] = 1
y[:101][(X[:101,0] > 3) & (X[:101,0] <=7)] = np.random.binomial(1, .9, np.sum((X[:101,0] > 3) & (X[:101,0] <=7)))
y[101:][(X[101:,0] > 3) & (X[101:,0] <=7)] = np.random.binomial(1, .1, np.sum((X[101:,0] > 3) & (X[101:,0] <=7)))

This problem has two features, one is a continuous feature, $X_1$, the other a binary, $X_2$. Now use `LogisticRegression` to fit the model regressing `y` on `X`. <i>Here we will not need to make dummy variables because $X_2$ is already a binary</i>.

$$
P(y=1|X) = \frac{1}{1+e^{-\left( \beta_0 + \beta_1 X_1 + \beta_2 X_2 \right)}}
$$

##### Sample Solution

In [8]:
from sklearn.linear_model import LogisticRegression

In [9]:
log_reg = LogisticRegression()

log_reg.fit(X, y)

LogisticRegression()

Now look at the coefficient estimates.

In [10]:
log_reg.coef_

array([[ 1.0201619 , -2.90714949]])

Remember how we set up that the logistic regression model was a linear model for the $\log$-odds that $y=1$,

$$
\log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2, \text{ or } \text{Odds}|X = C e^{\beta_1 X_1 + \beta_2 X_2}.
$$

We can interpret the coefficient on the binary variable $X_2$ by making a comparison to a baseline case, say $X_2=0$.

$$
\frac{\text{Odds}|X_1 = X_1^*, X_2 = 1}{\text{Odds}|X_1 = X_1^*, X_2 = 0} = \frac{C e^{\beta_1 X_1^* + \beta_2 (1)}}{C e^{\beta_1 X_1^* + \beta_2 (0)}} = e^{\beta_2},
$$

and so we can interpret $\beta_2$ by saying that:

<br>

<center>
    The odds that $y=1$ when $X_2=1$ are $e^{\beta_2}$ times the odds that $y=1$ when $X_2=0$ holding all other variables equal.
</center>

This interpretation is sometimes called the <i>odds ratio</i> of $X_2 = 1$ to $X_2=0$.

Interpret the estimate of $\beta_2$ for the model we just fit.

In [11]:
print("The odds that y=1 when X2=1 are",
         np.exp(log_reg.coef_[0][1]),
         "times the odds that y=1 when X2=0",
         "holding X1 constant.")

The odds that y=1 when X2=1 are 0.054631235141658414 times the odds that y=1 when X2=0 holding X1 constant.


This is the same way you would interpret the $K-1$ coefficients for a feature with $K$ possible classes, where the reference variable is the one for which you did not make a dummy variable.

##### 4. Multiclass logistic regression (multinomial regression)

While we formulated logistic regression for binary classification, multiclass classification is also possible.

Suppose we have $m$ features stored in a variable $X$ that we would like to use to predict a variable $y$ that takes on values $1, 2, \dots , K$. 

The multinomial logistic regression model regressing $y$ on $X$ is:

$$
P(y=k | X=X^*) = \frac{\exp(X^*\beta^{(k)})}{1+\sum_{l=1}^{K-1} \exp(X^*\beta^{(l)}) }, \text{ for } k = 1,\dots,K-1, \text{ and}
$$

$$
P(y=K | X=X^*) = \frac{1}{1 + \sum_{l=1}^{K-1} \exp(X^*\beta^{(l)}) },
$$

where the $\beta^{(l)}$ are class specific coefficient vectors.

This is similar to when we have a categorical input variable, as we can see:

$$
\log \left( \frac{P(y=k| X=X^*)}{P(y=K| X=X^*)} \right) = X^* \beta^{(k)}.
$$


It is possible to fit the multinomial logistic regression model with `sklearn`'s `LogisticRegression` model. This can be done by setting the `multi_class` argument to `'multinomial'` when creating the `LogisticRegression` model object.

Do so to fit a multinomial logistic regression model predicting the iris class. What is the training accuracy of this model?

##### Sample Solution

In [12]:
multi = LogisticRegression(multi_class='multinomial', max_iter=500)

multi.fit(iris_train[['sepal_length','sepal_width','petal_length','petal_width']], iris_train['iris_class'])

LogisticRegression(max_iter=500, multi_class='multinomial')

In [13]:
from sklearn.metrics import accuracy_score

In [14]:
accuracy_score(iris_train['iris_class'], 
               multi.predict(iris_train[['sepal_length','sepal_width','petal_length','petal_width']]))

0.975

##### 5. Generalized linear models (GLMs)

<i>This is not an exercise, just read the following.</i>

Let's review the two types of regression models we've discussed.

#### Linear Regression

For a continuous target, $y$, and a features matrix, $X$, we had:
$$
E(y|X) = X\beta.
$$

#### Logistic Regression

For a binary target, $y$, and a feature matrix, $X$, we had:

$$
\log\left( \frac{P(y=1|X)}{1-P(y=1|X)} \right) = X\beta.
$$

Where we should note that for a binary $0$-$1$ variable $P(y=1|X) = E(y|X)$ so in reality we had:

$$
\log\left( \frac{E(y|X)}{1-E(y|X)} \right) = X\beta.
$$

#### Notice Anything?

In both cases we could write the following:

$$
g(E(y|X)) = X\beta,
$$

where we made a specific choice for the functional form of $g$ depending on the data type of $y$. This is the idea behind generalized linear models.

### Three Components

Given features, $X$, and target, $y$, a generalized linear model relating $y$ to $X$ is composed of three components. 

##### I.  Random Component

This is where you assume a probability distribution for $y|X$. It is typically assumed that distribution for $y|X$ comes from the <i>exponential family</i>, <a href="https://en.wikipedia.org/wiki/Exponential_family">https://en.wikipedia.org/wiki/Exponential_family</a>.

##### II. Systematic Component

Where you relate the parameters $\beta$ to the features $X$. It is always the case in a generalized linear model that the systematic component is $X\beta$.

##### III. Link Component

The connection between the random and systematic components.

Combining all three of these components gives the following:

$$
g(E(y|X)) = X\beta.
$$

We will not do anything else with generalized linear models in this program or in python. However, as you continue on in your own data science work it may be useful to be familiar with the generalized linear model setup. For those interested in learning more I encourage you to check out the following resources:

<a href="http://www.stat.cmu.edu/~ryantibs/advmethods/notes/glm.pdf">http://www.stat.cmu.edu/~ryantibs/advmethods/notes/glm.pdf</a>

<a href="http://www.utstat.toronto.edu/~brunner/oldclass/2201s11/readings/glmbook.pdf">http://www.utstat.toronto.edu/~brunner/oldclass/2201s11/readings/glmbook.pdf</a>

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)