# LOGISTIC REGRESSION

## AGENDA
1. Refresh your memory on how to do linear regression in scikit-learn
2. Attempt to use linear regression for classification
3. Show you why logistic regression is a better alternative for classification
4. Brief overview of probability, odds, e, log, and log-odds
5. Explain the form of logistic regression
6. Compare logistic regression with other models

## PART 1: PREDICTING A CATEGORICAL RESPONSE

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
# glass identification dataset
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass = pd.read_csv(url, names=col_names, index_col='id')
glass['assorted'] = glass.glass_type.map({1:0, 2:0, 3:0, 4:0, 5:1, 6:1, 7:1})
glass.head()

## PART 2: USING LOGISTIC REGRESSION

In [None]:
# Logistic regression can do what we just did:
# fit a linear regression model and store the class predictions
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
feature_cols = ['al']
X = glass[feature_cols]
y = glass.assorted
logreg.fit(X, y)
assorted_pred_class = logreg.predict(X)

In [None]:
LogisticRegression?

In [None]:
glass.shape

In [None]:
# print the class predictions
assorted_pred_class

In [None]:
glass.head()

In [None]:
# add predicted class to DataFrame
glass['assorted_pred_class'] = assorted_pred_class

In [None]:
glass.head()

In [None]:
# sort DataFrame by al
glass.sort_values('al', inplace=True)

In [None]:
# plot the class predictions again
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, glass.assorted_pred_class, color='red')

### What if we wanted the **predicted probabilities** instead of just the **class predictions**, to understand how confident we are in a given prediction?

In [None]:
logreg.predict_proba(X)

In [None]:
# store the predicted probabilites of class 1
assorted_pred_prob = logreg.predict_proba(X)[:, 1]

In [None]:
assorted_pred_prob

In [None]:
glass['assorted_pred_prob'] = sorted(assorted_pred_prob)

In [None]:
# plot the predicted probabilities
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, glass.assorted_pred_prob, color='red')

In [None]:
# examine some example predictions
print (logreg.predict_proba([[1]]))
print (logreg.predict_proba([[2]]))
print (logreg.predict_proba([[3]]))

#### What is this? 
* The first column indicates the predicted probability of **class 0**, and the second column indicates the predicted probability of **class 1**.

## PART 3: PROBABILITY, ODDS, e, LOG, LOG-ODDS



## LOGISTIC REGRESSION ANALYSIS: UNDERSTANDING ODDS AND PROBABILITY

Probability and odds measure the same thing: **the likelihood of a specific outcome**.

People use the terms odds and probability interchangeably in casual usage, but that is unfortunate. It just creates confusion because they are not equivalent.

They measure the same thing on different scales. Imagine how confusing it would be if people used degrees Celsius and degrees Fahrenheit interchangeably. “It’s going to be 35 degrees today” could really make you dress the wrong way.

In measuring the likelihood of any outcome, we need to know two things: how many times something happened and how many times it could have happened, or equivalently, how many times it didn’t. The outcome of interest is called a success, whether it’s a good outcome or not.

The other outcome is a failure. Each time one of the outcomes could occur is called a trial. Since each trial must end in success or failure, number of successes and number of failures adds up to total number of trials.

Probability is the number of times success occurred compared to the total number of trials.

Odds are the number of times success occurred compared to the number of times failure occurred.

For example, to predict the likelihood of accidents at a particular intersection, each car that goes through the intersection is considered a trial. Each trial has one of two outcomes: accident or safe passage. If the outcome we’re most interested in modeling is an accident, that is a success (no matter how morbid it sounds).

**Probability(success)** = number of successes/total number of trials

**Odds(success)** = number of successes/number of failures

Odds are often written as:

    `Number of successes:1 failure`

    which is read as the number of successes for every 1 failure. But often the :1 is dropped.

You will see a lot of researchers get stuck when learning logistic regression because they are not used to thinking of likelihood on an odds scale.

- Equal odds are 1. 1 success for every 1 failure. 1:1
- Equal probabilities are .5. 1 success for every 2 trials.


- Odds can range from 0 to infinity. 
- Odds greater than 1 indicates success is more likely than failure. 
- Odds less than 1 indicates failure is more likely than success.


- Probability can range from 0 to 1. 
- Probability greater than .5 indicates success is more likely than failure. 
- Probability less than .5 indicates failure is more likely than success.


#### THE EXAMPLE: 

In the last month, data from a particular intersection indicate that of the 1,354 cars that drove through it, 72 got into an accident.

- 72 Successes = Accident
- 1282 Failures = Safe Passage (1,354 – 72)


- Failures = Total – Successes


- Pr(Accident) = 72/1354 = .053
- Pr(Safe Passage) = 1282/1354 = .947


- Odds(Accident) = 72/1282 = .056
- Odds(Safety) = 1282/72 = 17.87


Now get out your calculator, because you’ll see how these relate to each other.

- Odds(Accident) = Pr(Accident)/Pr(Safety) = .053/.947

# $$probability = \frac {one\ outcome} {all\ outcomes}$$
# $$odds = \frac {one\ outcome} {all\ other\ outcomes}$$

#### Examples:
- Dice roll of 1: probability = 1/6, odds = 1/5
- Even dice roll: probability = 3/6, odds = 3/3 = 1
- Dice roll less than 5: probability = 4/6, odds = 4/2 = 2

# $$odds = \frac {probability} {1 - probability}$$
# $$probability = \frac {odds} {1 + odds}$$

In [None]:
# create a table of probability versus odds
table = pd.DataFrame({'probability':[0.1, 0.2, 0.25, 0.5, 0.6, 0.8, 0.9]})
table['odds'] = table.probability/(1 - table.probability)
table

#### What is **e**? It is the base rate of growth shared by all continually growing processes:

In [None]:
# exponential function: e^1
np.exp(1)

#### What is a **(natural) log**? It gives you the time needed to reach a certain level of growth:

In [None]:
# time needed to grow 1 unit to 2.718 units
np.log(2.718)

* It is also the **inverse** of the exponential function:

In [None]:
np.log(np.exp(1))

In [None]:
# add log-odds to the table
table['logodds'] = np.log(table.odds)
table

## PART 4: WHAT IS LOGISTIC REGRESSION?

### **Linear regression:** continuous response is modeled as a linear combination of the features:
# $$y = \beta_0 + \beta_1x$$

### **Logistic regression:** log-odds of a categorical response being "true" (1) is modeled as a linear combination of the features:
# $$\log \left({p\over 1-p}\right) = \beta_0 + \beta_1x$$

* This is called the **logit function**.

#### Probability is sometimes written as pi:
# $$\log \left({\pi\over 1-\pi}\right) = \beta_0 + \beta_1x$$

#### The equation can be rearranged into the **logistic function**:
# $$\pi = \frac{e^{\beta_0 + \beta_1x}} {1 + e^{\beta_0 + \beta_1x}}$$

#### In other words:
- Logistic regression outputs the **probabilities of a specific class**
- Those probabilities can be converted into **class predictions**

#### The **logistic function** has some nice properties:
- Takes on an "s" shape
- Output is bounded by 0 and 1

#### Notes:
- **Multinomial logistic regression** is used when there are more than 2 classes.
- Coefficients are estimated using **maximum likelihood estimation**, meaning that we choose parameters that maximize the likelihood of the observed data.

## PART 5: COMPARING LOGISTIC REGRESSION WITH OTHER OTHER MODELS

### Advantages of logistic regression:
- Highly interpretable (if you remember how)
- Model training and prediction are fast
- No tuning is required (excluding regularization)
- Features don't need scaling
- Can perform well with a small number of observations
- Outputs well-calibrated predicted probabilities

### Disadvantages of logistic regression:
- Presumes a linear relationship between the features and the log-odds of the response
- Performance is (generally) not competitive with the best supervised learning methods
- Sensitive to irrelevant features
- Can't automatically learn feature interactions

# REFERENCES

1. [Explaining Logistic Regression](http://www.theanalysisfactor.com/explaining-logistic-regression/)
2. [Why use Odds Rations](http://www.theanalysisfactor.com/why-use-odds-ratios/)
3. [The Intuitive Guide to Exponential Functions & e](https://betterexplained.com/articles/an-intuitive-guide-to-exponential-functions-e/)
4. [Demystifying the Natural Logarithm](https://betterexplained.com/articles/demystifying-the-natural-logarithm-ln/)