<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Beyond-Linear-Regression" data-toc-modified-id="Beyond-Linear-Regression-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Beyond Linear Regression</a></span></li><li><span><a href="#Predicting-a-Categorical-Response" data-toc-modified-id="Predicting-a-Categorical-Response-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Predicting a Categorical Response</a></span><ul class="toc-item"><li><span><a href="#Preparing-Data" data-toc-modified-id="Preparing-Data-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Preparing Data</a></span></li><li><span><a href="#Using-a-Regression-Line" data-toc-modified-id="Using-a-Regression-Line-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Using a Regression Line</a></span></li><li><span><a href="#Interpreting-Our-Predictions" data-toc-modified-id="Interpreting-Our-Predictions-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Interpreting Our Predictions</a></span></li></ul></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Fitting-a-Line-to-$Logit(target)$" data-toc-modified-id="Fitting-a-Line-to-$Logit(target)$-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Fitting a Line to $Logit(target)$</a></span><ul class="toc-item"><li><span><a href="#sklearn.linear_model.LogisticRegression()" data-toc-modified-id="sklearn.linear_model.LogisticRegression()-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span><code>sklearn.linear_model.LogisticRegression()</code></a></span></li><li><span><a href="#.predict()-vs.-.predict_proba()" data-toc-modified-id=".predict()-vs.-.predict_proba()-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span><code>.predict()</code> vs. <code>.predict_proba()</code></a></span></li></ul></li></ul></div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# For our modeling steps
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import log_loss

# For demonstrative pruposes
from scipy.special import logit, expit

# Objectives

- Describe conceptually the need to move beyond linear regression
- Explain the relationship between probability and odds
- Explain the form of logistic regression

# Beyond Linear Regression

**Classification techniques** are an essential part of machine learning and data mining applications. Approximately 70% of problems in Data Science are classification problems. There are lots of classification algorithms that are available, but logistic regression is exceedingly common and useful. We shall focus on binary classification problems, to which logistic regression most immediately applies. Other classification problems handle the cases where multiple classes are present in the target variable, though logistic regression can be extended to cover this sort of case as well. The iris dataset is a very famous example of multi-class classification.

**Logistic Regression** is one of the most simple and commonly used Machine Learning algorithms for two-class classification. It is easy to implement and can be used as the baseline for any binary classification problem. Its basic fundamental concepts are also constructive in deep learning. Logistic regression describes and estimates the relationship between one dependent binary variable and independent variables.

Linear regressions have [limitations](https://en.wikipedia.org/wiki/Generalized_linear_model).

As it stands, the algorithm could generate a prediction *anywhere on the real number line*. This *may* be realistic, like if I'm predicting national surpluses/debts.

But what if I'm predicting values of a variable that doesn't take, say, negative values, like temperature in Kelvin?

What if I'm predicting values of a variable that takes only integer values, like the number of mouseclicks on my killer ds blog per minute?

What if I'm predicting probabilities? Or something Boolean / Bernoullian?

What if the shape of my errors changes as a function of the dependent variable?

Am I stuck using linear regression? There's got to be a better way!

# Predicting a Categorical Response

Here we have a dataset about glass. Information [here](https://archive.ics.uci.edu/ml/datasets/glass+identification).

## Preparing Data

In [None]:
# glass identification dataset
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass = pd.read_csv(url, names=col_names, index_col='id')
glass.sort_values('al', inplace=True)
glass.head()

In [None]:
# types 1, 2, 3 are window glass
# types 5, 6, 7 are household glass
glass['household'] = glass.glass_type.map({1:0, 2:0, 3:0, 5:1, 6:1, 7:1})
glass.head()

Let's change our task, so that we're predicting **household** using **al**. Let's visualize the relationship to figure out how to do this:

In [None]:
fig, ax = plt.subplots()
ax.scatter(glass.al, glass.household)
ax.set_xlabel('al')
ax.set_ylabel('household')
ax.set_title('Type of Glass as a Function of Aluminum Content');

## Using a Regression Line

Let's draw a **regression line**, like we did before:

In [None]:
# fit a linear regression model and store the predictions

linreg = LinearRegression()
feature_cols = ['al']
X = glass[feature_cols]
y = glass.household
linreg.fit(X, y)
glass['household_pred'] = linreg.predict(X)

In [None]:
# scatter plot that includes the regression line

fig, ax = plt.subplots()
ax.scatter(glass.al, glass.household)
ax.plot(glass.al, glass.household_pred, color='red')
ax.set_xlabel('al')
ax.set_ylabel('household');

> What are some issues with this graph?

## Interpreting Our Predictions

If **al=3**, what class do we predict for household? 

If **al=1.5**, what class do we predict for household? 

We predict the 0 class for **lower** values of al, and the 1 class for **higher** values of al. What's our cutoff value? Around **al=2**, because that's where the linear regression line crosses the midpoint between predicting class 0 and class 1.

Therefore, we'll say that if **household_pred >= 0.5**, we predict a class of **1**, else we predict a class of **0**.

# Logistic Regression

Logistic regression can do what we just did.

The strategy now is to *generalize* the notion of linear regression; linear regression as we've known it will become a special case. In particular, we'll keep the idea of the regression best-fit line, but now **we'll allow the model to make predictions through some (non-trivial) transformation of the linear predictor**.

Let's say we've constructed our best-fit line, i.e. our linear predictor, $\hat{L} = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$.

Consider the following transformation: <br/>
$\large\hat{y} = \Large\frac{1}{1 + e^{-\hat{L}}} \large= \Large\frac{1}{1 + e^{-(\beta_0 + ... + \beta_nx_n)}}$. This is called the **sigmoid function**.

We're imagining that $\hat{L}$ can take any values between $-\infty$ and $\infty$.

$\large\rightarrow$ But what values can $\hat{y}$ take? What does this function even look like?

In [None]:
# Let's plot this function here:

X = np.linspace(-10, 10, 300)
Y = 1 / (1 + np.exp(-X))

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(X, Y, 'r');

# Fitting a Line to $Logit(target)$

Let's try applying the logit function to our target and then fitting a linear regression to that. Since the model will be trained not on whether the glass is household but rather on *the logit of this label*, it will also make predictions of the logit of that label. But we can simply apply the sigmoid function to the model's output to get its predictions of whether the glass is household.

We can't use the target as is, because the logit of 1 is $\infty$ and the logit of 0 is $-\infty$.

In [None]:
glass['household'].unique()

In [None]:
logit(glass['household']).unique()

So we'll make a small adjustment:

In [None]:
target_approx = np.where(glass['household'] == 0, 1e-9, 1-1e-9)

In [None]:
line_to_logit = LinearRegression()

X = glass[['al']]
y = logit(target_approx)

line_to_logit.fit(X, y)

In [None]:
fig, ax = plt.subplots()

final_preds = expit(line_to_logit.predict(X))
ax.scatter(X, glass['household'])
ax.plot(X, final_preds, 'm');

## `sklearn.linear_model.LogisticRegression()`

In general, we should always scale our data when using this class. Scaling is always important for models that include regularization, and scikit-learn's `LogisticRegression()` objects have regularization by default.

Here we've forgone the scaling since we only have a single predictor.

In [None]:
# fit a logistic regression model and store the class predictions

logreg = LogisticRegression(random_state=42)
feature_cols = ['al']
X = glass[feature_cols]
y = glass.household
logreg.fit(X, y)
glass['household_pred_class'] = logreg.predict(X)

In [None]:
# plot the class predictions

fig, ax = plt.subplots()
ax.scatter(glass.al, glass.household)
ax.plot(glass.al, glass.household_pred_class, color='red')
ax.set_xlabel('al')
ax.set_ylabel('household');

## `.predict()` vs. `.predict_proba()`

Let's checkout some specific examples to make predictions with. We'll use both `predict()` and `predict_proba()`.

In [None]:
glass.al

In [None]:
# examine some example predictions

print(logreg.predict(glass['al'][22].reshape(1, -1)))
print(logreg.predict(glass['al'][185].reshape(1, -1)))
print(logreg.predict(glass['al'][164].reshape(1, -1)))
print('\n')
print(logreg.predict_proba(glass['al'][22].reshape(1, -1))[0])
print(logreg.predict_proba(glass['al'][185].reshape(1, -1))[0])
print(logreg.predict_proba(glass['al'][164].reshape(1, -1))[0])
first_row = glass['al'][22].reshape(1, -1)

In [None]:
# store the predicted probabilites of class 1
glass['household_pred_prob'] = logreg.predict_proba(X)[:, 1]

In [None]:
# plot the predicted probabilities
fig, ax = plt.subplots()
ax.scatter(glass.al, glass.household)
ax.plot(glass.al, glass.household_pred_prob, color='red')
ax.set_xlabel('al')
ax.set_ylabel('household');

The first column indicates the predicted probability of **class 0**, and the second column indicates the predicted probability of **class 1**.

In [None]:
log_loss(glass.household, logreg.predict_proba(X)[:, 1])

The above is a pretty good score. A baseline classifier that is fit on data with equal numbers of data points in the two target classes should be right about 50% of the time, and the log loss for such a classifier would be $-ln(0.5) = 0.693$.

In [None]:
-np.log(0.5)