# Logistic Regression
- The purpose of this notebook is to provide a theoretical background for logistic regression, and build a logistic regression model from scratch.

# Sources
 - These notes are largely taken from the following sources:
     - [UCLA: How Do I Interpret Odds Ratios in Logistic Regression](www.stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/)
     - [Win Vector: Logistic Regression Derivation](https://win-vector.com/2011/09/14/the-simpler-derivation-of-logistic-regression/)
     - [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)
     

# Background

# Model

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

In [2]:
# load data, train-test-split, standardize independent variables
boston = load_boston()
X = boston.data
y = boston.target
col_names = boston.feature_names

# convert from numpy array to pandas DF so it's easier to follow the column names
X = pd.DataFrame(X, columns=col_names)
y = pd.DataFrame(y, columns=['target'])
df = pd.concat([X, y], axis=1)

X_train, X_test, y_train, y_test = train_test_split(
    df[col_names], df.target, test_size=0.33, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Example 1: Logistic regression with no predictor variables
In an equation, we are modelling $B_0 = logit(p)$  
where $logit(p)= \dfrac{p}{1-p}$

In [10]:
# Create DF
df2 = pd.DataFrame(np.array(y), columns = ['y'])
df2['intercept'] = 1
df2['price_above_mean'] = df2.y > df2.y.mean()

# Model
model = LogisticRegression(fit_intercept=False, penalty='none')
logit_result = model.fit(pd.DataFrame(df2['intercept']),df2['price_above_mean'])
print(f"{logit_result.coef_[0][0]:,.4}", "<<< coefficient of intercept")

# Manual calculation
pct_above_mean = df2.price_above_mean.sum() / len(df2) # this is p in logit(p)
log_odds = np.log(pct_above_mean / (1-pct_above_mean)) # this is p / (1-p)
print(f"{log_odds:,.4}", "<<<  matches the number above")

-0.3514 <<< coefficient of intercept
-0.3514 <<<  matches the number above


### Takeaway:

The coefficient of the intercept can be interpreted as the log odds of the target occurring.  More specifically, the coefficient of the intercept is simply $ln \dfrac{p}{(1-p)}$, where $p$ is the probability of an individual house price being above the average house price.

## Example 2: Logistic regression with one dichotomous predictor variable
In an equation, we are modelling the following:  
$$
logit(p) = B_0 + B_1 * CHAS
$$

In [40]:
# Create DF
df2 = pd.DataFrame(np.array(y), columns = ['y'])
df2['CHAS'] = df.CHAS
df2['intercept'] = 1
df2['price_above_mean'] = df2.y > df2.y.mean()

# Model
model = LogisticRegression(fit_intercept=False, penalty='none')
logit_result = model.fit(pd.DataFrame(df2[['intercept','CHAS']]),df2.price_above_mean)

# print(f"{logit_result.intercept_[0]:,.4}", "<<< coefficient of intercept")
print(f"{logit_result.coef_[0][0]:,.4}", "<<< coefficient of B_0")
print(f"{logit_result.coef_[0][1]:,.4}", "<<< coefficient of B_1")

-0.4002 <<< coefficient of B_0
0.6878 <<< coefficient of B_1


In [56]:
# Use this output to calculate B_0 and B_1
df2.groupby('CHAS')['price_above_mean'].value_counts()

CHAS  price_above_mean
0.0   False               282
      True                189
1.0   True                 20
      False                15
Name: price_above_mean, dtype: int64

In [55]:
# Calculate B_0
chas_0_odds = (189 / (282+189)) / (282 / (282+189))
B_0 = np.log(chas_0_odds)

# Calculate B_1
chas_1_odds = (20 / (20+15)) / (15 / (20+15))
B_1 = np.log(chas_1_odds / chas_0_odds)

print(f"{B_0:,.4}", "<<< matches B_0")
print(f"{B_1:,.4}", "<<< matches B_1")

-0.4002 <<< matches B_0
0.6878 <<< matches B_1


## Takeaway

## Logistic regression with a single continuous predictor variable
In an equation, we are modelling the following:  
$$
logit(p) = B_0 + B_1 * RM
$$

In [59]:
# Create DF
df2 = pd.DataFrame(np.array(y), columns = ['y'])
df2['RM'] = df.RM
df2['intercept'] = 1
df2['price_above_mean'] = df2.y > df2.y.mean()

# Model
model = LogisticRegression(fit_intercept=False, penalty='none')
logit_result = model.fit(pd.DataFrame(df2[['intercept','RM']]),df2.price_above_mean)

# print(f"{logit_result.intercept_[0]:,.4}", "<<< coefficient of intercept")
print(f"{logit_result.coef_[0][0]:,.4}", "<<< coefficient of B_0")
print(f"{logit_result.coef_[0][1]:,.4}", "<<< coefficient of B_1")


-16.74 <<< coefficient of B_0
2.597 <<< coefficient of B_1


In [73]:
df2['rm_above_mean'] = df2.RM > df2.RM.mean()

In [78]:
# Use this output to calculate B_0 and B_1
df2.groupby('rm_above_mean')['price_above_mean'].value_counts()

rm_above_mean  price_above_mean
False          False               229
               True                 49
True           True                160
               False                68
Name: price_above_mean, dtype: int64

In [81]:
np.log((229 / (229+49)) / (49 / (229+49)))

1.541901705443613

In [83]:
np.exp(-16.74)

5.3692097844149524e-08

In [84]:
(160 / (160+68)) / (68 / (160+68))

2.3529411764705883

In [85]:
np.log(2.352)

0.8552660300363805

In [86]:
np.exp(2.597)

13.423407347176434

# Miscellaneous

## Feature Definitions
- **CRIM**: per capita crime rate by town
- **ZN**: proportion of residential land zoned for lots over 25,000 sq.ft.
- **INDUS**: proportion of non-retail business acres per town.
- **CHAS**: Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- **NOX**: nitric oxides concentration (parts per 10 million)
- **RM**: average number of rooms per dwelling
- **AGE**: proportion of owner-occupied units built prior to 1940
- **DIS**: weighted distances to five Boston employment centres
- **RAD**: index of accessibility to radial highways
- **TAX**: full-value property-tax rate per \$10,000
- **PTRATIO**:  pupil-teacher ratio by town  
- **B**: 1000(Bk - 0.63)^2 where B_k is the proportion of blacks by town
- **LSTAT**: % lower status of the population
- **MEDV**: Median value of owner-occupied homes in $1000's