## The Wonderful World of ML - Session 4 Assignment: Logistic Regression & Regularization

### Problem 1 - Extend Logistic Regression Model From Session 3

In the solutions notebook for session 3, we ran a logistic regression on 2016 Broncos data.  We found that when we added the **Home** variable, the coefficient on this variable was not significantly different from 0, so we left it out of our model.

In this problem, we're going to do a little feature engineering and add a variable called **YrdsDiff** which is computed by taking the difference of offensive yards minus defensive yards.  Try the following:

1) Read in the revised dataset that includes offensive and defensive stats.  Treat all interger variables as continuous and rebuild the logistic regression model using just the Broncos scare and call this  model **logRegBroncos1**.

2) Create a new column in the dataframe called **YrdsDiff** and populate that column with the difference of offensive yards minus defensive yards.  Build a new model using **DenWin** and the new **YrdsDiff** variable and call this model **logRegBroncos3**.

3) Is the coefficient for the new **YrdsDiff** variable significantly different from 0 to justify adding it to our final model?  Why / Why not?

4) What does the difference in the values for AIC for **logRegBroncos** and **logRegBroncos3** suggest?

In [1]:
import numpy as np, pandas as pd
#import scikit??? as ???
data_path = "https://raw.githubusercontent.com/MichaelSzczepaniak/WonderfulML/master/data/broncos2016.csv"
broncos_data = pd.read_csv(data_path)
broncos_data.head()

Unnamed: 0,Date,Week,DenScore,OppScore,DenWin,Home,Off1stDwns,OffPassYrds,OffRushYrds,Def1stDwns,DefPassYrds,DefRushYrds,Notes
0,09/08/2016,1,21,20,1,1,21,159,148,21,176,157,
1,09/18/2016,2,34,20,1,1,24,266,134,19,170,83,
2,09/25/2016,3,29,17,1,0,21,303,52,20,189,143,
3,10/02/2016,4,27,7,1,0,22,218,89,15,143,72,
4,10/09/2016,5,16,23,0,1,18,183,84,19,250,122,


**TODO**: Implement building these models by hand and research how to:

1. Build logsitic regression models with Python 3.x compatible package (e.g. sklearn, statsmodels, et. al.)
2. Do F-tests between 2 nested modles.  [My post of Stackoverflow](https://stackoverflow.com/questions/45243802/how-do-i-do-an-f-test-to-compare-nested-linear-models-in-python) recommends using the  **feature_selection.f_regression** function, but need to spend some time on this because it didn't look very intuitive upon my initial read through.

### Problem 2 - Add an L2 Weight Penalty

In session 3, we discussed how we could employ shrinkage methods to regression models to lower our risk of overfitting.  Specifically, we discussed the effects of applying L1 (lasso) and L2 (ridge) penalties.  What we did not discuss, was that you can apply them to classification.  The basic motivation behind applying these methods for classification are similar to those behind applying them for regression: We want to minimize overfitting.

These 2 videos (in the repo) from the [**Machine Learning: Classification** class on coursera](https://www.coursera.org/learn/ml-classification/home/welcome) (requires an account on coursera) do a really nice job of describing why and how to apply an L2 weight penalty to a logistic regression model:

+ [Penalizing large coefficients to mitigate overfitting](https://github.com/MichaelSzczepaniak/WonderfulML/raw/master/docs/resources/010%20-%20Penalizing%20large%20coefficients%20to%20mitigate%20overfitting.mp4)
+ [L2 Regularized Logistic Regression](https://github.com/MichaelSzczepaniak/WonderfulML/raw/master/docs/resources/011%20-%20L2%20regularized%20logistic%20regression.mp4)

Let's see how an L2 weight penalty can help us.  Try this task:

1) Read in this cleaned and truncated version of the [Titanic data set](https://raw.githubusercontent.com/MichaelSzczepaniak/WonderfulML/master/data/titanic_class_sex_age.csv) and split the data into 3 partitions: 400 training sample, 293 test sample, and 200 validation samples.  Use the code provided below to get started.

2) Fit a logistic regression model starting with **Pclass** then adding **Age** and **Sex** if they are significantly non-zero and improve RSS significantly (F-test).

3) Use the training and validation set to determine which of these 10 values of L2 weight penalty $\lambda$ minimizes the negative log-likelihood cost function (same as maximizing the log-likelihood).  Use the following algorithm:

In [2]:
titanic_data_path = "https://raw.githubusercontent.com/MichaelSzczepaniak/WonderfulML/master/data/titanic_class_sex_age.csv"
data_all = pd.read_csv(titanic_data_path, sep=',', header=0)
df_no_nans = data_all.dropna()  # remove all rows with 1 or more NaN vals
df_no_nans.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
9,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
10,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


**TODO**: Either implement by hand or/and with a Python 3.x package (e.g. sklearn or statsmodels)

$$\newcommand{\xv}{\mathbf{x}}
\newcommand{\Xv}{\mathbf{X}}
\newcommand{\yv}{\mathbf{y}}
\newcommand{\Yv}{\mathbf{Y}}
\newcommand{\zv}{\mathbf{z}}
\newcommand{\av}{\mathbf{a}}
\newcommand{\Wv}{\mathbf{W}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\betav}{\mathbf{\beta}}
\newcommand{\gv}{\mathbf{g}}
\newcommand{\Hv}{\mathbf{H}}
\newcommand{\dv}{\mathbf{d}}
\newcommand{\Vv}{\mathbf{V}}
\newcommand{\vv}{\mathbf{v}}
\newcommand{\tv}{\mathbf{t}}
\newcommand{\Tv}{\mathbf{T}}
\newcommand{\Sv}{\mathbf{S}}
\newcommand{\zv}{\mathbf{z}}
\newcommand{\Zv}{\mathbf{Z}}
\newcommand{\Norm}{\mathcal{N}}
\newcommand{\muv}{\boldsymbol{\mu}}
\newcommand{\sigmav}{\boldsymbol{\sigma}}
\newcommand{\phiv}{\boldsymbol{\phi}}
\newcommand{\Phiv}{\boldsymbol{\Phi}}
\newcommand{\Sigmav}{\boldsymbol{\Sigma}}
\newcommand{\Lambdav}{\boldsymbol{\Lambda}}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
\newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}}
\newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}}
\newcommand{\grad}{\mathbf{\nabla}}
\newcommand{\ebx}[1]{e^{\wv_{#1}^T \xv_n}}
\newcommand{\eby}[1]{e^{y_{n,#1}}}
\newcommand{\Tiv}{\mathbf{Ti}}
\newcommand{\Fv}{\mathbf{F}}
\newcommand{\ones}[1]{\mathbf{1}_{#1}}
$$

## The Wonderful World of ML - Session 4 Discussion: 
## Linear & Quadratic Discriminant Analysis

In logistic regression, we were fitting a function directly to our probablity of a certain class given the data or $P(C_n=k\,|\, \xv_n)$ where $C_n=\text{class of } n^\text{th}\text{ sample}$.  In LDA and QDA, we do something a little different.  We start by modeling the likelihood of the data within each class or $P(\xv_n\,|\, C=k)$ and then use Bayes' Theorem (shown below) to compute the probability of a certain class given that data or $P(C_n=k\,|\, \xv_n)$.

(1)$$
	P(C=k|x) = \frac{P(x|C=k) P(C=k)}{P(x)}
$$

This seems a little cumbersome doesn't it?  Can you think of any reasons why we would want to do this?

Well, if we think about this a little, we might realize that we typically can observe how the data is distributed within a class (sometimes referred to as "the likelihood of the data") which is the $p(\xv_n\,|\, C=k)$ term.  We might then conclude that it is natural to use these distributions within each class to infer the actual class from the data.

In order to make use of (1), we need to make a important assumption about the probability distribution of a data sample from each class to define $P(\xv_n\,|\, C=k)$.  A common assumption is that this distribution is Normal with mean $\mu_k$ and covariance matrix $\Sigma_k$.  This assumption allows $P(\xv_n\,|\, C=k)$ to be expressed using the common relationship for the d-dimensional Guassian distribution as shown in equation (2).

(2)$$
P(x|C=k) = \frac{1}{(2\pi)^{\frac{p}{2}} |\Sigma_k|^{\frac{1}{2}}}
e^{-\frac{1}{2}(x-\mu_k)^T \Sigma_k^{-1} (x-\mu_k)}
$$

To classify $x$ as being from Class 1 in a two-class discrimination problem, we must check to see if  $P(C=1 | x) > P(C=2 | x)$ is true.  Rewriting each side of the inequality using Bayes' Theorem we get equation (3).

(3)$$
P(x|C=1) P(C=1) / P(x) > P(x|C=2)P(C=2)/P(x)
$$

Since $P(x)$ is positive, it can be removed from each side.  And since we have defined $P(x|C=k)$ to be a Normal distribution involving an exponential, we can take the logarithm and expand both sides as shown in equations (4) and (5).

(4)$$
\log( P(x|C=1) P(C=1)) > \log( P(x|C=2) P(C=2))
$$

(5)$$
\log( P(x|C=1)) + \log( P(C=1)) > \log( P(x|C=2)) + \log( P(C=2))
$$

If we substitute equation (2) into (5) and simplify, the left side of the inequality would look like what is shown in (6) and the right side would look like what is shown in (7).

(6)$$
-\frac{1}{2} \log |\Sigma_1| -\frac{1}{2}(x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1) + \log P(C=1)
$$

(7)$$
-\frac{1}{2} \log |\Sigma_2| -\frac{1}{2}(x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2) + \log P(C=2)
$$

If we define each side of this inequality as a discriminant function, $\delta(x)$ for
Class 1 or 2, then, in general

(8)$$
\delta_k(x) = -\frac{1}{2} \log |\Sigma_k| -\frac{1}{2}(x-\mu_k)^T
\Sigma_k^{-1} (x-\mu_k) + \log P(C=k)
$$

and the class of a new sample $x$ is $argmax_k \delta_k(x)$.  Notice
that the boundary between Class 1 and Class 2 is the set of points $x$
for which $\delta_1(x) = \delta_2(x)$.  Substituting in the definitions of these discriminant functions we see that this equation is quadratic in $x$, meaning that the boundary between Class 1 and 2 is quadratic.  We have just defined **Quadratic Discriminant Analysis**, or **QDA**.  

In order to apply QDA to a given sets of data samples $X_1, X_2, \ldots, X_K$ from Classes $1, 2, \dots, K$, we must compute the likelihood and prior terms: $P(x|C=k)$ and $P(C=k)$ respectively, using (9), (10), and (11):

(9)$$
\mu_k = \frac{1}{N_k} \sum_{x \in X_k} x
$$

(10)$$
\Sigma_k = \frac{1}{N_k-1} \sum_{x\in X_k} (x-\mu_k) (x-\mu_k)^T
$$

(11)$$
P(C=k) = \frac{N_k}{N}
$$

where $N_k$ is the number of samples in $X_k$ and $N$ is the number of all samples.

### Linear Discriminant Analysis (LDA)

LDA is derived in the same way as QDA with the additional assumption that the variance within each class are equal.  Making this additional assumption allows us to simplify the terms in (6) and (7) defined on both sides inequality as shown in equation (12).

(12)$$
x^T \Sigma^{-1} \mu_1 - \frac{1}{2}\mu_1^T \Sigma^{-1} \mu_1 + \log P(C=1) >
x^T \Sigma^{-1} \mu_2 - \frac{1}{2}\mu_2^T \Sigma^{-1} \mu_2 + \log P(C=2)
$$

which results in the new discriminant defined in equation (13) which is the same as equation (4.19) in the ESL.

(13)$$
\delta_k(x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log P(C=k)
$$

where the new effectively averaged covariance matrix is defined by (14).

(14)$$
\Sigma = \frac{1}{N-K} \sum_{k=1}^K \sum_{x\in X_k}(x-\mu_k)(x-\mu_k)^T
$$

Like QDA, the maximum $\delta_k(x)$ for a given $x$ determines the class.

Let's implement QDA and LDA on some 1 and 2-D sythetic data for illustration purposes.  After that, we'll take them for a spin on [an interesting real data set](http://archive.ics.uci.edu/ml/datasets/heart+Disease).  Let's start with QDA on 1-D data.

### 1-D QDA & LDA Visualizations On Synthetic Data

#### QDA On 1-D Data

In [3]:
# TODO

### Applying LDA and QDA on Cleveland Heart Data (CHD)

In this section, we'll take our LDA and QDA models out for a spin on some real data.  The plots below shows the results of the 2-dimensional analysis of the CHD using LDA and QDA.  First, we need to do a little clean up of the data.  The last column (14) hold the targets designated 0-4.  As [described in the documentation](https://raw.githubusercontent.com/MichaelSzczepaniak/WonderfulML/master/data/cleveland_heart_disease_data_description.txt), 0 indicates no presence of heart disease, while values from 1 to 4 indicate varying degrees of heart disease.  To simply our analysis, we'll turn this into a binary classification problem and consider anything > 0 to indicate heart disease.

In the intrest of brevity, I'll skip over much of the details of the EDA so we can focus on the classification itself.  Of the 13 predictors in the abbreviated dataset, the following 8 variables didn't appear to have any significant effect on heart disease (**dhd** target variable):  2. sex, 3. cp, 6. fbs, 7. restecg, 9. exang, 11. slope, 12. ca, 13. thal.  The number preceding the original variable name is the column in the abbreviated dataset, (e.g. column 2 for *sex*).

A pairs plot of the remaining 5 predictors can be found [here](https://github.com/MichaelSzczepaniak/WonderfulML/raw/master/docs/graphics/chd_pairs_plot.jpg) and R code used to create this plot can be found [here](https://raw.githubusercontent.com/MichaelSzczepaniak/WonderfulML/master/R/ClevelandHeartPairsPlot.R).  From the last last row of the pairs plot, the three factors that seemed to show some influence on **dhd** (renamed target variable, short for *detected heart disease*) were $\texttt{age}$, $\texttt{thalach}$ (abbreviated $\texttt{thlch}$), and $\texttt{oldpeak}$ (abbreviated $\texttt{oldpk}$). The $\texttt{age}$ and $\texttt{oldpk}$ factors to positively correlate to $\texttt{dhd}$, meaning that increases in these factors generally resulted in increase in $\texttt{dhd}$.  The factor $\texttt{thlch}$ appeared to correlate negatively.  From these observations, the three most prominant factors were reduces to two factors as follows:

$$
x_1 = \frac{\texttt{age}}{thlch}
$$
and
$$  
x_2 = \frac{\texttt{oldpk}+1}{(\texttt{thlch})}
$$

### LDA Results On Test Set

In [4]:
# TODO

### QDA Results On Test Set

In [5]:
# TODO