## The Wonderful World of ML - Session 4 Assignment: Logistic Regression & Regularization

### Problem 1 - Extend Logistic Regression Model From Session 3

In the solutions notebook for session 3, we ran a logistic regression on 2016 Broncos data.  We found that when we added the **Home** variable, the coefficient on this variable was not significantly different from 0, so we left it out of our model.

In this problem, we're going to do a little feature engineering and add a variable called **YrdsDiff** which is computed by taking the difference of offensive yards minus defensive yards.  Try the following:

1) Read in the revised dataset that includes offensive and defensive stats.  Treat all interger variables as continuous and rebuild the logistic regression model using just the Broncos scare and call this  model **logRegBroncos1**.

2) Create a new column in the dataframe called **YrdsDiff** and populate that column with the difference of offensive yards minus defensive yards.  Build a new model using **DenWin** and the new **YrdsDiff** variable and call this model **logRegBroncos3**.

3) Is the coefficient for the new **YrdsDiff** variable significantly different from 0 to justify adding it to our final model?  Why / Why not?

4) What does the difference in the values for AIC for **logRegBroncos** and **logRegBroncos3** suggest?

In [69]:
data_path <- "https://raw.githubusercontent.com/MichaelSzczepaniak/WonderfulML/master/data/broncos2016.csv"
broncos_data <- read.csv(data_path)
head(broncos_data)
logRegBroncos <- glm(broncos_data$DenWin ~ broncos_data$DenScore, family="binomial")
summary(logRegBroncos)

Date,Week,DenScore,OppScore,DenWin,Home,Off1stDwns,OffPassYrds,OffRushYrds,Def1stDwns,DefPassYrds,DefRushYrds,Notes
09/08/2016,1,21,20,1,1,21,159,148,21,176,157,
09/18/2016,2,34,20,1,1,24,266,134,19,170,83,
09/25/2016,3,29,17,1,0,21,303,52,20,189,143,
10/02/2016,4,27,7,1,0,22,218,89,15,143,72,
10/09/2016,5,16,23,0,1,18,183,84,19,250,122,
10/13/2016,6,13,21,0,0,16,220,84,16,166,99,



Call:
glm(formula = broncos_data$DenWin ~ broncos_data$DenScore, family = "binomial")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1947  -0.2699   0.2206   0.4761   1.2306  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)  
(Intercept)            -7.0906     3.7653  -1.883   0.0597 .
broncos_data$DenScore   0.3483     0.1702   2.047   0.0407 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 21.930  on 15  degrees of freedom
Residual deviance: 10.965  on 14  degrees of freedom
AIC: 14.965

Number of Fisher Scoring iterations: 6


In [70]:
library(dplyr)

broncos_data <- mutate(broncos_data, Delta1stDwns = Off1stDwns - Def1stDwns)
broncos_data <- mutate(broncos_data, YrdsOffTotal = OffPassYrds + OffRushYrds)
broncos_data <- mutate(broncos_data, YrdsDefTotal = DefPassYrds + DefRushYrds)
broncos_data <- mutate(broncos_data, YrdsDiff = YrdsOffTotal - YrdsDefTotal)
#broncos_data[, c('DenWin', 'YrdsDiff')]
#broncos_data
logRegBroncos3 <- glm(broncos_data$DenWin ~ broncos_data$DenScore + broncos_data$YrdsDiff, family="binomial")
summary(logRegBroncos3)


Call:
glm(formula = broncos_data$DenWin ~ broncos_data$DenScore + broncos_data$YrdsDiff, 
    family = "binomial")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.49668  -0.07598   0.07638   0.35998   1.35019  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)
(Intercept)           -15.94706   10.46565  -1.524    0.128
broncos_data$DenScore   0.74469    0.47232   1.577    0.115
broncos_data$YrdsDiff  -0.01815    0.01417  -1.281    0.200

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 21.9301  on 15  degrees of freedom
Residual deviance:  8.6284  on 13  degrees of freedom
AIC: 14.628

Number of Fisher Scoring iterations: 7


### Problem 2 - Add an L2 Weight Penalty

In session 3, we discussed how we could employ shrinkage methods to lower our risk of overfitting.  Specifically, we discussed the effects of applying L1 (lasso) and L2 (ridge) penalties to regression models.  What we did not discuss, was that you can apply them to classification.  The basic motivation behind applying these methods for classification are similar to those behind applying them for regression: We want to minimize overfitting.

These 2 videos (in the repo) from the University of Washington's [**Machine Learning: Classification** class on coursera](https://www.coursera.org/learn/ml-classification/home/welcome) (requires an account on coursera) do a really nice job of describing why and how to apply an L2 weight penalty to a logistic regression model:

+ [Penalizing large coefficients to mitigate overfitting](https://github.com/MichaelSzczepaniak/WonderfulML/raw/master/docs/resources/010%20-%20Penalizing%20large%20coefficients%20to%20mitigate%20overfitting.mp4)
+ [L2 Regularized Logistic Regression](https://github.com/MichaelSzczepaniak/WonderfulML/raw/master/docs/resources/011%20-%20L2%20regularized%20logistic%20regression.mp4)

Let's see how an L2 weight penalty can help us.  Try this task:

1) Read in this cleaned and truncated version of the [Titanic data set](https://raw.githubusercontent.com/MichaelSzczepaniak/WonderfulML/master/data/titanic_class_sex_age.csv).  A description of the data can be found in the [Data Diction section of this document](https://github.com/MichaelSzczepaniak/Jamatitanic/blob/master/data/titanic%20data%20dictionary.pdf).  After the read, split the data into 3 partitions: 400 training sample, 293 test sample, and 200 validation samples.  Use the code provided below to get started.

2) Using just the training set, fit a logistic regression model starting with **Pclass** then adding **Age** and **Sex** if they are significantly non-zero and improve AIC.  What does an improvement in AIC look like (increase or decrease)? Why?

3) Use the training and validation set to determine which of these 10 values of L2 weight penalty $\lambda$ minimizes the negative log-likelihood cost function (same as maximizing the log-likelihood).  Use the following algorithm:



In [71]:
data_all <- read.csv("https://raw.githubusercontent.com/MichaelSzczepaniak/WonderfulML/master/data/titanic_class_sex_age.csv", na.strings = c("NA", ""), stringsAsFactors = FALSE)
data_all <- data_all[, c('Survived', 'Pclass', 'Name', 'Sex', 'Age')]
head(data_all)

Survived,Pclass,Name,Sex,Age
0,3,"Braund, Mr. Owen Harris",male,22
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38
1,3,"Heikkinen, Miss. Laina",female,26
1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35
0,3,"Allen, Mr. William Henry",male,35
0,1,"McCarthy, Mr. Timothy J",male,54


## The Wonderful World of ML - Session 4 Discussion: 
## Linear & Quadratic Discriminant Analysis