## The Wonderful World of ML - Session 4 Assignment: Logistic Regression & Regularization

### Problem 1 - Extend Logistic Regression Model From Session 3

In the solutions notebook for session 3, we ran a logistic regression on 2016 Broncos data.  We found that when we added the **Home** variable, the coefficient on this variable was not significantly different from 0, so we left it out of our model.

In this problem, we're going to do a little feature engineering and add a variable called **YrdsDiff** which is computed by taking the difference of offensive yards minus defensive yards.  Try the following:

1) Read in the revised dataset that includes offensive and defensive stats.  Treat all interger variables as continuous and rebuild the logistic regression model using just the Broncos scare and call this  model **logRegBroncos1**.

2) Create a new column in the dataframe called **YrdsDiff** and populate that column with the difference of offensive yards minus defensive yards.  Build a new model using **DenWin** and the new **YrdsDiff** variable and call this model **logRegBroncos3**.

3) Is the coefficient for the new **YrdsDiff** variable significantly different from 0 to justify adding it to our final model?  Why / Why not?

4) What does the difference in the values for AIC for **logRegBroncos** and **logRegBroncos3** suggest?

In [145]:
data_path <- "https://raw.githubusercontent.com/MichaelSzczepaniak/WonderfulML/master/data/broncos2016.csv"
broncos_data <- read.csv(data_path)
head(broncos_data)
logRegBroncos <- glm(broncos_data$DenWin ~ broncos_data$DenScore, family="binomial")
summary(logRegBroncos)

Date,Week,DenScore,OppScore,DenWin,Home,Off1stDwns,OffPassYrds,OffRushYrds,Def1stDwns,DefPassYrds,DefRushYrds,Notes
09/08/2016,1,21,20,1,1,21,159,148,21,176,157,
09/18/2016,2,34,20,1,1,24,266,134,19,170,83,
09/25/2016,3,29,17,1,0,21,303,52,20,189,143,
10/02/2016,4,27,7,1,0,22,218,89,15,143,72,
10/09/2016,5,16,23,0,1,18,183,84,19,250,122,
10/13/2016,6,13,21,0,0,16,220,84,16,166,99,



Call:
glm(formula = broncos_data$DenWin ~ broncos_data$DenScore, family = "binomial")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1947  -0.2699   0.2206   0.4761   1.2306  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)  
(Intercept)            -7.0906     3.7653  -1.883   0.0597 .
broncos_data$DenScore   0.3483     0.1702   2.047   0.0407 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 21.930  on 15  degrees of freedom
Residual deviance: 10.965  on 14  degrees of freedom
AIC: 14.965

Number of Fisher Scoring iterations: 6


In [146]:
library(dplyr)

broncos_data <- mutate(broncos_data, Delta1stDwns = Off1stDwns - Def1stDwns)
broncos_data <- mutate(broncos_data, YrdsOffTotal = OffPassYrds + OffRushYrds)
broncos_data <- mutate(broncos_data, YrdsDefTotal = DefPassYrds + DefRushYrds)
broncos_data <- mutate(broncos_data, YrdsDiff = YrdsOffTotal - YrdsDefTotal)
#broncos_data[, c('DenWin', 'YrdsDiff')]
#broncos_data
logRegBroncos3 <- glm(broncos_data$DenWin ~ broncos_data$DenScore + broncos_data$YrdsDiff, family="binomial")
summary(logRegBroncos3)


Call:
glm(formula = broncos_data$DenWin ~ broncos_data$DenScore + broncos_data$YrdsDiff, 
    family = "binomial")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.49668  -0.07598   0.07638   0.35998   1.35019  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)
(Intercept)           -15.94706   10.46565  -1.524    0.128
broncos_data$DenScore   0.74469    0.47232   1.577    0.115
broncos_data$YrdsDiff  -0.01815    0.01417  -1.281    0.200

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 21.9301  on 15  degrees of freedom
Residual deviance:  8.6284  on 13  degrees of freedom
AIC: 14.628

Number of Fisher Scoring iterations: 7


Adding **YrdsDiff** is not significantly zero at 95% confidence, but the AIC is a little lower.  AIC is proxy for test error, so it suggests that **logRegBroncos3** might be slightly better than **logRegBroncos**.

### Problem 2 - Add an L2 Weight Penalty

In session 3, we discussed how we could employ shrinkage methods to lower our risk of overfitting.  Specifically, we discussed the effects of applying L1 (lasso) and L2 (ridge) penalties to regression models.  What we did not discuss, was that you can apply them to classification.  The basic motivation behind applying these methods for classification are similar to those behind applying them for regression: We want to minimize overfitting.

These 2 videos (in the repo) from the University of Washington's [**Machine Learning: Classification** class on coursera](https://www.coursera.org/learn/ml-classification/home/welcome) (requires an account on coursera) do a really nice job of describing why and how to apply an L2 weight penalty to a logistic regression model:

+ [Penalizing large coefficients to mitigate overfitting](https://github.com/MichaelSzczepaniak/WonderfulML/raw/master/docs/resources/010%20-%20Penalizing%20large%20coefficients%20to%20mitigate%20overfitting.mp4)
+ [L2 Regularized Logistic Regression](https://github.com/MichaelSzczepaniak/WonderfulML/raw/master/docs/resources/011%20-%20L2%20regularized%20logistic%20regression.mp4)

Let's see how an L2 weight penalty can help us.  Try this task:

1) Read in this cleaned and truncated version of the [Titanic data set](https://raw.githubusercontent.com/MichaelSzczepaniak/WonderfulML/master/data/titanic_class_sex_age.csv).  A description of the data can be found in the [Data Dictionary section of this document](https://github.com/MichaelSzczepaniak/Jamatitanic/blob/master/data/titanic%20data%20dictionary.pdf).  After the read, split the data into 3 partitions: 400 training sample, 200 test sample, and 114 validation samples.  Use the code provided below to get started.

2) Using just the training set, fit a logistic regression model starting with **Pclass** then adding **Age** and **Sex** if they are significantly non-zero and improve AIC.  What does an improvement in AIC look like (increase or decrease)? Why?

3) Use the training and validation set to determine which of these 10 values of L2 weight penalty $\lambda$ minimizes the negative log-likelihood cost function (same as maximizing the log-likelihood).  Use the following algorithm:


In [147]:
data_all <- read.csv("https://raw.githubusercontent.com/MichaelSzczepaniak/WonderfulML/master/data/titanic_class_sex_age.csv", na.strings = c("NA", ""), stringsAsFactors = FALSE)
data_all <- data_all[, c('PassengerId','Survived', 'Pclass', 'Name', 'Sex', 'Age')]
head(data_all)

PassengerId,Survived,Pclass,Name,Sex,Age
1,0,3,"Braund, Mr. Owen Harris",male,22
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38
3,1,3,"Heikkinen, Miss. Laina",female,26
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35
5,0,3,"Allen, Mr. William Henry",male,35
7,0,1,"McCarthy, Mr. Timothy J",male,54


In [148]:
library(caret)
library(e1071)
set.seed(711)

# Factorize the discrete variables
data_all$Survived <- factor(data_all$Survived)
data_all$Sex <- factor(data_all$Sex)
data_all$Pclass <- factor(data_all$Pclass)

# Create training, validation, and test datasets
train_test_indices <- sample(1:nrow(data_all), 600)
valid_data <- data_all[-train_test_indices, ]     # 714 - 600 = 114 validation samples
train_indices <- sample(train_test_indices, 400)  # 400 train samples
test_indices <- setdiff(train_test_indices, train_indices)
train_data <- data_all[train_indices, ]
test_data <- data_all[test_indices, ]             # 200 test samples
# Check: If train, test, and validation are disjoint, id_keys should have nrow(data_all) = 714 rows
id_keys <- union(union(train_data$PassengerId, test_data$PassengerId), valid_data$PassengerId)
c(nrow(train_data), nrow(test_data), nrow(valid_data), length(id_keys))

In [149]:
titanic_mod1 <- glm(Survived ~ Pclass, data=train_data, family="binomial")
summary(titanic_mod1)


Call:
glm(formula = Survived ~ Pclass, family = "binomial", data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4028  -0.7059  -0.7059   0.9860   1.7389  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.5158     0.1998   2.581  0.00984 ** 
Pclass2      -0.1863     0.2861  -0.651  0.51488    
Pclass3      -1.7785     0.2641  -6.734 1.65e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 543.58  on 399  degrees of freedom
Residual deviance: 480.43  on 397  degrees of freedom
AIC: 486.43

Number of Fisher Scoring iterations: 4


Significance of the intercept and **Pclass** slope look good.  Let's add **Age** and see what happens.

In [150]:
titanic_mod2 <- glm(Survived ~ Pclass + Age, data=train_data, family="binomial")
summary(titanic_mod2)


Call:
glm(formula = Survived ~ Pclass + Age, family = "binomial", data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1395  -0.8144  -0.5592   0.9203   2.4766  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.270464   0.419155   5.417 6.07e-08 ***
Pclass2     -0.520854   0.308600  -1.688   0.0915 .  
Pclass3     -2.493670   0.321131  -7.765 8.15e-15 ***
Age         -0.044378   0.008941  -4.963 6.92e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 543.58  on 399  degrees of freedom
Residual deviance: 452.79  on 396  degrees of freedom
AIC: 460.79

Number of Fisher Scoring iterations: 3


All parameters appear significant and we got a small reduction in AIC, so it looks like we are going in the right direction.  Let's add our last parameter **Sex** and see what happens.

In [151]:
titanic_mod3 <- glm(Survived ~ Pclass + Age + Sex, data=train_data, family="binomial")
summary(titanic_mod3)


Call:
glm(formula = Survived ~ Pclass + Age + Sex, family = "binomial", 
    data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7544  -0.6141  -0.3442   0.5282   2.5527  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.84730    0.54922   7.005 2.47e-12 ***
Pclass2     -0.74882    0.37871  -1.977 0.048009 *  
Pclass3     -2.54751    0.38901  -6.549 5.80e-11 ***
Age         -0.03838    0.01034  -3.712 0.000206 ***
Sexmale     -2.83001    0.29192  -9.695  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 543.58  on 399  degrees of freedom
Residual deviance: 333.29  on 395  degrees of freedom
AIC: 343.29

Number of Fisher Scoring iterations: 5


All parameters are still significant and we got an even bigger reduction in AIC, so let's see how this model does on the validation set.

In [152]:
# retrain using caret interface for easier evaluation
titanic_mod3 <- train(Survived ~ Pclass + Age + Sex, method='glm', data=train_data, family="binomial")
mod3_valid_results <- predict(titanic_mod3, newdata=valid_data)
mod3_valid_acc <- sum(mod3_valid_results == valid_data$Survived) / nrow(valid_data)
mod3_valid_acc

Not too bad, now lets 

## The Wonderful World of ML - Session 4 Discussion: 
## Linear & Quadratic Discriminant Analysis