In [11]:
source("2-model_tests.R")

### Linear Discriminant Analysis

#### Notes on Models

* train.set with KNN imputing
    - NZV makes no difference
    - BoxCox makes no difference
    - PCA makes no difference
* train.dv improves over train.set but gives warning for collinearity
* train.hc improves over train.dv but gives warning for collinearity unless nzv removed
    - PCA slightly decreases accuracy, kappa, and specificity; slightly increases sensitivity; no warning for collinearity

In [6]:
set.seed(1056)
lda3 <- train(x = train.hc[, -1], y = train.party, method = "lda", trControl = trCtrl, metric = "ROC",
              preProcess = c("nzv", "knnImpute"))

Loading required package: MASS

Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

In lda.default(x, grouping, ...): variables are collinear

In [7]:
pred.lda3 <- predict(lda3, newdata = valid.hc[, -1], na.action = na.pass)
cm.lda3 <- confusionMatrix(pred.lda3, valid.party)

In [8]:
lda3

In gsub("knnImpute", paste(x$k, "nearest neighbor imputation"), : argument 'replacement' has length > 1 and only the first element will be used

Linear Discriminant Analysis 

4455 samples
 122 predictor
   2 classes: 'Democrat', 'Republican' 

Pre-processing: YOB nearest neighbor imputation (118), centered (118),
 scaled (118), remove (4) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 4010, 4010, 4009, 4009, 4010, 4010, ... 
Resampling results:

  ROC        Sens       Spec     
  0.6649332  0.6425113  0.5864466

 

In [9]:
cm.lda3

Confusion Matrix and Statistics

            Reference
Prediction   Democrat Republican
  Democrat        380        193
  Republican      210        330
                                          
               Accuracy : 0.6379          
                 95% CI : (0.6089, 0.6662)
    No Information Rate : 0.5301          
    P-Value [Acc > NIR] : 2.309e-13       
                                          
                  Kappa : 0.2745          
 Mcnemar's Test P-Value : 0.4254          
                                          
            Sensitivity : 0.6441          
            Specificity : 0.6310          
         Pos Pred Value : 0.6632          
         Neg Pred Value : 0.6111          
             Prevalence : 0.5301          
         Detection Rate : 0.3414          
   Detection Prevalence : 0.5148          
      Balanced Accuracy : 0.6375          
                                          
       'Positive' Class : Democrat        
                             

In [None]:
predTest <- predict(lda3, newdata = test.set, na.action = na.pass)

test.submit <- data_frame(USER_ID = testing$USER_ID, Predictions = predTest)

mod <- str_replace_all(as.character(Sys.time()), "-|:| ", "")
write_csv(test.submit, paste0("submission_", mod, ".csv"))


### Partial Least Squares Discriminant Analysis

#### Notes on Models
- train.dv2 improves accuracy over top LDA model
- train.hc2 improves accuracy over train.dv2

In [21]:
set.seed(1056)
pls1 <- train(x = train.dv2[, -1], y = train.party, method = "pls", trControl = trCtrl, metric = "ROC",
              preProcess = "knnImpute", tuneGrid = expand.grid(.ncomp = 1:10))

Loading required package: pls

Attaching package: 'pls'

The following object is masked from 'package:caret':

    R2

The following object is masked from 'package:stats':

    loadings



In [22]:
pred.pls1 <- predict(pls1, newdata = valid.dv2[, -1], na.action = na.pass)
cm.pls1 <- confusionMatrix(pred.pls1, valid.party)

In [23]:
pls1

In gsub("knnImpute", paste(x$k, "nearest neighbor imputation"), : argument 'replacement' has length > 1 and only the first element will be used

Partial Least Squares 

4455 samples
 329 predictor
   2 classes: 'Democrat', 'Republican' 

Pre-processing: YOB nearest neighbor imputation (329), centered (329),
 scaled (329) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 4010, 4010, 4009, 4009, 4010, 4010, ... 
Resampling results across tuning parameters:

  ncomp  ROC        Sens       Spec     
   1     0.6164302  0.7082486  0.4744976
   2     0.6707430  0.6835997  0.5556938
   3     0.6761090  0.6852975  0.5612276
   4     0.6721289  0.6681832  0.5678227
   5     0.6650225  0.6583537  0.5627569
   6     0.6609039  0.6507277  0.5663800
   7     0.6585382  0.6431853  0.5687601
   8     0.6582865  0.6448820  0.5655142
   9     0.6576116  0.6438672  0.5677102
  10     0.6572554  0.6444593  0.5682903

ROC was used to select the optimal model using  the largest value.
The final value used for the model was ncomp = 3. 

In [24]:
cm.pls1

Confusion Matrix and Statistics

            Reference
Prediction   Democrat Republican
  Democrat        408        214
  Republican      182        309
                                          
               Accuracy : 0.6442          
                 95% CI : (0.6153, 0.6724)
    No Information Rate : 0.5301          
    P-Value [Acc > NIR] : 8.887e-15       
                                          
                  Kappa : 0.2833          
 Mcnemar's Test P-Value : 0.1193          
                                          
            Sensitivity : 0.6915          
            Specificity : 0.5908          
         Pos Pred Value : 0.6559          
         Neg Pred Value : 0.6293          
             Prevalence : 0.5301          
         Detection Rate : 0.3666          
   Detection Prevalence : 0.5588          
      Balanced Accuracy : 0.6412          
                                          
       'Positive' Class : Democrat        
                             

In [27]:
set.seed(1056)
pls2 <- train(x = train.hc2[, -1], y = train.party, method = "pls", trControl = trCtrl, metric = "ROC",
              preProcess = c("nzv", "knnImpute"), tuneGrid = expand.grid(.ncomp = 1:10))

In [28]:
pred.pls2 <- predict(pls2, newdata = valid.hc2[, -1], na.action = na.pass)
cm.pls2 <- confusionMatrix(pred.pls2, valid.party)

In [29]:
pls2

In gsub("knnImpute", paste(x$k, "nearest neighbor imputation"), : argument 'replacement' has length > 1 and only the first element will be used

Partial Least Squares 

4455 samples
 315 predictor
   2 classes: 'Democrat', 'Republican' 

Pre-processing: YOB nearest neighbor imputation (309), centered (309),
 scaled (309), remove (6) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 4010, 4010, 4009, 4009, 4010, 4010, ... 
Resampling results across tuning parameters:

  ncomp  ROC        Sens       Spec     
   1     0.6187365  0.7133308  0.4724926
   2     0.6713414  0.6855517  0.5472846
   3     0.6757346  0.6892791  0.5541572
   4     0.6705284  0.6676772  0.5629478
   5     0.6641744  0.6579318  0.5581695
   6     0.6603100  0.6523407  0.5593119
   7     0.6578861  0.6425109  0.5609282
   8     0.6575169  0.6430222  0.5578751
   9     0.6568918  0.6425967  0.5585455
  10     0.6564419  0.6417514  0.5622702

ROC was used to select the optimal model using  the largest value.
The final value used for the model was ncomp = 3. 

In [30]:
cm.pls2

Confusion Matrix and Statistics

            Reference
Prediction   Democrat Republican
  Democrat        419        214
  Republican      171        309
                                         
               Accuracy : 0.6541         
                 95% CI : (0.6253, 0.682)
    No Information Rate : 0.5301         
    P-Value [Acc > NIR] : < 2e-16        
                                         
                  Kappa : 0.3024         
 Mcnemar's Test P-Value : 0.03231        
                                         
            Sensitivity : 0.7102         
            Specificity : 0.5908         
         Pos Pred Value : 0.6619         
         Neg Pred Value : 0.6437         
             Prevalence : 0.5301         
         Detection Rate : 0.3765         
   Detection Prevalence : 0.5687         
      Balanced Accuracy : 0.6505         
                                         
       'Positive' Class : Democrat       
                                         

In [None]:
set.seed(1056)
pls3 <- train(x = train.lc[, -1], y = train.party, method = "pls", trControl = trCtrl, metric = "ROC",
              preProcess = c("nzv", "knnImpute"), tuneGrid = expand.grid(.ncomp = 1:10))

In [None]:
pred.pls3 <- predict(pls3, newdata = valid.lc[, -1], na.action = na.pass)
cm.pls3 <- confusionMatrix(pred.pls3, valid.party)

In [None]:
pls3

In [None]:
cm.pls3