In [51]:
source("1-data_sets.R")

In [2]:
table(train.data$EducationLevel, train.data$Party)

                       
                        Democrat Republican
  Current K-12               350        318
  High School Diploma        270        270
  Current Undergraduate      338        288
  Associate's Degree         149        162
  Bachelor's Degree          518        434
  Master's Degree            296        207
  Doctoral Degree             86         74

In [4]:
library(tableone)
vars <- names(train.data)[-c(1, 7)]
fvars <- vars[-1]
tbl <- CreateTableOne(vars, "Party", train.data, fvars)
print(tbl)

                                Stratified by Party
                                 Democrat        Republican      p      test
  n                                 2361            2094                    
  YOB (mean (sd))                1979.64 (15.12) 1980.09 (14.89)  0.335     
  Gender = Male (%)                 1294 (56.0)     1356 (66.0)  <0.001     
  Income (%)                                                      0.418     
     under 25                        328 (17.1)      301 (17.6)             
     25-50                           314 (16.4)      259 (15.1)             
     50-74                           352 (18.4)      312 (18.2)             
     75-100                          316 (16.5)      261 (15.3)             
     100-150                         317 (16.5)      278 (16.3)             
     over 150                        290 (15.1)      299 (17.5)             
  HouseholdStatus (%)                                            <0.001     
     Domestic Partners (

In [5]:
mod.glm <- glm(Party ~ ., data = train.data[, -1], family = "binomial")

In [6]:
summary(mod.glm)


Call:
glm(formula = Party ~ ., family = "binomial", data = train.data[, 
    -1])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5822  -0.8599  -0.1507   0.8821   2.7772  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
(Intercept)                                2.108249  30.838118   0.068   0.9455
YOB                                       -0.001423   0.015519  -0.092   0.9269
GenderMale                                -0.763292   0.334798  -2.280   0.0226
Income.L                                  -0.688942   0.376289  -1.831   0.0671
Income.Q                                   0.670283   0.311310   2.153   0.0313
Income.C                                  -0.338258   0.294258  -1.150   0.2503
Income^4                                  -0.059439   0.282202  -0.211   0.8332
Income^5                                  -0.068248   0.283464  -0.241   0.8097
HouseholdStatusDomestic Partners (w/kids) -2.825776   1.547785  -1.8

In [7]:
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, 22)

seeds[[51]] <- sample.int(1000, 1)

In [8]:
trCtrl <- trainControl(method = "repeatedcv", repeats = 5, seeds = seeds, classProbs = TRUE, returnResamp = "final", 
                       summaryFunction = twoClassSummary)

In [9]:
library(doParallel)
registerDoParallel()

Loading required package: foreach

Attaching package: 'foreach'

The following objects are masked from 'package:purrr':

    accumulate, when

Loading required package: iterators
Loading required package: parallel


### LDA

#### Notes on Models

* train.set with KNN imputing
    - NZV makes no difference
    - BoxCox makes no difference
    - PCA makes no difference
* train.dv improves over train.set but gives warning for collinearity
* train.hc improves over train.dv but gives warning for collinearity
    - PCA slightly decreases accuracy, kappa, and specificity; slightly increases sensitivity; no warning for collinearity

In [30]:
set.seed(1056)
lda1 <- train(x = train.set[, -1], y = train.party, method = "lda", trControl = trCtrl, preProcess = "knnImpute")

In train.default(x = train.set[, -1], y = train.party, method = "lda", : The metric "Accuracy" was not in the result set. ROC will be used instead.

In [31]:
pred.lda1 <- predict(lda1, newdata = valid.set[, -1], na.action = na.pass)
cm.lda1 <- confusionMatrix(pred.lda1, valid.party)

In [32]:
lda1

In gsub("knnImpute", paste(x$k, "nearest neighbor imputation"), : argument 'replacement' has length > 1 and only the first element will be used

Linear Discriminant Analysis 

4455 samples
 106 predictor
   2 classes: 'Democrat', 'Republican' 

Pre-processing: YOB nearest neighbor imputation (106), centered (106),
 scaled (106) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 4010, 4010, 4009, 4009, 4010, 4010, ... 
Resampling results:

  ROC        Sens       Spec     
  0.6652186  0.6391196  0.5836564

 

In [33]:
cm.lda1

Confusion Matrix and Statistics

            Reference
Prediction   Democrat Republican
  Democrat        362        194
  Republican      228        329
                                          
               Accuracy : 0.6208          
                 95% CI : (0.5916, 0.6494)
    No Information Rate : 0.5301          
    P-Value [Acc > NIR] : 6.24e-10        
                                          
                  Kappa : 0.2417          
 Mcnemar's Test P-Value : 0.1082          
                                          
            Sensitivity : 0.6136          
            Specificity : 0.6291          
         Pos Pred Value : 0.6511          
         Neg Pred Value : 0.5907          
             Prevalence : 0.5301          
         Detection Rate : 0.3252          
   Detection Prevalence : 0.4996          
      Balanced Accuracy : 0.6213          
                                          
       'Positive' Class : Democrat        
                             

In [34]:
set.seed(1056)
lda2 <- train(x = train.dv[, -1], y = train.party, method = "lda", trControl = trCtrl, preProcess = c("knnImpute"))

In lda.default(x, grouping, ...): variables are collinear

In [36]:
pred.lda2 <- predict(lda2, newdata = valid.dv[, -1], na.action = na.pass)
cm.lda2 <- confusionMatrix(pred.lda2, valid.party)

In [37]:
lda2

In gsub("knnImpute", paste(x$k, "nearest neighbor imputation"), : argument 'replacement' has length > 1 and only the first element will be used

Linear Discriminant Analysis 

4455 samples
 224 predictor
   2 classes: 'Democrat', 'Republican' 

Pre-processing: YOB nearest neighbor imputation (224), centered (224),
 scaled (224) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 4010, 4010, 4009, 4009, 4010, 4010, ... 
Resampling results:

  ROC        Sens       Spec     
  0.6646455  0.6450526  0.5853985

 

In [38]:
cm.lda2

Confusion Matrix and Statistics

            Reference
Prediction   Democrat Republican
  Democrat        371        197
  Republican      219        326
                                          
               Accuracy : 0.6262          
                 95% CI : (0.5971, 0.6547)
    No Information Rate : 0.5301          
    P-Value [Acc > NIR] : 5.961e-11       
                                          
                  Kappa : 0.2515          
 Mcnemar's Test P-Value : 0.3032          
                                          
            Sensitivity : 0.6288          
            Specificity : 0.6233          
         Pos Pred Value : 0.6532          
         Neg Pred Value : 0.5982          
             Prevalence : 0.5301          
         Detection Rate : 0.3333          
   Detection Prevalence : 0.5103          
      Balanced Accuracy : 0.6261          
                                          
       'Positive' Class : Democrat        
                             

In [41]:
hcor <- cor(train.dv, use = "na.or.complete")
hc <- findCorrelation(hcor)
train.hc <- train.dv[, -hc]
valid.hc <- valid.dv[, -hc]

In [42]:
set.seed(1056)
lda3 <- train(x = train.hc[, -1], y = train.party, method = "lda", trControl = trCtrl, 
              preProcess = "knnImpute")

In lda.default(x, grouping, ...): variables are collinear

In [43]:
pred.lda3 <- predict(lda3, newdata = valid.hc[, -1], na.action = na.pass)
cm.lda3 <- confusionMatrix(pred.lda3, valid.party)

In [44]:
lda3

In gsub("knnImpute", paste(x$k, "nearest neighbor imputation"), : argument 'replacement' has length > 1 and only the first element will be used

Linear Discriminant Analysis 

4455 samples
 122 predictor
   2 classes: 'Democrat', 'Republican' 

Pre-processing: YOB nearest neighbor imputation (122), centered (122),
 scaled (122) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 4010, 4010, 4009, 4009, 4010, 4010, ... 
Resampling results:

  ROC        Sens       Spec     
  0.6659793  0.6410738  0.5885436

 

In [45]:
cm.lda3

Confusion Matrix and Statistics

            Reference
Prediction   Democrat Republican
  Democrat        384        199
  Republican      206        324
                                          
               Accuracy : 0.6361          
                 95% CI : (0.6071, 0.6644)
    No Information Rate : 0.5301          
    P-Value [Acc > NIR] : 5.658e-13       
                                          
                  Kappa : 0.2701          
 Mcnemar's Test P-Value : 0.7656          
                                          
            Sensitivity : 0.6508          
            Specificity : 0.6195          
         Pos Pred Value : 0.6587          
         Neg Pred Value : 0.6113          
             Prevalence : 0.5301          
         Detection Rate : 0.3450          
   Detection Prevalence : 0.5238          
      Balanced Accuracy : 0.6352          
                                          
       'Positive' Class : Democrat        
                             

In [47]:
set.seed(1056)
lda4 <- train(x = train.hc[, -1], y = train.party, method = "lda", trControl = trCtrl, 
              preProcess = c("knnImpute", "pca"))

In train.default(x = train.hc[, -1], y = train.party, method = "lda", : The metric "Accuracy" was not in the result set. ROC will be used instead.

In [48]:
pred.lda4 <- predict(lda4, newdata = valid.hc[, -1], na.action = na.pass)
cm.lda4 <- confusionMatrix(pred.lda4, valid.party)

In [49]:
lda4

In gsub("knnImpute", paste(x$k, "nearest neighbor imputation"), : argument 'replacement' has length > 1 and only the first element will be used

Linear Discriminant Analysis 

4455 samples
 122 predictor
   2 classes: 'Democrat', 'Republican' 

Pre-processing: YOB nearest neighbor imputation (122), principal
 component signal extraction (122), centered (122), scaled (122) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 4010, 4010, 4009, 4009, 4010, 4010, ... 
Resampling results:

  ROC        Sens       Spec     
  0.6651858  0.6531052  0.5801463

 

In [50]:
cm.lda4

Confusion Matrix and Statistics

            Reference
Prediction   Democrat Republican
  Democrat        393        210
  Republican      197        313
                                          
               Accuracy : 0.6343          
                 95% CI : (0.6053, 0.6627)
    No Information Rate : 0.5301          
    P-Value [Acc > NIR] : 1.365e-12       
                                          
                  Kappa : 0.2649          
 Mcnemar's Test P-Value : 0.552           
                                          
            Sensitivity : 0.6661          
            Specificity : 0.5985          
         Pos Pred Value : 0.6517          
         Neg Pred Value : 0.6137          
             Prevalence : 0.5301          
         Detection Rate : 0.3531          
   Detection Prevalence : 0.5418          
      Balanced Accuracy : 0.6323          
                                          
       'Positive' Class : Democrat        
                             

In [52]:
set.seed(1056)
lda5 <- train(x = train.dv2[, -1], y = train.party, method = "lda", trControl = trCtrl, 
              preProcess = "knnImpute")

In lda.default(x, grouping, ...): variables are collinear

In [53]:
pred.lda5 <- predict(lda5, newdata = valid.dv2[, -1], na.action = na.pass)
cm.lda5 <- confusionMatrix(pred.lda5, valid.party)

In [54]:
lda5

In gsub("knnImpute", paste(x$k, "nearest neighbor imputation"), : argument 'replacement' has length > 1 and only the first element will be used

Linear Discriminant Analysis 

4455 samples
 329 predictor
   2 classes: 'Democrat', 'Republican' 

Pre-processing: YOB nearest neighbor imputation (329), centered (329),
 scaled (329) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 4010, 4010, 4009, 4009, 4010, 4010, ... 
Resampling results:

  ROC        Sens       Spec     
  0.6567246  0.6444572  0.5693361

 

In [55]:
cm.lda5

Confusion Matrix and Statistics

            Reference
Prediction   Democrat Republican
  Democrat        384        210
  Republican      206        313
                                          
               Accuracy : 0.6262          
                 95% CI : (0.5971, 0.6547)
    No Information Rate : 0.5301          
    P-Value [Acc > NIR] : 5.961e-11       
                                          
                  Kappa : 0.2494          
 Mcnemar's Test P-Value : 0.8831          
                                          
            Sensitivity : 0.6508          
            Specificity : 0.5985          
         Pos Pred Value : 0.6465          
         Neg Pred Value : 0.6031          
             Prevalence : 0.5301          
         Detection Rate : 0.3450          
   Detection Prevalence : 0.5337          
      Balanced Accuracy : 0.6247          
                                          
       'Positive' Class : Democrat        
                             

In [56]:
hcor2 <- cor(train.dv2, use = "na.or.complete")
hc2 <- findCorrelation(hcor2)
train.hc2 <- train.dv2[, -hc2]
valid.hc2 <- valid.dv2[, -hc2]

In [57]:
set.seed(1056)
lda6 <- train(x = train.hc2[, -1], y = train.party, method = "lda", trControl = trCtrl, 
              preProcess = "knnImpute")

In lda.default(x, grouping, ...): variables are collinear

In [58]:
pred.lda6 <- predict(lda6, newdata = valid.hc2[, -1], na.action = na.pass)
cm.lda6 <- confusionMatrix(pred.lda6, valid.party)

In [59]:
lda6

In gsub("knnImpute", paste(x$k, "nearest neighbor imputation"), : argument 'replacement' has length > 1 and only the first element will be used

Linear Discriminant Analysis 

4455 samples
 315 predictor
   2 classes: 'Democrat', 'Republican' 

Pre-processing: YOB nearest neighbor imputation (315), centered (315),
 scaled (315) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 4010, 4010, 4009, 4009, 4010, 4010, ... 
Resampling results:

  ROC        Sens       Spec    
  0.6566775  0.6447958  0.569146

 

In [60]:
cm.lda6

Confusion Matrix and Statistics

            Reference
Prediction   Democrat Republican
  Democrat        383        211
  Republican      207        312
                                         
               Accuracy : 0.6244         
                 95% CI : (0.5952, 0.653)
    No Information Rate : 0.5301         
    P-Value [Acc > NIR] : 1.324e-10      
                                         
                  Kappa : 0.2458         
 Mcnemar's Test P-Value : 0.8833         
                                         
            Sensitivity : 0.6492         
            Specificity : 0.5966         
         Pos Pred Value : 0.6448         
         Neg Pred Value : 0.6012         
             Prevalence : 0.5301         
         Detection Rate : 0.3441         
   Detection Prevalence : 0.5337         
      Balanced Accuracy : 0.6229         
                                         
       'Positive' Class : Democrat       
                                         

In [61]:
set.seed(1056)
lda7 <- train(x = train.hc2[, -1], y = train.party, method = "lda", trControl = trCtrl, 
              preProcess = c("knnImpute", "pca"))

In train.default(x = train.hc2[, -1], y = train.party, method = "lda", : The metric "Accuracy" was not in the result set. ROC will be used instead.

In [62]:
pred.lda7 <- predict(lda7, newdata = valid.hc2[, -1], na.action = na.pass)
cm.lda7 <- confusionMatrix(pred.lda7, valid.party)

In [63]:
lda7

In gsub("knnImpute", paste(x$k, "nearest neighbor imputation"), : argument 'replacement' has length > 1 and only the first element will be used

Linear Discriminant Analysis 

4455 samples
 315 predictor
   2 classes: 'Democrat', 'Republican' 

Pre-processing: YOB nearest neighbor imputation (315), principal
 component signal extraction (315), centered (315), scaled (315) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 4010, 4010, 4009, 4009, 4010, 4010, ... 
Resampling results:

  ROC        Sens       Spec     
  0.6653651  0.6579346  0.5664748

 

In [64]:
cm.lda7

Confusion Matrix and Statistics

            Reference
Prediction   Democrat Republican
  Democrat        383        224
  Republican      207        299
                                          
               Accuracy : 0.6128          
                 95% CI : (0.5834, 0.6415)
    No Information Rate : 0.5301          
    P-Value [Acc > NIR] : 1.643e-08       
                                          
                  Kappa : 0.2213          
 Mcnemar's Test P-Value : 0.4409          
                                          
            Sensitivity : 0.6492          
            Specificity : 0.5717          
         Pos Pred Value : 0.6310          
         Neg Pred Value : 0.5909          
             Prevalence : 0.5301          
         Detection Rate : 0.3441          
   Detection Prevalence : 0.5454          
      Balanced Accuracy : 0.6104          
                                          
       'Positive' Class : Democrat        
                             