# Review -- Exercises

## The objective is to practice the following:
 - repeated K-fold cross-validation
 - parallel repeated K-fold cross-validation 

## The data: Canadian Community Health Survey (CCHS)
Statistics Canada conducts the Canadian Community Health Survey (CCHS) to collect health-related data at the sub-provincial levels of geography (health region or combined health regions). The survery represents cross-sectional estimates of factors related to health status, health care utilization and health determinants for the Canadian population. Questions in the survey cover a range of health-related topics including: physical activity, height,  weight, smoking, exposure to secondhand smoke, alcohol consumption, general health, chronic health conditions, injuries, use of health care services and related socio-demographic information. The survey targets the population of resident households in all provinces and territories of Canada but excludes households living on: Indian Reserves, Canadian forces bases, and some remote areas.

The protion of the survey data we will use is a a synthetic version of the CCHS data designed specifically for the educational purposes of this course. This version of the data selects few variables as described below.
 
## Data dictionary:
1. **Age**: a factor features with **levels:age group** as follows:
    - 1 : <20
    - 2 : 20-29
    - 3 : 30-39
    - 4 : 40-49
    - 5 : 50-59
    - 6 : 60-69
    - 7 : =>70
    
    
2. **sex**: a factor with **levels:sex** as:
    - 1 : male
    - 2 : female
    
    
3. **CANHEARTbin**: a binary factor and outcome with **levels:CANHEARTbin** to indicate whether the case is labels as CANHEART profile or not. The CANHEART index definition in Canadian population describes:
    - Smoking: Non-Smoker or former daily or occasional smoker who quit more than 12 months
    - Overweight/obesity: BMI<25
    - Hypertension: No self-reported HTN diagnosed by health professional
    - Diabetes: No self-reported diabetes diagnosed by health professional
    - Physical activity: Covers on leisure physical activity
    - Fruit and vegetable consumption: More or less than 5 times per day
    
    The associated binary values in the data are:
    - 1 : yes = in CANHEART profile
    - 0 : no  = not in CANHEART profile
    
    
3. **householdsize**: indicates the numebr of persons in the house hold as follows:
    - 1 : 1 person
    - 2 : 2 persons
    - 3 : 3 persons
    - 4 : 4 persons
    - 5 : 5 or more persons


4. **education**: indicates the highest level of education for the respondent
    - 1 : less than secondary education
    - 2 : secondary diploma
    - 3 : post-secondary education
    - 4 : post-secondary graduate degree


5. **maritalstatus**: shows the marital status of the correspondent as one of:
    - 1 : single/never married
    - 2 : widow/separated/divorced
    - 3 : common in law/married


- **immigration**: a binary flag indicating whether the correspondent is an immigrant or not as:
    - 1 : yes
    - 0 : no


- **houseincome**:
    - 3 : low  (less than 20K)
    - 2 : med  (20K to 60K)
    - 1 : high (more than 60K)


In [1]:
df <- epi7913A::cchs
head(df)

Unnamed: 0_level_0,age,sex,CANHEARTbin,householdsize,education,maritalstatus,immigration,houseincome
Unnamed: 0_level_1,<fct>,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<fct>
1,1,2,1,4,1,1,0,1
2,1,1,1,3,1,1,0,1
3,6,1,1,2,4,3,0,2
4,3,1,0,5,2,3,0,1
5,6,2,0,1,4,2,0,2
6,5,1,0,5,1,3,0,1


In [2]:
table(df$age)


   1    2    3    4    5    6    7 
1040 1053 1150 1014 1680 1983 2080 

In [3]:
table(df$sex)


   1    2 
4429 5571 

In [4]:
table(df$CANHEARTbin)


   0    1 
3785 6215 

In [5]:
table(df$householdsize)


   1    2    3    4    5 
2779 3863 1313 1330  715 

In [6]:
table(df$education)


   1    2    3    4 
2258 2008  442 5292 

In [7]:
table(df$maritalstatus)


   1    2    3 
2827 2056 5117 

In [8]:
table(df$immigration)


   0    1 
8458 1542 

In [9]:
table(df$houseincome)


   1    2    3 
5209 3737 1054 

## We will also need the package **sdgm**
### The following functions will be needed form the **sdgm** package:
- ### sdgm::cart.bestmodel.bin() used to train a CART model with the best parameters
- ### sdgm::auc() used to calculate the area under the receiver characteristics (ROC) curve known as the AUC to measure the ranking performance of calssification
- ### sdgm::brier() used to calculate the Brier score (equivalent to MSE) which measures the prediction performance of probability estimates

In [1]:
library(magrittr)

full_data <- epi7913A::cchs %>% dplyr::slice_sample(prop=0.005)
voutcome  <- "CANHEARTbin"
head(full_data)

Unnamed: 0_level_0,age,sex,CANHEARTbin,householdsize,education,maritalstatus,immigration,houseincome
Unnamed: 0_level_1,<fct>,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<fct>
1,4,2,1,3,4,3,0,1
2,6,1,0,1,2,1,0,1
3,3,1,1,2,3,1,0,2
4,4,2,1,4,4,3,0,1
5,5,2,0,1,4,1,0,2
6,6,2,1,1,4,2,1,1


In [2]:
ll.mean<-mean(sapply(caret::createFolds(full_data[, voutcome], k=5), function(x)
{
  testInds <- x
  trnInds <- setdiff(1:nrow(full_data), testInds)
  train_data <- full_data[trnInds,] 
  test_data <- full_data[testInds,]
  
  best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome, n_iter=5)

  preds<-predict(best_model, test_data)
 
  if (!is.null(preds))
  {
    test_ll<- MLmetrics::LogLoss(preds, test_data[,voutcome] ) 
  } else  {
    test_ll<-NA
    print("Logloss calculation failed")
  }
}))
print(ll.mean, na.rm=T)


 Best Parameters Found: 
Round = 10	minsplit = 16.0000	minbucket = 13.0000	cp = 0.03573669	maxdepth = 9.0000	Value = -0.6104034 

 Best Parameters Found: 
Round = 3	minsplit = 11.0000	minbucket = 18.0000	cp = 0.02257529	maxdepth = 6.0000	Value = -0.6690772 

 Best Parameters Found: 
Round = 21	minsplit = 8.0000	minbucket = 9.0000	cp = 0.03575185	maxdepth = 12.0000	Value = -0.5605247 

 Best Parameters Found: 
Round = 12	minsplit = 20.0000	minbucket = 15.0000	cp = 0.05713059	maxdepth = 4.0000	Value = -0.5789702 

 Best Parameters Found: 
Round = 21	minsplit = 7.0000	minbucket = 14.0000	cp = 0.06507879	maxdepth = 7.0000	Value = -0.6683998 
[1] 1.38019


In [3]:
ll.mean<-mean(sapply(caret::createFolds(full_data[, voutcome], k=5), function(x){
  testInds <- x
  trnInds <- setdiff(1:nrow(full_data), testInds)
  train_data <- full_data[trnInds,] 
  test_data <- full_data[testInds,]
  
  best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome, n_iter=2)

  preds<-predict(best_model, test_data)

#AUC
 if (!is.null(preds))
{
  test_auc <- sdgm::auc(preds, test_data[,voutcome] ) 
} else {
  test_auc <- NA
  print("AUC calculation failed because there are no predicted values")
}
print(paste0("AUC on CCHS Data: ", test_auc))
}))
cat("\nThe mean AUC obtained from cross-validation is:",ll.mean,"\n")

print(ll.mean, na.rm=T)


 Best Parameters Found: 
Round = 12	minsplit = 9.0000	minbucket = 17.0000	cp = 0.06776244	maxdepth = 14.0000	Value = -0.6832473 
[1] "AUC on CCHS Data: 0.666666666666667"

 Best Parameters Found: 
Round = 5	minsplit = 14.0000	minbucket = 18.0000	cp = 0.06003933	maxdepth = 11.0000	Value = -0.6513667 
[1] "AUC on CCHS Data: 0.541666666666667"

 Best Parameters Found: 
Round = 1	minsplit = 17.0000	minbucket = 19.0000	cp = 0.06804908	maxdepth = 9.0000	Value = -0.6505009 
[1] "AUC on CCHS Data: 0.547619047619048"

 Best Parameters Found: 
Round = 8	minsplit = 6.0000	minbucket = 19.0000	cp = 0.08749433	maxdepth = 11.0000	Value = -0.6436704 
[1] "AUC on CCHS Data: 0.5"

 Best Parameters Found: 
Round = 13	minsplit = 16.0000	minbucket = 11.0000	cp = 0.02188563	maxdepth = 8.0000	Value = -0.5145888 
[1] "AUC on CCHS Data: 0.56"


“argument is not numeric or logical: returning NA”



The mean AUC obtained from cross-validation is: NA 
[1] NA


In [4]:
ll.mean<-mean(sapply(caret::createFolds(full_data[, voutcome], k=5), function(x)
{
  testInds <- x
  trnInds <- setdiff(1:nrow(full_data), testInds)
  train_data <- full_data[trnInds,] 
  test_data <- full_data[testInds,]
  
  best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome, n_iter=20)

  preds<-predict(best_model, test_data)

      #Brier
      if (!is.null(preds))
    {
      test_brier <- sdgm::brier(preds, test_data[,voutcome] ) 
    } else {
      test_brier <- NA
      print("Brier calculation failed because there are no predicted values")
    }
      print(paste0("Brier Scoreon CCHS Data: ", test_brier))
}))
print(ll.mean, na.rm=T)


 Best Parameters Found: 
Round = 6	minsplit = 6.0000	minbucket = 18.0000	cp = 0.08499422	maxdepth = 3.0000	Value = -0.6583479 
[1] "Brier Scoreon CCHS Data: 0.0211640211640213"

 Best Parameters Found: 
Round = 38	minsplit = 16.0000	minbucket = 16.0000	cp = 0.003256354	maxdepth = 15.0000	Value = -0.620214 
[1] "Brier Scoreon CCHS Data: 0"

 Best Parameters Found: 
Round = 29	minsplit = 9.0000	minbucket = 16.0000	cp = 0.004729816	maxdepth = 15.0000	Value = -0.6118378 
[1] "Brier Scoreon CCHS Data: 0"

 Best Parameters Found: 
Round = 38	minsplit = 14.0000	minbucket = 15.0000	cp = 0.1000	maxdepth = 3.0000	Value = -0.6615632 
[1] "Brier Scoreon CCHS Data: 0"

 Best Parameters Found: 
Round = 39	minsplit = 5.0000	minbucket = 16.0000	cp = 0.0010	maxdepth = 9.0000	Value = -0.5929837 
[1] "Brier Scoreon CCHS Data: 0"


“argument is not numeric or logical: returning NA”


[1] NA
