# Feature Selection
## Topics for this notebook
### (Pearson correlation and chi square will not be covered in this notebook, see bivariate analysis) 
1. Clinical context  
2. Read in CSV file  
3. Spitting the data between training and testing sets
4. Oversampling to make the training set more balanced  
5. Feature selection using a wrapper method
6. Feature selection using an embedded method 

## Clinical context
For this notebook we will be building a model to predict hospital death based on data from an intensive care unit (ICU). As a basis for this model we will be selecting many of the same features used by the Apache Algorithm. The Apache Algorithm was also created to predict patient death in the ICU and should serve as a good starting point for our model. Due to some missing data certain features from the Apache Algorithm have been excluded from our model. Also, in an attempt to increase the performance of our model new features have been added that are not part of the Apache Algorithm. Predicting patient death in the ICU is extremely important because determining which patients need attention immediately is a matter of life or death.   

## Read in CSV file 

In [1]:
#install and load required libraries/packages 
install.packages("readr")
library (readr)

#location of raw csv file on github repository 
urlfile="https://raw.githubusercontent.com/e-cui/ENABLE-HiDAV-Online-Modules/master/Data%20Mining%20Modules/csv_files/t.csv"

#Read the CSV file into a data frame called training_v2 
t<-read_csv(url(urlfile))

package 'readr' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Aaron\AppData\Local\Temp\RtmpyiWQ3Q\downloaded_packages


"package 'readr' was built under R version 3.6.3"Parsed with column specification:
cols(
  hospital_death = col_double(),
  temp_apache = col_double(),
  map_apache = col_double(),
  h1_heartrate_max = col_double(),
  d1_resprate_max = col_double(),
  sodium_apache = col_double(),
  d1_potassium_max = col_double(),
  d1_creatinine_max = col_double(),
  d1_hematocrit_max = col_double(),
  wbc_apache = col_double(),
  gcs_eyes_apache = col_double(),
  gcs_motor_apache = col_double(),
  age = col_double(),
  pre_icu_los_days = col_double(),
  bmi = col_double(),
  intubated_apache = col_double(),
  sepsis = col_double(),
  cardiovascular_diagnosis = col_double()
)


tibble [91,713 x 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ hospital_death          : num [1:91713] 0 0 0 0 0 0 0 0 1 0 ...
 $ temp_apache             : num [1:91713] 39.3 35.1 36.7 34.8 36.7 36.6 35 36.6 36.9 36.3 ...
 $ map_apache              : num [1:91713] 40 46 68 60 103 130 138 60 66 58 ...
 $ h1_heartrate_max        : num [1:91713] 119 114 96 100 89 83 79 118 82 96 ...
 $ d1_resprate_max         : num [1:91713] 34 32 21 23 18 32 38 28 24 44 ...
 $ sodium_apache           : num [1:91713] 134 145 138 138 138 ...
 $ d1_potassium_max        : num [1:91713] 4 4.2 4.25 5 4.25 ...
 $ d1_creatinine_max       : num [1:91713] 2.51 0.71 1.49 1.49 1.49 ...
 $ d1_hematocrit_max       : num [1:91713] 27.4 36.9 34.5 34 34.5 ...
 $ wbc_apache              : num [1:91713] 14.1 12.7 10.5 8 15.5 ...
 $ gcs_eyes_apache         : num [1:91713] 3 1 3 4 1 4 4 4 4 4 ...
 $ gcs_motor_apache        : num [1:91713] 6 3 6 6 1 6 6 6 6 6 ...
 $ age                     : num [1:91713] 68 77 25 81 19 67 5

## Split the data frame into separate training and testing data frames 

In [9]:
#install and load required libraries/packages 
install.packages("caTools")
library(caTools)

#75% of the sample size
smp_size <- floor(0.75 * nrow(t))

#set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(t)), size = smp_size)

#create a training data frame and a testing data frame 
train1 <- t[train_ind, ]
test1 <- t[-train_ind, ]

"package 'caTools' is in use and will not be installed"

tibble [22,929 x 18] (S3: tbl_df/tbl/data.frame)
 $ hospital_death          : num [1:22929] 0 0 0 0 0 0 0 0 0 0 ...
 $ temp_apache             : num [1:22929] 34.8 36.8 36.6 36.6 36.6 36.3 36.5 36.7 36.8 36.2 ...
 $ map_apache              : num [1:22929] 60 72 140 103 55 133 64 162 163 118 ...
 $ h1_heartrate_max        : num [1:22929] 100 90 92 108 100 116 72 76 82 132 ...
 $ d1_resprate_max         : num [1:22929] 23 23 25 41 22 36 18 28 36 44 ...
 $ sodium_apache           : num [1:22929] 138 138 138 140 138 ...
 $ d1_potassium_max        : num [1:22929] 5 4.2 4.25 4.4 4 ...
 $ d1_creatinine_max       : num [1:22929] 1.49 0.9 1.49 1.13 1.49 ...
 $ d1_hematocrit_max       : num [1:22929] 34 34.5 41.8 33.1 33.3 28.9 43.2 27 32.6 44 ...
 $ wbc_apache              : num [1:22929] 8 9.96 8.41 10.3 5 ...
 $ gcs_eyes_apache         : num [1:22929] 4 4 4 4 4 3 4 4 4 4 ...
 $ gcs_motor_apache        : num [1:22929] 6 6 6 6 6 6 6 6 6 6 ...
 $ age                     : num [1:22929] 81 72 48 

## Oversample the minority outcome to balance the training data frame (not always necessary)  

In [10]:
#install and load required libraries/packages 
install.packages("ROSE")
library(ROSE)

#oversample the minority outcome to correct for data imbalance
data.rose <- ROSE(hospital_death~., p = 0.4, data=test1, seed=3)$data
table(data.rose$hospital_death)

package 'ROSE' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Aaron\AppData\Local\Temp\RtmpyiWQ3Q\downloaded_packages


"package 'ROSE' was built under R version 3.6.3"Loaded ROSE 0.0-3




    0     1 
13840  9089 

'data.frame':	22929 obs. of  18 variables:
 $ hospital_death          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ temp_apache             : num  36.4 36.3 36.1 36.3 35.7 ...
 $ map_apache              : num  98.68 160.36 50.07 5.97 146.41 ...
 $ h1_heartrate_max        : num  78.3 127.4 70.6 152.5 80.9 ...
 $ d1_resprate_max         : num  19.9 40.5 29.7 40.2 26.9 ...
 $ sodium_apache           : num  144 134 137 130 140 ...
 $ d1_potassium_max        : num  3.71 4.55 4.26 4.74 4.4 ...
 $ d1_creatinine_max       : num  1.517 0.82 1.285 -0.208 2.247 ...
 $ d1_hematocrit_max       : num  28 35.6 20.4 26.2 35.7 ...
 $ wbc_apache              : num  11.18 4.62 2.51 16.71 5.39 ...
 $ gcs_eyes_apache         : num  3.72 4.19 3.23 4.41 5.01 ...
 $ gcs_motor_apache        : num  5.74 5.73 5.46 5.47 5.51 ...
 $ age                     : num  46.1 97.2 61.6 81.2 70.5 ...
 $ pre_icu_los_days        : num  0.1724 -0.0158 8.3065 -0.3877 2.0257 ...
 $ bmi                     : num  24 22.6 20.4 21.3 28.7 ...


## Backward feature selection using recursive feature elimination (wrapper method) 

In [None]:
#feature selection using recursive feature elimination(rfe) 
#backward feature selection starts with all features then reduces the number of features 
#install and load required packages/libraries 
install.packages("mlbench")
install.packages("caret")
install.packages("randomForest")
install.packages("e1071")
library(e1071)
library(mlbench)
library(caret)
library(randomForest)

#establish training parameters
rfe_training <- rfeControl(functions=rfFuncs, method="cv", number=10)

#run the rfe model using the oversampled training data 
rfe <- rfe(data.rose[,2:18], data.rose[,1], sizes=c(2:18), rfeControl=rfe_training)
print(rfe)

#show variable rank
predictors(rfe) 

#display graph that highlights the most accurate number of features 
plot(rfe, type=c("g", "o"))

package 'mlbench' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Aaron\AppData\Local\Temp\RtmpyiWQ3Q\downloaded_packages


also installing the dependencies 'ps', 'processx', 'callr', 'prettyunits', 'backports', 'desc', 'pkgbuild', 'rprojroot', 'rstudioapi', 'numDeriv', 'SQUAREM', 'pkgload', 'praise', 'colorspace', 'KernSmooth', 'lava', 'testthat', 'farver', 'labeling', 'munsell', 'RColorBrewer', 'viridisLite', 'stringi', 'rpart', 'survival', 'nnet', 'class', 'prodlim', 'gtable', 'isoband', 'MASS', 'mgcv', 'scales', 'data.table', 'stringr', 'generics', 'gower', 'ipred', 'lubridate', 'Matrix', 'tidyr', 'timeDate', 'lattice', 'ggplot2', 'plyr', 'ModelMetrics', 'nlme', 'reshape2', 'recipes', 'withr', 'pROC'




  There are binary versions available but the source versions are later:
          binary source needs_compilation
backports  1.1.6  1.1.7              TRUE
scales     1.1.0  1.1.1             FALSE

  Binaries will be installed
package 'ps' successfully unpacked and MD5 sums checked
package 'processx' successfully unpacked and MD5 sums checked
package 'callr' successfully unpacked and MD5 sums checked
package 'prettyunits' successfully unpacked and MD5 sums checked
package 'backports' successfully unpacked and MD5 sums checked
package 'desc' successfully unpacked and MD5 sums checked
package 'pkgbuild' successfully unpacked and MD5 sums checked
package 'rprojroot' successfully unpacked and MD5 sums checked
package 'rstudioapi' successfully unpacked and MD5 sums checked
package 'numDeriv' successfully unpacked and MD5 sums checked
package 'SQUAREM' successfully unpacked and MD5 sums checked
package 'pkgload' successfully unpacked and MD5 sums checked
package 'praise' successfully unpa

installing the source package 'scales'



package 'randomForest' successfully unpacked and MD5 sums checked


"restored 'randomForest'"


The downloaded binary packages are in
	C:\Users\Aaron\AppData\Local\Temp\RtmpyiWQ3Q\downloaded_packages
package 'e1071' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Aaron\AppData\Local\Temp\RtmpyiWQ3Q\downloaded_packages


"package 'caret' was built under R version 3.6.3"Loading required package: lattice
"package 'lattice' was built under R version 3.6.3"Loading required package: ggplot2
"package 'randomForest' was built under R version 3.6.3"randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:ggplot2':

    margin

"The response has five or fewer unique values.  Are you sure you want to do regression?"

## Feature selection using a random forest algorithm (embedded method) 

In [None]:
#install and load required libraries/packages 
install.packages("mlbench")
install.packages("caret")
install.packages("randomForest")
install.packages("e1071")
library(e1071)
library(mlbench)
library(caret)
library(randomForest)

#establish the training parameters 
t_training<- trainControl(method = "repeatedcv", number=10, repeats=3)
seed<- 7
metric<- "Accuracy"
set.seed(seed)
mtry<- sqrt(ncol(data.rose))
tunegrid<-expand.grid(.mtry=mtry)

#train the random forest algorithm on the oversampled training data 
#the oversampled training data frame is called data.rose 
#the outcome variable is hospital death
t_model<- train(hospital_death~., data=data.rose, method="rf", metric=metric, tuneGrid=tunegrid, trControl=t_training)

#show the results of model training
print(t_model)

#apply the trained model to the unseen testing data frame and store the results as z  
z<- predict(t_model, test1)

#look at the results of running the model with unseen test data using a confusion matrix 
confusionMatrix(z, test1$hospital_death)

#visualize variable importance 
variable_importance<- varImp(t_model)
print(variable_importance)

## This concludes the feature selection part of the course
- The following key questions were addressed during this notebook and the accompanying video lecture:
1. What are the three main types of methods used for feature selection?
2. What are the advantages and disadvantages of the feature selection approaches mentioned?

 

