# Feature Selection
## Topics for this notebook
### (Pearson correlation and chi square will not be covered in this notebook, see bivariate analysis) 
1. Spitting the data between training and testing sets
2. Oversampling to make the training set more balanced  
3. Feature selection using a wrapper method
4. Feature selection using an embedded method 

## Split the data frame into separate training and testing data frames 

In [None]:
#splits the t dataframe into a training data frame and a testing data frame
#the training data frame holds 75% of the data and the rest is in the testing data frame 
install.packages("caTools")
library(caTools)
set.seed(123)   
sample = sample.split(t,SplitRatio = 0.75) 
train1 =subset(t,sample ==TRUE) 
test1=subset(t, sample==FALSE)


## Oversample the minority outcome to balance the training data frame (not always necessary)  

In [None]:
#oversampling minority outcome to correct for data imbalance
install.packages("ROSE")
library(ROSE)
data.rose <- ROSE(hospital_death~., p = 0.4, data=test1, seed=3)$data
table(data.rose$hospital_death)

## Backward feature selection using recursive feature elimination (wrapper method) 

In [None]:
#feature selection using recursive feature elimination(rfe) 
#backward feature selection starts with all features then reduces the number of features 

#establish training parameters
rfe_training <- rfeControl(functions=rfFuncs, method="cv", number=10)

#run the rfe model using the oversampled training data 
rfe <- rfe(data.rose[,2:18], data.rose[,1], sizes=c(2:18), rfeControl=rfe_training)
print(rfe)

#show variable rank
predictors(rfe) 

#display graph that highlights the most accurate number of features 
plot(rfe, type=c("g", "o"))

## Feature selection using a random forest algorithm (embedded method) 

In [None]:
#feature selection using a random forest algorithm 
#random forest is an embedded method that selects variables as part of building the model 
install.packages(“mlbench”)
install.packages(“caret”)
install.packages(“randomForest”)
install.packages(“e1071”)
library(e1071)
library(mlbench)
library(caret)
library(randomForest)

#establish the training parameters 
t_training<- trainControl(method = "repeatedcv", number=10, repeats=3)
seed<- 7
metric<- "Accuracy"
set.seed(seed)
mtry<- sqrt(ncol(data.rose))
tunegrid<-expand.grid(.mtry=mtry)

#train the random forest algorithm on the oversampled training data 
t_model<- train(hospital_death~., data=data.rose, method="rf", metric=metric, tuneGrid=tunegrid, trControl=t_training)

#show the results of model training
print(t_model)

#apply the trained model to the unseen testing data frame and store the results as z  
z<- predict(t_model, test1)

#look at the results of running the model with unseen test data using a confusion matrix 
confusionMatrix(z, test1$hospital_death)

#visualize variable importance 
variable_importance<- varImp(t_model)
print(variable_importance)

## This concludes the feature selection part of the course
- The following key questions were addressed during this notebook and the accompanying video lecture:
1. What are the three main types of methods used for feature selection?
2. What are the advantages and disadvantages of the feature selection approaches mentioned?

 

