# RwHealth: Data Science Assessment 

**Your task:**

Using this Inpatient dataset, build 3 machine learning models to predict the likelihood of a readmission. 

Evaluate each model, and determine which of the models is best.

As part of your analysis of the data please include any statistical analysis, plots or data exploration carried out.

**Dataset:**

Throughout this notebook, you will be using the Inpatient dataset.


**Tips:**

- Some code is included to get you started in R
- If you are doing this in Python please replace and refactor any code
- Add markdown cells to this notebook to include explanations  
- Include comments in your code
- Include plots where appropriate to explain your data and models


# Getting started

Now the packages are installed, load them in using them `library()` function.

In [1]:
# Load in libraries
library(tidyverse)

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Data preparation
- Load in the `Inpatient.csv` dataset
- Include some exploratory analysis
- Clean the data by one hot encoding appropriate columns and converting data types where required
- Create a test/train split of the data. 

## Load in data

Read in the `Inpatient` dataset. 


In [None]:
# read in the dataset
df <- read.csv("https://raw.githubusercontent.com/Draper-Dash/dsp-training/main/Inpatients.csv", na.strings = "")


This data has 33 columns of data. We can view the structure of the data by using `str()`:

In [None]:
# view a summary of the data using str()
str(df)

## Exploratory analysis

Running the code below will plot a grid of histograms which give a good visual summary of the data

In [None]:
# sets the heigth and width of the plot
options(repr.plot.width=15, repr.plot.height=10)

# create a facet plot for all numerical columns
df %>%
  keep(is.numeric) %>% 
  pivot_longer(cols = everything())  %>%   # Convert to name-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ name, scales = "free") +  # in separate panels
    geom_histogram()  +                    # as histogram
    theme_minimal() +
    theme(text = element_text(size=20))

## Setup

In [None]:
# Load extra packages
if (!require("pacman")){
  install.packages("pacman")}

pacman::p_load(#
  'caret',
  'skimr',
  'randomForest',
  'gbm',
  'earth'
)

## Data Pre-processing

In [None]:
# Convert features to numeric or factors
data[data == "NA"] <- NA
numerical <- c(8:11,14:16,18:32)

data[, numerical] <- apply(data[, numerical], 2, function(x) as.numeric(as.character(x)))
data[,-numerical] <- lapply(data[,-numerical], factor)

data <- data[,c(13,10:11,14:16,18:32,3:5,7,17)] # Remove PII, wierd factors and variables with < 90%

# Create the training and test datasets
set.seed(100)
trainRowNumbers <- createDataPartition(data$ReadmitFlag, p=0.8, list=FALSE)
trainData <- data[trainRowNumbers,]
testData <- data[-trainRowNumbers,]

# Impute numeric variables
preProcess_missingdata_model <- preProcess(trainData, method='knnImpute')
preProcess_missingdata_model

trainData <- predict(preProcess_missingdata_model, newdata = trainData)

# Use only complete cases
if(anyNA(trainData)){
  trainData <- trainData[complete.cases(trainData),]
}

# Set predictors and outcome variables
x = trainData[, 2:26]
y = trainData$ReadmitFlag

# One-Hot Encode factor variables
dummies_model <- dummyVars(ReadmitFlag ~ ., data = trainData)
trainData_mat <- predict(dummies_model, newdata = trainData)
trainData <- data.frame(trainData_mat)

# Pre-process using range between 1 and 0
preProcess_range_model <- preProcess(trainData, method='range')
trainData <- predict(preProcess_range_model, newdata = trainData)

# Append the Y variable
trainData$ReadmitFlag <- y

## Feature Selection

In [None]:
# Visualize the importance of variables 
# Box
featurePlot(x = trainData[, 1:26], 
            y = trainData$ReadmitFlag, 
            plot = "box",
            strip=strip.custom(par.strip.text=list(cex=.7)),
            scales = list(x = list(relation="free"), 
                          y = list(relation="free")))
# Density
featurePlot(x = trainData[, 1:26], 
            y = trainData$ReadmitFlag, 
            plot = "density",
            strip=strip.custom(par.strip.text=list(cex=.7)),
            scales = list(x = list(relation="free"), 
                          y = list(relation="free")))

In [None]:
# Select features using RFE
set.seed(100)
options(warn=-1)

subsets <- c(1:5, 10, 15, 25)
ctrl <- rfeControl(functions = rfFuncs,
                   method = "repeatedcv",
                   repeats = 5,
                   verbose = FALSE)
rfProfile <- rfe(x=trainData[, 1:18], y=trainData$ReadmitFlag,
                 sizes = subsets,
                 rfeControl = ctrl)
rfProfile

plot(rfProfile, type = c("g", "o"))

optVars <- rfProfile$optVariables # use only optimal variables

trainData <- trainData[,c("ReadmitFlag",optVars)]

## Training


In [None]:
# MARS
set.seed(100)
model_mars = train(ReadmitFlag ~ ., data=trainData, method='earth')
fitted <- predict(model_mars)
plot(model_mars, main="Model Accuracies with MARS")

# GLM
set.seed(100)
model_glm = train(ReadmitFlag ~ ., data=trainData, method='glm')
fitted <- predict(model_glm) # No tuning for this model

#Random Forest
set.seed(100)
model_rf = train(ReadmitFlag ~ ., data=trainData, method='rf')
fitted <- predict(model_rf)
plot(model_rf, main="Model Accuracies with RandomForests")

In [None]:
# Variable Importance
varimp_mars <- varImp(model_mars)
plot(varimp_mars, main="Variable Importance with MARS")

varimp_glm <- varImp(model_glm)
plot(varimp_glm, main="Variable Importance with GLM")

varimp_rf <- varImp(model_rf)
plot(varimp_rf, main="Variable Importance with RF")

## Testing

In [None]:
# Pre-processing pipeline
# Step 1: Impute missing values
testData2 <- predict(preProcess_missingdata_model, testData)

# Step 2: Use only complete cases
if(anyNA(testData2)){
  testData2 <- testData2[complete.cases(testData2),]
}

# Step 3: Create one-hot encodings (dummy variables)
testData3 <- predict(dummies_model, testData2)

# Step 4: Use only optimal variables
testData4 <- testData3[,optVars]

## Model Selection

In [None]:
# Predict on test Data and Assess 
# MARS
predicted_mars <- predict(model_mars, testData4)
confusionMatrix(
  reference = testData2$ReadmitFlag,
  data = predicted_mars,
  mode ='everything',
  positive = '1'
  )

# GLM
predicted_glm <- predict(model_glm, testData4)
confusionMatrix(
  reference = testData2$ReadmitFlag,
  data = predicted_glm,
  mode ='everything',
  positive = '1'
)

# RF
predicted_rf <- predict(model_rf, testData4)
confusionMatrix(
  reference = testData2$ReadmitFlag,
  data = predicted_rf,
  mode ='everything',
  positive = '1'
)

In [None]:
# Model Comparison
# Compare model performances using resample()
models_compare <- resamples(list(
  RF=model_rf,
  MARS=model_mars,
  GLM=model_glm))

summary(models_compare)

scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(models_compare, scales=scales)

# Conclusion

- What insights do these models give? 
- Which model is best?
- How might you improve your analysis?

In [None]:
# your code here...