# Assignment 4

In this assignment, you will build a model to predict length of stay of 3 days or more  (as a binary classification outcome) using a sample from the Texas hospital discharge dataset. We will evaluate your model performance using a hold-out dataset that is kept from you.

In this assignment, you will generate a full analysis report, including the descriptive stats and missingness patterns, train and evaluate the model, indicate the optimal hyperparameters, explain the model, and evaluate fairness on `sex`.

Some instructions of this assignment:

1. You should use this checklist for reporting (and complete the checklist and include it as part of the submission):
[Journal of Medical Internet Research - Consolidated Reporting Guidelines for Prognostic and Diagnostic Machine Learning Modeling Studies: Development and Validation (jmir.org)](https://www.jmir.org/2023/1/e48763)

2. You can choose which modeling technique you want to use or come up with an ensemble if you want. It is up to you. You can try and compare different models.

3. In most publications you need to explain why you did not use a simpler modeling technique, so you should have an logistic regression model as a baseline to compare against.

4. You should submit the workbook, the checklist, and the model (see details below). For submission, __only submit your best model__. 

5. We have 10 bonus points to distribute among the class based on a ranking of how well your model predicts on the holdout. Bonus points 4,3,2,1 for the top four.

## Load useful packages

In [None]:
library(sdgm)
library(dplyr)
library(ggplot2)

## Load the data

In [None]:
# check the help file of this dataset. It documents the binary prediction model 
# that can be trained from this dataset.
?sdgm::texas

In [None]:
# In this assignment, we will provide the data pre-processing steps. Please don't change code here.

# get the dataset
df<-sdgm::getdata("texas")

# create binary outcome
df$status <- ifelse(as.numeric(df$LENGTH_OF_STAY) >= 3, 1, 0)


# transform numeric predictors
df$age  <- as.numeric(df$PAT_AGE)
df$fees <- ifelse(df$CHRGS_NON_COV < 0, NA, df$CHRGS_NON_COV)


# transform categorical predictors
df$sex            <- as.factor(ifelse(df$SEX_CODE == "F", "F",
                                      ifelse(df$SEX_CODE == "M", "M", NA)))
df$ethnicity      <- as.factor(ifelse(df$ETHNICITY == "1" | df$ETHNICITY == "1.0", 1,
                                        ifelse(df$ETHNICITY == "2" | df$ETHNICITY == "2.0", 2, NA)))
df$race           <- as.factor(df$RACE)
df$location       <- as.factor(df$PAT_STATE)
df$weekday        <- as.factor(gsub("'",'',df$ADMIT_WEEKDAY))
df$risk_mortality <- as.factor(gsub("'",'',df$RISK_MORTALITY))
df$severity       <- as.factor(gsub("'",'',df$ILLNESS_SEVERITY))
df$drg            <- as.factor(gsub("'",'',df$APR_DRG))

# select predictors
vars_select <- c("status", "age", "fees", "sex", "ethnicity", "race", "weekday",
                         "location", "risk_mortality", "severity", "drg")

# transform from tibble to dataframe
full_data <- as.data.frame(df %>% select(all_of(vars_select)))

## Question 1: Descriptive analysis

Explore and describe the dataset by print summary stats. Note this question is just to let you get familiar with the dataset, there is no need to do any data preprocessing.

In [None]:
# Have a look at the data
"Your answer"

In [None]:
# define the outcome variable
"Your answer"

## Question 2: Train and evaluate model, find your best model

You are free to do as many experients as you want to find your __best model__. Remember to train a __logistic regression prediction model as your baseline__.

### Question 2.1: Train and evaluate model(s)

__Important note:__ Because you'll need to select and submit your best model, it's better to save all your models when you do the experiment. It is required to save __a model file__ for each model. 

We just updated the `sdgm` package, which has `save.model(object, filename)` and `load.model(filename)` functions that will properly save all of the models.

#### Examples
best_model<-sdgm::nested.cv.bin(sdgm::cart.bestmodel.bin, full_data, voutcome)

sdgm::save.model(best_model$model, "model_best.model") # save the model

best_model_model <- sdgm::load.model("model_best.model") # load the model

__Recommendation:__ You may also want to save __a result file__ for each model like what we did in assignment 3 to help you answer the follow-up questions.

In [None]:
# Train and evaluate the model(s)
"Your answer"

### Question 2.2: Choose your best model

Do whatever you need to determine which is your best model. And sumbit your saved best model. Please rename it as `final.model.YourName.model`. 

__Note__ Please submit your model file, not result file (if applicable).

Describe your best model here:

Model: "Your answer"

With tune? (yes/no): "Your answer"

If yes, what were the hyperparameters: "Your answer"

In [None]:
# Do whatever you need to determine which is your best model.
"Your answer"

## Question 3: Explainability

### Question 3.1: Feature importance: permute and predict

Identify the most important 3 variables using permute and predict. Using "Lecture 9 - Variable importance on COVID.ipynb" as your reference.

__Question:__ based on your results below, what are the 3 most important variables?

Your answer:

In [None]:
# load your best model for the question below
best_model <- "Your answer" # load the model file

In [None]:
# permute and predict based on your best model
"Your answer"

### Question 3.2: Partial Dependence Plot (PDP)

For the most important 3 variables you identified in question 6, plot their functional relationship with the output. Using "Lecture 9 - PDP and fairness examples using sdgm.ipynb" for your reference. 

In [None]:
# PDP of variable #1
"Your answer"

In [None]:
# PDP of variable #2
"Your answer"

In [None]:
# PDP of variable #3
"Your answer"

### Question 3.3

If `drg` is one of your top 3 variables, answer the below question:

What are the top 3 drugs associated with the output, i.e. higher p(1)?

Your answer

## Question 4: Fairness on sex

Refer to "Lecture 9 - PDP and fairness examples using sdgm.ipynb", explore the fairness of your best model on sex.

__Question:__ Do you think your best model is fair to females? Why?

Your answer.

In [None]:
"Your answer"

## Question 5: Report your results

Check the reporting guideline in the instruction at the top and complete the below table.

In [None]:
# get the hyperparameters of your best model
best_model$params

In [None]:
### START CODE HERE  (REPLACE INSTANCES OF "Your answer" with your code) ###  
### Each "Your answer" can be different.
# answer
table <- data.frame(
  my_outcome     = "Your answer",
  my_feature     = "Your answer", # Note you need to paste all predictor names into a string
  n_feature      = "Your answer",
  n_sample         = "Your answer",
  my_model       = "Your answer",
  # for best hyperparameter, I put a space holder for three, 
  # modify it based on your best model.
  best_parameter = paste(paste("Name of hyperparameter 1:", "Your answer", "; "),
                         paste("Name of hyperparameter 2:", "Your answer", "; "),
                         paste("Name of hyperparameter 3:", "Your answer"), sep = ""),
  metric         = "auc",
  eval_results = "Your answer", # AUC value of your best model
  top_3_features = paste("Your answer", "Your answer", "Your answer", sep = "; ")
)

knitr::kable(t(table), "simple")
