# Introduction

## Overview of Problem

We have received Vanderbilt hospital data extracted from the Synthetic Derivative. At Vanderbilt, bioinformaticians helped to create a "mirror image" of electronic medical records such as those in BioVU, Vanderbilt's biorepository of DNA extracted from discarded blood collected during routine clinical testing. This mirror of the EMR is called the Synthetic Derivative, and it contains over 2 million individual patients with all clinical information available for the past 10 years. It has been scrubbed of HIPAA identifiers with an eror rate of ~0.01%, meaning that the data has been deidentified with a subject ID. 

For this project, we have received a subset of data on 10,000 subjects, linked with a deidentified subject ID. The objective of this project is to employ modeling tools introduced in class to fit prediction models for patient readmission within 30 days of discharge using given data. The data includes multiple variables, detailed below in "Data." Ideally, our model will be able to well predict readmission within 30 days of discharge for a patient using the admit/discharge/transfer events data. The rest of the data detials information about the patients themselves, tests and treatments they underwent at Vanderbilt, and lab results and medication. Using these variables, we hope to accurately predict readmission. This information could be very useful for actually clinicians hoping to predict which patients may need to be readmitted, and which characteristics/tests cause them to be readmitted. Once this is determined, that subset of patients could be given more attention and/or tests to prevent readmission.

## Goal and Structure of Project

This project will introduce several approaches to predictive modeling of patient readmission within 30 days of discharge. Three approaches are detailed in the following jupyter notebooks, each of which may include more than one modeling type, attempts to improve each model and test performance, and tuning of the models to increase the goodness of fit. Cross-validation will be used when appropriate, and model selection methods and/or explanations of the models chosen will be provided for each notebook. We will then justify and describe each model selection, and provide visualizations and discussions of the results. The models will be compared using goodness of fit tests and other performance characteristics. For clarification, the steps for each model notebook are listed below, and enumerated in the following 3 modeling notebooks. 

1. Identify the model approach(es), describe, and justify the selection
2. Code, parameterize, and run model (including visualization)
3. Cross-validation
4. Goodness of fit assessments, performance characteristics (including visualization)
5. Improvements to model/tuning of parameters; model selection methods, justification of improvements/tests
6. Comparison of models; identification of best model
7. Results 
8. Implications of model and conclusions

We have also included a conclusions notebook that details the comparisons of the 3 model types, which ones worked and didn't work, our best model, and future directions.

## Data and cleaning

Phenotype: Patient attributes, including sex, race, and dates of birth and death.
Includes the variables: 
1. "Sex" (F or M) 
2. "DOB" (date of birth)
3. "DOD" (date of death)
4. "Race" (W = white, B = black, A = Asian, N = Native American, H = hispanic, U = unidentified). 

Cleaning of the Phenotype dataset:

BMI: Body mass index measurement information.
Includes the variables:
1. "BMI" (numeric)
2. Date_BMI (date M/DD/YY)
3. BMI_Weight (numeric, in kg)
4. BMI_Height (numeric, in cm)
5. Pregnancy_Indicator (0, 1).

Cleaning of the BMI dataset:

MED: Medications information, including dose and duration.
Includes the variables:
1. "Entry_Date" (date M/DD/YY)	
2. "Drug_Name" (common drug name, string)	
3. "DRUG_FORM" (if drug comes in multiple forms, this describes which form is given. E.g. nebulizer versus inhaler for albuterol)	
4. "DRUG_STRENGTH" (mL, or NA)
5. "Route" (Route of drug administration; e.g. IV, FLUSH, PO)
6. "Dose_Amt" (Amount of drug, variable units; g, ML/HR, units)
7. "Drug_Freq" (number of times given/how the drug is given; e.g. twice daily, once, Q1H PRN)
8. "Duration" (length of time drug is given; e.g. months, days, etc)