In [None]:
#hide
# from your_lib.core import *

In [None]:
#hide
%reload_ext autoreload
%autoreload 2

# lemonade

> An open source deep learning library for Electronic Health Record (EHR) data.

In this initial release of the library ..
- it implements 2 deep learning models (an LSTM and a CNN) based on popular papers 
- that uses synthetic EHR data, created using the open source [Synthea Patient Generator](https://github.com/synthetichealth/synthea/wiki)
- to predict 4 conditions that are on the [CDC's list of top chronic diseases](https://www.cdc.gov/chronicdisease/about/costs/index.htm) that contribute most to healthcare costs

The end goal is to 
- keep adding more model implementations 
- keep adding different publicly available datasets 
- and have a leaderboard to track which models and configurations work best on these datasets

## Install

`pip install lemonade`

## How to use

### Setup Synthea

- [Synthea - Wiki](https://github.com/synthetichealth/synthea/wiki)
    - contains details about what the project is and how to get started and generate the data.
    - download the binary - [detailed here](https://github.com/synthetichealth/synthea/wiki/Basic-Setup-and-Running)
        - Don't run it yet
    - Go to `./src/main/resources/synthea.properties` to update the default setings file and make the following updates
        - `exporter.years_of_history = 0` - to keep entire patient history
        - `exporter.csv.export = true` - to export in csv format
        - set all other output formats to `false`
        - leave other default settings intact

#### Generate Data
- For our purposes, once Synthea is set up, the following scripts generate the data. 
- Its important to record the run dates, we will use this during preprocessing.
    - Basic setup run command is: `java -jar synthea-with-dependencies.jar` 
    - Developer setup run command is: `./run_synthea` 
- Run with the `-p` switch to control population of patients generated as shown in examples below. 

`./run_synthea -s 12345 -p 10000`
- run date: 12-19-2019
- {alive=10000, dead=1076}
- raw - 1.5GB

`./run_synthea -s 54321 -p 20000`
- run date: 11-5-2020
- {alive=20000, dead=2195}
- csv - 2.9GB

`./run_synthea -s 12345 -p 100000`
- run date: 4-4-2020
- {alive=100000, dead=10872}
- csv - 14.6GB

### Copy Into DataStore / Datastore Folder Structure
- Choose a location to store all Synthea-generated data 
    - for example `~/code/datasets`
- Copy data generated by Synthea into this specific folder structure 
    - for 10K data - `~/code/datasets/synthea/10K/raw_original`
    - **Note** - Synthea outputs all csv files in a folder called `csv`; after copying into the datastore, the csv files must be in the `raw_original` folder, where this library expects it for preprocessing
    
### Look at Quick Starts 

## Roadmap
- **A leader-board to track which models and configurations work best on different publicly available datasets**.


- Callbacks, Mixed Precision, etc
    - Either upgrade the library to use fastai v2.
    - Or at least as a minimum, build functionality for fastai-style callbacks & [PyTorch AMP](https://pytorch.org/docs/stable/amp.html).

- More models
    - Pick some of the best EHR models out there and implement them.
    - **Ideas are welcome**.
- More datasets
    - eICU and MIMIC3 possibly.
    - **Ideas are welcome**.
- NLP on clinical notes
    - Synthea does not have clinical notes, so this can only be done with other datasets.
- Predicting different conditions
    - Again different datasets will allow this - e.g. hospitalization data (length of stay, in-patient mortality), ER data, etc.
- Integraion with Experiment management tools like W&B, Comet, etc,.

## Known Issues & Limitations