# Deep learning for predicting the class of an epidemic curve#



In this tutorial, we'll build a (small!) multilayer perceptron, train it, and use it to predict the class of a epidemic curve. To make this tutorial work, you'll need to download Python3 and install several packages (including numpy, tensorflow, pandas, and Keras). 

The dataset is available on my GitHub, [here](https://github.com/caugusta/Disease_Modelling_Club). You should download 2 files:

- train_data.csv (~200 KB)
- test_data.csv (~50 KB)




## Describing the dataset##

This dataset consists of simulated epidemic curves (counts of infectious individuals per time unit). An SIR model was used, and three types of epidemics were generated:

- one that travels relatively slowly through the population (class 0, yellow in the image below)
- one that travels very quickly through the population (class 1, grey)
- one that travels not as quickly through the population (class 2, pink)

For those of you who are interested, the population of interest was simulated locations of 413 swine farms in Sioux County, Iowa, based on the [FLAPS online farm location simulator](http://flaps.biology.colostate.edu/) from Colorado State University's Dr. Chris Burdett. At time of writing this tutorial, **FLAPS is down**, but I'm hoping it'll be up and running again soon. This is part of the dataset that was used for this paper [1].

These epidemics will look somewhat strange, because they've been padded with 0s so that each epidemic has the same length (this is necessary for input to a multilayer perceptron, more on that later).

An epidemic curve from class 0 looks like this:

4 19 17 28 30 40 51 46 31 40 29 24 11 10 9 1 3 9 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

This means that 4 farms were initially infected with the disease (day 1 had 4 infectious farms); day 2 had 19 infectious farms, etc.

The '0' on the end of this particular epidemic means that this epidemic curve was generated from class 0 (this is a fast epidemic). In this epidemic, 4+19+...+1 = 403 farms were infected. This epidemic lasted 19 days.

An epidemic curve from class 1 looks like this:

4 33 107 194 71 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Note there are many more infectious farms at the beginning. 412 farms got infected in 6 days - that's a wildfire-fast epidemic! And the '1' at the end denotes that this epidemic belongs to class 1.

An epidemic curve from class 2 looks like this:

4 20 40 79 104 86 49 21 6 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2

This epidemic lasted 11 days.

If you were to graph the epidemic curves (number of infectious farms vs index of day), it would look like this:

<img src="ThreeEpidemics_DMC-1.png" width="500" height="500">

These three classes of epidemic look very different from one another, so we would expect a classifer to perform very well. This is an extreme example, to show how the classifier works, but in reality there could be some epidemics from the yellow group that look more like epidemics from the pink group. The power of these types of models is in how they learn to distinguish types of epidemic curves _even when those epidemic curves look very similar_. For this tutorial, the goal is to understand more about how a multilayer perceptron works, so that's another topic for another day.


## Building the classifer: a multilayer perceptron (MLP)##


Okay, let's get started with some code. In Python, the first thing we do is import the various modules we'll be using. If you run the code below, there shouldn't be any output, except perhaps 'Using TensorFlow backend'.


In [135]:
import keras
import string
import pandas as pd
import tensorflow as tf
import numpy as np
from keras import Model
from keras.layers import Dense
from keras.optimizers import SGD

Now we want to read in the data we're planning to use. Note you will need to modify the path (the './Desktop/DMC/New_Presentation/Disease_Modelling_Club' part) below to the location in which you saved train_data.csv and test_data.csv. We'll load the data, and manipulate it into a useable form.

Keras, the API we'll be using to build our MLP, requires numpy arrays as input. So we have to convert each of our epidemic curves to numpy arrays.

In [136]:
#Reading in the data

with open('./Desktop/DMC/New_Presentation/Disease_Modelling_Club/train_data.csv', 'r') as f:
    train_data = f.readlines()
    
with open('./Desktop/DMC/New_Presentation/Disease_Modelling_Club/test_data.csv', 'r') as f:
    test_data = f.readlines()
    
#Right now, train_data is a list of strings.    
#We can visualize the first line of the training data to make sure things read in properly:

#train_data[0]
#'4,19,17,28,30,40,51,46,31,40,29,24,11,10,9,1,3,9,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0\n'


train_no_newline = [s.rstrip() for s in train_data] #remove \n from the end of each line
train_lists = [list(s) for s in train_no_newline]
#translator = str.maketrans('', '', string.punctuation)
#train_list=[list(s.translate(translator)) for s in train_no_newline]
#train_no_newline = [s.translate(None, string.punctuation) for s in train_data]
#train_array = np.asarray(train_list)

#train_df = pd.DataFrame(train_data) #convert to a dataframe, similar to R.
#train_df = train_df.replace(r'\n',' ', regex=True) #remove \n from the end of each line
#train_df = train_df.apply(lambda x: x.as_matrix)

#train_mat = train_df.as_matrix() #convert to a numpy array, similar to R's matrix class.

##train_mat.shape #(2400, 1) - the shape of the array of training epidemic curves.

##Now also fix the test set

#test_df = pd.DataFrame(test_data)
#test_df = test_df.replace(r'\n',' ', regex=True)
#test_df = test_df.as_matrix()

##test_df.shape #(600, 1) - 600 epidemics

In [137]:
#train_df[0] #a pandas.core.series.Series
#train_df[0][0] # a str object
#train_no_newline[0]
#type(train_lists)
#type(train_lists[0])
#train_lists[0]
train_array

array([list(['4', '1', '9', '1', '7', '2', '8', '3', '0', '4', '0', '5', '1', '4', '6', '3', '1', '4', '0', '2', '9', '2', '4', '1', '1', '1', '0', '9', '1', '3', '9', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']),
       list(['4', '2', '5', '2', '0', '3', '6', '4', '1', '3', '5', '4', '9', '3', '7', '3', '9', '4', '2', '3', '0', '1', '3', '1', '5', '6', '5', '4', '2', '2', '3', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']),
       list(['4', '1', '5', '2', '7', '3', '9', '4', '4', '3', '7', '4', '2', '3', '5', '4', '2', '2', '9', '2', '3', '1', '8', '1', '3', '1', '0', '5', '7', '6', '6', '2', '3', '2', '0', '2', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']),
       ...,
       list(['4', '1', '2', '3', '1', '6', '8', '9', '8', '1', '1', '1', '5', '6', '1', '9', '9', '3', '1', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '2']),
       list(['4', 

As is common in machine learning, we should normalize the input before we feed it in to our model. That means we'll subtract the mean of the training data and divide by the standard deviation.

The complete code for this entire tutorial, all together, is available on my GitHub(here) for anyone who would like to play with it! If you use these data in a presentation or publication, please cite me.

[1] Augusta, C., R. Deardon and G. W. Taylor. Deep learning for classifying epidemic curves. [Under review] Spatial and Spatio-Temporal Epidemiology. 