# Capstone: Epidemic propagation

## Objectives
Can you run a simple spatial-temporal clustering of recent historic epidemics, and can this give you the next epidemics?

Can you pull in some main socio-economic features of the centroids of these clusters and run PCA to see the commonalities across these. What do we predict through this?

Can you include propagation of epidemics and see how they spread?

Therefore... 
Is the Coronavirus predictable?
Where's the next one, and how will that spread?
    
## Data required:
1. List of all recent epidemics (with location and year), including: SARS, Coronavirus, MERS, Ebola, Zika, Bird Flu
2. Socio-economic data for each virus: Long / Lat, Population, Development index, cleanliness etc.?
3. Epidemic propagation: airport traffic at each infected city, volume of travel between cities affected

In [1]:
import pandas as pd
import glob
import os
pd.set_option('display.max_rows',100, 'display.max_columns',100, 'display.max_colwidth',1000)

In [None]:
# Epidemics list:

epidemics_list = pd.read_csv('epidemics_list.csv')

epi_interest = ['SARS coronavirus',
                'Ebola',
                'Middle East respiratory disease',
                'Zika virus',
                'Novel coronavirus (2019-nCoV)']

# what do I want for each?
# date, city, country, long/lat, no. confirmed cases
epidemics_list.Disease.unique()
epidemics_list[epidemics_list['Disease'].isin(['Ebola','Ebola virus disease\n\nEbola virus virion'])]
epidemics_list[epidemics_list['Disease'].isin(['SARS coronavirus'])]

In [None]:
# Ebola (old):
ebola = pd.read_csv('ebola_data.csv')
ebola['Date'] = pd.to_datetime(ebola['Date'])
ebola['Country'] = ['Guinea' if row=='Guinea 2' else 'Liberia' if row=='Liberia 2' else row for row in ebola.Country]
ebola.groupby(['Indicator','Country','Date']).sum()
ebola.sort_values('Date', inplace=True)
ebola.reset_index(drop=True)
ebola[(ebola['Indicator']=='Cumulative number of confirmed Ebola cases') & (ebola['Country']=='Sierra Leone')]

In [2]:
# Ebola:
path = r'ebola-master/sl_data'
filenames = glob.glob(path + "/*.csv")

inner = []

for filename in filenames:
    inner.append(pd.read_csv(filename, index_col=None, header=0))
    
ebola = pd.concat(inner, ignore_index=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  # Remove the CWD from sys.path while we load stuff.


In [None]:
# Data Cleaning:

ebola = ebola[ebola['variable']!='percent_seen']
ebola['date'] = pd.to_datetime(ebola['date'])
ebola_cum = ebola[ebola['variable']=='cum_confirmed']
ebola_cum.loc[:,'34 Military Hospital':'Western area urban'].replace({',': ''}, regex=True, inplace=True).apply(pd.to_numeric)

In [None]:
ebola_cum.dtypes

In [None]:
ebola_cum.fillna(0).info()