# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [113]:
import pandas as pd
import numpy as np
import os
DATA_FOLDER = './Data/' # Use the data folder provided in Tutorial 02 - Intro to Pandas.

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average* per year of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

# Task 1: Clean code
In this section we are putting some nice and clean explained code. If you want to check out what we did in more detail you can take a look at the [Steps section](#task_1_steps). We'll start by checking what we have for the three countries:

In [115]:
import numpy as np
import pandas as pd

from pandas import IndexSlice as pidx

import glob

csv_files_guinea = glob.glob(DATA_FOLDER+"ebola/guinea_data/*.*")
csv_files_liberia = glob.glob(os.path.join(DATA_FOLDER, "ebola/liberia_data/*.*"))
csv_files_sl = glob.glob(os.path.join(DATA_FOLDER, "ebola/sl_data/*.*"))
frame_guinea = pd.DataFrame()
frame_liberia = pd.DataFrame()
frame_sl = pd.DataFrame()
for csv_file in csv_files_guinea:
    frame_guinea = frame_guinea.append(pd.read_csv(csv_file))

for csv_file in csv_files_liberia:
    frame_liberia = frame_liberia.append(pd.read_csv(csv_file))

for csv_file in csv_files_sl:
    frame_sl = frame_sl.append(pd.read_csv(csv_file))

## Liberia
Let's start by looking at the Liberia frame:

In [116]:
frame_liberia.head(50)

Unnamed: 0,Bomi County,Bong County,Date,Gbarpolu County,Grand Bassa,Grand Cape Mount,Grand Gedeh,Grand Kru,Lofa County,Margibi County,Maryland County,Montserrado County,National,Nimba County,River Gee County,RiverCess County,Sinoe County,Unnamed: 18,Variable
0,,,6/16/2014,,,,,,1.0,,,0.0,1.0,,,,,,Specimens collected
1,,,6/16/2014,,,,,,0.0,,,0.0,0.0,,,,,,Specimens pending for testing
2,,,6/16/2014,,,,,,21.0,,,7.0,28.0,,,,,,Total specimens tested
3,,,6/16/2014,,,,,,1.0,,,0.0,2.0,,,,,,Newly reported deaths
4,,,6/16/2014,,,,,,4.0,,,0.0,8.0,,,,,,Total death/s in confirmed cases
5,,,6/16/2014,,,,,,2.0,,,0.0,6.0,,,,,,Total death/s in probable cases
6,,,6/16/2014,,,,,,2.0,,,0.0,2.0,,,,,,Total death/s in suspected cases
7,,,6/16/2014,,,,,,8.0,,,0.0,16.0,,,,,,"Total death/s in confirmed, probable, suspecte..."
8,,,6/16/2014,,,,,,,,,,,,,,,,Case Fatality Rate (CFR) - Confirmed & Probabl...
9,,,6/16/2014,,,,,,41.0,,,0.0,41.0,,,,,,Newly reported contacts


Here, by looking at the various data, we can see that we have the "Total death/s in confirmed, probable, suspected cases" field which is the one we are looking for. We also have the "Total death/s in suspected cases", "Total death/s in probable cases" and "Total death/s in confirmed cases" that we could take separately and add up but, in this case, it is easier to simply take the Total. Just by looking at the dataset the total seems right and there should be no mistake if it was generated computationally. But it could be a good idea to sum up the three values for all dates and just to check if it is always good. For now we will simply assume the total is good. As for the new cases we don't have the total. We only have the three columns "New Case/s (Suspected)", "New Case/s (Probable)" and "New case/s (confirmed)". We will simply add them up

In [117]:
total_deaths_string = 'Total death/s in confirmed, probable, suspected cases'
new_cases_suspected = "New Case/s (Suspected)"
new_cases_probable = "New Case/s (Probable)"
new_cases_confirmed = "New case/s (confirmed)"

One very important thing before starting: we must tell pandas that the Date field is a date and not a string 

In [118]:
frame_liberia.Date = pd.to_datetime(frame_liberia.Date)

Now, let's do some useful preprocessing. First we have to set the index and also replace NaN values by zero. And we also have to sort them by index:

In [119]:
frame_liberia = frame_liberia.set_index(['Date', 'Variable']).fillna(0).sort_index()

Now let's check if everything went correctly:

In [120]:
frame_liberia.head(50)

Unnamed: 0_level_0,Unnamed: 1_level_0,Bomi County,Bong County,Gbarpolu County,Grand Bassa,Grand Cape Mount,Grand Gedeh,Grand Kru,Lofa County,Margibi County,Maryland County,Montserrado County,National,Nimba County,River Gee County,RiverCess County,Sinoe County,Unnamed: 18
Date,Variable,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2014-06-16,Case Fatality Rate (CFR) - Confirmed & Probable Cases,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-06-16,Contacts lost to follow-up,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-06-16,Contacts seen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,95.0,0.0,0.0,0.0,95.0,0.0,0.0,0.0,0.0,0.0
2014-06-16,Contacts who completed 21 day follow-up,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,0.0,0.0,0.0,11.0,0.0,0.0,0.0,0.0,0.0
2014-06-16,Cumulative admission/isolation,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
2014-06-16,Cumulative cases among HCW,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2014-06-16,Cumulative deaths among HCW,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2014-06-16,Currently under follow-up,0.0,0.0,0.0,0.0,0.0,0.0,0.0,95.0,0.0,0.0,0.0,95.0,0.0,0.0,0.0,0.0,0.0
2014-06-16,New Case/s (Probable),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2014-06-16,New Case/s (Suspected),0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0


Let's also have a look at the tail:

In [121]:
frame_liberia.tail(50)

Unnamed: 0_level_0,Unnamed: 1_level_0,Bomi County,Bong County,Gbarpolu County,Grand Bassa,Grand Cape Mount,Grand Gedeh,Grand Kru,Lofa County,Margibi County,Maryland County,Montserrado County,National,Nimba County,River Gee County,RiverCess County,Sinoe County,Unnamed: 18
Date,Variable,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2014-12-08,Newly Reported deaths in HCW,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-12-08,Newly reported contacts,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-12-08,Newly reported deaths,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-12-08,Specimens collected,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-12-08,Specimens pending for testing,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-12-08,Total Number of Confirmed Cases \r\n of Guinean Nationality,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-12-08,Total Number of Confirmed Cases \r\n of Sierra Leonean Nationality,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-12-08,Total confirmed cases,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-12-08,Total contacts listed,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-12-08,Total death/s in confirmed cases,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Everything seems fine ! By looking at the DATA folder we see that dates go from june to december. So let's prepare some tools for slicing the frame:

In [122]:
month_starts = pd.date_range(start="6/1/2014", end="12/9/2014", freq="MS")
month_ends = pd.date_range(start="6/1/2014", end="12/31/2014", freq="M")
month_starts = [str(entry.date()) for entry in month_starts]
month_ends = [str(entry.date()) for entry in month_ends]
months=[]
for i in range(len(month_starts)):
    months.append(slice(month_starts[i], month_ends[i]))

Quick check to see if everything is fine:

In [123]:
months

[slice('2014-06-01', '2014-06-30', None),
 slice('2014-07-01', '2014-07-31', None),
 slice('2014-08-01', '2014-08-31', None),
 slice('2014-09-01', '2014-09-30', None),
 slice('2014-10-01', '2014-10-31', None),
 slice('2014-11-01', '2014-11-30', None),
 slice('2014-12-01', '2014-12-31', None)]

Excellent ! Now we have our slices. Before doing the rest, we have a very important remark to do because the dataset is not clean. Indeed, among the 100 availables dates, 24 have the total death as the 'Total death/s in confirmed, \r\n probable, suspected cases' string (if it does not work, try putting \n instead of \r\n. It seems different OSs read the csv data differently - \r\n on Windows 10 and \n on Ubuntu): 

In [124]:
selection = frame_liberia.loc[(pidx[slice('2014-06-01', '2014-12-30', None), 'Total death/s in confirmed, \r\n probable, suspected cases']), :]
selection.shape

(24, 17)

And 76 have the string 'Total death/s in confirmed, probable, suspected cases' without the '\n':

In [125]:
selection = frame_liberia.loc[(pidx[slice('2014-06-01', '2014-12-30', None), 'Total death/s in confirmed, probable, suspected cases']), :]
selection.shape

(76, 17)

This is really cumbersome, but we have multiple ways of dealing with this. We can take the dataframe and set the values accordingly. Or we can simply research with the two strings (we will choose this method because it is more "light weight":

In [126]:
selection = frame_liberia.loc[(pidx[slice('2014-06-01', '2014-12-30', None), ['Total death/s in confirmed, probable, suspected cases', 'Total death/s in confirmed, \n probable, suspected cases']]), :]
selection.shape

(76, 17)

In [127]:
selection

Unnamed: 0_level_0,Unnamed: 1_level_0,Bomi County,Bong County,Gbarpolu County,Grand Bassa,Grand Cape Mount,Grand Gedeh,Grand Kru,Lofa County,Margibi County,Maryland County,Montserrado County,National,Nimba County,River Gee County,RiverCess County,Sinoe County,Unnamed: 18
Date,Variable,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2014-06-16,"Total death/s in confirmed, probable, suspected cases",0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,16.0,0.0,0.0,0.0,0.0,0.0
2014-06-17,"Total death/s in confirmed, probable, suspected cases",0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,8.0,16.0,0.0,0.0,0.0,0.0,0.0
2014-06-22,"Total death/s in confirmed, probable, suspected cases",0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,0.0,0.0,11.0,25.0,0.0,0.0,0.0,0.0,0.0
2014-06-24,"Total death/s in confirmed, probable, suspected cases",0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.0,0.0,0.0,15.0,32.0,0.0,0.0,0.0,0.0,0.0
2014-06-25,"Total death/s in confirmed, probable, suspected cases",0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0,0.0,0.0,18.0,37.0,0.0,0.0,0.0,0.0,0.0
2014-06-28,"Total death/s in confirmed, probable, suspected cases",0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.0,2.0,0.0,20.0,49.0,0.0,0.0,0.0,0.0,0.0
2014-06-29,"Total death/s in confirmed, probable, suspected cases",0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.0,2.0,0.0,20.0,49.0,0.0,0.0,0.0,0.0,0.0
2014-07-01,"Total death/s in confirmed, probable, suspected cases",0.0,0.0,0.0,0.0,0.0,0.0,0.0,32.0,2.0,0.0,27.0,61.0,0.0,0.0,0.0,0.0,0.0
2014-07-02,"Total death/s in confirmed, probable, suspected cases",0.0,0.0,0.0,0.0,0.0,0.0,0.0,35.0,2.0,0.0,29.0,66.0,0.0,0.0,0.0,0.0,0.0
2014-07-03,"Total death/s in confirmed, probable, suspected cases",0.0,0.0,0.0,0.0,0.0,0.0,0.0,35.0,2.0,0.0,33.0,70.0,0.0,0.0,0.0,0.0,0.0


So let us change our string to a list:

In [128]:
total_deaths_string = ['Total death/s in confirmed, probable, suspected cases', 'Total death/s in confirmed, \n probable, suspected cases', 'Total death/s in confirmed, \r\n probable, suspected cases']

Now we have to make sure everything is fine for the other values (suspected, probable and confirmed new cases):

In [129]:
selection = frame_liberia.loc[(pidx[slice('2014-06-01', '2014-12-30', None), new_cases_confirmed]), :]
print("Confirmed: " + str(selection.shape))
selection = frame_liberia.loc[(pidx[slice('2014-06-01', '2014-12-30', None), new_cases_probable]), :]
print("Probable: " + str(selection.shape))
selection = frame_liberia.loc[(pidx[slice('2014-06-01', '2014-12-30', None), new_cases_suspected]), :]
print("Suspected: " + str(selection.shape))

Confirmed: (100, 17)
Probable: (100, 17)
Suspected: (100, 17)


In [130]:
selection

Unnamed: 0_level_0,Unnamed: 1_level_0,Bomi County,Bong County,Gbarpolu County,Grand Bassa,Grand Cape Mount,Grand Gedeh,Grand Kru,Lofa County,Margibi County,Maryland County,Montserrado County,National,Nimba County,River Gee County,RiverCess County,Sinoe County,Unnamed: 18
Date,Variable,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2014-06-16,New Case/s (Suspected),0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
2014-06-17,New Case/s (Suspected),0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
2014-06-22,New Case/s (Suspected),0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
2014-06-24,New Case/s (Suspected),0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
2014-06-25,New Case/s (Suspected),0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0
2014-06-28,New Case/s (Suspected),0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
2014-06-29,New Case/s (Suspected),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-07-01,New Case/s (Suspected),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0
2014-07-02,New Case/s (Suspected),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2014-07-03,New Case/s (Suspected),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0


It is 100 for each of them so we don't have the same problem!

In [131]:
dataframe_liberia_avg = pd.DataFrame(columns=['month', 'Liberia: death daily avg', 'Liberia: new cases daily avg'])
for i, month in enumerate(months):
    begin_date = pd.to_datetime(month.start)
    selection_new_cases_confirmed = frame_liberia.loc[(pidx[month, new_cases_confirmed]), :]
    selection_new_cases_probable = frame_liberia.loc[(pidx[month, new_cases_probable]), :]
    selection_new_cases_suspected = frame_liberia.loc[(pidx[month, new_cases_suspected]), :]
    
    # Uncomment these lines to check that all have the same shape:
    # print(selection_new_cases_confirmed.shape)
    # print(selection_new_cases_probable.shape)
    # print(selection_new_cases_suspected.shape)
    # print("\n")
    
    # We sum the three:
    selection_deaths = frame_liberia.loc[(pidx[month, total_deaths_string]), :]
    size = selection_deaths.shape[0]
    if(size > 0):
        avg_daily_deaths = selection_deaths.sum().sum() / selection_deaths.shape[0]
        sum_new_cases = selection_new_cases_confirmed.sum().sum() + selection_new_cases_probable.sum().sum() + selection_new_cases_suspected.sum().sum()
        avg_daily_new_cases = sum_new_cases / selection_new_cases_confirmed.shape[0]
    else :
        avg_daily_deaths = 0
    
    #avg_daily_new_cases = selection_new_cases.sum().sum() / selection_new_cases.shape[0]
    dataframe_liberia_avg.loc[i] = [int(begin_date.month), avg_daily_deaths , avg_daily_new_cases]
dataframe_liberia_avg.set_index('month')

Unnamed: 0_level_0,Liberia: death daily avg,Liberia: new cases daily avg
month,Unnamed: 1_level_1,Unnamed: 2_level_1
6.0,62.857143,11.428571
7.0,188.909091,17.090909
8.0,1036.222222,74.444444
9.0,2834.583333,127.666667
10.0,4515.08,91.68
11.0,5550.4,57.8
12.0,6417.111111,10263.111111


That's it ! We are done with Liberia. And the numbers seem pretty accurate since ebola was spreading more and more with time. Now let's do Guinea:

## Guinea
Let's go a little bit faster this time ! So just to see what we have:

In [132]:
frame_guinea.head(50)

Unnamed: 0,Beyla,Boffa,Conakry,Coyah,Dabola,Dalaba,Date,Description,Dinguiraye,Dubreka,...,Kouroussa,Lola,Macenta,Mzerekore,Nzerekore,Pita,Siguiri,Telimele,Totals,Yomou
0,,0.0,5.0,,0.0,,2014-08-04,New cases of suspects,0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,5.0,
1,,0.0,0.0,,0.0,,2014-08-04,New cases of probables,0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,0.0,
2,,0.0,1.0,,0.0,,2014-08-04,New cases of confirmed,0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,4.0,
3,,0.0,6.0,,0.0,,2014-08-04,Total new cases registered so far,0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,9.0,
4,,0.0,9.0,,0.0,,2014-08-04,Total cases of suspects,0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,11.0,
5,,5.0,8.0,,3.0,,2014-08-04,Total cases of probables,1.0,0.0,...,2.0,,11.0,,0.0,1.0,0.0,3.0,133.0,
6,,18.0,78.0,,1.0,,2014-08-04,Total cases of confirmed,0.0,0.0,...,2.0,,28.0,,4.0,1.0,6.0,23.0,351.0,
7,,23.0,95.0,,4.0,,2014-08-04,Cumulative (confirmed + probable + suspects),1.0,0.0,...,4.0,,39.0,,4.0,2.0,6.0,26.0,495.0,
8,,0.0,0.0,,0.0,,2014-08-04,New deaths registered today,0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,2.0,
9,,0.0,0.0,,0.0,,2014-08-04,New deaths registered today (confirmed),0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,2.0,


We see that we have the 'Total new cases registered so far' variable here. Visually we can check that it is the sum of the three informations: 'New cases of suspects', 'New cases of probables' and 'New cases of confirmed'. Now, again, we will assume that this sum is correct (should be if the dataset was constructed automatically) but we could also go into detail and check that this sum is indeed correct for every date. The we have the 'Total deaths (confirmed + probables + suspects)' which is the sum of the 5 rows coming before it. We will take this as the value to take into account but, again, we could have summed up only the three rows 'Total deaths of suspects', 'Total deaths of probables' and 'Total deaths of confirmed'. This would have been another approach to try but we will keep it simple.

In [133]:
total_deaths_string = "Total deaths (confirmed + probables + suspects)"
new_cases_string = "Total new cases registered so far"

In [134]:
frame_guinea.Date = pd.to_datetime(frame_guinea.Date)
frame_guinea = frame_guinea.set_index(['Date', 'Description']).fillna(0).sort_index()

In [135]:
frame_guinea.head(50)

Unnamed: 0_level_0,Unnamed: 1_level_0,Beyla,Boffa,Conakry,Coyah,Dabola,Dalaba,Dinguiraye,Dubreka,Forecariah,Gueckedou,...,Kouroussa,Lola,Macenta,Mzerekore,Nzerekore,Pita,Siguiri,Telimele,Totals,Yomou
Date,Description,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2014-08-04,Cumulative (confirmed + probable + suspects),0,23,95,0,4,0,1,0,0,285,...,4,0,39,0,4,2,6,26,495,0
2014-08-04,New cases of confirmed,0,0,1,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,4,0
2014-08-04,New cases of probables,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2014-08-04,New cases of suspects,0,0,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,5,0
2014-08-04,New deaths registered today,0,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,2,0
2014-08-04,New deaths registered today (confirmed),0,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,2,0
2014-08-04,New deaths registered today (probables),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2014-08-04,New deaths registered today (suspects),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2014-08-04,Number of confirmed cases among health workers,0,0,16,0,0,0,0,0,0,3,...,0,0,1,0,0,0,2,1,21,0
2014-08-04,Number of contacts followed yesterday,0,0,331,0,0,0,0,0,0,200,...,37,0,44,0,0,26,134,0,772,0


Visually everything seems alright. Now we want to do the same checks as we did for Liberia. We want to check that we have 22 different values for both strings (you can convince yourself that there are 22 different dates by opening a terminal and running the following in the guinea_data folder:
```bash
ls -1 | wc -l
```

In [136]:
selection = frame_guinea.loc[(pidx[slice('2014-06-01', '2014-12-31', None), total_deaths_string]), :]
print("Total deaths string: " + str(selection.shape))
selection = frame_guinea.loc[(pidx[slice('2014-06-01', '2014-12-31', None), new_cases_string]), :]
print("New cases string: " + str(selection.shape))

Total deaths string: (22, 23)
New cases string: (22, 23)


22, that's good ! But here we have an interesting problem ocurring. The datatypes are all 'object' and not 'float64'. This is why we will change it: 

In [137]:
frame_guinea.apply(pd.to_numeric)

ValueError: ('Unable to parse string "100%" at position 683', 'occurred at index Beyla')

Now this is bothering ! If we look at position 683, we see the percentage instead of the integer number:

In [138]:
frame_guinea.ix[683]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Beyla          100%
Boffa           67%
Conakry         39%
Coyah           42%
Dabola         100%
Dalaba           0%
Dinguiraye     100%
Dubreka         33%
Forecariah      56%
Gueckedou       83%
Kerouane        21%
Kindia           0%
Kissidougou     88%
Kouroussa      100%
Lola           100%
Macenta         61%
Mzerekore         0
Nzerekore       64%
Pita            38%
Siguiri         50%
Telimele        38%
Totals          62%
Yomou           45%
Name: (2014-10-01 00:00:00, Fatality rate for confirmed and probables), dtype: object

But this is the 'Fatality rate for confirmed and probables' variable so we won't bother with that. We will just convert what we need to numeric values.

In [139]:
dataframe_guinea_avg = pd.DataFrame(columns=['month', 'Guinea: death daily avg', 'Guinea: new cases daily avg'])
for i, month in enumerate(months):
    begin_date = pd.to_datetime(month.start)
    selection_new_cases = frame_guinea.loc[(pidx[month, new_cases_string]), :].apply(pd.to_numeric)
    
    selection_deaths = frame_guinea.loc[(pidx[month, total_deaths_string]), :].apply(pd.to_numeric)
    size = selection_deaths.shape[0]
    if(size > 0):
        avg_daily_deaths = selection_deaths.sum().sum() / selection_deaths.shape[0]
        avg_daily_new_cases = selection_new_cases.sum().sum() / selection_new_cases.shape[0]
    else :
        avg_daily_deaths = 0
        avg_daily_new_cases = 0
    
    #avg_daily_new_cases = selection_new_cases.sum().sum() / selection_new_cases.shape[0]
    dataframe_guinea_avg.loc[i] = [int(begin_date.month), avg_daily_deaths , avg_daily_new_cases]
dataframe_guinea_avg.set_index('month')

Unnamed: 0_level_0,Guinea: death daily avg,Guinea: new cases daily avg
month,Unnamed: 1_level_1,Unnamed: 2_level_1
6.0,0.0,0.0
7.0,0.0,0.0
8.0,1079.2,51.6
9.0,1206.31,39.0625
10.0,1475.0,68.0
11.0,0.0,0.0
12.0,0.0,0.0


## Sierra Leone
Now let's go with Sierra Leone ! We will proceed in a similar fashion. First we take a look at the head of the frame:

In [140]:
frame_sl.head(50)

Unnamed: 0,34 Military Hospital,Bo,Bo EMC,Bombali,Bonthe,Hastings-F/Town,Kailahun,Kambia,Kenema,Kenema (IFRC),...,Port Loko,Pujehun,Tonkolili,Unnamed: 18,Western area,Western area combined,Western area rural,Western area urban,date,variable
0,,654142,,494139,168729.0,,465048,341690.0,653013,,...,557978,335574,434937,,,,263619,1040888,2014-08-12,population
1,,0,,0,0.0,,0,0.0,3,,...,1,0,0,,,,0,0,2014-08-12,new_noncase
2,,1,,0,0.0,,0,0.0,9,,...,0,0,0,,,,0,0,2014-08-12,new_suspected
3,,1,,0,0.0,,0,0.0,0,,...,0,0,0,,,,0,0,2014-08-12,new_probable
4,,0,,0,0.0,,0,0.0,9,,...,2,0,0,,,,0,0,2014-08-12,new_confirmed
5,,54,,10,0.0,,201,1.0,269,,...,7,2,10,,,,5,56,2014-08-12,cum_noncase
6,,2,,1,0.0,,0,0.0,10,,...,0,0,0,,,,0,0,2014-08-12,cum_suspected
7,,1,,1,0.0,,32,0.0,0,,...,1,0,0,,,,0,1,2014-08-12,cum_probable
8,,22,,7,1.0,,378,1.0,259,,...,24,3,2,,,,1,13,2014-08-12,cum_confirmed
9,,0,,0,0.0,,2,0.0,1,,...,1,0,0,,,,0,1,2014-08-12,death_suspected


It seems we have no total deaths or total new cases here. But we have all we need: 'new_suspected', 'new_probable' and 'new_confirmed' for new cases and 'death_suspected', 'death_probable' and 'death_confirmed' for deaths. Depending on what we want to do we might also take the new_noncase into account but we won't do that here. 

In [141]:
new_suspected = 'new_suspected'
new_probable = 'new_probable'
new_confirmed = 'new_confirmed'

death_suspected = 'death_suspected'
death_probable = 'death_probable'
death_confirmed = 'death_confirmed'

Let's just check the datatypes before anything else:

In [142]:
frame_sl.dtypes

34 Military Hospital      float64
Bo                         object
Bo EMC                    float64
Bombali                    object
Bonthe                     object
Hastings-F/Town           float64
Kailahun                   object
Kambia                     object
Kenema                     object
Kenema (IFRC)             float64
Kenema (KGH)               object
Koinadugu                  object
Kono                       object
Moyamba                    object
National                   object
Police training School    float64
Police traning School     float64
Port Loko                  object
Pujehun                    object
Tonkolili                  object
Unnamed: 18               float64
Western area              float64
Western area combined     float64
Western area rural         object
Western area urban         object
date                       object
variable                   object
dtype: object

We see that there are some values which are not numbers so we will have to be careful about it (if we try to parse we see there are percentage values as in the Guinea case). But the most important here is to convert the date datatype:

In [143]:
frame_sl.date = pd.to_datetime(frame_sl.date)

Now let's set the index, sort and fill the missing values with 0s:

In [144]:
frame_sl = frame_sl.set_index(['date', 'variable']).fillna(0).sort_index()

Now let's take a look:

In [145]:
frame_sl.head(50)

Unnamed: 0_level_0,Unnamed: 1_level_0,34 Military Hospital,Bo,Bo EMC,Bombali,Bonthe,Hastings-F/Town,Kailahun,Kambia,Kenema,Kenema (IFRC),...,Police training School,Police traning School,Port Loko,Pujehun,Tonkolili,Unnamed: 18,Western area,Western area combined,Western area rural,Western area urban
date,variable,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2014-08-12,cfr,0.0,9.1,0.0,14.3,0,0.0,39.9,0,40.2,0.0,...,0.0,0.0,4.2,0,0,0.0,0.0,0.0,0,15.4
2014-08-12,contacts_followed,0.0,160,0.0,161,0,0.0,321,0,497,0.0,...,0.0,0.0,466,47,11,0.0,0.0,0.0,23,310
2014-08-12,contacts_healthy,0.0,157,0.0,141,0,0.0,305,0,427,0.0,...,0.0,0.0,443,47,11,0.0,0.0,0.0,23,103
2014-08-12,contacts_ill,0.0,0,0.0,0,0,0.0,11,0,3,0.0,...,0.0,0.0,0,0,0,0.0,0.0,0.0,0,0
2014-08-12,contacts_not_seen,0.0,3,0.0,20,0,0.0,16,0,64,0.0,...,0.0,0.0,28,0,0,0.0,0.0,0.0,0,0
2014-08-12,cum_completed_contacts,0.0,67,0.0,16,26,0.0,0,0,701,0.0,...,0.0,0.0,7,10,0,0.0,0.0,0.0,0,82
2014-08-12,cum_confirmed,0.0,22,0.0,7,1,0.0,378,1,259,0.0,...,0.0,0.0,24,3,2,0.0,0.0,0.0,1,13
2014-08-12,cum_contacts,0.0,227,0.0,161,26,0.0,0,0,1356,0.0,...,0.0,0.0,471,57,16,0.0,0.0,0.0,23,390
2014-08-12,cum_noncase,0.0,54,0.0,10,0,0.0,201,1,269,0.0,...,0.0,0.0,7,2,10,0.0,0.0,0.0,5,56
2014-08-12,cum_probable,0.0,1,0.0,1,0,0.0,32,0,0,0.0,...,0.0,0.0,1,0,0,0.0,0.0,0.0,0,1


Before counting we have to check that we get all the 103 values (there are 103 different files / dates in the sl_data folder):

In [146]:
selection = frame_sl.loc[(pidx[slice('2014-06-01', '2014-12-31', None), new_suspected]), :]
print("New suspected: " + str(selection.shape))
selection = frame_sl.loc[(pidx[slice('2014-06-01', '2014-12-31', None), new_probable]), :]
print("New probable: " + str(selection.shape))
selection = frame_sl.loc[(pidx[slice('2014-06-01', '2014-12-31', None), new_confirmed]), :]
print("New confirmed: " + str(selection.shape))

selection = frame_sl.loc[(pidx[slice('2014-06-01', '2014-12-31', None), death_suspected]), :]
print("Death suspected: " + str(selection.shape))
selection = frame_sl.loc[(pidx[slice('2014-06-01', '2014-12-31', None), death_probable]), :]
print("Death probable: " + str(selection.shape))
selection = frame_sl.loc[(pidx[slice('2014-06-01', '2014-12-31', None), death_confirmed]), :]
print("Death confirmed: " + str(selection.shape))

New suspected: (103, 25)
New probable: (103, 25)
New confirmed: (103, 25)
Death suspected: (103, 25)
Death probable: (103, 25)
Death confirmed: (103, 25)


That's it ! 103 values for each string! So everything is fine and we have all the values. Let's proceed with counting:

In [147]:
dataframe_sl_avg = pd.DataFrame(columns=['month', 'SL: death daily avg', 'SL: new cases daily avg'])
for i, month in enumerate(months):
    begin_date = pd.to_datetime(month.start)
    selection_new_suspected = frame_sl.loc[(pidx[month, new_suspected]), :].apply(pd.to_numeric)
    selection_new_probable = frame_sl.loc[(pidx[month, new_probable]), :].apply(pd.to_numeric)
    selection_new_confirmed = frame_sl.loc[(pidx[month, new_confirmed]), :].apply(pd.to_numeric)
    
    selection_death_suspected = frame_sl.loc[(pidx[month, death_suspected]), :].apply(pd.to_numeric)
    selection_death_probable = frame_sl.loc[(pidx[month, death_probable]), :].apply(pd.to_numeric)
    selection_death_confirmed = frame_sl.loc[(pidx[month, death_confirmed]), :].apply(pd.to_numeric)
    size = selection_death_confirmed.shape[0]
    if(size > 0):
        sum_daily_deaths = selection_death_suspected.sum().sum() + selection_death_probable.sum().sum() + selection_death_confirmed.sum().sum() 
        avg_daily_deaths = sum_daily_deaths / size
        sum_daily_new_cases = selection_new_suspected.sum().sum() + selection_new_probable.sum().sum() + selection_new_confirmed.sum().sum() 
        avg_daily_new_cases = sum_daily_new_cases / size
    else :
        avg_daily_deaths = 0
        avg_daily_new_cases = 0
    
    #avg_daily_new_cases = selection_new_cases.sum().sum() / selection_new_cases.shape[0]
    dataframe_sl_avg.loc[i] = [int(begin_date.month), avg_daily_deaths , avg_daily_new_cases]
dataframe_sl_avg.set_index('month')

Unnamed: 0_level_0,SL: death daily avg,SL: new cases daily avg
month,Unnamed: 1_level_1,Unnamed: 2_level_1
6.0,0.0,0.0
7.0,0.0,0.0
8.0,706.0,51.8
9.0,988.759,83.5172
10.0,2330.5,142.857
11.0,2603.33,153.81
12.0,3215.2,82.0


That's it ! We have the data for the Sierra Leone dataframe. In all three dataframes we can notice that the number of deaths rise during the period for which we have data.

## Merging

Now we simply have to merge the three dataframes:

In [161]:
final_dataframe = dataframe_liberia_avg.set_index(['month']).join(dataframe_guinea_avg.set_index(['month'])).join(dataframe_sl_avg.set_index(['month']))

In [162]:
final_dataframe

Unnamed: 0_level_0,Liberia: death daily avg,Liberia: new cases daily avg,Guinea: death daily avg,Guinea: new cases daily avg,SL: death daily avg,SL: new cases daily avg
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
6.0,62.857143,11.428571,0.0,0.0,0.0,0.0
7.0,188.909091,17.090909,0.0,0.0,0.0,0.0
8.0,1036.222222,74.444444,1079.2,51.6,706.0,51.8
9.0,2834.583333,127.666667,1206.31,39.0625,988.759,83.5172
10.0,4515.08,91.68,1475.0,68.0,2330.5,142.857
11.0,5550.4,57.8,0.0,0.0,2603.33,153.81
12.0,6417.111111,10263.111111,0.0,0.0,3215.2,82.0


<a id='task_1_steps'></a>
# Task 1: Steps

In [11]:
# Write your answer here
import numpy as np
import pandas as pd

from pandas import IndexSlice as pidx

import glob

In [12]:
csv_files_guinea = glob.glob(DATA_FOLDER+"ebola/guinea_data/*.*")
csv_files_liberia = glob.glob(os.path.join(DATA_FOLDER, "ebola/liberia_data/*.*"))
csv_files_sl = glob.glob(os.path.join(DATA_FOLDER, "ebola/sl_data/*.*"))
frame_guinea = pd.DataFrame()
frame_liberia = pd.DataFrame()
frame_sl = pd.DataFrame()
for csv_file in csv_files_guinea:
    frame_guinea = frame_guinea.append(pd.read_csv(csv_file))

for csv_file in csv_files_liberia:
    frame_liberia = frame_liberia.append(pd.read_csv(csv_file))

for csv_file in csv_files_sl:
    frame_sl = frame_sl.append(pd.read_csv(csv_file))

In [13]:
frame_liberia

Unnamed: 0,Bomi County,Bong County,Date,Gbarpolu County,Grand Bassa,Grand Cape Mount,Grand Gedeh,Grand Kru,Lofa County,Margibi County,Maryland County,Montserrado County,National,Nimba County,River Gee County,RiverCess County,Sinoe County,Unnamed: 18,Variable
0,0.0,0.0,9/6/2014,0.0,0.0,0.0,0.0,,13.0,26.0,,0.0,39.0,0.0,0.0,0.0,0.0,,Specimens collected
1,7.0,0.0,9/6/2014,0.0,0.0,0.0,0.0,,13.0,20.0,,0.0,41.0,0.0,1.0,0.0,0.0,,Specimens pending for testing
2,0.0,0.0,9/6/2014,0.0,0.0,0.0,0.0,,14.0,6.0,,0.0,20.0,0.0,0.0,0.0,0.0,,Total specimens tested
3,0.0,2.0,9/6/2014,0.0,0.0,1.0,0.0,,7.0,15.0,,19.0,44.0,0.0,0.0,0.0,0.0,,Newly reported deaths
4,27.0,11.0,9/6/2014,0.0,13.0,4.0,0.0,,159.0,29.0,,229.0,502.0,30.0,0.0,0.0,0.0,,Total death/s in confirmed cases
5,14.0,13.0,9/6/2014,0.0,3.0,3.0,0.0,,154.0,56.0,,134.0,418.0,40.0,0.0,1.0,0.0,,Total death/s in probable cases
6,0.0,22.0,9/6/2014,0.0,3.0,5.0,2.0,,35.0,50.0,,159.0,293.0,12.0,5.0,0.0,0.0,,Total death/s in suspected cases
7,41.0,46.0,9/6/2014,0.0,19.0,12.0,2.0,,348.0,135.0,,522.0,1213.0,82.0,5.0,1.0,0.0,,"Total death/s in confirmed, probable, suspecte..."
8,,,9/6/2014,,,,,,,,,,,,,,,,Case Fatality Rate (CFR) - Confirmed & Probabl...
9,0.0,10.0,9/6/2014,0.0,0.0,0.0,0.0,,6.0,54.0,,132.0,202.0,0.0,0.0,0.0,0.0,,Newly reported contacts


In [14]:
frame_guinea

Unnamed: 0,Beyla,Boffa,Conakry,Coyah,Dabola,Dalaba,Date,Description,Dinguiraye,Dubreka,...,Kouroussa,Lola,Macenta,Mzerekore,Nzerekore,Pita,Siguiri,Telimele,Totals,Yomou
0,,,,,,,2014-08-26,New cases of suspects,,,...,,,12,1,,,,,18,4
1,,,,,,,2014-08-26,New cases of probables,,,...,,,,,,,,,,
2,,,,,,,2014-08-26,New cases of confirmed,,1,...,,,5,,,,,,10,3
3,,,,,,,2014-08-26,Total new cases registered so far,,1,...,,,17,1,,,,,28,7
4,,0,8,,0,,2014-08-26,Total cases of suspects,0,0,...,0,,13,1,,0,0,0,30,4
5,,5,8,,3,,2014-08-26,Total cases of probables,1,0,...,2,,12,1,,1,0,3,141,0
6,,19,98,,1,,2014-08-26,Total cases of confirmed,0,8,...,2,,99,10,,5,6,23,490,10
7,,24,114,,4,,2014-08-26,Cumulative (confirmed + probable + suspects),1,8,...,4,,124,12,,6,6,26,661,14
8,,18,195,,1,,2014-08-26,Total suspected non-class cases,0,2,...,1,,63,6,,7,7,18,518,0
9,,0,0,,0,,2014-08-26,New deaths registered,0,0,...,0,,2,0,,0,0,0,5,3


In [15]:
frame_sl

Unnamed: 0,34 Military Hospital,Bo,Bo EMC,Bombali,Bonthe,Hastings-F/Town,Kailahun,Kambia,Kenema,Kenema (IFRC),...,Port Loko,Pujehun,Tonkolili,Unnamed: 18,Western area,Western area combined,Western area rural,Western area urban,date,variable
0,,654142,,494139,168729,,465048,341690,653013,,...,557978,335574,434937,,,,,,2014-10-22,population
1,,0,,0,0,,4,0,10,,...,0,0,2,,,,12,9,2014-10-22,new_noncase
2,,3,,0,0,,0,0,4,,...,0,0,4,,,,0,0,2014-10-22,new_suspected
3,,0,,0,0,,0,0,0,,...,0,0,0,,,,0,0,2014-10-22,new_probable
4,,1,,0,0,,0,1,2,,...,0,1,8,,,,26,5,2014-10-22,new_confirmed
5,,153,,256,2,,303,17,735,,...,185,11,76,,,,332,714,2014-10-22,cum_noncase
6,,55,,74,3,,18,4,75,,...,23,6,29,,,,37,63,2014-10-22,cum_suspected
7,,1,,1,0,,32,0,0,,...,1,0,0,,,,0,1,2014-10-22,cum_probable
8,,159,,470,2,,550,30,477,,...,480,26,165,,,,396,509,2014-10-22,cum_confirmed
9,,3,,18,1,,4,6,4,,...,49,3,3,,,,4,6,2014-10-22,death_suspected


# Liberia

I just want to check the date format for Liberia (I will focus on Liberia for now):

In [44]:
frame_liberia = frame_liberia.set_index(['Date', 'Variable'])

We assume that the Nan values were not "signigicant enough" to be reported so let's say they are 0:

In [46]:
frame_liberia = frame_liberia.fillna(0)

In [74]:
frame_liberia.head(31)

Unnamed: 0_level_0,Unnamed: 1_level_0,Bomi County,Bong County,Gbarpolu County,Grand Bassa,Grand Cape Mount,Grand Gedeh,Grand Kru,Lofa County,Margibi County,Maryland County,Montserrado County,National,Nimba County,River Gee County,RiverCess County,Sinoe County,Unnamed: 18
Date,Variable,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
6/16/2014,Specimens collected,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
6/16/2014,Specimens pending for testing,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6/16/2014,Total specimens tested,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21.0,0.0,0.0,7.0,28.0,0.0,0.0,0.0,0.0,0.0
6/16/2014,Newly reported deaths,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
6/16/2014,Total death/s in confirmed cases,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0
6/16/2014,Total death/s in probable cases,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0
6/16/2014,Total death/s in suspected cases,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
6/16/2014,"Total death/s in confirmed, probable, suspected cases",0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,16.0,0.0,0.0,0.0,0.0,0.0
6/16/2014,Case Fatality Rate (CFR) - Confirmed & Probable Cases,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6/16/2014,Newly reported contacts,0.0,0.0,0.0,0.0,0.0,0.0,0.0,41.0,0.0,0.0,0.0,41.0,0.0,0.0,0.0,0.0,0.0


In [49]:
frame_liberia.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Bomi County,Bong County,Gbarpolu County,Grand Bassa,Grand Cape Mount,Grand Gedeh,Grand Kru,Lofa County,Margibi County,Maryland County,Montserrado County,National,Nimba County,River Gee County,RiverCess County,Sinoe County,Unnamed: 18
Date,Variable,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
12/9/2014,Total probable cases,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12/9/2014,Total confirmed cases,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12/9/2014,Total Number of Confirmed Cases \r\n of Sierra Leonean Nationality,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12/9/2014,Total Number of Confirmed Cases \r\n of Guinean Nationality,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12/9/2014,"Cumulative confirmed, probable and suspected cases",293.0,557.0,39.0,158.0,165.0,10.0,36.0,645.0,1255.0,23.0,4196.0,7797.0,320.0,19.0,41.0,40.0,0.0


It seems dates go from june to december, so let's create a date range:

In [52]:
month_starts = pd.date_range(start="6/1/2014", end="12/9/2014", freq="MS")

Let's check that the months start dates are good:

In [54]:
print(month_starts)

DatetimeIndex(['2014-06-01', '2014-07-01', '2014-08-01', '2014-09-01',
               '2014-10-01', '2014-11-01', '2014-12-01'],
              dtype='datetime64[ns]', freq='MS')


The month start dates seem fine. Let's do the same thing for the end dates:

In [57]:
month_ends = pd.date_range(start="6/1/2014", end="12/31/2014", freq="M")

In [58]:
month_ends

DatetimeIndex(['2014-06-30', '2014-07-31', '2014-08-31', '2014-09-30',
               '2014-10-31', '2014-11-30', '2014-12-31'],
              dtype='datetime64[ns]', freq='M')

Same! Now let's do some slicing:

In [192]:
# to make indexing even easier, I'm creating a list of slices
month_starts = [str(entry.date()) for entry in month_starts]
month_ends = [str(entry.date()) for entry in month_ends]
months=[]
for i in range(len(month_starts)):
    months.append(slice(month_starts[i], month_ends[i]))

In [193]:
months

[slice('2014-08-01', '2014-08-31', None),
 slice('2014-09-01', '2014-09-30', None),
 slice('2014-10-01', '2014-10-31', None)]

That's great ! Now we have all the ranges from the june to the december month ! 

In [194]:
pidx[months[0]]

slice('2014-08-01', '2014-08-31', None)

In [101]:
frame_liberia = frame_liberia.sort_index()


In [109]:
frame_liberia_copy = frame_liberia.copy()

In [111]:
frame_liberia_copy.Date = pd.to_datetime(frame_liberia_copy.Date)

In [117]:
frame_liberia_copy = frame_liberia_copy.set_index(["Date", "Variable"])

In [119]:
frame_liberia_copy = frame_liberia_copy.sort_index()

In [140]:
frame_liberia_copy.loc[(pidx[months[6], "Total death/s in confirmed cases"]), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Bomi County,Bong County,Gbarpolu County,Grand Bassa,Grand Cape Mount,Grand Gedeh,Grand Kru,Lofa County,Margibi County,Maryland County,Montserrado County,National,Nimba County,River Gee County,RiverCess County,Sinoe County,Unnamed: 18
Date,Variable,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2014-12-01,Total death/s in confirmed cases,,,,,,,,,,,,,,,,,
2014-12-02,Total death/s in confirmed cases,,,,,,,,,,,,,,,,,
2014-12-03,Total death/s in confirmed cases,,,,,,,,,,,,,,,,,
2014-12-04,Total death/s in confirmed cases,,,,,,,,,,,,,,,,,
2014-12-05,Total death/s in confirmed cases,,,,,,,,,,,,,,,,,
2014-12-06,Total death/s in confirmed cases,,,,,,,,,,,,,,,,,
2014-12-07,Total death/s in confirmed cases,,,,,,,,,,,,,,,,,
2014-12-08,Total death/s in confirmed cases,,,,,,,,,,,,,,,,,
2014-12-09,Total death/s in confirmed cases,,,,,,,,,,,,,,,,,


Here we have the sum of all cases of "Total death/s in confirmed cases" for july month

In [138]:
selection = frame_liberia_copy.loc[(pidx[months[6], "Total death/s in confirmed cases"]), :]

In [130]:
selection.shape

(11, 17)

The daily average of confirmed deaths for the month of july is:

In [132]:
selection.sum().sum() / selection.shape[0]

86.545454545454547

In [16]:
dataframe_liberia_avg = pd.DataFrame(columns=['month', 'confirmed death daily avg', 'new cases daily avg'])
for i, month in enumerate(months):
    begin_date = pd.to_datetime(month.start)
    selection_new_cases = frame_liberia_copy.loc[(pidx[month, "New case/s (confirmed)"]), :]
    selection_deaths_confirmed = frame_liberia_copy.loc[(pidx[month, "Total death/s in confirmed, probable, suspected cases"]), :]
    avg_daily_deaths_confirmed = selection_deaths_confirmed.sum().sum() / selection_deaths_confirmed.shape[0]
    avg_daily_new_cases = selection_new_cases.sum().sum() / selection_new_cases.shape[0]
    new_row = pd.DataFrame
    dataframe_liberia_avg.loc[i] = [int(begin_date.month), avg_daily_deaths_confirmed, avg_daily_new_cases]
dataframe_liberia_avg.set_index('month')

NameError: name 'months' is not defined

# Guinea

Let's just check the types:

In [159]:
frame_guinea.dtypes

Beyla          object
Boffa          object
Conakry        object
Coyah          object
Dabola         object
Dalaba         object
Date           object
Description    object
Dinguiraye     object
Dubreka        object
Forecariah     object
Gueckedou      object
Kerouane       object
Kindia         object
Kissidougou    object
Kouroussa      object
Lola           object
Macenta        object
Mzerekore      object
Nzerekore      object
Pita           object
Siguiri        object
Telimele       object
Totals         object
Yomou          object
dtype: object

So we see Date is an object. For convenience we will convert it to a string:

In [161]:
frame_guinea.Date = pd.to_datetime(frame_guinea.Date)

In [162]:
frame_guinea.dtypes

Beyla                  object
Boffa                  object
Conakry                object
Coyah                  object
Dabola                 object
Dalaba                 object
Date           datetime64[ns]
Description            object
Dinguiraye             object
Dubreka                object
Forecariah             object
Gueckedou              object
Kerouane               object
Kindia                 object
Kissidougou            object
Kouroussa              object
Lola                   object
Macenta                object
Mzerekore              object
Nzerekore              object
Pita                   object
Siguiri                object
Telimele               object
Totals                 object
Yomou                  object
dtype: object

Now we see it is a datetime object. It is easier like this. So now let's just put a little reminder of what is inside the frame:

In [174]:
frame_guinea.head(42)

Unnamed: 0,Beyla,Boffa,Conakry,Coyah,Dabola,Dalaba,Date,Description,Dinguiraye,Dubreka,...,Kouroussa,Lola,Macenta,Mzerekore,Nzerekore,Pita,Siguiri,Telimele,Totals,Yomou
0,,0.0,5.0,,0.0,,2014-08-04,New cases of suspects,0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,5,
1,,0.0,0.0,,0.0,,2014-08-04,New cases of probables,0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,0,
2,,0.0,1.0,,0.0,,2014-08-04,New cases of confirmed,0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,4,
3,,0.0,6.0,,0.0,,2014-08-04,Total new cases registered so far,0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,9,
4,,0.0,9.0,,0.0,,2014-08-04,Total cases of suspects,0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,11,
5,,5.0,8.0,,3.0,,2014-08-04,Total cases of probables,1.0,0.0,...,2.0,,11.0,,0.0,1.0,0.0,3.0,133,
6,,18.0,78.0,,1.0,,2014-08-04,Total cases of confirmed,0.0,0.0,...,2.0,,28.0,,4.0,1.0,6.0,23.0,351,
7,,23.0,95.0,,4.0,,2014-08-04,Cumulative (confirmed + probable + suspects),1.0,0.0,...,4.0,,39.0,,4.0,2.0,6.0,26.0,495,
8,,0.0,0.0,,0.0,,2014-08-04,New deaths registered today,0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,2,
9,,0.0,0.0,,0.0,,2014-08-04,New deaths registered today (confirmed),0.0,0.0,...,0.0,,0.0,,0.0,0.0,0.0,0.0,2,


Let's put Date and Description as index:

In [175]:
frame_guinea = frame_guinea.set_index(['Date', 'Description'])

In [177]:
frame_guinea

Unnamed: 0_level_0,Unnamed: 1_level_0,Beyla,Boffa,Conakry,Coyah,Dabola,Dalaba,Dinguiraye,Dubreka,Forecariah,Gueckedou,...,Kouroussa,Lola,Macenta,Mzerekore,Nzerekore,Pita,Siguiri,Telimele,Totals,Yomou
Date,Description,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2014-08-04,New cases of suspects,,0,5,,0,,0,0,,0,...,0,,0,,0,0,0,0,5,
2014-08-04,New cases of probables,,0,0,,0,,0,0,,0,...,0,,0,,0,0,0,0,0,
2014-08-04,New cases of confirmed,,0,1,,0,,0,0,,3,...,0,,0,,0,0,0,0,4,
2014-08-04,Total new cases registered so far,,0,6,,0,,0,0,,3,...,0,,0,,0,0,0,0,9,
2014-08-04,Total cases of suspects,,0,9,,0,,0,0,,2,...,0,,0,,0,0,0,0,11,
2014-08-04,Total cases of probables,,5,8,,3,,1,0,,95,...,2,,11,,0,1,0,3,133,
2014-08-04,Total cases of confirmed,,18,78,,1,,0,0,,188,...,2,,28,,4,1,6,23,351,
2014-08-04,Cumulative (confirmed + probable + suspects),,23,95,,4,,1,0,,285,...,4,,39,,4,2,6,26,495,
2014-08-04,New deaths registered today,,0,0,,0,,0,0,,2,...,0,,0,,0,0,0,0,2,
2014-08-04,New deaths registered today (confirmed),,0,0,,0,,0,0,,2,...,0,,0,,0,0,0,0,2,


Now we want to do the same things as with Liberia. First we sort according to index:

In [178]:
frame_guinea = frame_guinea.sort_index()

In [179]:
frame_guinea

Unnamed: 0_level_0,Unnamed: 1_level_0,Beyla,Boffa,Conakry,Coyah,Dabola,Dalaba,Dinguiraye,Dubreka,Forecariah,Gueckedou,...,Kouroussa,Lola,Macenta,Mzerekore,Nzerekore,Pita,Siguiri,Telimele,Totals,Yomou
Date,Description,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2014-08-04,Cumulative (confirmed + probable + suspects),,23,95,,4,,1,0,,285,...,4,,39,,4,2,6,26,495,
2014-08-04,New cases of confirmed,,0,1,,0,,0,0,,3,...,0,,0,,0,0,0,0,4,
2014-08-04,New cases of probables,,0,0,,0,,0,0,,0,...,0,,0,,0,0,0,0,0,
2014-08-04,New cases of suspects,,0,5,,0,,0,0,,0,...,0,,0,,0,0,0,0,5,
2014-08-04,New deaths registered today,,0,0,,0,,0,0,,2,...,0,,0,,0,0,0,0,2,
2014-08-04,New deaths registered today (confirmed),,0,0,,0,,0,0,,2,...,0,,0,,0,0,0,0,2,
2014-08-04,New deaths registered today (probables),,0,0,,0,,0,0,,0,...,0,,0,,0,0,0,0,0,
2014-08-04,New deaths registered today (suspects),,0,0,,0,,0,0,,0,...,0,,0,,0,0,0,0,0,
2014-08-04,Number of confirmed cases among health workers,,0,16,,0,,0,0,,3,...,0,,1,,0,0,2,1,21,
2014-08-04,Number of contacts followed yesterday,,0,331,,0,,0,0,,200,...,37,,44,,0,26,134,0,772,


Here we see that the data is only from august to october. So let's do it like this:

In [238]:
month_starts = pd.date_range(start="2014-08-1", end="2014-10-1", freq="MS")
month_ends = pd.date_range(start="2014-08-31", end="2014-10-31", freq="M")
month_starts = [str(entry.date()) for entry in month_starts]
month_ends = [str(entry.date()) for entry in month_ends]
months=[]
for i in range(len(month_starts)):
    months.append(slice(month_starts[i], month_ends[i]))

In [239]:
months

[slice('2014-08-01', '2014-08-31', None),
 slice('2014-09-01', '2014-09-30', None),
 slice('2014-10-01', '2014-10-31', None)]

In [240]:
dataframe_guinea_avg = pd.DataFrame(columns=['month', 'confirmed death daily avg', 'new cases daily avg'])
for i, month in enumerate(months):
    begin_date = pd.to_datetime(month.start)
    selection_new_cases = frame_guinea.loc[(pidx[month, "New cases of confirmed"]), :].fillna(0).apply(pd.to_numeric)
    selection_deaths_confirmed = frame_guinea.loc[(pidx[month, "Total deaths of confirmed"]), :].fillna(0).apply(pd.to_numeric)
    avg_daily_deaths_confirmed = selection_deaths_confirmed.sum().sum() / selection_deaths_confirmed.shape[0]
    avg_daily_new_cases = selection_new_cases.sum().sum() / selection_new_cases.shape[0]
    new_row = pd.DataFrame
    dataframe_guinea_avg.loc[i] = [int(begin_date.month), avg_daily_deaths_confirmed, avg_daily_new_cases]
dataframe_guinea_avg.set_index('month')

Unnamed: 0_level_0,confirmed death daily avg,new cases daily avg
month,Unnamed: 1_level_1,Unnamed: 2_level_1
8.0,584.6,24.8
9.0,891.25,25.5625
10.0,1124.0,12.0


# Sierra Leone

In [235]:
frame_sl.head(42)

Unnamed: 0,34 Military Hospital,Bo,Bo EMC,Bombali,Bonthe,Hastings-F/Town,Kailahun,Kambia,Kenema,Kenema (IFRC),...,Port Loko,Pujehun,Tonkolili,Unnamed: 18,Western area,Western area combined,Western area rural,Western area urban,date,variable
0,,654142,,494139,168729.0,,465048,341690.0,653013,,...,557978,335574,434937,,,,263619,1040888,2014-08-12,population
1,,0,,0,0.0,,0,0.0,3,,...,1,0,0,,,,0,0,2014-08-12,new_noncase
2,,1,,0,0.0,,0,0.0,9,,...,0,0,0,,,,0,0,2014-08-12,new_suspected
3,,1,,0,0.0,,0,0.0,0,,...,0,0,0,,,,0,0,2014-08-12,new_probable
4,,0,,0,0.0,,0,0.0,9,,...,2,0,0,,,,0,0,2014-08-12,new_confirmed
5,,54,,10,0.0,,201,1.0,269,,...,7,2,10,,,,5,56,2014-08-12,cum_noncase
6,,2,,1,0.0,,0,0.0,10,,...,0,0,0,,,,0,0,2014-08-12,cum_suspected
7,,1,,1,0.0,,32,0.0,0,,...,1,0,0,,,,0,1,2014-08-12,cum_probable
8,,22,,7,1.0,,378,1.0,259,,...,24,3,2,,,,1,13,2014-08-12,cum_confirmed
9,,0,,0,0.0,,2,0.0,1,,...,1,0,0,,,,0,1,2014-08-12,death_suspected


In [233]:
frame_sl.date = pd.to_datetime(frame_sl.date)

In [236]:
frame_sl = frame_sl.set_index(['date', 'variable'])

In [237]:
frame_sl = frame_sl.sort_index()

In [241]:
month_starts = pd.date_range(start="2014-08-1", end="2014-12-1", freq="MS")
month_ends = pd.date_range(start="2014-08-31", end="2014-12-31", freq="M")
month_starts = [str(entry.date()) for entry in month_starts]
month_ends = [str(entry.date()) for entry in month_ends]
months=[]
for i in range(len(month_starts)):
    months.append(slice(month_starts[i], month_ends[i]))

In [242]:
dataframe_sl_avg = pd.DataFrame(columns=['month', 'confirmed death daily avg', 'new cases daily avg'])
for i, month in enumerate(months):
    begin_date = pd.to_datetime(month.start)
    selection_new_cases = frame_sl.loc[(pidx[month, "new_confirmed"]), :].fillna(0).apply(pd.to_numeric)
    selection_deaths_confirmed = frame_sl.loc[(pidx[month, "death_confirmed"]), :].fillna(0).apply(pd.to_numeric)
    avg_daily_deaths_confirmed = selection_deaths_confirmed.sum().sum() / selection_deaths_confirmed.shape[0]
    avg_daily_new_cases = selection_new_cases.sum().sum() / selection_new_cases.shape[0]
    new_row = pd.DataFrame
    dataframe_sl_avg.loc[i] = [int(begin_date.month), avg_daily_deaths_confirmed, avg_daily_new_cases]
dataframe_sl_avg.set_index('month')

Unnamed: 0_level_0,confirmed death daily avg,new cases daily avg
month,Unnamed: 1_level_1,Unnamed: 2_level_1
8.0,625.0,38.15
9.0,897.724138,70.689655
10.0,1816.607143,114.25
11.0,2027.190476,123.142857
12.0,2629.6,65.2


## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

Let's first import the 9 dataframes (probably we could do it using a loop but for now let's do it like this)

In [167]:

MICROBIOME_FOLDER = os.path.join(DATA_FOLDER, 'microbiome')

microbiome_1 = pd.read_excel(os.path.join(MICROBIOME_FOLDER, 'MID1.xls'), header=None)
microbiome_2 = pd.read_excel(os.path.join(MICROBIOME_FOLDER, 'MID2.xls'), header=None)
microbiome_3 = pd.read_excel(os.path.join(MICROBIOME_FOLDER, 'MID3.xls'), header=None)
microbiome_4 = pd.read_excel(os.path.join(MICROBIOME_FOLDER, 'MID4.xls'), header=None)
microbiome_5 = pd.read_excel(os.path.join(MICROBIOME_FOLDER, 'MID5.xls'), header=None)
microbiome_6 = pd.read_excel(os.path.join(MICROBIOME_FOLDER, 'MID6.xls'), header=None)
microbiome_7 = pd.read_excel(os.path.join(MICROBIOME_FOLDER, 'MID7.xls'), header=None)
microbiome_8 = pd.read_excel(os.path.join(MICROBIOME_FOLDER, 'MID8.xls'), header=None)
microbiome_9 = pd.read_excel(os.path.join(MICROBIOME_FOLDER, 'MID9.xls'), header=None)

Now let's join them by sequentially calling merge:

In [168]:
combined_df_noind = microbiome_1.merge(microbiome_2, left_on=0, right_on=0, how='outer').merge(microbiome_3, left_on=0, right_on=0, how='outer').merge(microbiome_4, left_on=0, right_on=0, how='outer').merge(microbiome_5, left_on=0, right_on=0, how='outer').merge(microbiome_6, left_on=0, right_on=0, how='outer').merge(microbiome_7, left_on=0, right_on=0, how='outer').merge(microbiome_8, left_on=0, right_on=0, how='outer').merge(microbiome_9, left_on=0, right_on=0, how='outer')

This is ugly and the line is very long. A better option would be to create an array of dataframes and then use reduce.

In [172]:
from functools import reduce;

microbiome_array = []
for i in range(1, 10):
    file_name = os.path.join(MICROBIOME_FOLDER, 'MID' + str(i) + '.xls')
    current_df = pd.read_excel(file_name, header=None)
    current_df.columns = ['id', i]
    current_df_ind = current_df
    microbiome_array.append(current_df_ind)
    
microbiome = reduce(lambda left, right: pd.merge(left, right, how='outer'), microbiome_array).fillna('unknown')

In [173]:
microbiome_ind = microbiome.set_index('id')
microbiome_ind.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",7,23,14,2,28,7,8,unknown,16
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus",2,2,unknown,unknown,3,2,1,unknown,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus",3,10,4,unknown,14,5,5,1,6
"Archaea ""Crenarchaeota"" Thermoprotei Thermoproteales Thermofilaceae Thermofilum",3,9,5,unknown,10,4,5,unknown,5
"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Methanocellales Methanocellaceae Methanocella",7,9,7,1,17,12,18,unknown,14


This is a much cleaner way to construct the microbiome dataframe. Let's just do a few sanity checks to see if everything went correctly:

In [174]:
bacteria = microbiome_1[0][0]

In [175]:
microbiome_2.columns = ['id', 'value']
microbiome_2[microbiome_2.id == bacteria]

Unnamed: 0,id,value
2,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",23


In [176]:
microbiome_3.columns = ['id', 'value']
microbiome_3[microbiome_3.id == bacteria]

Unnamed: 0,id,value
2,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",14


In [177]:
microbiome_6.columns = ['id', 'value']
microbiome_6[microbiome_6.id == bacteria]

Unnamed: 0,id,value
0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",7


In [178]:
microbiome_8.columns = ['id', 'value']
microbiome_8[microbiome_8.id == bacteria]

Unnamed: 0,id,value


Great ! So everything seems to be at the right place ! Now let's read the metadata information:

In [155]:
metadata = pd.read_excel(os.path.join(MICROBIOME_FOLDER, 'metadata.xls')).fillna('')

In [156]:
column_names = metadata.GROUP + metadata.SAMPLE

In [157]:
id_series = pd.Series(['id'])
column_names_final = id_series.append(column_names)

In [158]:
microbiome.columns = column_names_final

In [159]:
microbiome

Unnamed: 0,id,EXTRACTION CONTROL,NEC 1tissue,Control 1tissue,NEC 2tissue,Control 2tissue,NEC 1stool,Control 1stool,NEC 2stool,Control 2stool
0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",7,23,14,2,28,7,8,unknown,16
1,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2,2,unknown,unknown,3,2,1,unknown,unknown
2,"Archaea ""Crenarchaeota"" Thermoprotei Sulfoloba...",3,10,4,unknown,14,5,5,1,6
3,"Archaea ""Crenarchaeota"" Thermoprotei Thermopro...",3,9,5,unknown,10,4,5,unknown,5
4,"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Meth...",7,9,7,1,17,12,18,unknown,14
5,"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Meth...",1,12,2,unknown,11,1,2,unknown,6
6,"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Meth...",1,2,1,unknown,3,1,2,unknown,3
7,"Archaea ""Euryarchaeota"" Archaeoglobi Archaeogl...",1,4,unknown,1,8,4,9,unknown,3
8,"Archaea ""Euryarchaeota"" Archaeoglobi Archaeogl...",1,unknown,unknown,unknown,1,unknown,2,unknown,3
9,"Archaea ""Euryarchaeota"" Halobacteria Halobacte...",1,4,3,unknown,7,4,3,unknown,4


That's it ! We constructed it.

## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [75]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')

0,1,2,3,4,5
Name,Labels,Units,Levels,Storage,NAs
pclass,,,3,integer,0
survived,Survived,,,double,0
name,Name,,,character,0
sex,,,2,integer,0
age,Age,Year,,double,263
sibsp,Number of Siblings/Spouses Aboard,,,double,0
parch,Number of Parents/Children Aboard,,,double,0
ticket,Ticket Number,,,character,0
fare,Passenger Fare,British Pound (\243),,double,1

0,1
Variable,Levels
pclass,1st
,2nd
,3rd
sex,female
,male
cabin,
,A10
,A11
,A14


For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

In [None]:
# Write your answer here

Our steps in more details:

In [77]:
titanic_data = pd.read_excel(os.path.join(DATA_FOLDER, 'titanic.xls'))

In [78]:
titanic_data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
5,1,1,"Anderson, Mr. Harry",male,48.0000,0,0,19952,26.5500,E12,S,3,,"New York, NY"
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0000,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
7,1,0,"Andrews, Mr. Thomas Jr",male,39.0000,0,0,112050,0.0000,A36,S,,,"Belfast, NI"
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0000,2,0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1,0,"Artagaveytia, Mr. Ramon",male,71.0000,0,0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"
