# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [1]:
%matplotlib inline
import datetime
import os
import pandas as pd
import numpy as np

In [2]:
DATA_FOLDER = os.path.join('.', 'Data') # Use the data folder provided in Tutorial 02 - Intro to Pandas.

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average per month* of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

# Helper functions for analysing the data in different datasets

In [3]:
def compute_columns_series(dataframe_list):
    """Function that computes the series representing the concatenation of the columns for each dataset of a specific country."""
    
    if len(dataframe_list) == 0 :
        return None
    
    return pd.Series([column for dataframe in dataframe_list for column in dataframe.columns.values])

In [4]:
def compute_desired_column_series(dataframe_list, desired_column):
    """ Function that computes the values in the column desired_column for all the datasets of a specific country. """
    
    if len(dataframe_list) == 0 :
        return None
    
    values_list = [description for dataframe in dataframe_list for description in dataframe[desired_column].values]
    return pd.Series(values_list)

In [5]:
def load_country_data(country_folder, date_column, thousands_parse=False):
    """Function that loads all the datasets for a specific country. Returns a list of DataFrame structures, each one of them for a file 
       under the folder defined by country_folder. """
    
    results = [] # a list with the data frames for each file
    COUNTRY_FOLDER = os.path.join(DATA_FOLDER, 'ebola', country_folder)
    
    for filename in os.listdir(COUNTRY_FOLDER):
        file = os.path.join(COUNTRY_FOLDER, filename)
        
        if not thousands_parse:
            current_dataframe = pd.read_csv(file, index_col=[date_column], parse_dates=[date_column])
        else:
            current_dataframe = pd.read_csv(file, index_col=[date_column], thousands=',', na_values=['-'])
        results.append(current_dataframe)
    
    return results

# 1.1. Guinea Dataset

In [6]:
guinea_dataframes = load_country_data('guinea_data', 'Date') # Loading the Guinea data

##  Guinea dataset transformation

As we can see below, the columns of the files are mostly the same, with some files having more cities as a breakdown than others. As we will argue later, this doesn't represent an inconvenient for our task.

In [7]:
compute_columns_series(guinea_dataframes).value_counts()

Conakry        22
Kissidougou    22
Totals         22
Dubreka        22
Description    22
Boffa          22
Macenta        22
Telimele       22
Siguiri        22
Kouroussa      22
Dinguiraye     22
Dabola         22
Gueckedou      22
Pita           22
Kerouane       21
Yomou          21
Coyah          21
Forecariah     21
Dalaba         21
Kindia         20
Beyla          20
Lola           20
Mzerekore      15
Nzerekore       7
dtype: int64

As we can see below, the values of the column Description are mostly the same in Guinea dataset. In fact, further analysis brings to light the fact that the file "2014-08-04.csv" is the only one that has different descriptions. This makes it easier to process the dataset as a whole, with only one exception.

In [8]:
compute_desired_column_series(guinea_dataframes, 'Description').value_counts()

Total deaths (confirmed + probables + suspects)             22
Total cases of suspects                                     22
Cumulative (confirmed + probable + suspects)                22
Total cases of confirmed                                    22
Number of contacts to follow today                          22
Total cases of probables                                    22
New cases of confirmed                                      22
Total new cases registered so far                           22
Total contacts registered from start date                   22
Total deaths of probables                                   22
Total deaths of suspects                                    22
New cases of probables                                      22
Total deaths of confirmed                                   22
New cases of suspects                                       22
Number of contacts out of the track 21 days                 21
Fatality rate for confirmed and probables              

We decided that only the descriptions regarding new cases and deaths and cumulative number of cases and deaths will be kept. This decision is taken because this is the only information desired from the dataset. 

It can be easily seen that the information that was aggregated from the usual dataframes is contained in the unusual one, with a single exception: the column <i> New deaths registered </i> is here named <i> New deaths registered today </i>. Therefore, we can easily write a method that will convert all the dataframes in the same fashion.

The final table for Guinea will not contain the breakdown on cities (e.g. Dabola, Boffa etc.), but only the aggregated information contained in the <i>Totals</i> column for each dataframe. This decision is taken because the granular information (the one based on cities) won't serve for our purpose of analysing data on a national scale.

Therefore, only the columns <i>Description</i> and <i>Totals</i> will be kept from the tables. Then, we decided, for a better data visualization, to reshape the DataFrame, having the description as columns. Also, renaming the descriptions help in a better visualization of the table, the information being more condensed. As we already mentioned, only the descriptions regarding new cases and deaths and cumulative number of cases and deaths will be kept. 

In [9]:
GUINEA_COLUMNS_RENAME = {'New cases of suspects' : 'new_cases_suspects',
                        'New cases of probables' : 'new_cases_probable',
                        'New cases of confirmed': 'new_cases_confirmed',
                        'Total new cases registered so far' : 'total_new_cases',
                        'Total cases of suspects' : 'cum_cases_suspects',
                        'Total cases of probables' : 'cum_cases_probables',
                        'Total cases of confirmed' : 'cum_cases_confirmed',
                        'Cumulative (confirmed + probable + suspects)' : 'total_cum_cases',
                        'Total deaths of suspects' : 'cum_deaths_suspects',
                        'Total deaths of probables' : 'cum_deaths_probables',
                        'Total deaths of confirmed' : 'cum_deaths_confirmed',
                        'Total deaths (confirmed + probables + suspects)' : 'total_cum_deaths'}

def reshape_guinea_dataframe(dataframe):
    """Function that reshapes an usual dataframe in Guinea folder, making it more human understandable. """
    
    columns_rename = GUINEA_COLUMNS_RENAME.copy()

    if any(description == 'New deaths registered' for description in dataframe['Description']): # usual dataframe
        columns_rename['New deaths registered'] = 'total_new_deaths'
    else: # unusual dataframe
        columns_rename['New deaths registered today'] = 'total_new_deaths' 
    
    dataframe = dataframe.set_index([dataframe.index, 'Description']) # Adding Description to index, thus being able to use unstack function
    dataframe = dataframe['Totals'].unstack('Description')  # Unstacking dataframe, adding Description values as columns
    
    dataframe.rename(columns=columns_rename, inplace=True) # Renaming the relevant columns
    return dataframe[list(columns_rename.values())] # Returning the dataframe containing only the relevant columns

We need to add all the information in the Guinea folder in the same DataFrame, which we do below. Also, we can fill the missing values with 0, because there are only two of them: line 2, at new_cases_probable and second to last line, at cum_deaths_suspects column.

In [10]:
merged_guinea_dataframe = pd.concat(list(map(reshape_guinea_dataframe, guinea_dataframes)), axis = 0)

merged_guinea_dataframe = merged_guinea_dataframe.apply(pd.to_numeric).fillna(0)
merged_guinea_dataframe

Description,new_cases_suspects,new_cases_probable,new_cases_confirmed,total_new_cases,cum_cases_suspects,cum_cases_probables,cum_cases_confirmed,total_cum_cases,cum_deaths_suspects,cum_deaths_probables,cum_deaths_confirmed,total_cum_deaths,total_new_deaths
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2014-08-04,5,0.0,4,9,11,133,351,495,2.0,133,228,363,2
2014-08-26,18,0.0,10,28,30,141,490,661,2.0,141,292,958,5
2014-08-27,12,0.0,10,22,22,142,499,663,2.0,142,294,438,2
2014-08-30,15,0.0,9,24,32,142,533,707,2.0,142,324,468,5
2014-08-31,9,8.0,29,46,36,150,563,749,2.0,150,337,489,3
2014-09-02,11,0.0,14,25,49,150,591,790,2.0,150,349,501,5
2014-09-04,13,0.0,17,30,51,151,621,823,3.0,151,368,522,5
2014-09-07,5,0.0,11,16,32,151,678,861,4.0,151,402,557,4
2014-09-08,5,0.0,11,16,32,151,678,861,4.0,151,402,557,4
2014-09-09,9,0.0,7,16,33,151,683,867,4.0,151,410,565,7


Interestingly enough, if we look for 26-08-2014, we will see the <i> total_cum_deaths </i> value being 958, while on the next day, 27-8-2014, the value for the same column has the value 438. We can detect a mistake in the dataset here. That's why we want to verify that the data is consistent, i.e. the sum of probables, confirmed and suspects equals the total, for each different class:

In [11]:
def delta_new_cases(dataframe):
    return dataframe.new_cases_suspects + dataframe.new_cases_probable + dataframe.new_cases_confirmed - dataframe.total_new_cases

def delta_cum_cases(dataframe):
    return dataframe.cum_cases_suspects + dataframe.cum_cases_probables + dataframe.cum_cases_confirmed - dataframe.total_cum_cases

def delta_cum_deaths(dataframe):
    return dataframe.cum_deaths_suspects + dataframe.cum_deaths_probables + dataframe.cum_deaths_confirmed - dataframe.total_cum_deaths

# merged_guinea_dataframe.apply(delta_new_cases, axis = 1)
# merged_guinea_dataframe.apply(delta_cum_cases, axis = 1)
merged_guinea_dataframe.apply(delta_cum_deaths, axis = 1)


Date
2014-08-04      0.0
2014-08-26   -523.0
2014-08-27      0.0
2014-08-30      0.0
2014-08-31      0.0
2014-09-02      0.0
2014-09-04      0.0
2014-09-07      0.0
2014-09-08      0.0
2014-09-09      0.0
2014-09-11      0.0
2014-09-14      0.0
2014-09-16      0.0
2014-09-17      0.0
2014-09-19      0.0
2014-09-21      0.0
2014-09-22      0.0
2014-09-23      0.0
2014-09-24      0.0
2014-09-26    663.0
2014-09-30      0.0
2014-10-01      0.0
dtype: float64

The analysis shows that there are only two inconsistencies. First of them is on August 26th 2014, where the total cumulative number of deaths is too high, with a value of 958 instead of 435, which represents the sum of the three breakdowns of deaths. Looking at the evolution of <i>total_cum_deaths</i> over the days and reasoning that the cumulative value cannot ever decrease, we decide to replace the value 958 with the actual sum, 435, keeping the data more reasonable. 

The second inconsistency is on September 26th, when the sum of the three breakdowns of deaths is far greater than the value in the table. Arguing again with the usual evolution of the three cumulative breakdown and that the cumulative value cannot ever decrease, we decide to keep the value 668 for  <i>total_cum_deaths</i>.

In [12]:
merged_guinea_dataframe.at['2014-08-26', 'total_cum_deaths'] = 435
merged_guinea_dataframe

Description,new_cases_suspects,new_cases_probable,new_cases_confirmed,total_new_cases,cum_cases_suspects,cum_cases_probables,cum_cases_confirmed,total_cum_cases,cum_deaths_suspects,cum_deaths_probables,cum_deaths_confirmed,total_cum_deaths,total_new_deaths
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2014-08-04,5,0.0,4,9,11,133,351,495,2.0,133,228,363,2
2014-08-26,18,0.0,10,28,30,141,490,661,2.0,141,292,435,5
2014-08-27,12,0.0,10,22,22,142,499,663,2.0,142,294,438,2
2014-08-30,15,0.0,9,24,32,142,533,707,2.0,142,324,468,5
2014-08-31,9,8.0,29,46,36,150,563,749,2.0,150,337,489,3
2014-09-02,11,0.0,14,25,49,150,591,790,2.0,150,349,501,5
2014-09-04,13,0.0,17,30,51,151,621,823,3.0,151,368,522,5
2014-09-07,5,0.0,11,16,32,151,678,861,4.0,151,402,557,4
2014-09-08,5,0.0,11,16,32,151,678,861,4.0,151,402,557,4
2014-09-09,9,0.0,7,16,33,151,683,867,4.0,151,410,565,7


Finally, we can deduce the Guinea dataframe:

In [13]:
guinea_data = merged_guinea_dataframe[['total_new_cases', 'total_cum_cases', 'total_new_deaths', 'total_cum_deaths']]
guinea_data

Description,total_new_cases,total_cum_cases,total_new_deaths,total_cum_deaths
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-08-04,9,495,2,363
2014-08-26,28,661,5,435
2014-08-27,22,663,2,438
2014-08-30,24,707,5,468
2014-08-31,46,749,3,489
2014-09-02,25,790,5,501
2014-09-04,30,823,5,522
2014-09-07,16,861,4,557
2014-09-08,16,861,4,557
2014-09-09,16,867,7,565


# 1.2. Liberia Dataset

In [14]:
liberia_dataframes = load_country_data('liberia_data', 'Date') # Loading the Liberia data

##  Liberia dataset transformation

The case in Liberia dataset is the same as for Guinea regarding the columns. We see that all the files have the <i>Variable</i> column, which is the equivalent of variable <i>Description</i> for Guinea dataset. Therefore, we validated that we have the desired columns in all files.

In [15]:
compute_columns_series(liberia_dataframes).value_counts()

Lofa County           100
Margibi County        100
Nimba County          100
Bomi County           100
Variable              100
National              100
Bong County           100
Montserrado County    100
RiverCess County       98
Sinoe County           93
River Gee County       92
Maryland County        87
Grand Gedeh            84
Grand Cape Mount       84
Grand Bassa            84
Gbarpolu County        79
Grand Kru              75
Unnamed: 18             1
dtype: int64

This dataset is a bit more complicated than the Guinea one, having duplicated values for the <i>Variable</i> column and also multiple ways of expressing the same concept. We will treat all of these cases below.

In [16]:
compute_desired_column_series(liberia_dataframes, 'Variable').value_counts()

Cumulative cases among HCW                                          101
Cumulative deaths among HCW                                         101
Total death/s in confirmed cases                                    101
Total death/s in probable cases                                     101
Total death/s in suspected cases                                    101
Total probable cases                                                100
Total discharges                                                    100
Newly Reported deaths in HCW                                        100
Newly reported contacts                                             100
New admissions                                                      100
New Case/s (Probable)                                               100
Total suspected cases                                               100
New Case/s (Suspected)                                              100
Contacts lost to follow-up                                      

In [17]:
LIBERIA_COLUMNS_RENAME = {'New Case/s (Suspected)' : 'new_cases_suspects',
                        'New Case/s (Probable)' : 'new_cases_probable',
                        'New case/s (confirmed)': 'new_cases_confirmed',
                        'Total suspected cases' : 'cum_cases_suspects',
                        'Total probable cases' : 'cum_cases_probables',
                        'Total confirmed cases' : 'cum_cases_confirmed',
                        'Newly reported deaths' : 'total_new_deaths',
                        'Total death/s in suspected cases' : 'cum_deaths_suspects',
                        'Total death/s in probable cases' : 'cum_deaths_probables',
                        'Total death/s in confirmed cases' : 'cum_deaths_confirmed'}

def reshape_liberia_dataframe(dataframe):
    """Function that reshapes an usual dataframe in Guinea folder, making it more human understandable. """
    
    columns_rename = LIBERIA_COLUMNS_RENAME.copy()
    
    # Treating different ways of expressing the same concept with if-statements.
    
    if any(variable == 'Total death/s in confirmed, probable, suspected cases' for variable in dataframe['Variable']): 
        columns_rename['Total death/s in confirmed, probable, suspected cases'] = 'total_cum_deaths'

    if any(variable == 'Cumulative confirmed, probable and suspected cases' for variable in dataframe['Variable']): 
        columns_rename['Cumulative confirmed, probable and suspected cases'] = 'total_cum_cases'
        
    if any(variable == 'Total death/s in confirmed, \n probable, suspected cases' for variable in dataframe['Variable']): 
        columns_rename['Total death/s in confirmed, \n probable, suspected cases'] = 'total_cum_deaths'
 
    if any(variable == 'Cumulative (confirmed + probable + suspects)' for variable in dataframe['Variable']): 
        columns_rename['Cumulative (confirmed + probable + suspects)'] = 'total_cum_cases'
        
    if any(variable == 'Cumulative (confirmed + probable + suspected)' for variable in dataframe['Variable']): 
        columns_rename['Cumulative (confirmed + probable + suspected)'] = 'total_cum_cases'

    if any(variable == 'Total case/s (confirmed)' for variable in dataframe['Variable']): 
        columns_rename['Total case/s (confirmed)'] = 'cum_cases_confirmed'
        
    if any(variable == 'Total Case/s (Probable)' for variable in dataframe['Variable']): 
        columns_rename['Total Case/s (Probable)'] = 'cum_cases_probables'
        
    if any(variable == 'Total Case/s (Suspected)' for variable in dataframe['Variable']): 
        columns_rename['Total Case/s (Suspected)'] = 'cum_cases_suspected'
        
    if any(variable == 'Total death/s in confirmed,  probable, suspected cases' for variable in dataframe['Variable']): 
        columns_rename['Total death/s in confirmed,  probable, suspected cases'] = 'total_cum_deaths'
        
    dataframe = dataframe.set_index([dataframe.index, 'Variable']) # Adding Variable to index, thus being able to use unstack function
    dataframe = dataframe['National'].unstack('Variable')  # Unstacking dataframe, adding Variable values as columns
    
    dataframe.rename(columns=columns_rename, inplace=True) # Renaming the relevant columns
    return dataframe[list(columns_rename.values())] # Returning the dataframe containing only the relevant columns

As we have seen in the previous table, there exist some files with duplicate values for <i>Variable </i> column. We will try to see which files are the ones, below:

In [18]:
for dataframe in liberia_dataframes:
    try:
        reshape_liberia_dataframe(dataframe)
    except ValueError:
        print(dataframe.index[0])


2014-10-04 00:00:00


Therefore, there is only one file with duplicated values for <i>Variable</i> column. A closer look at that file for the date October 4th 2014 shows that the values for the duplicate Variables are not very different. Therefore, we can compute the average of the duplicated rows and keep that value. 

In [19]:
merged_liberia_dataframe = []

for index in range(len(liberia_dataframes[:54])):
    try:
        merged_liberia_dataframe.append(reshape_liberia_dataframe(liberia_dataframes[index]))
    except ValueError:
        pass
#         date = liberia_dataframes[index].index[0] # For recovering the index afterwards
#         mean_dataframe = liberia_dataframes[index].groupby('Variable', as_index = False).mean().round(0) # Computing the mean on every duplicate
#         mean_dataframe.index = [date] * len(mean_dataframe.index)
#         mean_dataframe.index.rename('Date', inplace=True)
# #         mean_dataframe['Variable'] = mean_dataframe.index # Adding Variable again as column
# #         mean_dataframe.index = range(len(mean_dataframe.index))
# #         mean_dataframe.index.rename('Id', inplace=True) # Renaming index
#         print(len(reshape_liberia_dataframe(mean_dataframe).mean(axis=1, level =0).columns))
#         merged_liberia_dataframe.append(reshape_dataframe(mean_dataframe)) # Adding reshaped dataframe to list
#         
    
merged_liberia_dataframe = pd.concat(merged_liberia_dataframe, axis = 0)

merged_liberia_dataframe

Variable,new_cases_suspects,new_cases_probable,new_cases_confirmed,cum_cases_suspects,cum_cases_probables,cum_cases_confirmed,total_new_deaths,cum_deaths_suspects,cum_deaths_probables,cum_deaths_confirmed,total_cum_deaths
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2014-06-16,2.0,1.0,1.0,4.0,6.0,12.0,2.0,2.0,6.0,8.0,16.0
2014-06-17,2.0,0.0,0.0,8.0,6.0,12.0,0.0,2.0,6.0,8.0,16.0
2014-06-22,5.0,0.0,5.0,6.0,8.0,28.0,4.0,1.0,8.0,16.0,25.0
2014-06-24,2.0,0.0,4.0,13.0,8.0,33.0,4.0,6.0,8.0,18.0,32.0
2014-06-25,4.0,1.0,2.0,17.0,9.0,35.0,3.0,9.0,8.0,20.0,37.0
2014-06-28,2.0,6.0,1.0,29.0,18.0,41.0,1.0,13.0,10.0,26.0,49.0
2014-06-29,0.0,0.0,2.0,29.0,18.0,43.0,0.0,13.0,10.0,26.0,49.0
2014-07-01,2.0,0.0,2.0,33.0,19.0,48.0,5.0,14.0,17.0,30.0,61.0
2014-07-02,1.0,3.0,0.0,34.0,22.0,48.0,5.0,14.0,20.0,32.0,66.0
2014-07-03,1.0,2.0,1.0,35.0,24.0,49.0,4.0,15.0,22.0,33.0,70.0


As it can be seen below, the column for the total number of cumulative deaths is consistent with the breakdown of the amount over the three classes.

In [20]:
merged_liberia_dataframe.apply(delta_cum_deaths, axis = 1)

Date
2014-06-16    0.0
2014-06-17    0.0
2014-06-22    0.0
2014-06-24    0.0
2014-06-25    0.0
2014-06-28    0.0
2014-06-29    0.0
2014-07-01    0.0
2014-07-02    0.0
2014-07-03    0.0
2014-07-07    0.0
2014-07-08    0.0
2014-07-10    0.0
2014-07-13    0.0
2014-07-17    0.0
2014-07-20    0.0
2014-07-24    0.0
2014-07-26    0.0
2014-08-02    0.0
2014-08-04    0.0
2014-08-12    0.0
2014-08-15    0.0
2014-08-17    0.0
2014-08-18    0.0
2014-08-20    0.0
2014-08-25    0.0
2014-08-28    0.0
2014-09-01    0.0
2014-09-02    0.0
2014-09-03    0.0
2014-09-04    0.0
2014-09-05    0.0
2014-09-06    0.0
2014-09-07    0.0
2014-09-08    0.0
2014-09-10    0.0
2014-09-11    0.0
2014-09-12    0.0
2014-09-13    0.0
2014-09-14    0.0
2014-09-15    0.0
2014-09-16    0.0
2014-09-17    0.0
2014-09-20    NaN
2014-09-21    0.0
2014-09-23    0.0
2014-09-25    0.0
2014-09-26    0.0
2014-09-27    0.0
2014-09-28    0.0
2014-09-30    0.0
2014-10-01    0.0
2014-10-03    0.0
dtype: float64

The final dataset for Liberia is presented below:

In [40]:
merged_liberia_dataframe['total_new_cases'] = merged_liberia_dataframe.new_cases_suspects + merged_liberia_dataframe.new_cases_confirmed + merged_liberia_dataframe.new_cases_probable
merged_liberia_dataframe['total_cum_cases'] = merged_liberia_dataframe.cum_cases_suspects + merged_liberia_dataframe.cum_cases_confirmed + merged_liberia_dataframe.cum_cases_probables

liberia_data = merged_liberia_dataframe[['total_new_cases', 'total_cum_cases', 'total_new_deaths', 'total_cum_deaths']]

# Replacing NaN in total_cum_deaths and total_cum_cases with the previous non-null value and in total_new_deaths and 
# total_new_cases with 0

liberia_data['total_cum_deaths'] = liberia_data['total_cum_deaths'].replace(to_replace=np.NaN, method='bfill')
liberia_data['total_cum_cases'] = liberia_data['total_cum_cases'].replace(to_replace=np.NaN, method='bfill')

liberia_data['total_new_deaths'] = liberia_data['total_new_deaths'].fillna(0)
liberia_data['total_new_cases'] = liberia_data['total_new_cases'].fillna(0)

liberia_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Variable,total_new_cases,total_cum_cases,total_new_deaths,total_cum_deaths
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-06-16,4.0,22.0,2.0,16.0
2014-06-17,2.0,26.0,0.0,16.0
2014-06-22,10.0,42.0,4.0,25.0
2014-06-24,6.0,54.0,4.0,32.0
2014-06-25,7.0,61.0,3.0,37.0
2014-06-28,9.0,88.0,1.0,49.0
2014-06-29,2.0,90.0,0.0,49.0
2014-07-01,4.0,100.0,5.0,61.0
2014-07-02,4.0,104.0,5.0,66.0
2014-07-03,4.0,108.0,4.0,70.0


# 1.3 Sierra Leone dataset

In [23]:
sl_dataframes = load_country_data('sl_data', 'date', thousands_parse=True) # Loading the Sierra Leone data

## Sierra Leone dataset transformation

As before, Sierra Leone dataset has most of the columns common in every file, with the columns <i>variable</i> and <i>National</i> present in every file.

In [24]:
compute_columns_series(sl_dataframes).value_counts()

Kono                      103
National                  103
Tonkolili                 103
Koinadugu                 103
Bombali                   103
Moyamba                   103
Kailahun                  103
Kenema                    103
Kambia                    103
Bo                        103
Bonthe                    103
Port Loko                 103
Pujehun                   103
Western area urban        103
variable                  103
Western area rural        103
Kenema (IFRC)              23
Western area combined      16
Western area               14
34 Military Hospital       10
Police training School      9
Kenema (KGH)                8
Hastings-F/Town             8
Unnamed: 18                 3
Police traning School       1
Bo EMC                      1
dtype: int64

This dataset is similar to the Guinea one, in the sense that most of the values of the <i>variable</i> column appear in every file. It is therefore easier to deal with this dataset than the previous one.

In [25]:
compute_desired_column_series(sl_dataframes, 'variable').value_counts()

new_confirmed             103
new_suspected             103
contacts_followed         103
etc_cum_deaths            103
population                103
death_probable            103
etc_currently_admitted    103
contacts_not_seen         103
cum_contacts              103
etc_new_deaths            103
etc_cum_discharges        103
contacts_healthy          103
cum_noncase               103
etc_cum_admission         103
new_noncase               103
new_probable              103
cum_confirmed             103
death_confirmed           103
new_contacts              103
cum_suspected             103
etc_new_admission         103
contacts_ill              103
cum_probable              103
percent_seen              103
etc_new_discharges        103
death_suspected           103
new_completed_contacts    103
cum_completed_contacts    103
cfr                       103
positive_corpse            35
negative_corpse            35
pending                    35
total_lab_samples          34
new_negati

We will do the same thing as before for this dataset: will reshape each DataFrame into a single-line DataFrame, with the columns represented by the values of <i>variable</i> column that are interesting for us.

In [26]:
SL_COLUMNS_RENAME = {'new_suspected' : 'new_cases_suspects',
                        'new_probable' : 'new_cases_probable',
                        'new_confirmed': 'new_cases_confirmed',
                        'cum_suspected' : 'cum_cases_suspects',
                        'cum_probable' : 'cum_cases_probables',
                        'cum_confirmed' : 'cum_cases_confirmed',
                        'death_suspected' : 'cum_deaths_suspects',
                        'death_probable' : 'cum_deaths_probables',
                        'death_confirmed' : 'cum_deaths_confirmed',
                        'etc_cum_deaths' : 'total_cum_deaths',
                        'etc_new_deaths' : 'total_new_deaths'}

def reshape_sl_dataframe(dataframe):
    """Function that reshapes an usual dataframe in Guinea folder, making it more human understandable. """
    
    columns_rename = SL_COLUMNS_RENAME.copy()
    
    dataframe = dataframe.set_index([dataframe.index, 'variable']) # Adding variable to index, thus being able to use unstack function
    dataframe = dataframe['National'].unstack('variable')  # Unstacking dataframe, adding variable values as columns
    
    dataframe.rename(columns=columns_rename, inplace=True) # Renaming the relevant columns
    return dataframe[list(columns_rename.values())] # Returning the dataframe containing only the relevant columns

Below, we observe that the file for October 10th 2014 doesn't respect the usual format, having the last 3 lines repeating, thus making our function to throw an error. Therefore, we will just drop the last 3 rows of that dataframe. This method is chosen instead of a more general one because it is only one dataframe that doesn't follow the usual rules.

In [27]:
merged_sl_dataframe = []

for dataframe in sl_dataframes:
    try:
        merged_sl_dataframe.append(reshape_sl_dataframe(dataframe))
    except ValueError:
        dataframe = dataframe[:-3]
        merged_sl_dataframe.append(reshape_sl_dataframe(dataframe))
        
merged_sl_dataframe = pd.concat(merged_sl_dataframe, axis = 0)

merged_sl_dataframe


variable,new_cases_suspects,new_cases_probable,new_cases_confirmed,cum_cases_suspects,cum_cases_probables,cum_cases_confirmed,cum_deaths_suspects,cum_deaths_probables,cum_deaths_confirmed,total_cum_deaths,total_new_deaths
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2014-08-12,10,1,11,46,37,717,5,34,264,,
2014-08-13,3,1,15,39,38,733,5,34,273,,
2014-08-14,0,2,13,37,39,747,5,34,280,,
2014-08-15,6,1,10,42,37,757,5,34,287,,
2014-08-16,3,0,18,39,34,775,5,34,297,,
2014-08-17,1,0,2,35,37,778,5,34,305,,
2014-08-18,40,15,5,72,52,783,5,34,312,,
2014-08-19,16,3,9,66,40,804,5,34,320,,
2014-08-20,1,0,4,52,37,813,5,34,322,,
2014-08-21,0,0,9,52,38,823,8,34,329,,


We can easily see that there are some clear problems in the dataset, because we have the <i>total_new_deaths</i> and <i>total_cum_deaths</i> columns filled only with NaN values. After some analysis of the data in the files, we can see that there is a major blank gap for specific rows and columns in the files. Therefore, we must compute the <i>National</i> value for <i>total_new_deaths as the sum of the values for each region</i> and for <i>total_cum_deaths</i> as the sum of the breakdown on confirmed, probable and suspect columns.

In [29]:
def compute_national_count(dataframe):
    """Method that fills total_new_deaths and total_cum_deaths columns with the sum of the specific values for every city"""
    
    dataframe = dataframe.fillna(0)
    
    columns = dataframe.columns
    cities = columns.drop(['variable', 'National']) # list of cities in the current dataframe
    
    total_new_deaths = 0
    
    dataframe_copy = dataframe.set_index(['variable']) # Reindex for an easier localization

    for city in cities:
        total_new_deaths += pd.to_numeric((dataframe_copy.at['etc_new_deaths', city]))

    total_cum_deaths = int(dataframe_copy.at['death_suspected', 'National']) + int(dataframe_copy.at['death_probable', 'National']) + int(dataframe_copy.at['death_confirmed', 'National'])
    dataframe.loc[dataframe['variable'] == 'etc_new_deaths', 'National'] = total_new_deaths
    dataframe.loc[dataframe['variable'] == 'etc_cum_deaths', 'National'] = total_cum_deaths
    
    return dataframe


sl_dataframes = list(map(compute_national_count, sl_dataframes)) # Apply the compute_national_count over each DataFrame

merged_sl_dataframe = []

for dataframe in sl_dataframes:
    try:
        merged_sl_dataframe.append(reshape_sl_dataframe(dataframe))
    except ValueError:
        dataframe = dataframe[:-3]
        merged_sl_dataframe.append(reshape_sl_dataframe(dataframe))
        
merged_sl_dataframe = pd.concat(merged_sl_dataframe, axis = 0)

# Also, we fill the zeros from the total_cum_deaths column with the preious non-zero value, because if we don't have information
# right now, we want to keep the previous gathered information.

merged_sl_dataframe['total_cum_deaths'] = merged_sl_dataframe['total_cum_deaths'].replace(to_replace=0, method='bfill')
merged_sl_dataframe['total_cum_deaths'] = merged_sl_dataframe['total_cum_deaths'].replace(to_replace=0, method='ffill')

merged_sl_dataframe

variable,new_cases_suspects,new_cases_probable,new_cases_confirmed,cum_cases_suspects,cum_cases_probables,cum_cases_confirmed,cum_deaths_suspects,cum_deaths_probables,cum_deaths_confirmed,total_cum_deaths,total_new_deaths
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2014-08-12,10,1,11,46,37,717,5,34,264,303,8
2014-08-13,3,1,15,39,38,733,5,34,273,312,6
2014-08-14,0,2,13,37,39,747,5,34,280,319,7
2014-08-15,6,1,10,42,37,757,5,34,287,326,7
2014-08-16,3,0,18,39,34,775,5,34,297,336,10
2014-08-17,1,0,2,35,37,778,5,34,305,344,8
2014-08-18,40,15,5,72,52,783,5,34,312,351,7
2014-08-19,16,3,9,66,40,804,5,34,320,359,6
2014-08-20,1,0,4,52,37,813,5,34,322,361,2
2014-08-21,0,0,9,52,38,823,8,34,329,371,7


Now, we compute the final dataframe for Sierra Leone:

In [30]:
def replace_comma(x):
    if isinstance(x, str):
        return x.replace(",", "")
    return x

merged_sl_dataframe = merged_sl_dataframe.applymap(replace_comma)
merged_sl_dataframe = merged_sl_dataframe.applymap(pd.to_numeric) # Transform every cell to numeric

merged_sl_dataframe['total_new_cases'] = merged_sl_dataframe.new_cases_suspects + merged_sl_dataframe.new_cases_confirmed + merged_sl_dataframe.new_cases_probable
merged_sl_dataframe['total_cum_cases'] = merged_sl_dataframe.cum_cases_suspects + merged_sl_dataframe.cum_cases_confirmed + merged_sl_dataframe.cum_cases_probables

merged_sl_dataframe['total_cum_cases'] = merged_sl_dataframe['total_cum_cases'].replace(to_replace=0, method='bfill')

sl_data = merged_sl_dataframe[['total_new_cases', 'total_cum_cases', 'total_new_deaths', 'total_cum_deaths']]
sl_data.index = sl_data.index.astype('datetime64[ns]')
sl_data.head(20)

variable,total_new_cases,total_cum_cases,total_new_deaths,total_cum_deaths
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-08-12,22.0,800.0,8.0,303.0
2014-08-13,19.0,810.0,6.0,312.0
2014-08-14,15.0,823.0,7.0,319.0
2014-08-15,17.0,836.0,7.0,326.0
2014-08-16,21.0,848.0,10.0,336.0
2014-08-17,3.0,850.0,8.0,344.0
2014-08-18,60.0,907.0,7.0,351.0
2014-08-19,28.0,910.0,6.0,359.0
2014-08-20,5.0,902.0,2.0,361.0
2014-08-21,9.0,913.0,7.0,371.0


# Combining results

Now, after we have seen the datasets for each country, we are prepared to import all of them in one unified DataFrame. It is important to think about what will be encapsulated in the final table. As we might observe, the difference between the value of <i>total_cum_deaths</i> between two consecutive dates is generally bigger than the number of new deaths between those two dates. Therefore, the reason might be that two consecutive entries are not exactly two consecutive dates or maybe the new deaths are not counted as precise as the cumulative deaths. The same principle is taken into consideration for the new cases and the number of cumulative cases. Therefore, we will take into consideration only the columns <i>total_cum_cases</i> and <i>total_cum_deaths</i>. 

For each country, we will compute the daily average per month of new cases and deaths, and then we will unify all of this information. An important observation is that all of the entries are in the year 2014, so we can use a unique month identifier as the month index.

The average will be computed as follows: for each month in which we have entries, we compute the difference between the value in the last day of the month and the value in the first day of the month, and the difference will be, then, divided by the number of days in the interval.

In [31]:
def compute_average(df, column, side_column):
    df = df.sort_values(by=['day'])
    delta_days = df.iloc[-1]['day'] - df.iloc[0]['day'] + 1
    delta_cum_cases = df.iloc[-1][column] - df.iloc[0][column]
    
    if delta_cum_cases > 0:   
        return delta_cum_cases / delta_days
    # If result is negative, then add the average of new daily
    
    sum = 0
    for index in range(len(df)):
        sum += df.iloc[index][side_column]
        
    return sum / len(df)
    

def compute_daily_average_per_month(dataframe_init): 
    dataframe = dataframe_init.copy()
    dataframe['day'] = dataframe.index.map(lambda x: x.day)
    dataframe.index = dataframe.index.map(lambda x: x.month)
    dataframe.index.name = 'Month'
    
    avg_new_deaths = dataframe.groupby('Month').apply(compute_average, column='total_cum_deaths', side_column='total_new_deaths')
    avg_new_deaths.name = 'avg_new_deaths'
    
    avg_new_cases = dataframe.groupby('Month').apply(compute_average, column='total_cum_cases', side_column='total_new_cases')
    avg_new_cases.name = 'avg_new_cases'
    
    return pd.concat([avg_new_cases, avg_new_deaths], axis = 1)

In [32]:
guinea_prepared_data = compute_daily_average_per_month(guinea_data)
liberia_prepared_data = compute_daily_average_per_month(liberia_data)
sl_prepared_data = compute_daily_average_per_month(sl_data)

In [33]:
guinea_prepared_data

Unnamed: 0_level_0,avg_new_cases,avg_new_deaths
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
8,9.071429,4.5
9,13.068966,7.344828
10,34.0,15.0


In [34]:
liberia_prepared_data

Unnamed: 0_level_0,avg_new_cases,avg_new_deaths
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
6,4.857143,2.357143
7,7.961538,3.230769
8,41.148148,23.259259
9,68.766667,35.133333
10,41.0,24.0


In [35]:
sl_prepared_data

Unnamed: 0_level_0,avg_new_cases,avg_new_deaths
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
8,19.5,6.6
9,40.533333,5.133333
10,68.741935,27.580645
11,83.551724,14.482759
12,73.923077,12.230769


Finally, we have to combine all of the previous tables:

In [36]:
guinea_prepared_data['Country'] = 'Guinea'
liberia_prepared_data['Country'] = 'Liberia'
sl_prepared_data['Country'] = 'Sierra Leone'

merged_countries = pd.concat([guinea_prepared_data, liberia_prepared_data, sl_prepared_data], axis = 0)

final_table = merged_countries.set_index([merged_countries.index, 'Country']).sort_index()

In [37]:
final_table

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_new_cases,avg_new_deaths
Month,Country,Unnamed: 2_level_1,Unnamed: 3_level_1
6,Liberia,4.857143,2.357143
7,Liberia,7.961538,3.230769
8,Guinea,9.071429,4.5
8,Liberia,41.148148,23.259259
8,Sierra Leone,19.5,6.6
9,Guinea,13.068966,7.344828
9,Liberia,68.766667,35.133333
9,Sierra Leone,40.533333,5.133333
10,Guinea,34.0,15.0
10,Liberia,41.0,24.0


## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

In [None]:
# Write your answer here

## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [None]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')

For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

In [None]:
# Write your answer here