# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [None]:
DATA_FOLDER = 'Data' # Use the data folder provided in Tutorial 02 - Intro to Pandas.
import pandas as pd
import numpy as np

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average per month* of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

## Way of thinking

The final results to the question "calculate for *each country*, the *daily average per month* of *new cases* and *deaths*" are found in the last three data frames before the start of Task 2. In the course of this solution, there will be many intermediate data frames.

**1st approach**: Transform the data to obtain a final DataFrame containing all the information about new cases and deaths depending on `date` and `Country`. Then aggregate by grouping by `Country` and `month` and count a mean.

**2nd approach**: Transform the data to obtain a DataFrame that collects data of total number of cases and deaths. Next find the first and the last day of each month, substract these values and divide by the number of days between those two dates.

### Import necessary libraries

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os as os
import re

### Extract file paths and names from directory

In [None]:
sl_2nd_solution

## Guinea

In [None]:
tmp = pd.read_csv('./Data/ebola/guinea_data/2014-08-04.csv')
np.array(tmp['Description'])

In [None]:
def read_data_guinea(file_list, columns):
    df = pd.DataFrame()
    for file in file_list:
        temp = pd.read_csv(file)
        country, date = (file.split("/")[3][:-5], file.split("/")[4][:10])
        temp = temp.loc[temp['Description'].isin(columns)]
        temp[temp.columns.values[2:]] = temp[temp.columns.values[2:]].apply(pd.to_numeric)
        temp = temp[['Description', 'Totals']]
        temp = temp.set_index('Description')
        temp = temp.T
        temp['Date'] = pd.to_datetime(date)
        temp['Country'] = country
        temp = temp.set_index('Date')
        #temp = column_filter(temp, country = country)
        df = pd.concat([df, temp])
    df = df.sort_index()
        
    return df

## Solution 1st approach

In [None]:
guinea_new_case_deaths = ['New deaths registered', 'New deaths registered today', 'Total new cases registered so far']

In [None]:
guinea = read_data_guinea([file for file in file_list if file.split("/")[3][:-5] == 'guinea'], guinea_new_case_deaths)
guinea.head()

In [None]:
guinea[guinea_new_case_deaths] = guinea[guinea_new_case_deaths].apply(pd.to_numeric)

**Assumption**: Attributes `New deaths registered today`, `New deaths registered` represent the same measure and call it differently.

## Final DataFrame Guinea - 1st approach

In [None]:
guinea['New deaths'] = guinea[['New deaths registered today', 'New deaths registered']].sum(axis = 1)
guinea.columns = ['Country', 'New deaths registered', 'New deaths registered today', 'New cases', 'New deaths']
guinea = guinea[['New cases', 'New deaths', 'Country']]
guinea

## Solution 2nd approach

In [None]:
liberia_2nd_solution

In [None]:
liberia.plot(y = ['New Case/s (Probable)', 'New Case/s (Suspected)',
       'New case/s (confirmed)', 'Newly Reported Cases in HCW'])

**Assumption**: Values of new cases between 2014-12-04 and 2014-12-09 seem to be wrong, because of the very high values that hardly fluctuate. It seems that there is some offset added to these values. As these values are not monotonically increasing it is sure that they are not cumulative values written in the wrong field or cell. That is the reason why we decided to discard this part of data suspecting it to be wrong. 

In [None]:
liberia = liberia.loc[:'2014-12-03']
liberia_2nd_solution = liberia_2nd_solution.loc[:'2014-12-03']

Create new attributes that aggregate values of New cases/deaths

**Assumption**: `New Case/s (Probable)`, `New Case/s (Suspected)`, `New case/s (confirmed)`, `Newly Reported Cases in HCW` are separate sets. The same goes for `Newly Reported deaths in HCW`, `Newly reported deaths`.

In [None]:
pd.set_option('mode.chained_assignment', None)
liberia.loc[:,'New cases'] = liberia[['New Case/s (Probable)', 'New Case/s (Suspected)',
       'New case/s (confirmed)', 'Newly Reported Cases in HCW']].sum(axis = 1)
liberia.loc[:,'New deaths'] = liberia[['Newly Reported deaths in HCW',
                                 'Newly reported deaths']].sum(axis = 1)
liberia = liberia[['Country', 'New cases', 'New deaths']]

### Final Liberia DataFrame - 1st approach

In [None]:
liberia.head()

In [None]:
#Aggregated values grouped by months
liberia.groupby(liberia.index.month).sum()

In [None]:
liberia.plot(y = 'New cases')

In [None]:
#Aggregated values grouped by months and Country
liberia.groupby([liberia.index.month, 'Country']).sum()

Function that counts number of days in each month according to provided data.
For example: we have data from 3rd of November up to day 28, it will return 26.
It is needed to compute daily average per month, we have to know how many days we have in particular month based on data.

In [None]:
def first_n_last_days(df):
    first_days = [np.append([], 0)]
    last_days = []
    last = 0
    for i in range(len(df.index.day)):
        if df.index.day[i] < last:
            first_days = np.append(first_days, i)
            last_days = np.append(last_days, i-1)
        last = df.index.day[i]
    last_days = np.append(last_days, len(df.index.day)-1)
    
    return(last_days, first_days) 

In [None]:
last_days, first_days = first_n_last_days(liberia_2nd_solution)
print(first_days, last_days)

In [None]:
#Generate list of months for indexing purposes
liberia_months = np.array(liberia_2nd_solution.index.month.drop_duplicates())

In [None]:
last_liberia = liberia_2nd_solution.iloc[last_days]
last_liberia

In [None]:
first_liberia = liberia_2nd_solution.iloc[first_days]
first_liberia

In [None]:
#Substracting last-day-of-the-month values with first-day-of-the-month values
New_cases = last_liberia['Total cases'].values - first_liberia['Total cases'].values
New_deaths = last_liberia['Total deaths'].values - first_liberia['Total deaths'].values
d = {'New cases': New_cases, 'New deaths': New_deaths, 'Country': 'liberia'}
liberia_2nd_solution = pd.DataFrame(data=d, index=liberia_months)

## Final Liberia DataFrame - 2nd approach

**Assumption**: As there is a lot of incomplete data we will focus only on `National` attribute of deaths and new cases. Sometimes there are informations about these facts only from a few cities, but there is lack of information about this at `national scale`. In such situation we are not taking into account that informations, because we do not want to infer informations about country on the basis of only few cities.

# Sierra Leone

In [None]:
#Almost the same function as above
def read_data_sl(file_list, columns):
    df = pd.DataFrame()
    for file in file_list:
        temp = pd.read_csv(file, na_values='-')
        country, date = (file.split("/")[3][:-5], file.split("/")[4][:10])
        #Filtering rows that we are intrested in
        temp = temp.loc[temp['variable'].isin(columns)]
        #Casting objects to numeric type
        temp[['National']] = temp[['National']].apply(lambda x: pd.to_numeric(x.astype(str).str.replace(',',''), errors='coerce'))
        temp[temp.columns.values[2:]] = temp[temp.columns.values[2:]].apply(pd.to_numeric)
        #If there is no value in the 'National' column, sum values from columns with regional information, 
        #which should sum up to 'National' value (a poor way of filling NA values)
        temp.loc[temp['National'].isnull(), 'National'] = temp.sum(axis = 1)
        temp = temp[['variable', 'National']]
        temp = temp.set_index('variable')
        temp = temp.T
        temp['Date'] = pd.to_datetime(date)
        temp['Country'] = country
        temp = temp.set_index('Date')
        df = pd.concat([df, temp])
    df = df.sort_index()
        
    return df

## Final DataFrame Guinea - 2nd solution

In [None]:
guinea_2nd_solution

## Merge all DataFrames

## 1st approach
#### Reminder:
**1st approach**: Transform the data to obtain a final DataFrame containing all the information about new cases and deaths depending on `date` and `Country`. Then aggregate by grouping by `Country` and `month` and count a mean.

In [None]:
total = pd.concat([sl, liberia, guinea])

In [None]:
total.index

In [None]:
total['colDateToSort'] = total.index
total = total.sort_values(['Country', 'colDateToSort'])
total = total.drop('colDateToSort', axis=1)
total

In [None]:
days_in_each_month = np.concatenate((number_of_days(guinea), number_of_days(liberia), number_of_days(sl)))
days_in_each_month

In [None]:
total.groupby([total.index.month, 'Country']).sum().reset_index().set_index(['Country', 'Date']).sort_index()

### 1st approach solution
**Assumption**: We treated the ebola dataset as complete, that means there is no missing files and if there is a gap of few day between next and current file, we assume that the next file contains information about days between these two dates.

In [None]:
#The code after the sum function is used only for visual purposes, to present the data in easy way or to make a proper division.
sol1 = total.groupby([total.index.month, 'Country']).sum().reset_index().set_index(['Country', 'Date']).sort_index().divide(days_in_each_month, axis = 0)
sol1 = sol1.rename(columns = {'New cases' : 'New cases daily avg', 'New deaths' : 'New deaths daily avg'})
sol1.index.rename(['Country','Month'],inplace=True)
sol1

### 1st aproach - another solution
**Assumption**: We can treat each file as a daily report concerning only this one particular date.

In [None]:
#Search of negative values of daily deaths
negatives = (sl['Deaths'].values[1:] - sl['Deaths'].values[:-1]).astype(int) < 0
negatives

In [None]:
sl.plot(y = 'Deaths')

In [None]:
indicies_negatives = np.array([i for i, x in enumerate(negatives) if x])
indicies_negatives

In [None]:
sl.iloc[indicies_negatives-1]['Deaths'].values

In [None]:
sl.iloc[(indicies_negatives)]['Deaths'].values

In [None]:
sl.iloc[(indicies_negatives+1)]['Deaths'].values

In [None]:
#Overwritiing the negative value with a mean of neighbouring values
sl.loc[negatives, 'Deaths'] = np.round((sl.iloc[(indicies_negatives+1)]['Deaths'].values + sl.iloc[(indicies_negatives-1)]['Deaths'].values)/2)

In [None]:
sl['Deaths'].plot()

In [None]:
#creating new attribute 'New deaths' by substracion 'Deaths' with shifted 'Deaths' attribute and appending
#with a value for a first day using value in 'etc_new_deaths'
#It can be done because Sierra Leone data is pretty complete in terms of having almost all daily reports
sl['New deaths'] = np.append(sl['etc_new_deaths'].values[0], (sl['Deaths'].values[1:] - sl['Deaths'].values[:-1]).astype(int))

In [None]:
#Creating new attribute 'New cases'
sl['New cases'] = sl[['new_suspected', 'new_probable', 'new_confirmed']].sum(axis=1)

In [None]:
#Create new DataFrame only with significant attributes
sl = sl[['New cases', 'New deaths', 'Country']]

## Final Sierra Leone DataFrame - 1st approach

In [None]:
sol2 = total.groupby([total.index.month, 'Country']).mean().reset_index().set_index(['Country', 'Date']).sort_index()
sol2 = sol2.rename(columns = {'New cases' : 'New cases daily avg', 'New deaths' : 'New deaths daily avg'})
sol2.index.rename(['Country','Month'],inplace=True)
sol2

## 2nd approach
#### Reminder:
**2nd approach**: Transform the data to obtain a DataFrame that collects data of total number of cases and deaths. Next find the first and the last day of each month, substract these values and divide by the number of days between those two dates.

In [None]:
total_cum = pd.concat([sl_2nd_solution, liberia_2nd_solution, guinea_2nd_solution])
total_cum.index.name = 'Month'
total_cum.reset_index().set_index(['Country', 'Month']).sort_index().divide(days_in_each_month, axis = 0)

## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

**Solution:**

First, we want to combine the 9+1 different Excel spreadsheets into one data frame. We do this by for-loop: first, outside the loop, we read the first file (MID1.xls) to a data frame (rna_data), and then loop over numbers 2-9 and read and append each file's contents to our data frame. In order to preserve information on which entry came from which file, we create a `MultiIndex` index for the data frame, where the first level corresponds to the file name and the second level is the index of the entry (row number in file, starting from 0). Column names in the data frame default to 0 and 1, and we leave them as is, since we don't have more information on what the data represents.

Once the data has been read, we name the first level of the `MultiIndex` "BARCODE" and the second level "index_no". Then, we read the metadata file (metadata.xls) to a data frame using the column "BARCODE" as index, and join it with the rna_data data frame. The join is performed on the common index, which in this case is "BARCODE". Since all "BARCODE" values are contained in both data frames, it doesn't matter whether we do an inner or outer join.

Finally, we change NaN values to "unknown", and check that our indices are unique and that no NaN values are left in the data frame.

In [None]:
# Load first file
rna_data = pd.read_excel(DATA_FOLDER+"/microbiome/MID1.xls",sheetname="Sheet 1",header=None)
rna_indices = [["MID1"]*len(rna_data.index), list(rna_data.index)] # List of lists of first and second level indices
tuples = list(zip(*rna_indices)) # Make indices into tuples of first and second level
rna_data.index = pd.MultiIndex.from_tuples(tuples) # Create MultiIndex where each index is from a tuple

# Append other files
for ii in range(2,10):
    next_sheet = pd.read_excel(DATA_FOLDER+"/microbiome/MID%d.xls" % ii,sheetname="Sheet 1",header=None)
    rna_indices = [["MID%d" % ii]*len(next_sheet.index), list(next_sheet.index)] # First level index according to file
    tuples = list(zip(*rna_indices))
    next_sheet.index = pd.MultiIndex.from_tuples(tuples)
    rna_data = rna_data.append(next_sheet,verify_integrity=True) # Append next sheet to data frame

# Change index names to match metadata
rna_data.index.names = ["BARCODE","index_no"]

# Read and join metadata, use BARCODE as index for metadata
rna_metadata = pd.read_excel(DATA_FOLDER+"/microbiome/metadata.xls",index_col="BARCODE")
rna_data = rna_data.join(rna_metadata, how='outer')

# Change NaNs to unknown
rna_data.fillna(value="unknown",inplace=True)

# Check that indices are unique and that no NaNs are left, and print data frame dimensions
print("All indices are unique: " + str(rna_data.index.is_unique))
print("Data frame contains NaN values: " + str(rna_data.isnull().any().any()))
print("Data frame dimensions: " + str(rna_data.shape))

In [None]:
rna_data

**Another solution:**

We might also want to index the data based on the first column (that was named 0 in rna_data). We construct a new data frame (rna_data_v2), where we rename columns 0 and 1 to NAME and VALUE, respectively (we assume that column 0 refers to species name and column 1 is some measured value). We then use BARCODE and NAME as a `MultiIndex` in rna_data_v2.

We also check that indices are unique and no NaNs are left. The number of columns in rna_data_v2 is one less than in rna_data, because here we use one of the columns as index instead.

In [None]:
sl.head()

In [None]:
sl_deaths_cases_cum = ['cum_confirmed', 'cum_probable', 'cum_suspected', 
                       'death_suspected', 'death_probable','death_confirmed']
sl_cum = read_data_sl([file for file in file_list if file.split("/")[3][:-5] == 'sl'], columns = sl_deaths_cases_cum)

sl_cum['Total cases'] = sl_cum[['cum_confirmed', 'cum_probable', 'cum_suspected']].sum(axis = 1).replace(to_replace=0, method='ffill')
sl_cum['Total deaths'] = sl_cum[['death_suspected', 'death_probable','death_confirmed']].sum(axis = 1).replace(to_replace=0, method='ffill')

sl_cum.head()

In [None]:
sl_cum = sl_cum[['Total cases', 'Total deaths']]
last_days, first_days = first_n_last_days(sl_cum)
last_sl = sl_cum.iloc[last_days]
last_sl

In [None]:
first_sl = sl_cum.iloc[first_days]
first_sl

In [None]:
#Substracting last-day-of-the-month values with first-day-of-the-month values
New_cases = last_sl['Total cases'].values - first_sl['Total cases'].values
New_deaths = last_sl['Total deaths'].values - first_sl['Total deaths'].values

#Generate list of months for indexing purposes
sl_months = np.array(sl_cum.index.month.drop_duplicates())

d = {'New cases': New_cases, 'New deaths': New_deaths, 'Country': 'sl'}
sl_2nd_solution = pd.DataFrame(data=d, index=sl_months)

### Final Sierra Leone DataFrame - 2nd approach

In [None]:
def number_of_days(df):
    first_days = [np.append([], df.index.day[0])]
    last_days = []
    last = 0
    for x in np.array(df.index.day):
        if x < last:
            first_days = np.append(first_days, x)
            last_days = np.append(last_days, last)
        last = x
    last_days = np.append(last_days, df.index.day[-1])
    
    return(last_days - first_days + 1) 


In [None]:
print(number_of_days(liberia))

Function that returns two arrays, one with indicies of first days of the months, second with the indicies of the last day of the month

#### Additional function to filter attributes by words (used nowhere right now)
This function keeps only these columns that have specific key words in their name

In [None]:
def column_filter(data_frame, country = 'liberia'):
    mylist = data_frame.columns.values
    if country == 'liberia':
        r = re.compile(".*(([D|d]eath)|([C|c]ase)).*")
    elif country == 'sl':
        r = re.compile(".*([N|n]ew).*")
    elif country == 'guinea':
        r = re.compile(".*([N|n]ew).*")
    newlist = list(filter(r.match, mylist))
    data_frame = data_frame[newlist]
    newlist = [" ".join(name.replace('\n ','').replace('/s', 's').replace('_', ' ').split()) for name in newlist[:]]
    data_frame.columns = newlist
    return data_frame

### Function to read Liberia files

In [None]:
def read_data_liberia(file_list, columns):
    df = pd.DataFrame()
    for file in file_list:
        temp = pd.read_csv(file)
        #retrive country name/abbreviation and date from the file name (files have different notation of date
        #the only consistent one is the notation in filename)
        country, date = (file.split("/")[3][:-5], file.split("/")[4][:10])
        #Leave only these rows of the table which have in their variable attribute value from columns argument 
        temp = temp.loc[temp['Variable'].isin(columns)]
        temp = temp[['Variable', 'National']]
        #Set 'Variable' as an index to make it easier to transpose in terms of cleaning and DataFrame format
        temp = temp.set_index('Variable')
        temp = temp.T
        #Change type of 'date' object
        temp['Date'] = pd.to_datetime(date)
        #Get country name from file path
        temp['Country'] = country
        temp = temp.set_index('Date')
        #temp = column_filter(temp, country = country)
        #concatenate created DataFrame with cumulative DataFrame
        df = pd.concat([df, temp])
    df = df.sort_index()
        
    return df

##### Read files concerning Liberia cases

In [None]:
'''Attributes choosen as those that contribute to New cases/deaths. 
   These were choosen just by inspecting the csv files'''
new_cases_deaths_liberia = ['New Case/s (Probable)', 'New Case/s (Suspected)',
       'New case/s (confirmed)','Newly Reported Cases in HCW', 'Newly Reported deaths in HCW',
       'Newly reported deaths']

total_cases_deaths_liberia = ['Total death/s in confirmed, probable, suspected cases', 
                      'Total suspected cases', 'Total probable cases',
                      'Total confirmed cases','Total death/s in confirmed, \n probable, suspected cases']

#Read files that have 'liberia' word in their filepath
liberia = read_data_liberia([file for file in file_list if file.split("/")[3][:-5] == 'liberia'], new_cases_deaths_liberia)

#We discard values from 'ebola/liberia_data/2014-10-04-v142.csv' because it has duplicated columns
#with different values so the data is wrong, but in the second approach it doesn't matter at all.
liberia_2nd_solution = read_data_liberia([file for file in file_list if file.split("/")[3][:-5] == 'liberia' and file !='./Data/ebola/liberia_data/2014-10-04-v142.csv'], total_cases_deaths_liberia)
liberia_2nd_solution['Total cases'] = liberia_2nd_solution[['Total suspected cases', 'Total probable cases', 'Total confirmed cases']].sum(axis = 1)
liberia_2nd_solution['Total deaths'] = liberia_2nd_solution[['Total death/s in confirmed, \n probable, suspected cases',
                                                            'Total death/s in confirmed, probable, suspected cases']].sum(axis = 1)
liberia_2nd_solution = liberia_2nd_solution[['Total cases', 'Total deaths', 'Country']]

In [None]:
liberia

In [None]:
#Columns that we are interested in
new_cases_deaths_sl = ['etc_new_deaths','new_confirmed', 'new_probable', 'new_suspected', 
                       'death_suspected', 'death_probable','death_confirmed']

In [None]:
sl = read_data_sl([file for file in file_list if file.split("/")[3][:-5] == 'sl'], columns = new_cases_deaths_sl)
sl

**Assumption**: Attribute `etc_new_deaths` is a part of a `New deaths` attribute

In [None]:
file_list = []
for path, subdirs, files in os.walk("./" + DATA_FOLDER + "/ebola/"):
    for name in files:
        if(name.split(".")[-1]) == "csv":
            file_list.append(os.path.join(path, name))

#Just in case of Windows users
file_list = [file.replace('\\', '/') for file in file_list]
file_list

In [None]:
guinea_total_case_deaths = ['Cumulative (confirmed + probable + suspects)', 'Total deaths (confirmed + probables + suspects)']

In [None]:
guinea_total = read_data_guinea([file for file in file_list if file.split("/")[3][:-5] == 'guinea'], guinea_total_case_deaths)
guinea_total

In [None]:
guinea_total['Total deaths (confirmed + probables + suspects)'].plot()

In [None]:
#Replacing obviously bad value with next day value, which is much better approximation rather than previous value
guinea_total.loc['2014-08-26', 'Total deaths (confirmed + probables + suspects)'] = guinea_total.loc['2014-08-27', 'Total deaths (confirmed + probables + suspects)']

In [None]:
guinea_total[['Total cases', 'Total deaths']] = guinea_total[guinea_total_case_deaths]
guinea_total = guinea_total[['Total cases', 'Total deaths', 'Country']]
guinea_total.head()

In [None]:
guinea_cum = guinea_total[['Total cases', 'Total deaths']]
last_days, first_days = first_n_last_days(guinea_cum)
last_guinea = guinea_cum.iloc[last_days]
last_guinea

In [None]:
first_guinea = guinea_cum.iloc[first_days]
first_guinea

In [None]:
#Substracting last-day-of-the-month values with first-day-of-the-month values
New_cases = last_guinea['Total cases'].values - first_guinea['Total cases'].values
New_deaths = last_guinea['Total deaths'].values - first_guinea['Total deaths'].values

#Generate list of months for indexing purposes
guinea_months = np.array(guinea_cum.index.month.drop_duplicates())

d = {'New cases': New_cases, 'New deaths': New_deaths, 'Country': 'guinea'}
guinea_2nd_solution = pd.DataFrame(data=d, index=guinea_months)

**Note**: that approach has a drawback, if there is only one day in month, first and last day is the same.

In [None]:
#Assign proper value
guinea_2nd_solution.loc[10] = guinea.loc['2014-10-01']