# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [2]:
DATA_FOLDER = 'Data' # Use the data folder provided in Tutorial 02 - Intro to Pandas.

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average per month* of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

In [2]:
# Write your answer here

## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

### Solution

#### Method
1. Import data files and name columns correctly (tissue/stool/NA, see metadata).
2. Concatenate the separate DataFrames to a single DataFrame while adding a  
   hierarchical index (Control Group, Test Type) in line with metadata values.
3. Replace NaN values with 'unknown'.

#### Code

Import the data files as DataFrames and save them in a list.

In [294]:
%matplotlib notebook

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
#import seaborn as sns
#sns.set_context('notebook')
plt.style.use('ggplot')

frames = [pd.read_excel('{}/microbiome/MID{}.xls'
                        .format(DATA_FOLDER, x), 'Sheet 1', header=None, index_col=0) for x in range(1,10)]
frames[0].head(2)

# helper function to label bars in histograms later
def autolabel(rects):
    """
    Attach a text label above each bar displaying its height
    """
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')

Quick size and index uniqueness check.

In [4]:
for m in frames:
    print('{:10}'.format(str(m.shape)), end='')
    print(m.index.is_unique)

(272, 1)  True
(288, 1)  True
(367, 1)  True
(134, 1)  True
(379, 1)  True
(181, 1)  True
(395, 1)  True
(99, 1)   True
(281, 1)  True


Rename columns according to metadata.

In [5]:
for m in frames:
    m.index.name = 'Taxon'
frames[0].columns = ['NA']
for m in frames[1:5]:
    m.columns = ['Tissue']
for m in frames[5:9]:
    m.columns = ['Stool']

Concatenate with correct hierarchical indexing according to metadata.

In [6]:
data = pd.concat(frames, axis=1)
data.columns.names = ['Type']
data.index.name = 'Taxon'
data.head(5)
arrays = [['EXTRACTION CONTROL', 'NEC 1', 'Control 1', 'NEC 2',
           'Control 2', 'NEC 1', 'Control 1', 'NEC 2', 'Control 2'],
          ['NA'] + ['Tissue']*4 + ['Stool']*4]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
data.columns = index
data = data.fillna('unknown')
data.head(15)

# ['EXTRACTION CONTROL', 'NEC 1', 'Control 1', 'NEC 2', 'Control 2', 'NEC 1', 'Control 1', 'NEC 2', 'Control 2']

first,EXTRACTION CONTROL,NEC 1,Control 1,NEC 2,Control 2,NEC 1,Control 1,NEC 2,Control 2
second,NA,Tissue,Tissue,Tissue,Tissue,Stool,Stool,Stool,Stool
Taxon,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Acidilobaceae Acidilobus",unknown,2,1,unknown,5,unknown,unknown,unknown,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",unknown,14,15,unknown,26,unknown,1,unknown,1
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",7,23,14,2,28,7,8,unknown,16
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Stetteria",unknown,unknown,unknown,unknown,1,unknown,unknown,unknown,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Sulfophobococcus",unknown,1,4,unknown,5,1,2,unknown,2
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Thermodiscus",unknown,unknown,1,unknown,unknown,unknown,unknown,unknown,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Thermosphaera",unknown,2,1,unknown,2,unknown,1,unknown,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Pyrodictiaceae Hyperthermus",unknown,1,unknown,unknown,unknown,unknown,unknown,unknown,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrodictium",unknown,unknown,3,unknown,2,1,1,unknown,5
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus",2,2,unknown,unknown,3,2,1,unknown,unknown


## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [7]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')

0,1,2,3,4,5
Name,Labels,Units,Levels,Storage,NAs
pclass,,,3,integer,0
survived,Survived,,,double,0
name,Name,,,character,0
sex,,,2,integer,0
age,Age,Year,,double,263
sibsp,Number of Siblings/Spouses Aboard,,,double,0
parch,Number of Parents/Children Aboard,,,double,0
ticket,Ticket Number,,,character,0
fare,Passenger Fare,British Pound (\243),,double,1

0,1
Variable,Levels
pclass,1st
,2nd
,3rd
sex,female
,male
cabin,
,A10
,A11
,A14


For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

### Solutions

### 1.
*Describe the type and the value range of each attribute. Indicate and transform the attributes that can be Categorical.*

#### Answer
A categorial attribute is an attribute that can only have a limited number of values. This value assigns each item to a nominal group/category.
The following attribues in the titanic data are categorial:
* Travel class ('pclass')
* If the passenger survived or not ('survived')
* Sex ('sex')
* Cabin ('cabin')
* Boat ('boat')
* Where the passenger embarked ('embarked')

#### Code

Import data and convert attributes from above to categorial.

In [341]:
titanic = pd.read_excel(DATA_FOLDER+'/titanic.xls', 'titanic', header=0)
# titanic.loc[titanic.duplicated('name')]
# titanic.loc[titanic.fare.isnull()]
titanic['sex'] = titanic.sex.astype('category')
titanic['embarked'] = titanic.embarked.astype('category')
titanic['boat'] = titanic.embarked.astype('category')

titanic.head(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,S,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,S,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,S,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,S,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,S,,"Montreal, PQ / Chesterville, ON"
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,S,,"New York, NY"
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,S,,"Hudson, NY"
7,1,0,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S,S,,"Belfast, NI"
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792,C101,S,S,,"Bayside, Queens, NY"
9,1,0,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,C,22.0,"Montevideo, Uruguay"


### 2.
*Plot histograms for the travel class, embarkation port, sex and age attributes. For the latter one, use discrete decade intervals.*

#### Method
Count number of values in the travel class, embarkation port, and sex categories. Plot results in histograms.
For age, count number of values using decade-sized bins and then plot.

#### Code

In [342]:
fig, axes = plt.subplots(nrows=2, ncols=2)

# Travel class
axes[0, 0].bar(range(3), titanic['pclass'].value_counts(), width=0.5, color='b', alpha=0.9)
plt.sca(axes[0, 0])
plt.xticks(range(3), ['1st class', '2nd class', '3rd class'])
plt.title('Nr of passengers in travel classes', fontsize=10)

# Embarkation port
axes[0, 1].bar(range(3), titanic['embarked'].value_counts(), width=0.5, color='g', alpha=0.9)
plt.sca(axes[0, 1])
plt.xticks(range(3), ['S', 'C', 'Q'])
plt.title('Embarkation port', fontsize=10)

# Males and females
axes[1, 0].bar(range(2), titanic.sex.value_counts(), width=0.25, color='#a30b3b', alpha=0.9)
plt.sca(axes[1, 0])
plt.xticks(range(2), ['Male', 'Female'])
plt.title('Nr of male and female passengers', fontsize=10)

# Age
ages = titanic['age'].copy().dropna() # remove all NaN ages
# note that max age is 80 (titanic.age.max() yields 80.0)
axes[1, 1].hist(ages, bins=range(0, 81, 10), color='#c98404', alpha=0.9)
plt.sca(axes[1, 1])
plt.xticks(range(0, 81, 10))
plt.title('Age of passengers', fontsize=10)

fig.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

### 3.
*Calculate the proportion of passengers by cabin floor. Present your results in a pie chart.*

#### Method

Handle the cabin data (split multi-cabin values), count first letter frequency of the cabin values and plot result as percentage of all (known) cabins.  
Will also throw in a pie chart with percentages passengers with known/unknown cabin values.

*Comment*  
If you take a look at the data the cabins are known for most of the 1st class passengers, but are usually unknown for 2nd and 3rd class passengers. Thus this will mostly provide information on the cabin floor distribution of the first class passengers, which might unfairly skew the data.

#### Code
Handle and count cabin values.

In [343]:
cabins = titanic.cabin.copy()
# Note we have many empty values here,
# especially for the 2nd and 3rd class passengers (scroll through the xls file).
print('Null: {}\nNot null: {}'.format(cabins.isnull().sum(), cabins.notnull().sum()))
cabins = cabins.str.split()
cabins = cabins.fillna('nan')

# Handle passengers with multiple cabins by splitting into separate series entries 
# follows order of nested loops, read right->left <=> in->out and it makes sense
split = [(cabin if row!='nan' else 'nan') for row in cabins for cabin in row]

floorcount = {'unknown':0}
for c in split:
    if c != 'nan' and len(c) > 0:
        if c[0] in floorcount:
            floorcount[c[0].upper()] += 1
        else:
            floorcount[c[0].upper()] = 1
    else:
        floorcount['unknown'] += 1
print(floorcount)
# exclude the unknowns for now
knowncount = {key: floorcount[key] for key in floorcount if key != 'unknown'}
print(knowncount)
labels, count = zip(*knowncount.items())
print(labels, end=' -> ')
print(count)

n_known = sum(knowncount.values())
per_of_known = [100*x/n_known for x in count]
per_unknown_of_total = 100*titanic.cabin.isnull().sum()/len(cabins)
print('Percentages of known:', [round(v, 2) for v in per_of_known])
print('Unknown as percentage of total:', round(per_unknown_of_total, 2), 'percent')

Null: 1014
Not null: 295
{'unknown': 3042, 'C': 114, 'A': 22, 'E': 45, 'T': 1, 'D': 48, 'B': 96, 'F': 21, 'G': 9}
{'T': 1, 'C': 114, 'A': 22, 'E': 45, 'G': 9, 'B': 96, 'D': 48, 'F': 21}
('T', 'C', 'A', 'E', 'G', 'B', 'D', 'F') -> (1, 114, 22, 45, 9, 96, 48, 21)
Percentages of known: [0.28, 32.02, 6.18, 12.64, 2.53, 26.97, 13.48, 5.9]
Unknown as percentage of total: 77.46 percent


 Present results in a pie charts.

In [344]:
fig1, axes = plt.subplots(1, 2, figsize=(12, 4))

# As percentage of known
axes[0].pie(per_of_known, labels=labels, autopct='%1.1f%%', startangle=0,  explode=[0.04]*8)
plt.sca(axes[0])
plt.axis('equal')
plt.legend(title='Cabin floor', loc=3)
plt.title('Cabin floor of passengers with known cabins')

# Percentage unknown of total
axes[1].pie([per_unknown_of_total, 100-per_unknown_of_total], labels=('Unknown', 'Known'),
            autopct='%1.1f%%', startangle=90)
plt.sca(axes[1])
plt.axis('equal')
plt.title('Passangers with known cabins')

plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

### 4.
For each travel class, calculate the proportion of the passengers that survived. Present your results in pie charts.

#### Method
Group survival data for each passenger by travel class. For each travel class, pie-plot survivors as percentage of total passengers in that travel class.

#### Code

In [347]:
totals = titanic.pclass.value_counts()
survivors = titanic.loc[titanic.survived==1]['pclass'].value_counts()
per_class_surv = []
[100*survivors[i]/totals[i] for i in range(1,4)]

fig1, axes = plt.subplots(1, 3, figsize=(10, 5))

for i in range(3):
    p_surv = 100*survivors[i+1]/totals[i+1]
    p_not = 100 - p_surv
    axes[i].pie([p_not, p_surv], autopct='%1.1f%%', startangle=90)
    plt.sca(axes[i])
    plt.axis('equal')
    plt.title('Passenger class '+str(i+1), fontsize=12)
    if i == 1:
        plt.legend(['Did not make it', 'Survived'], loc=8)

plt.show()

<IPython.core.display.Javascript object>

### 5.
*Calculate the proportion of the passengers that survived by travel class and sex. Present your results in a single histogram.*

#### Method
Group by travel class and sex, then count survivors for each group. For each group, plot survivors as percentage of total number of passenger in that group.

#### Code

In [349]:
totals = titanic[['pclass','survived', 'sex',]].groupby(['pclass', 'sex']).count()
surv = titanic.loc[titanic.survived==1][['pclass', 'sex']]

surv_count = []
surv_per = []

for i in range(1,4):
    surv_count.append(len(surv.loc[(surv.pclass==i) & (surv.sex=='male')]))
    surv_count.append(len(surv.loc[(surv.pclass==i) & (surv.sex=='female')]))
    
    surv_per.append(surv_count[-2]/int(totals.loc[i, 'male']))
    surv_per.append(surv_count[-1]/int(totals.loc[i, 'female']))

fig, ax = plt.subplots(figsize=(10,4))

ax.bar(range(6), surv_per, width=0.9, color='#1f60d1', alpha=0.8)
plt.sca(ax)
plt.xticks(range(6), ['1st - Male', '1st - Female', '2nd - Male',
                      '2nd - Female', '3rd - Male', '3rd - Female'])
ax.set_yticklabels(['{:3.2f}%'.format(x*100) for x in ax.get_yticks()])
[ax.get_children()[i].set_color('#cc1836') for i in [1, 3, 5]]
plt.title('Proportion of passengers that survived by travel class and sex', fontsize=10)
plt.show()

<IPython.core.display.Javascript object>

### 6.
*Create 2 equally populated age categories and calculate survival proportions by age category, travel class and sex. Present your results in a DataFrame with unique index.*

#### Method
Split data on median age. For each half follow save process as above, i.e.
> Group by travel class and sex, then count survivors for each group. For each group, plot survivors as percentage of total number of passenger in that group.

Plot result as two histograms, one for younger half and one for the older half of the passengers.

#### Code

In [351]:
ship = titanic.loc[titanic.age.notnull()] # skip rows with undefined age
print('Splitting on median:', ship.age.median()) # median is 28.0
age_cat = pd.qcut(ship.age, 2, labels=['bottom', 'top'])
btm = ship.loc[age_cat == 'bottom'].copy()
top = ship.loc[age_cat == 'top'].copy()

totals_btm = btm[['pclass','survived', 'sex',]].groupby(['pclass', 'sex']).count()
totals_top = top[['pclass','survived', 'sex',]].groupby(['pclass', 'sex']).count()

surv_btm = btm.loc[btm.survived==1][['pclass', 'sex']]
surv_top = top.loc[top.survived==1][['pclass', 'sex']]

surv_count_btm = []
surv_count_top = []
surv_per_btm = []
surv_per_top = []

# could have turned thnkgs into vectors of tuples/vectors and used a double loop
# but let's just do this for now since we only have only two age bins
for i in range(1,4):
    surv_count_btm.append(len(surv_btm.loc[(surv_btm.pclass==i) & (surv_btm.sex=='male')]))
    surv_count_top.append(len(surv_top.loc[(surv_top.pclass==i) & (surv_top.sex=='male')]))
    
    surv_count_btm.append(len(surv_btm.loc[(surv_btm.pclass==i) & (surv_btm.sex=='female')]))
    surv_count_top.append(len(surv_top.loc[(surv_top.pclass==i) & (surv_top.sex=='female')]))

    surv_per_btm.append(surv_count_btm[-2]/int(totals_btm.loc[i, 'male']))
    surv_per_top.append(surv_count_top[-2]/int(totals_top.loc[i, 'male']))
    
    surv_per_btm.append(surv_count_btm[-1]/int(totals_btm.loc[i, 'female']))
    surv_per_top.append(surv_count_top[-1]/int(totals_top.loc[i, 'female']))
    
fig, axes = plt.subplots(2, 1, figsize=(10,10))

axes[0].bar(range(6), surv_per_top, width=0.9, color='#1f60d1', alpha=0.8)
plt.sca(axes[0])
plt.xticks(range(6), ['1st - Male', '1st - Female', '2nd - Male',
                      '2nd - Female', '3rd - Male', '3rd - Female'])
axes[0].set_yticklabels(['{:3.2f}%'.format(x*100) for x in axes[0].get_yticks()])
[axes[0].get_children()[i].set_color('#cc1836') for i in [1, 3, 5]]
plt.title('Older half of the passengers', fontsize=10)

axes[1].bar(range(6), surv_per_btm, width=0.9, color='#1f60d1', alpha=0.8)
plt.sca(axes[1])
plt.xticks(range(6), ['1st - Male', '1st - Female', '2nd - Male',
                      '2nd - Female', '3rd - Male', '3rd - Female'])
axes[1].set_yticklabels(['{:3.2f}%'.format(x*100) for x in axes[1].get_yticks()])
[axes[1].get_children()[i].set_color('#cc1836') for i in [1, 3, 5]]
plt.title('Younger half of the passengers', fontsize=10)

plt.suptitle('Proportion of passengers that survived by travel class and sex,\nsplit into two equally populated age categories')

plt.show()


Splitting on median: 28.0


<IPython.core.display.Javascript object>