# Tutorial 8: Statistics and Data Analysis

**The goal of this assignment is to perform a data analysis on the survivor of the [Titanic disaster](https://en.wikipedia.org/wiki/RMS_Titanic).**

At first, you have to answer 10 questions based on some descriptive statistics that I already computed.

Then, you will have to implement some code to answer additional questions on the passengers.

In both case, you will use Pandas: a popular library to perform data analysis in Python.

This library is very convenient, and can replace Excel when combined with Jupyter.

I recommend you to read ["10 minutes to pandas"](https://pandas.pydata.org/pandas-docs/stable/10min.html) to get an overview of Pandas.

__Grade scale__: 20 points
- __correct code/answer__: 1 point
- __incorrect code/answer__: 0 point

__Further documentations__:
* https://www.kaggle.com/c/titanic
* https://learnxinyminutes.com/docs/python/
* https://pandas.pydata.org/pandas-docs/stable/
* https://pandas.pydata.org/pandas-docs/stable/10min.html

# Core

## Dataset Variables

- __survival__        Survival(0 = No; 1 = Yes)
- __pclass__          Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- __name__            Name
- __sex__             Sex
- __age__             Age
- __sibsp__           Number of Siblings/Spouses Aboard
- __parch__           Number of Parents/Children Aboard
- __ticket__          Ticket Number
- __fare__            Passenger Fare
- __cabin__           Cabin
- __embarked__        Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [1]:
# we import pandas
# with 'pd' as alias
import pandas as pd

In [2]:
# import the dataset from a csv.gz
# 'df' is a dataframe ~excel sheet
df = pd.read_csv('titanic.csv.gz')

df.head()

Unnamed: 0,pclass,survival,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S


# Questions

In [3]:
# display dataset summary
# great to get an overview
df.describe(include="all")

Unnamed: 0,pclass,survival,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
count,1309.0,1309.0,1309,1309,1046.0,1309.0,1309.0,1309,1308.0,295,1307
unique,,,1307,2,,,,929,,186,3
top,,,"Kelly, Mr. James",male,,,,CA. 2343,,C23 C25 C27,S
freq,,,2,843,,,,11,,6,914
mean,2.294882,0.381971,,,29.881135,0.498854,0.385027,,33.295479,,
std,0.837836,0.486055,,,14.4135,1.041658,0.86556,,51.758668,,
min,1.0,0.0,,,0.1667,0.0,0.0,,0.0,,
25%,2.0,0.0,,,21.0,0.0,0.0,,7.8958,,
50%,3.0,0.0,,,28.0,0.0,0.0,,14.4542,,
75%,3.0,1.0,,,39.0,1.0,0.0,,31.275,,


__1. How many passengers were on the Titanic ?__

In [4]:
def Q1():
    # YOUR CODE HERE
    # raise NotImplementedError()
    return len(df)
    
Q1()

1309

In [5]:
assert isinstance(Q1(), int)

__2. How passengers were male ?__

In [6]:
def Q2():
    # YOUR CODE HERE
    # raise NotImplementedError()
    return len(df[df['sex']=='male'])
    
Q2()

843

In [7]:
assert isinstance(Q2(), int)

__3. How many different cabins were on the Titanic ?__

In [8]:
def Q3():
    # YOUR CODE HERE
    # raise NotImplementedError()
    # print(df['cabin'].dropna().unique())
    return len(df['cabin'].dropna().unique())
    
Q3()

186

In [9]:
assert isinstance(Q3(), int)

__4. How old was the oldest person on board ?__

In [10]:
def Q4():
    # YOUR CODE HERE
    # raise NotImplementedError()
    return int(max(df["age"]))
    
Q4()

80

In [11]:
assert isinstance(Q4(), int)

__5. What was the median ticket fare (rounded up) ?__

In [12]:
import math
def Q5():
    # YOUR CODE HERE
    # raise NotImplementedError()
    return math.ceil(df["fare"].median())
    
Q5()

15

In [13]:
assert isinstance(Q5(), int)

__6. What was the passenger class with the most people ?__

In [14]:
def Q6():
    # YOUR CODE HERE
    # raise NotImplementedError()
    # return pd.Series.value_counts(df['pclass'])
    return int(df['pclass'].mode())
    
Q6()

3

In [15]:
assert isinstance(Q6(), int)

__7. From which location did people embarked the most (one letter) ?__

In [16]:
def Q7():
    # YOUR CODE HERE
    # raise NotImplementedError()
    return str(df.embarked.mode()).strip('0 \ndtype:object')
    
Q7()

'S'

In [17]:
assert isinstance(Q7(), str)
assert len(Q7()) == 1

__8. How many people survived the Titanic disaster (rounded down) ?__

In [18]:
def Q8():
    # YOUR CODE HERE
    # raise NotImplementedError()
    return len(df[df['survival']==1])
    
Q8()

500

In [19]:
assert isinstance(Q8(), int)
assert Q8() % 10 == 0

__9. What was the maximum number of parents/children somebody had ?__

In [20]:
def Q9():
    # YOUR CODE HERE
    # raise NotImplementedError()
    return max(df.parch)
    
Q9()

9

In [21]:
assert isinstance(Q9(), int)

__10. What is the 3rd quartile for the number of sibblings/spouses ?__

In [22]:
def Q10():
    # YOUR CODE HERE
    # raise NotImplementedError()
    return int(df.sibsp.quantile(0.75))
Q10()

1

In [23]:
assert isinstance(Q10(), int)

# Queries

__1. Select all values of the name column__
- *hint*: have a look at https://pandas.pydata.org/pandas-docs/stable/indexing.html#basics

In [24]:
def Q1(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    return df['name']    
Q1(df).head()

0                      Allen, Miss. Elisabeth Walton
1                     Allison, Master. Hudson Trevor
2                       Allison, Miss. Helen Loraine
3               Allison, Mr. Hudson Joshua Creighton
4    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
Name: name, dtype: object

In [25]:
assert isinstance(Q1(df), pd.Series)
assert Q1(df).shape == (1309,)

__2. Select all passengers older than 50 years old (included)__
- *hint*: have a look at https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing

In [27]:
def Q2(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    return df[df["age"]>=50]
    
Q2(df).head()

Unnamed: 0,pclass,survival,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792,C101,S
9,1,0,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
14,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
17,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.5208,B58 B60,C


In [28]:
assert isinstance(Q2(df), pd.DataFrame)
assert Q2(df).shape[1] == 11  # 11 cols

__3. Select all passengers in 1st and 2nd class__
- *hint*: you can also take the complement statement

In [34]:
def Q3(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    return df[(df["pclass"]==1)|(df["pclass"]==2)]
    
Q3(df).head()

Unnamed: 0,pclass,survival,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S


In [35]:
assert isinstance(Q3(df), pd.DataFrame)
assert Q3(df).shape[1] == 11 # 11 cols

__4. Select the passengers who embarked at Queenstown ("Q") and paid more than 50 (included)__
- *hint*: have a look (again) at https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing

In [36]:
def Q4(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    return df[(df["embarked"]=='Q')&(df["fare"]>=50)]
Q4(df).head()

Unnamed: 0,pclass,survival,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
206,1,0,"Minahan, Dr. William Edward",male,44.0,2,0,19928,90.0,C78,Q
207,1,1,"Minahan, Miss. Daisy E",female,33.0,1,0,19928,90.0,C78,Q
208,1,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0,C78,Q


In [37]:
assert isinstance(Q4(df), pd.DataFrame)
assert Q4(df).shape[1] == 11 # colums

__5. Select the name of passengers below 18 years old (included) who did not survived__
- *hint*: you can combine the syntax from the previous queries

In [40]:
def Q5(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    return df[(df["survival"]==0)&(df["age"]<=18)]["name"]
Q5(df).head()

2                    Allison, Miss. Helen Loraine
53                         Carrau, Mr. Jose Pedro
228    Penasco y Castellana, Mr. Victor de Satode
326                    Andrew, Mr. Edgardo Samuel
331                      Bailey, Mr. Percy Andrew
Name: name, dtype: object

In [41]:
assert isinstance(Q5(df), pd.Series)
assert Q5(df).shape == (98,)

__6. Compute the value counts (frequencies) for the passenger class variable__
- *hint*: there is a method on pandas.Series to handle this case

In [42]:
def Q6(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    return pd.Series.value_counts(df.pclass)
    
Q6(df)

3    709
1    323
2    277
Name: pclass, dtype: int64

In [43]:
assert isinstance(Q6(df), pd.Series)
assert set(Q6(df).to_dict().keys()) == {1, 2, 3}

__7. Compute the 95% quantile of the age variable__
- *hint*: convert the result to an int with `int()`

In [46]:
def Q7(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    return int(df.age.quantile(0.95))
    
Q7(df)

57

In [47]:
assert isinstance(Q7(df), int)

__8. Compute the mean survival rate by passenger class__
- *hint*: have a look at http://pandas.pydata.org/pandas-docs/stable/groupby.html

In [49]:
def Q8(df):
    # YOUR CODE HERE
    #raise NotImplementedError()
    return df.groupby("pclass")["survival"].mean()
    
Q8(df)

pclass
1    0.619195
2    0.429603
3    0.255289
Name: survival, dtype: float64

In [50]:
assert isinstance(Q8(df), pd.Series)
assert set(Q8(df).to_dict().keys()) == {1, 2, 3}

__9. Compute the correlation for every numerical variable pair using the spearman method__
- *hint*: have a look at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html

In [51]:
def Q9(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    return df.corr('spearman')
Q9(df)

Unnamed: 0,pclass,survival,age,sibsp,parch,fare
pclass,1.0,-0.309734,-0.395892,-0.066679,-0.028752,-0.709019
survival,-0.309734,1.0,-0.041672,0.08362,0.162086,0.294016
age,-0.395892,-0.041672,1.0,-0.129929,-0.216097,0.192676
sibsp,-0.066679,0.08362,-0.129929,1.0,0.438373,0.445566
parch,-0.028752,0.162086,-0.216097,0.438373,1.0,0.400301
fare,-0.709019,0.294016,0.192676,0.445566,0.400301,1.0


In [52]:
assert isinstance(Q9(df), pd.DataFrame)
assert Q9(df).shape == (6, 6)

__10. Compute the number of passengers per passenger class (rows) and survival (columns)__
- *hint*: have a look at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html

In [53]:
def Q10(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    return pd.crosstab(df.pclass, df.survival)
    
Q10(df)

survival,0,1
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,123,200
2,158,119
3,528,181


In [54]:
assert isinstance(Q10(df), pd.DataFrame)
assert Q10(df).shape == (3, 2)