# Data Analysis of the Kaggle Titanic Dataset

Machine Learning Problem: Classification  
Output Variable: Survived

## Import all Libraries

In [None]:
# pandas: handle the datasets in the pandas dataframe for data processing and analysis
import pandas as pd
print("pandas version: {}". format(pd.__version__))

# matplotlib: standard library to create visualizations
import matplotlib
import matplotlib.pyplot as plt
print("matplotlib version: {}". format(matplotlib.__version__))

# seaborn: advanced visualization library to create more advanced charts
import seaborn as sns
print("seaborn version: {}". format(sns.__version__))

# turn off warnings for better reading in the Jupyter notebbok
pd.options.mode.chained_assignment = None  # default='warn'

## Load Training and Test Dataset
Load the training and test dataset, that you find on the kaggle website. Make sure that you maybe have to change the folder path.

In [None]:
# load training and test dataset
df_train = pd.read_csv('../01_rawdata/train.csv')
df_test = pd.read_csv('../01_rawdata/test.csv')

## First Look at the Training and Test Dataset
To get a fist look at the training and test datasets, we plot the first few lines and create basic statistical reports.

### Print the first lines of the dataset

In [None]:
# print the first 10 lines of the training data
df_train.head(10)

In [None]:
# print the first 10 lines of the test data
df_test.head(10)

### Create a Statistical Report of Numeric and Categorical Features

In [None]:
# create the statistic report of the numeric features of the training dataset
df_train.describe().transpose()

In [None]:
# create the statistic report of the numeric features of the test dataset
df_test.describe().transpose()

#### Result: statistical resport of numeric features

- The **training dataset contains 891 samples** (number of rows in training dataset) and the **test dataset 418 samples** (number .of rows in test dataset)
- The "PassengerId" is consecutively numbered -> does not add any information if a passenger survived or not but harms the ML algorithm because it adds false added afterwards information
- The feature **"Age" has missing values**. (714 instead of 891 in the training dataset and 332 instead of 418 in the test dataset) -> handle later
- The mean of "Survived" is 0.38, therefore we already know that **38% of all passengers survived**.
- 75% of all passengers are between 38 and 39 years old or younger. There are a few older passengers with the oldest 80 years old.
- More than 75% of all passengers travel without parents or children (75% percentile of Parch == 0)
- The minimum fare is 0 -> check if Children did not have to pay.
- For the test dataset the feature **"Fare" has missing values** (417 instead of 418) -> handle later

In [None]:
# create the statistic report of the categoric features of the training dataset
df_train.describe(include=['O']).transpose()

In [None]:
# create the statistic report of the categoric features of the test dataset
df_test.describe(include=['O']).transpose()

#### Results: statistical report of categorical features
- All names in the column "Name" are unique
- There are 843 (577+266) male passengers and (891+418)-843 = 472 female passengers
- 914 (644+270) out of 1.309 passengers embarked in Southampton
- Not all ticket numbers are unique -> maybe children have the ticket number from their parents
- The feature **"Cabin" has missing values** (204 instead of 891 in the training dataset and 91 instead of 418 in the test dataset) -> handle later
- The feature **"Embarked" has missing values** (889 instead of 891 in the training dataset) -> handle later

## Key Questions for the Data Analysis
It is important to get a better understanding of the features, because it might help you to create a better dataset for the machine learning algorithm through feature engineering. Therefore I prepared some key questions.

In [None]:
def pivot_survival_rate(df_train, target_column):
    # create a pivot table with the target_column as index and "Survived" as columns
    # count the number of entries of "PassengerId" for each combination of target_column and "Survived"
    # fill all empty cells with 0
    df_pivot = pd.pivot_table(
        df_train[['PassengerId', target_column, 'Survived']],
        index=[target_column],
        columns=["Survived"],
        aggfunc='count',
        fill_value=0)\
        .reset_index()

    # rename the columns to avoid numbers as column name
    df_pivot.columns = [target_column, 'not_survived', 'survived']

    # create a new column with the total number of survived and not survived passengers
    df_pivot['passengers'] = df_pivot['not_survived']+df_pivot['survived']

    # create a new column with the proportion of survivors to total passengers
    df_pivot['survival_rate'] = df_pivot['survived']/df_pivot['passengers']*100

    print(df_pivot.to_markdown())

### Had Older Passengers and Children a Higher Chance of Survival?
Create a basic univariate distribution plot of "Age" in the training data to find the threshold values when the survival rate is changing. Based on this thresholds, we create a new feature that categorizes the age feature (children, adult, senior). Based on the number of survived passengers, we can then calculate the survival rate for each age category.

#### Univariate Dirstribution Plot: "Age"

In [None]:
# create univariate dirstribution plot for "Age" seperated by "Survived"
# common_norm=False: distribution for survived and not survived passengers sum up individually to 1
sns.kdeplot(data=df_train, x="Age", hue="Survived", common_norm=False)
#sns.kdeplot(data=df_train, x="Age", hue="Survived")

# limit the x-axes to the max age
plt.xlim(0, df_train['Age'].max())

plt.grid()
plt.show()

From the distribution plot we can get the following information, by comparing the difference between the line of survived (orange) and not survived (blue):

- Below 12 years, the chances of survival are higher than not to survive, especially for children around 5 years (peak in the survived curve).
- If a passenger is older than the 60 years, the chance to survive reduces very fast.

#### Create Age Category and Calculate Survival Rate of each Category

In [None]:
def age_category(row):
    """
    Function to transform the actual age in to an age category
    Thresholds are deduced from the distribution plot of age
    """
    if row < 12:
        return 'children'
    if (row >= 12) & (row < 60):
        return 'adult'
    if row >= 60:
        return 'senior'
    else:
        return 'no age'

# apply the function age_category to each row of the dataset
df_train['Age_category'] = df_train['Age'].apply(lambda row: age_category(row))
df_test['Age_category'] = df_test['Age'].apply(lambda row: age_category(row))

In [None]:
# show the survival table with the previously created function
pivot_survival_rate(df_train, "Age_category")

#### Results
- Children under 12 years have a higher survival rate (57%).
- Passangers older 60 years have a lower survival rate (27%).

### Had Passengers of a Higher Pclass also a Higher Change of Survival?
First we compute the countplot with the seaborn library and use the "Survived" feature as category to see the absolute amount of passengers that survived and died in the three different passenger classes. Then we group the dataset by the passenger class and calculate the relative survival rate for each passenger class.

In [None]:
# create a count plot that counts the survived and not survived passengers for each passenger class
ax=sns.countplot(data=df_train, x='Pclass', hue='Survived')

# show numbers above the bars
for p in ax.patches:
    ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))

# show the ledgend outside of the plot
ax.legend(title='Survived', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()

From the bar chart we can see that most passengers that survived are from the 1st class, but to get the exact numbers, we use the pivot_survival_rate function.

In [None]:
pivot_survival_rate(df_train, "Pclass")

#### Results
- The higher the passenger class, the higher was the survival rate.
- **The highest survival rate had passengers in the first class (63%) compared to the survival rate of the lowest class (24%).**

### Did Passengers that Paid a Higher Fare also had a Higher Survival Rate?
To see if the fare influences the survival rate, we create a basic univariate distribution plot of "Fare" for the training data because we need the information if the passengers survived not or. For the distribution plot we use the kdeplot function of the seaborn library and separate the distribution by "Survived".

In [None]:
# create univariate dirstribution plot for "Fare" seperated by "Survived"
# common_norm=False: distribution for survived and not survived passengers sum up individually to 1
sns.kdeplot(data=df_train, x="Fare", hue="Survived", common_norm=False)
plt.grid()
plt.xlim(0, 100)
plt.show()

#### Results
- Under a fare of 30 the survival rate is very low with the top at around a fare of 10. 
- If a passengers paid a fare higher then 30, the chance to survive was higher than to not survive the Titanic.

### Did women have a Higher Chance of Survival?
To find out if the sex of a passenger had an influence on the survival rate, we pivot the training data with "Sex" as index and "Survived" as columns. 

In [None]:
pivot_survival_rate(df_train, "Sex")

#### Result
- **The survival rate of female passengers is much higher with 74% compared to the survival rate of male passengers with 19%.**

### Did the Port of Embarkation influence the Survival Rate?
To find out if the port of embarkation influenced the survival rate, we pivot the training data with "Embarked" as index and "Survived" as columns. 

In [None]:
pivot_survival_rate(df_train, "Embarked")

#### Results
- There is a difference in the survival rate between the three different ports.
- The lowest survival rate had passengers that embarked in Southampton (S) with 34%.
- The highest survival rate had passengers that embarked in Cherbourg (C) with 55%.

## Try to separate Survived and not Survived Passengers
In addition to the key questions, we create different visualizations to see if one or a combination of features are able to separate the survived and not survived passengers. This task gives an indication which features could be important for the machine learning algorithm.

### Survival Rate for Sex and Pclass
During the data analysis process we saw that the Sex as well as the Pclass had a significant influence on the survival rate. Therefore we would like to see the combined influence of Sex and Pclass on the survival rate.

- Use catplot for categorical and numerical features
- Use bar_label (matplotlib >= v3.4.2) to show the numbers for each bar

In [None]:
sns.set(font_scale=1.3)
g = sns.catplot(x="Sex", y="Survived", col="Pclass", data=df_train, kind="bar")

# loop over the three different axes crated by the col feature
for i in range(3):
    # extract the matplotlib axes_subplot objects from the FacetGrid
    ax = g.facet_axis(0, i)

    # iterate through the axes containers
    for c in ax.containers:
        labels = [f'{(v.get_height()):.2f}' for v in c]
        ax.bar_label(c, labels=labels, label_type='center')

plt.show()

#### Results Catplot of Survival Rate for Sex and Pclass
- Almost all female passengers of the first class (97%) as well as the second class (92%) survived.
- Female passengers of the 3rd class had a higher chance of survival than male passengers of the first class **-> the feature Sex has a higher influence of the survival rate than the Pclass.**
- Male passengers from the first class had more than twice as high a change in survival as male from the second and third class.
- The survival rate of male passengers between the second and third class differs not much.

### Survival Rate for Age and Pclass
- Almost all young passengers from the first and second passenger class survived, but there are a lot of young passengers from the third class that died.
- The second observation from the swarmplot is that older passengers have a higher survival change if they are in a higher passenger class (imagine a horizontal line, starting around the age of 50).

In [None]:
g = sns.catplot(x="Survived", y="Age", col="Pclass", data=df_train, kind="swarm")
plt.show()

### Survival Rate for selected Categorical and Numerical Features
- Use
    - catplot for categorical and
    - kdeplot for numerical features
- Use bar_label (matplotlib >= v3.4.2) to show the numbers for each bar

In [None]:
for feature in ["Sex", "Embarked", "Pclass", "SibSp", "Parch"]:
    g = sns.catplot(x=feature, y="Survived", data=df_train, kind="bar")
    
    # extract the matplotlib axes_subplot objects from the FacetGrid
    ax = g.facet_axis(0, -1)

    # iterate through the axes containers
    for c in ax.containers:
        labels = [f'{(v.get_height()):.2f}' for v in c]
        ax.bar_label(c, labels=labels, label_type='center')
    
    plt.show()

#### Results Catplot of selected Categorical Features
- **Sex:** The survival rate of female passengers is much higher with 74% compared to the survival rate of male passengers with 19%.
- **Embarked:** There is a difference in the survival rate between the three different ports. The lowest survival rate had passengers that embarked in Southampton (S) with 34%. The highest survival rate had passengers that embarked in Cherbourg (C) with 55%.
- **Pclass:** The higher the passenger class, the higher was the survival rate. The highest survival rate had passengers in the first class (63%) compared to the survival rate of the lowest class (24%).
- **SibSp:** The highest survival rate had passengers with 1 sibling or spouse (54%). The second highest survival rate had passengers with 2 siblings or spouses (45%) but the confidence interval gets very wide. Therefore the reliability of the results gets weaker.
- **Parch:** Passengers with 3 parents or children had the highest survival rate (60%) but with a wide confidence interval. Therefore passengers with 1 parch had a slightly lower mean survival rate (55%) but is it a more confidence result.

In [None]:
for feature in ["Age", "Fare"]:
    g = sns.kdeplot(data=df_train, x=feature, hue="Survived", common_norm=False)
    plt.show()

#### Results Catplot of selected Numerical Features
- **Age:** Below 12 years, the chances of survival are higher than not to survive, especially for children around 5 years. If a passenger is older than the 60 years, the change to survive reduces very fast.
- **Fare:** The kernel density estimate (KDE) plot shows values that does not exists in the dataset, like a negative "Fare". Under a fare of 30 the survival rate is very low with the top at around a fare of 10. If a passengers paid a fare higher then 30, the chance to survive was higher than to not survive the Titanic. **-> see the the kdeplot with the limited x-axis in the key question section.**

## Save the Analyzed Dataset

In [None]:
df_train.to_pickle('df_train.pkl')
df_test.to_pickle('df_test.pkl')