In [120]:
# Import Packages
import pandas as pd
from pathlib import Path
import altair as alt
import importlib

import sys
import os

# Add the src directory to the Python path
sys.path.append(os.path.abspath('../src'))
import altair_plots as ap
import eda_univariate as eu
importlib.reload(ap)
importlib.reload(eu)

<module 'eda_univariate' from '/Users/congminhnguyen/MPhil Econs and Data Science/titantic/titanic_problem_set/src/eda_univariate.py'>

In [121]:
# PATHS
DATA_PATH = Path("../data")
TRAIN_PATH = DATA_PATH / "train.csv"
TEST_PATH = DATA_PATH / "test.csv"

# Color palette
color_list = ["#A5D7E8", "#576CBC", "#19376D", "#0b2447"]

# Load the data
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)
all_df = combine_train_test(train_df, test_df)

# Report

## Methodology 

The purpose of this report is to enhance the findings of the key visualisations in the original Titanic analysis.
Specifically, the original analysis identified that Sex, Pclass, and Family Size were associated with varying survival rates.
To build upon this analysis, we will explore whether there is a case to add interaction terms to a predictive model to account for joint effects between these features.

## Pclass and Sex

This visualization plots the survival count on the y-axis versus the passenger class on the x-axis. The facet aspect of the chart splits the dataset between female and male passengers, allowing for an easy analysis of whether gender impacts the likelihood of survival given cabin class.

In [122]:
ap.sex_class_survival_countplot(all_df, x="Pclass", group="Survived", facet="Sex", 
                     title="Passenger Counts by Class, Gender and Survival")


Initially, examining the distribution of passengers by class and gender reveals that males outnumber females across all three classes. However, the male-to-female ratio varies significantly between classes. First class exhibits the most balanced ratio, with only slightly more males than females. In contrast, second class shows a greater imbalance, and third class displays a pronounced disparity with a significantly higher number of males.
 
This observation is crucial because conventional wisdom suggests that females ('women and children first') and passengers in higher classes would have a better chance of survival. Analyzing survival based solely on class or gender might lead to an overestimation of the correlation between being male or in third class and survival, due to the confounding factor of more males being present in third class.
 
By visualizing both class and gender in the same plot, we can explore the combined effects of these variables. For instance, it becomes evident that, on average, female passengers have a higher likelihood of survival compared to male passengers. However, this relationship is not consistent across all classes. Nearly all female passengers in first and second class survived, whereas in third class, the survival rate is closer to 50%. 
For male passengers, the relationship between class and survival is less straightforward, although it is apparent that first class males have a higher survival rate than those in second and third class.
 
To address these combined effects, incorporating an interaction term for class and gender would be beneficial. This approach improves upon using class and gender as separate features, as it accounts for their joint impact on survival rates, particularly highlighted in the female passenger data.

## Family Size and Sex

In [123]:
all_df["Family Size"] = eu.calculate_family_size(all_df)

Family size is an engineered feature that adds Parents/Children and Spouse/Siblings.
 
From the original analysis, family size clearly had an impact on survival rate. Based on our above plot, it may be that class or sex has interaction effects with family size.
 
Possible interaction with Pclass was explored but the results were insignificant. Let's investigate interaction with the Sex feature.
First, below is the survival rate split only by sex. We see that females have a 75% chance of survival, whereas males have 20%.

In [124]:
ap.proportion_chart(all_df, x="Sex", group="Survived", title="Survival Rate by Sex")

Below is the survival rate just by family size we see that single travellers have a 30% chance of survival but small to medium size families have an above even chance of survival. This decreases for larger families.

In [125]:
ap.proportion_chart(all_df, x="Family Size", group="Survived", title="Survival Rate by Family Size")

In [126]:
ap.barchart_proportions(all_df, x="Family Size", group="Survived", facet="Sex", title="Survival Rate by Family Size and Sex")

Recall that the average survival rate for women was 70%. We see that for family size up to 4 the survival chance for females is slightly higher than this average. 

Additionally, recall that the average survival rate for men was 20%. We see that single men have a lower than average chance of survival, but men with small to medium families have a higher than average survival rate.

This motivates the inclusion of an interaction term between Sex and Family Size. Family size has an impact on survival chances for both male and female but
its impact is not constant between Sex. In particular categorising family size into categories for 'small medium family' and 'large family' will show differences between families and single travelers.

## Conclusion

I have showcased evidence for the inclusion of interaction terms between Sex and Pclass and Family Size and Sex. Furthermore I recommend that Family Size be categorised into singles, small-medium family, and large family. Evidence of interaction between Pclass and Family Size was investigated (not shown) but deemed to be insignificant