In [63]:
# Visualisation report
# Import Packages

import pandas as pd
from pathlib import Path
import altair as alt

import sys
import os

# Add the src directory to the Python path
sys.path.append(os.path.abspath('../src'))
from altair_plots import *
from eda_univariate import *  # for combining test and train

In [64]:
# PATHS
DATA_PATH = Path("../data")
TRAIN_PATH = DATA_PATH / "train.csv"
TEST_PATH = DATA_PATH / "test.csv"

# Color palette
color_list = ["#A5D7E8", "#576CBC", "#19376D", "#0b2447"]

# Load the data
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)
all_df = combine_train_test(train_df, test_df)

# Report

## Methodology 

The purpose of this report is to augment the findings of the most important visualisations in the original Titanic analysis.
In particular, the original analysis found that Sex, Pclass, and Family Size were associated with differing survival rates.
To improve upon this analysis, I will consider whether there is a case to add interaction terms to a predictive model which account for joint effects between the features.

## Pclass and Sex

This visualisation plots the survival count on the y axis versus the passenger class on the x axis. The facet aspect of the chart splits the dataset between female and male passengers. This allows for easy analysis on whether gender impacts liklihood to survive given cabin class.

In [68]:
facet_group_countplot(all_df, x="Pclass", group="Survived", facet="Sex", title="Survival Rate by Pclass and Sex")


Firstly, in terms of the numbers of passengers per class and their gender split we see that there are more male passengers than female passengers in all 3 classes. However the ratio of male to female are not equal between classes. We see that 1st class has the most equal ratio with only a fraction more male than female, then 2nd class and finally in 3rd class we see a huge disparity between the number of males and females.

This as important as general intuition would predict the females 'women and children first' and higher classes would be more likely to survive. A plot that looked only at class or sex vs survival might overestimate the correlation between male or 3rd class and survival due to the confounding effects of having more males in 3rd class.

By visualising both class and sex on the same plot we can dig into the joint effects of both variables. For example, we see that on average female passengers are more likely to survive than male passengers. Yet there is not an equal relationshop across classes. Nearly all female passengers in 1st and 2nd class survived, but in 3rd class it is closer to a 50/50 chance of survival. 
On the male section of the dataset the relationship between class and survival is less clear, although we can conclude that 1st class males are more likely to survive than 2nd and 3rd class males.

To account for these joint effects, I would include a joint feature of class and sex. This is an improvement on just selecting sex and class as seperate features as it controls for the joint impacts of class and gender on survival rate as shown most clearly in the female section of the dataset

## Family Size and Sex

In [69]:
all_df["Family Size"] = calculate_family_size(all_df)

Family size is an engineered feature which adds Parents/Children and Spouse/Siblings.

From the original analysis, family size clearly had an impact on survival rate. Based on our above plot, it may be that class or sex has interaction effects with family size.

Possible interaction with pclass was explored but the results were insignificant. Lets investigate interaction with the Sex feature.
First, below is the survival rate split only on sex. We see that females have a 75% chance of survival whereas males have 20.

In [70]:
proportion_chart(all_df, x="Sex", group="Survived", title="Survival Rate by Sex")

Below is the survival rate just by family size we see that single travellers have a 30% chance of survival but small to medium size families have an above even chance of survival. This decreases for larger families.

In [71]:
proportion_chart(all_df, x="Family Size", group="Survived", title="Survival Rate by Family Size")

In [72]:
barchart_proportions(all_df, x="Family Size", group="Survived", facet="Sex", title="Survival Rate by Family Size and Sex")

Recall that the average survival rate for women was 70%. We see that for family size up to 4 the survival chance for females is slightly higher than this average. 

Additionally, recall that the average survival rate for men was 20%. We see that single men have a lower than average chance of survival, but men with small to medium families have a higher than average survival rate.

This motivates the inclusion of an interaction term between Sex and Family Size. Family size has an impact on survival chances for both male and female but
its impact is not constant between Sex. In particular categorising family size into categories for 'small medium family' and 'large family' will show differences between families and single travelers.

## Conclusion

I have showcased evidence for the inclusion of interaction terms between Sex and Pclass and Family Size and Sex. Furthermore I recommend that Family Size be categorised into singles, small-medium family, and large family. Evidence of interaction between Pclass and Family Size was investigated (not shown) but deemed to be insignificant