# **Factors of Survival**

## Objectives
Add objectives here, linked to hypotheses

## Inputs

* outputs/datasets/collection/titanic_passengers.csv

## Outputs

* Generate code and visualisations that fulfil business requirement 1, above.!~!!!!! ! ! ! ! 

___

## Set up the Working Directory

Define and confirm the current working directory

In [None]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

___

## Load Collected Data

In [None]:
import pandas as pd
df_raw_path = "outputs/datasets/collection/titanic_passengers.csv"
df = pd.read_csv(df_raw_path)
df.head()

In [None]:
df.describe(include='all')

## Data Exploration

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

For this exploratory study, some data will be encoded and some will be dropped.

Firstly, values that are unique (Name,PassengerId, Ticket) or largely missing (Cabin) will be dropped.

In [None]:
df = df.drop(["Name", "PassengerId", "Ticket", "Cabin"], axis=1)
df.head(5)

The Sex variable will be encoded as 1 for male and 0 for female.

In [None]:
df['Sex'] = df['Sex'].map({'male': 1, 'female': 0})
df.head(5)

The Embarked variable will be encoded using One-Hot Encoding. There are 2 missing values from this column; these will be imputed as 'Missing'.

In [None]:
from feature_engine.encoding import OneHotEncoder
df['Embarked'].fillna('Missing', inplace=True)
encoder = OneHotEncoder(variables=['Embarked'], drop_last=False)
df_ohe = encoder.fit_transform(df)
df_ohe.head(5)

___

## Correlation Study

Credit: Code Institute Walkthrough 2 - Churnometer
Add infromation about Pearson and Spearman!!!! ! ! ! !

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['Survived'].sort_values(key=abs, ascending=False)[1:]
corr_spearman

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['Survived'].sort_values(key=abs, ascending=False)[1:]
corr_pearson

For both Spearman and Pearson, there appears to be moderate and weak to moderate correlation to the target variable, for the following variables: Sex, Pclass and Fare. These will be investigated further.

## Exploratory Data Analysis (EDA) on Selected Variables

In [None]:
vars_to_study = ['Sex', 'Pclass', 'Fare']

In [None]:
df_eda = df.filter(vars_to_study + ['Survived'])
df.head()

### Variable Distribution by Survived

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


def plot_categorical(df, col, target_var):

    plt.figure(figsize=(12, 5))
    sns.countplot(data=df, x=col, hue=target_var, order=df[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


target_var = 'Survived'
for col in vars_to_study:
    if df_eda[col].dtype == 'object':
        plot_categorical(df_eda, col, target_var)
        print("\n\n")
    else:
        plot_numerical(df_eda, col, target_var)
        print("\n\n")

These visualisations suggest the following striking conclusions:

1. Female passengers were more likely to survive than male passengers.
2. Passengers who embarked at Southampton were less likely to survive than passengers who embarked at Cherbourg.
    * Passengers who embarked at Cherbourg, were more likely to survive than not.
3. First class passengers were more likely to survive than third class passengers.

## Parallel Plot

In order to better visualise the data, Fare will be discretised.

In [None]:
from feature_engine.discretisation import EqualFrequencyDiscretiser

n_classes = 10
disc = EqualFrequencyDiscretiser(q=n_classes, variables=['Fare'])
df_parallel = disc.fit_transform(df_eda)
df_parallel.head()

The discretised data will also be re-labelled with more meaninful labels.

In [None]:
classes_ranges = disc.binner_dict_['Fare'][1:-1]

fare_map = {}
for n in range(0, n_classes):
    if n == 0:
        fare_map[n] = f"<{round(classes_ranges[0],2)}"
    elif n == n_classes-1:
        fare_map[n] = f"+{round(classes_ranges[-1],2)}"
    else:
        fare_map[n] = f"{round(classes_ranges[n-1],2)} to {round(classes_ranges[n],2)}"

fare_map

In [None]:
df_parallel['Fare'] = df_parallel['Fare'].replace(fare_map)
df_parallel

In [None]:
import plotly.express as px
fig = px.parallel_categories(df_parallel, color="Survived")
fig.show()

This parallel plot inidicates a connection the highest fares and survival rate, as well as highlighting the relationship between Sex and Pclass.

## Hypothesis Testing

In [None]:
df['Sex'] = df['Sex'].map({1: 'male', 0: 'female'})

### Hypothesis 1: Male passengers were less likely to survive the tragedy than female passengers.

Here, the null hypothesis would be - There is no association between sex and survival rate for passengers on the titanic.

In [None]:

from scipy.stats import chi2_contingency

significance = 0.05

contingency_table = pd.crosstab(df['Survived'], df['Sex'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)


if (p_value < significance):
    statement = "There is sufficient evidence to reject the null hypothesis"
else:
    statement = "There is insufficient evidence to reject the null hypothesis"

statement


In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

for i, sex in enumerate(contingency_table.columns):
    ax[i].pie(contingency_table[sex], labels=contingency_table.index, autopct='%1.1f%%', startangle=90)
    ax[i].set_title(f'Survival Status - {sex}')

plt.show()

This is further proof of the relationship (non-independence) between Sex and Survived.

### Hypothesis 2: Passengers travelling in First Class were more likely to survive than passengers travelling in Third Class.

Here, the null hypothesis would be - There is no association between Pclass and surival rate for passengers on the Titanic.

In [None]:
significance = 0.05

contingency_table = pd.crosstab(df['Survived'], df['Pclass'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)


if (p_value < significance):
    statement = "There is sufficient evidence to reject the null hypothesis."
else:
    statement = "There is insufficient evidence to reject the null hypothesis"

statement


In [None]:
fig, ax = plt.subplots(1, 3, figsize=(10, 5))
 
for i, pclass in enumerate(contingency_table.columns):
    ax[i].pie(contingency_table[pclass], labels=contingency_table.index, autopct='%1.1f%%', startangle=90)
    ax[i].set_title(f'Survival Status By Class - {pclass}')

plt.show()

This is further proof of the relationship (non-independence) between Pclass and Survived.

### Benchmarking Fares & Class

The study requires some information about the relationship between Pclass and Fate for the dashboard. For each class, it will be useful to know the 1st Quartlie, 3rd Quartile and median cost of a ticket.

In [None]:
fare_info = df.groupby('Pclass')['Fare'].quantile([0.25,0.5, 0.75]).unstack()
fare_info.columns = ['1st Quartile', 'Median', '3rd Quartile']
fare_info

___

## Conclusions

The study and visualisations above support the following conclusions: 

1. There is a significant relationship between Sex and survival rate. 
    * Female passengers were more likely to survive than Male passengers.
2. There is a significant relationship between Pclass and surival rate.
    * Passengers travelling in First Class were more likely to survive than passengers travelling in 3rd class.
