<a href="https://www.kaggle.com/code/collindavies/wcg-voting-ensemble-0-83014-lb?scriptVersionId=180312230" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Approach

The best models for predicting Titanic survivors build upon a few simple facts:
1. Nearly all males die,
2. Nearly all females live, and
3. Women-children groups (WCGs) with members in both the train and test sets live or die together.
4. WCGs with members in just the test set live or die based on Pclass.

Strategic WCG selection yields survivor prediction scores of 83+%, which is in the top 1% of the rolling leaderboard scores as of September 22, 2023. <br>

In this notebook, I present my strategy to selecting WCG groups and the resulting prediction scores.

# Benchmark

There are a few different basic strategies to predicting Titanic survivors. <br>

To understand how the model performs, I've noted a few benchmark models for reference. <br>

Here are the benchmark models and previously reported results:
* No members of the test set survive (All dead model): 62.7% (131/209)
* All females of the test set survive and all males die (Gender model): 76.6% (160/209)

Based on previous results for WCG models, the goal of this notebook is 83% or higher.

## Note

It is common to think that more complicated models yield better results than simple models. <br>

*There are instances when this is true.* <br>

However, a simple but well-constructed model supported by thorough data analysis and some strategic feature engineering can perform as well or better than more complicated models in many cases.<br>

And, simple models are often easier to understand, reproduce, and edit. <br>

From what I've seen from other entries, a simple model will outperform the vast majority of the more complicated leaderboard submissions. <br>

Let's see how far we can get with strategic decisions and simple models!

## Note II

*Much of this work was inspired by other folks here on Kaggle. Thanks to [Chris Deotte](https://www.kaggle.com/cdeotte) for compiling a lot of the work on this problem into useful notebooks and then taking it all a step further.

# Work

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load datasets
train_df = pd.read_csv('/kaggle/input/titanic/train.csv')
test_df = pd.read_csv("/kaggle/input/titanic/test.csv")

In [3]:
# Combine train and test dfs
def concat_df(train_data, test_data):
    # Returns a concatenated df of training and test set
    return pd.concat([train_data, test_data], sort=True).reset_index(drop=True)

df_all = concat_df(train_df, test_df)

In [4]:
# Feature cleanup
# Extract Last Name and Maiden Name from Name
df_all['Last_Name'] = df_all['Name'].str.extract(r'^(.+?),', expand=False)
df_all['Maiden_Name'] = df_all['Name'].str.extract(' ([A-Za-z]+)\)', expand=False)

# Title
df_all['Title'] = np.where(df_all['Sex'] == 'female', 'woman',
                          np.where(df_all['Name'].str.contains('Master'), 'boy', 'man'))

In [5]:
# Group
# Basic conditions
df_all['group'] = (df_all['Last_Name'] + '-' 
                   + df_all['Pclass'].astype(str) + '-'
                   + df_all['Fare'].astype(str) + '-'
                   + df_all['Ticket'].str[:-1] + '-'
                   + df_all['Embarked'])

# Remove males from group
df_all.loc[df_all['Title'] == 'man', 'group'] = 'noGroup'

# Remove "groups" of 1
df_all = pd.merge(df_all, df_all.groupby('group')['PassengerId'].count().reset_index().rename(columns={"PassengerId": "group_freq"}), on='group', how='left')
df_all.loc[df_all['group_freq'] == 1.0, 'group'] = 'noGroup'

# Add group members with different last names to group
df_all['TicketId'] = (df_all['Fare'].astype(str) + '-'
                       + df_all['Ticket'].str[:-1] + '-'
                      + df_all['Embarked'])

for i, row in df_all.iterrows():
    # Check if 'Title' is not 'man' and 'GroupId' is 'noGroup'
    if row['Title'] != 'man' and row['group'] == 'noGroup':
        # Find rows with matching 'TicketId' and update 'GroupId'
        matching_rows = df_all[df_all['TicketId'] == row['TicketId']]
        if not matching_rows.empty:
            df_all.at[i, 'group'] = matching_rows.iloc[0]['group']

# Manually handle extended family group of Richards and Hocking
df_all.loc[df_all['Last_Name'].isin(['Richards', 'Hocking']), 'group'] = 'Hocking-Richards-2-2910-S'

In [6]:
# How do we handle WCG without any members in the training set (i.e., no groupSurvival rate)?
# Nearly every WCG member in Pclass 1 and 2 survives while the majority of WCG members in Pclass 3 die.
df_all[df_all['group'] != 'noGroup'].groupby(['Pclass', 'Survived']).PassengerId.count()

Pclass  Survived
1       0.0          2
        1.0         25
2       0.0          1
        1.0         36
3       0.0         60
        1.0         35
Name: PassengerId, dtype: int64

In [7]:
# Predict
df_all['groupSurvival'] = pd.NA
# Calculate the average survival status within each group for the training set (first 891 rows)
df_all['groupSurvival'] = df_all.iloc[:891].groupby('group')['Survived'].transform('mean')
# Apply the average survival status to the remaining rows (892 to 1309)
for i in range(891, 1309):
    group_id = df_all.at[i, 'group']
    group_average = df_all[df_all['group'] == group_id]['groupSurvival'].iloc[0]
    df_all.at[i, 'groupSurvival'] = group_average

df_all.loc[(df_all['groupSurvival'].isna()) & (df_all['Pclass'] == 3), 'groupSurvival'] = 0
df_all.loc[(df_all['groupSurvival'].isna()) & (df_all['Pclass'] != 3), 'groupSurvival'] = 1

df_all['Predict'] = 0
df_all.loc[(df_all['Sex'] == 'female'), "Predict"] = 1
df_all.loc[(df_all['Title'] == 'woman') & (df_all['groupSurvival'] == 0), "Predict"] = 0
df_all.loc[(df_all['Title'] == 'boy') & (df_all['groupSurvival'] == 1), "Predict"] = 1

# Results so far

So far I completed the following steps:
1) identify WCGs and assume that survival rates among group members in the train set correlate to survival rates of group members in the test set, and <br>
2) predict survival of WCG members in the test set with no members in the train set based on Pclass, <br>

At this point, the modeled scored an impressive 81+% on the leaderboard. <br>
But, there's at least one more improvement I can make to the model to get closer to that goal of 83+%.

# Next step
Leverage the results of previous high scoring models with a voting ensemble to find males frequently predicted to survive and females frequently predicted to die.<br>This approach will put the hard and thoughtful work of others to good use. <br>

I included the results from the following 5 modeles in my voting ensemble:
1. [Titanic [0.82] - [0.83]](https://www.kaggle.com/code/konstantinmasich/titanic-0-82-0-83/notebook) by Konstantin Masich,<br>
2. [Titanic: ML tutorial on small dataset - [0.82296]](https://www.kaggle.com/code/shaochuanwang/titanic-ml-tutorial-on-small-dataset-0-82296/notebook) by Shao-Chuan Wang (There are multiple output files here. I used the voting results since they reportedly scored the highest), <br>
3. [Titanic Starter with XGBoost, 173/209 LB](https://www.kaggle.com/code/numbersareuseful/titanic-starter-with-xgboost-173-209-lb/notebook) by Tae Hyong Whang, <br>
4. [Titanic: Machine Learning from Disaster](https://www.kaggle.com/code/francksylla/titanic-machine-learning-from-disaster/script) by Frank Sylla, and <br>
5. [Divide and Conquer [0.82296]](https://www.kaggle.com/code/pliptor/divide-and-conquer-0-82296/report) by Oscar Takeshita.

In [8]:
# Ensemble approach with top 5 models
ens = pd.read_csv("/kaggle/input/titanic-ensemble/Titanic_Ensemble.csv")

In [9]:
ens['sum'] = (ens['Konstantin'] +
              ens['Shao'] +
              ens['Tae'] +
              ens['Franck'] +
              ens['Oscar']
             )

df_all = pd.merge(df_all, ens, on='PassengerId', how='left')
df_all.loc[(df_all['Predict'] == 1) & (df_all['sum'] < 2.5), 'Predict'] = 0

In [10]:
output = df_all.iloc[891:][['PassengerId', 'Predict']]
output.rename(columns={'Predict': 'Survived'}, inplace=True)
output.to_csv("submission.csv", index=False)
output.head()

Unnamed: 0,PassengerId,Survived
891,892,0
892,893,1
893,894,0
894,895,0
895,896,1
