<a href="https://colab.research.google.com/github/austinkirwin/public-projects/blob/main/Python_projects/Titanic_project/Titanic_Survival.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Titanic Survival Analysis Project

In this project I will try and predict the survivors of the titanic test data set using linear regression. Additionally, I will identify which three variables explain survivability the most.

## Importing

In [1]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

## Reading in the training and test data

In [2]:
train_data = pd.read_csv("https://raw.githubusercontent.com/austinkirwin/public-projects/refs/heads/main/Python_projects/Titanic_project/train.csv")
test_data = pd.read_csv("http://raw.githubusercontent.com/austinkirwin/public-projects/refs/heads/main/Python_projects/Titanic_project/test.csv")
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Linear Models

I'm going to use a linear model to predict which passengers survive and which do not.

In [3]:
# Splitting the training data and removing NaN values
train_data = train_data.dropna()
# Features matrix
train_feature = train_data.drop(['Survived'], axis = 1)
# Target variable
train_target = train_data['Survived']

In [4]:
# Dropping unnecessary variables
train_feature = train_feature.drop(['Name','Ticket','Cabin','Embarked','Sex'], axis = 1)

In [5]:
# Compiling and fitting the full model
full_model = LinearRegression()
full_model.fit(train_feature, train_target)

prediction_values = full_model.predict(test_data.drop(['Cabin','Embarked','Name','Sex','Ticket'], axis = 1).dropna(),)

In [6]:
final_preds = pd.DataFrame(prediction_values)
final_preds.columns = ['Survived']
final_preds

Unnamed: 0,Survived
0,0.661221
1,0.599410
2,0.485545
3,0.727210
4,0.756698
...,...
326,1.035987
327,0.973622
328,0.833828
329,0.923213


In [7]:
# Reformatting each value to 'Yes' or 'No' for survival

final_preds[final_preds['Survived'] > .5] = 1
final_preds[final_preds['Survived'] < .5] = 0

map = {1: 'Yes', 0: 'No'}
final_preds['Survived'].map(map)

Unnamed: 0,Survived
0,Yes
1,Yes
2,No
3,Yes
4,Yes
...,...
326,Yes
327,Yes
328,Yes
329,Yes


## Implementing Decision Trees

For the purposes of learning, I will be using Tensorflow decision trees to try and predict the survivors.

In [12]:
!pip install tensorflow tensorflow_decision_forests
import tensorflow as tf
import tensorflow_decision_forests as tfdf

Collecting tensorflow_decision_forests
  Downloading tensorflow_decision_forests-1.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.0 kB)
Collecting wurlitzer (from tensorflow_decision_forests)
  Downloading wurlitzer-3.1.1-py3-none-any.whl.metadata (2.5 kB)
Collecting ydf (from tensorflow_decision_forests)
  Downloading ydf-0.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.5 kB)
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3 (from tensorflow)
  Downloading protobuf-5.29.3-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Downloading tensorflow_decision_forests-1.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m77.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading wurlitzer-3.1.1-py3-none-any.whl (8.6 kB)
Downloading ydf-0.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_

In [13]:
tfdf_train_data = pd.read_csv("https://raw.githubusercontent.com/austinkirwin/public-projects/refs/heads/main/Python_projects/Titanic_project/train.csv")
tfdf_test_data = pd.read_csv("http://raw.githubusercontent.com/austinkirwin/public-projects/refs/heads/main/Python_projects/Titanic_project/test.csv")

Tokenizing the names in the data and extracting any prefix.

In [14]:
def preprocess(df):
  df = df.copy()

  def normalize_name(x):
    return " ".join([v.strip(",()[].\"'") for v in x.split(" ")])

  def ticket_number(x):
    return x.split(" ")[-1]

  def ticket_item(x):
    items = x.split(" ")
    if len(items) == 1:
      return "NONE"
    return "_".join(items[0:-1])

  df["Name"] = df["Name"].apply(normalize_name)
  df["Ticket_number"] = df["Ticket"].apply(ticket_number)
  df["Ticket_item"] = df["Ticket"].apply(ticket_item)
  return df

preprocessed_train_df = preprocess(tfdf_train_data)
preprocessed_test_df = preprocess(tfdf_test_data)

preprocessed_train_df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Ticket_number,Ticket_item
0,1,0,3,Braund Mr Owen Harris,male,22.0,1,0,A/5 21171,7.25,,S,21171,A/5
1,2,1,1,Cumings Mrs John Bradley Florence Briggs Thayer,female,38.0,1,0,PC 17599,71.2833,C85,C,17599,PC
2,3,1,3,Heikkinen Miss Laina,female,26.0,0,0,STON/O2. 3101282,7.925,,S,3101282,STON/O2.
3,4,1,1,Futrelle Mrs Jacques Heath Lily May Peel,female,35.0,1,0,113803,53.1,C123,S,113803,NONE
4,5,0,3,Allen Mr William Henry,male,35.0,0,0,373450,8.05,,S,373450,NONE


We don't want to train the model on "PassengerID" and "Ticket" features.

In [16]:
input_features = list(preprocessed_train_df.columns)
input_features.remove("Ticket")
input_features.remove("PassengerId")
input_features.remove("Survived")

## Converting dataset to TensorFlow dataset

In [20]:
def tokenize_names(features, labels = None):
  features["Name"] = tf.strings.split(features["Name"])
  return features, labels

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(preprocessed_train_df, label = "Survived").map(tokenize_names)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(preprocessed_test_df).map(tokenize_names)

## Training the model with default params

In [22]:
model = tfdf.keras.GradientBoostedTreesModel(
    verbose = 0,
    features = [tfdf.keras.FeatureUsage(name=n) for n in input_features],
    exclude_non_specified_features = True,
    random_seed = 10,
)
model.fit(train_ds)

self_evaluation = model.make_inspector().evaluation()
self_evaluation.loss

0.7801646590232849

## Training model with improved default params

In [24]:
model2 = tfdf.keras.GradientBoostedTreesModel(
    verbose = 0,
    features=[tfdf.keras.FeatureUsage(name = n) for n in input_features],
    exclude_non_specified_features = True,
    min_examples = 1,
    categorical_algorithm = "RANDOM",
    shrinkage = 0.05,
    split_axis = "SPARSE_OBLIQUE",
    sparse_oblique_normalization = "MIN_MAX",
    sparse_oblique_num_projections_exponent=2.0,
    num_trees = 2000,
    random_seed = 10,
)

model2.fit(train_ds)

self_evaluation2 = model2.make_inspector().evaluation()
self_evaluation2.loss

0.8411012291908264

In [25]:
model2.summary()

Model: "gradient_boosted_trees_model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
Total params: 1 (1.00 Byte)
Trainable params: 0 (0.00 Byte)
Non-trainable params: 1 (1.00 Byte)
_________________________________________________________________
Type: "GRADIENT_BOOSTED_TREES"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (11):
	Age
	Cabin
	Embarked
	Fare
	Name
	Parch
	Pclass
	Sex
	SibSp
	Ticket_item
	Ticket_number

No weights

Variable Importance: INV_MEAN_MIN_DEPTH:
    1.           "Sex"  0.459147 ################
    2.           "Age"  0.373636 ###########
    3.          "Fare"  0.263117 #####
    4.          "Name"  0.218394 ##
    5.   "Ticket_item"  0.181858 
    6. "Ticket_number"  0.178881 
    7.      "Embarked"  0.178076 
    8.        "Pclass"  0.177209 
    9.         "Parch"  0.176724 
   10.         "SibSp"  0.171615 

Variable Importance: NUM_AS_ROOT:
    1.  "Sex" 45.000000

## Making predictions

In [29]:
def prediction_to_kaggle_format(model2, threshold=0.5):
    proba_survive = model2.predict(test_ds, verbose=0)[:, 0]  # Using model2 instead of model
    # Convert test_ds to a pandas DataFrame to access "PassengerId"
    test_df = next(iter(test_ds.batch(len(test_ds)))).as_numpy_iterator()
    test_df = pd.DataFrame(list(test_df))

    return pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": (proba_survive >= threshold).astype(int)
    })

kaggle_predictions = prediction_to_kaggle_format(model2)

AttributeError: 'tuple' object has no attribute 'as_numpy_iterator'