In [33]:
import pandas as pd
import os
from sklearn.model_selection import train_test_split

def prepare_dataset(X, train=False, mean=0, std=1):
    """
    Preprocess the Titanic dataset by dropping irrelevant features, encoding categorical variables, 
    handling missing values, and applying feature scaling.

    Args:
        X (pd.DataFrame): The input dataset to preprocess.
        train (bool): Indicates whether the dataset is the training set. If True, calculates and returns
                      the mean and standard deviation for scaling. If False, applies the provided mean and std.
        mean (pd.Series or float): Mean of the training set for scaling (used when train=False).
        std (pd.Series or float): Standard deviation of the training set for scaling (used when train=False).

    Returns:
        pd.DataFrame: Preprocessed dataset.
        pd.Series: Mean of the features (only returned if train=True).
        pd.Series: Standard deviation of the features (only returned if train=True).
    """
    
    # Drop columns that are irrelevant or not useful for model training
    X = X.drop(["Survived", "PassengerId", "Name", "Embarked", "Ticket", "Cabin"], axis=1, errors="ignore")

    # Encode 'Sex' column as binary: male -> 1, female -> 0
    X["Sex"] = X["Sex"].apply(lambda x: 1 if x == "male" else 0)

    # Fill missing values with the mean of each column
    for col in X.columns:
        X[col] = X[col].fillna(X[col].mean())

    # Add a new binary feature 'Child' where Age < 15 -> 1 (child), otherwise -> 0
    X["Child"] = X["Age"].apply(lambda x: 1 if x < 15 else 0)

    # Add a new binary feature 'Cheap_Fare' where Fare < 18 -> 1 (cheap), otherwise -> 0
    X["Cheap_Fare"] = X["Fare"].apply(lambda x: 1 if x < 18 else 0)

    # If the dataset is for training, compute the mean and standard deviation for scaling
    if train:
        mean = X.mean()
        std = X.std()
        X = (X - mean) / std  # Normalize the features
    else:
        # Apply the mean and standard deviation from the training set
        X = (X - mean) / std

    # Create a new feature 'Family' that sums 'SibSp' (siblings/spouses) and 'Parch' (parents/children)
    X["Family"] = X["SibSp"] + X["Parch"]

    # Drop columns no longer needed after feature engineering
    X = X.drop(["Age", "Fare", "SibSp", "Parch"], axis=1)

    # Return the dataset and mean/std if training, otherwise just the dataset
    if train:
        return X, mean, std
    return X

def load_datasets():
    """
    Load and preprocess the Titanic dataset, separating the features and labels for training and testing.
    
    Returns:
        pd.DataFrame: Preprocessed training features.
        pd.Series: Training labels (Survived column).
        pd.DataFrame: Preprocessed testing features.
        pd.Series: Testing labels (Survived column).
    """
    
    # Load the train and test datasets from CSV files
    x_train = pd.read_csv(os.path.join("titanic_data", "train.csv"))
    x_test = pd.read_csv(os.path.join("titanic_data", "test.csv"))
    
    # Extract the target variable 'Survived' for the training dataset
    y_train = x_train["Survived"]
    
    # Load the test set labels from a separate submission CSV
    y_test = pd.read_csv(os.path.join("titanic_data", "gender_submission.csv"))["Survived"]
    
    # Preprocess the training dataset and retrieve the mean and std for normalization
    x_train, mean, std = prepare_dataset(x_train, train=True)
    
    # Preprocess the test dataset using the same mean and std as the training set
    x_test = prepare_dataset(x_test, mean=mean, std=std)

    # Return the preprocessed features and labels for both training and testing
    return x_train, y_train, x_test, y_test

# Load and preprocess the datasets
x_train, y_train, x_test, y_test = load_datasets()


## Feature Selection For Titanic Dataset

For selecting features to train our model with for the titanic dataset, we first did a preliminary filter on features that we believed wouldn't be useful in determing whether or not a passenger has survived. We first removed categorical features such as Names and Cabin because we couldn't find a good and meaningful way to generate categories from them that would help in classifying the passengers. We had tried to parse through the names and look for titles such as Mrs to see whether or not a female passenger was married but this didn't seem to help much. We then removed ticket number and passenger ID which didn't seem to have any influence on the passenger's chance of survival. We also originally had planned to keep the Embarked feature but we found that after one hot encoding, our performance on the test set dropped by around 4-5% and we believe that this had occurred because our model had overfit on the train set because the one hot encoding had introduced too many new features.

We kept the following attributes to use to train our model: PClass, Sex, Age, SibSp, Parch, and Fare. We kept the class of the passenger as we believed that it was likely that unfortunately people with higher classes during the crash were more likely to be offered the raft and therefore survive the crash. We chose sex because we found that generally women and children were first given priority to go on the raft. Because of this, we also created a Child feature which is a categorical feature that is 1 if the passenger is younger than 15 and 0 otherwise. The idea behind this, is that by creating a discrete feature that directly tells the model whether or not a passenger was a child, we hoped that the model would have an easier time picking up on the relationship between age and a passenger's chance of survival. After this, we dropped the Age feature because we found that the survival rate of passengers older than 15 was roughly the same meaning that the Age feature couldn't provide us with any more discrimatory information.

The SibSp and Parch represent the number of siblings and parents or children that a passenger had on board wth them respectively. We believed that this would be incredibly useful information to the model, in particular, because we could create a custom feature using these attributes called Family which gives the number of family members that a passenger had on board. The idea is that if a person had family members on board, then they might be more likely to advocate for their family members to go on the raft instead of them. After doing this, we dropped the SibSp and Parch class since we believed that with the Family feature, we had extracted the bulk of information that these two features collectively had to offer.

Our last feature that we used was the ticket fare. Here, we found the passengers that had paid more for their tickets had a greater chance of survival. This likely has to do with the fact that if a passenger had bought a more expensive ticket then they likely are in a higher class which we believe correlates with a higher survival rate.

In [32]:
from sklearn.linear_model import LogisticRegression

# Create an instance of the LogisticRegression model and fit it to the training data
clf = LogisticRegression().fit(x_train, y_train)

# Print the accuracy score of the model on the training set
print(clf.score(x_train, y_train))

# Print the accuracy score of the model on the test set
print(clf.score(x_test, y_test))


0.8103254769921436
0.9569377990430622
