# Preparation
Done by Lander Jacobs

This notebook is designed for use within SageMaker. It will download the Titanic dataset from Kaggle and then clean it according to the steps outlined in the previous notebook ([1_0_EDA](./1_0_EDA.ipynb)).

## Installation necessary for sagemaker
To download the dataset required for this project, we need to install Kaggle within SageMaker.

In [1]:
!conda install -c conda-forge kaggle -y

Retrieving notices: ...working... done
Channels:
 - conda-forge
 - nvidia
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 24.7.1
    latest version: 24.11.0

Please update conda by running

    $ conda update -n base -c conda-forge conda



## Package Plan ##

  environment location: /home/ec2-user/anaconda3/envs/python3

  added / updated specs:
    - kaggle


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    kaggle-1.6.17              |     pyhd8ed1ab_0          77 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          77 KB

The following NEW packages will be INSTALLED:

  kaggle             conda-forge/noarch::kaggle-1.6.17-pyhd8ed1ab_0 

The following packages will be UPDATED:

  openssl                                  3

## Imports

In [2]:
import pandas as pd
import os
import json
from sklearn.model_selection import train_test_split

# Download data

You will need to use your own API key to access the Kaggle API, which you can generate for free in your account. Once you've used it once, it typically works automatically without needing to re-enter the key.

In [3]:
kaggle_api_token = {"username":"","key":""}

os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)

# Write the credentials to ~/.kaggle/kaggle.json
with open(os.path.expanduser("~/.kaggle/kaggle.json"), "w") as file:
    json.dump(kaggle_api_token, file)

# Set the permissions to secure the file
os.chmod(os.path.expanduser("~/.kaggle/kaggle.json"), 0o600)

Once Kaggle is set up, you can easily download the dataset and load it into a pandas dataframe.

In [4]:
!kaggle datasets download -d marcpaulo/titanic-huge-dataset-1m-passengers --unzip

data = pd.read_csv("huge_1M_titanic.csv", header=0)
df = data.copy()
df.head()

Dataset URL: https://www.kaggle.com/datasets/marcpaulo/titanic-huge-dataset-1m-passengers
License(s): apache-2.0
Downloading titanic-huge-dataset-1m-passengers.zip to /home/ec2-user/SageMaker/titanic_project
 68%|█████████████████████████▊            | 17.0M/25.0M [00:00<00:00, 82.2MB/s]
100%|██████████████████████████████████████| 25.0M/25.0M [00:00<00:00, 88.5MB/s]


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1310,1,1,"Name1310, Miss. Surname1310",female,,0,0,SOTON/O2 3101272,76.760165,,C
1,1311,0,3,"Name1311, Col. Surname1311",male,29.0,0,0,223596,10.193097,,S
2,1312,0,3,"Name1312, Mr. Surname1312",male,20.0,0,0,54636,12.029416,C83,C
3,1313,0,3,"Name1313, Mr. Surname1313",male,27.0,0,0,PC 17760,13.429448,,S
4,1314,0,3,"Name1314, Mr. Surname1314",male,32.0,0,0,364512,4.840769,E33,C


# Clean data

In [None]:
# we will only use those columns for this project
used_cols = ["Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
df = df[used_cols]

We remove capital letters for simplicity and ease of use.

In [6]:
lower_used_cols = [x.lower() for x in used_cols]

new_column_names = {x: y for x,y in zip(used_cols,lower_used_cols)}

used_cols = lower_used_cols

df = df.rename(columns=new_column_names)

df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked
0,1,1,female,,0,0,76.760165,C
1,0,3,male,29.0,0,0,10.193097,S
2,0,3,male,20.0,0,0,12.029416,C
3,0,3,male,27.0,0,0,13.429448,S
4,0,3,male,32.0,0,0,4.840769,C


Here, we remove the rows that are considered outliers or contain empty values that cannot be used.

In [7]:
# drop the rows with empty values
df = df.dropna(subset=["age"])
df = df.dropna(subset=["embarked"])

# set the lower limits for the columns that need those as discussed in the previous notebook
max_age = 70
max_sibsp = 4
max_parch = 2
max_fare = 37

# drop the rows that aren't needed in their respective columns
df = df[df["age"] <= max_age]
df = df[df["sibsp"] <= max_sibsp]
df = df[df["parch"] <= max_parch]
df = df[df["fare"] < max_fare]

We also encode the categorical data into numerical values so that the models can use them for training.

In [8]:
# change the categorical variables into numerical, this way our model can train with them
df["sex"] = df["sex"].astype("category").cat.codes
df["embarked"] = df["embarked"].astype("category").cat.codes

df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked
1,0,3,1,29.0,0,0,10.193097,2
2,0,3,1,20.0,0,0,12.029416,0
3,0,3,1,27.0,0,0,13.429448,2
4,0,3,1,32.0,0,0,4.840769,0
5,1,3,0,0.0,0,0,14.805817,2


Next, we will split the cleaned dataset into training and test sets to train all of our models. This ensures that each model is built using the same training set and can be evaluated using the same test set.

In [9]:
cat_features = ["pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]
label = 'survived'

x = df[cat_features]
y = df[label]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=df[label])

x_train.to_csv('x_train.csv', index=False)
x_test.to_csv('x_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)