# Titanic Notebook
## Data Preparation

In this notebook, we split our file into three: (1) Train, (2) Validation, and (3) Test

In [4]:
# Import libraries
import pandas as pd
# The train_test_split method will allow to split files and 
# randomly sample from the whole data
from sklearn.model_selection import train_test_split

In [5]:
# Import the data
titanicDf = pd.read_csv('./titanic_cleaned.csv')
titanicDf.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Cabin_ind
0,0,3,0,22.0,1,0,7.25,S,0
1,1,1,1,38.0,1,0,71.2833,C,1
2,1,3,1,26.0,0,0,7.925,S,0
3,1,1,1,35.0,1,0,53.1,S,1
4,0,3,0,35.0,0,0,8.05,S,0


In [7]:
# Create two independent data frames one for (1) features one for (2) label
label = titanicDf['Survived']
features = titanicDf.drop('Survived', axis=1)

In [10]:
# In this cell, we are going to create three data sets. However, the train_test_split function
# can only create two splits at a time. Therefore, we will create train and test once. Then, 
# we will take test and divide into two – once for test and for validation.

# Do train/val/test with 60/20/20    
# The first time, we will get 60/40 for train and test, respectively
trainF, testF, trainL, testL = train_test_split(features, label, test_size=0.4, random_state=42)

# Next, we divide test into two data sets: (1) for validation and (2) test
testF, valF, testL, valL = train_test_split(testF, testL, test_size=0.5, random_state=42)

In [11]:
# Check the ratio of the train, validation, and test datasets
for dataset in [trainF, valF, testF]:
    ratio = round(len(dataset) / len(features), 2)
    print(ratio)

0.6
0.2
0.2


In [12]:
# Export the features and labels for the train/val/test data sets.dataset
trainF.to_csv('./train_features.csv', index=False)
valF.to_csv('./validation_features.csv', index=False)
testF.to_csv('./test_features.csv', index=False)

trainL.to_csv('./train_labels.csv', index=False)
valL.to_csv('./validation_labels.csv', index=False)
testL.to_csv('./test_labels.csv', index=False)