# Kaggle Competition: Titanic

The Titanic problem is the `Hello World!` equivalent in the Data Science/Machine Learning field, and consists in predicting if a passenger either lived or died given some data. This notebook presents step by step the construction of a model suitable to tackle the Titanic problem.

In [1]:
# Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sklearn as skl
import xgboost as xgb
import tensorflow as tf

## Data Exploration

First, lets explore the data we have available to get a better understanding of the information we have available. We can perform this task easily by loading the data into a pandas DataFrame object. Then, by invoking the `info` method, pandas summarizes for us the current state of our data.

In [2]:
# Load the train dataset to a pandas data frames
data = pd.read_csv("./data/train.csv")

In [3]:
# Print information about the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


We can see that each entry in our data corresponds to one passenger in the Titanic, and it consists of 12 fields containing various information. Furthermore, note that some fields contain missing values (Age, Cabin, and Embarked). This can also be easily viewed with the following snippet:

In [4]:
# Snippet: Get columns with missing values in a DataFrame
# df.columns[df.isnull().any()]

# Get a Series object containing a mask where True corresponds to a column with missing values
columns_mask = data.isnull().any()
# Get an Index object containing the columns in our data
columns_index = data.columns
# Apply the mask to get a new Index object containing only the columns with missing values
columns_with_missing = columns_index[columns_mask]
print(columns_with_missing)

Index(['Age', 'Cabin', 'Embarked'], dtype='object')


Now, to get a better picture of the problem's setting, we need to explore the data in more detail using the information we have available and asking ourselves questions about it.