- [Week 3] AI Saturdays - AI Developers, Boise
- June 2, 2018
- Prepared by: Ashish Sharma <accssharma@gmail.com>

## Titanic: Machine Learning from Disaster (Kaggle)

### Competition Detail

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

### Viewing Titanic as a Machine Learning problem

**Definition 1:**

`Machine learning is the science of getting computers to act without being explicitly programmed.`


**Definition 2:** 

`A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.`

#### Task (T)
- Binary Classification Problem (passengers survived or not)

#### Experience (E)
- passengers: “features” like passengers’ gender and class. [input (X)]
- did passenger survive the disaster: yes or no [output (y)]

#### Performance (P)

#### Mathematical Interpretation
- Just a simple function approximation i.e. find a relationship between X and y.

### Dataset

#### The training set (train.csv)
- used to build/train your machine learning models.
- Along with the input data, we also have the outcome (also known as the “ground truth”) for each passenger. 
- Input (X): “features” like passengers’ gender and class. 
- Output (y): whether passenger survived

#### The test set 
- used to see how well your model performs on unseen data.
- For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

<img src="imgs/kaggle_titanic_data.png" height=550 width=900/>

### Know about different types of data
- [General data types](https://towardsdatascience.com/data-types-in-statistics-347e152e8bee)
- [Tutorial reference](https://www.kaggle.com/startupsci/titanic-data-science-solutions)

## Import necessary packages

In [None]:
# data analysis, manipulation
import pandas as pd
import numpy as np
from collections import Counter

# visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")
sns.set(rc={'figure.figsize':(9,6)})

# machine learning
from sklearn.linear_model import LogisticRegression

## Load data

In [None]:
train_csv = "/home/asharma/.kaggle/competitions/titanic/train.csv"
test_csv = "/home/asharma/.kaggle/competitions/titanic/test.csv"

df_train = pd.read_csv(train_csv)
# quickly peek through a few of the training data
df_train.head()

In [None]:
df_test = pd.read_csv(test_csv)
df_test.head()

### Exploratory Data Analysis

- Let's quickly check the types and very basic statistic about our variables (each column) in the dataset

In [None]:
df_train.info()
print("-"*40)
df_test.info()

**What is the distribution of numerical feature values across the samples?**

This helps us determine, among other early insights, how representative is the training dataset of the actual problem domain.

- Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).
- Survived is a categorical feature with 0 or 1 values.
- Around 38% samples survived representative of the actual survival rate at 32%.
- Most passengers (> 75%) did not travel with parents or children.
- Nearly 30% of the passengers had siblings and/or spouse aboard.
- Fares varied significantly with few passengers (<1%) paying as high as $512.
- Few elderly passengers (<1%) within age range 65-80.

**What is the distribution of categorical features?**

- Names are unique across the dataset (count=unique=891)
- Sex variable as two possible values with 65% male (top=male, freq=577/count=891).
- Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.
- Embarked takes three possible values. S port used by most passengers (top=S)
- Ticket feature has high ratio (22%) of duplicate values (unique=681).

In [None]:
# descriptive statistics of training data
df_train.describe(include=['number'])

In [None]:
# descriptive statistics of training data
df_test.describe(include=['number'])

**Which features are categorical?**

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.

- Categorical: Survived, Sex, and Embarked. Ordinal: Pclass.

**Which features are numerical?**

Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.

- Continous: Age, Fare. Discrete: SibSp, Parch.

**Which features are mixed data types?**

Numerical, alphanumeric data within same feature. These are candidates for correcting goal.

- Ticket is a mix of numeric and alphanumeric data types. Cabin is alphanumeric.

**Which features may contain errors or typos?**

This is harder to review for a large dataset, however reviewing a few samples from a smaller dataset may just tell us outright, which features may require correcting.

- Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.

**Which features contain blank, null or empty values?**

These will require correcting.

- Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.
- Cabin > Age are incomplete in case of test dataset.

**What are the data types for various features?**

Helping us during converting goal.

- Seven features are integer or floats. Six in case of test dataset.
- Five features are strings (object).

In [None]:
num_samples = len(df_train)
df_train["Survived"].value_counts()
print ("Dataset: ", 891/2224, "% representation of the population" )
# total surived ration in our data
print ("Sample survival rate:", 342/891)
# 492 passengers and 214 crew were saved in real among 2224 total passengers
print("True survival rate:", 706/2224)

In [None]:
# passengers with no parents or siblings
df_train[(df_train["Parch"] == 0) | (df_train["SibSp"] == 0)].shape[0]/num_samples

In [None]:
# how many people survived who had either parents or siblings?
df_train[(df_train["Parch"] == 0) | (df_train["SibSp"] == 0) & (df_train["Survived"] == 1)].shape[0]/num_samples

### Visualization

### Count plot - frequency plot of different possible values

In [None]:
sns.countplot(x="Survived", data=df_train)

In [None]:
sns.countplot(x="SibSp", data=df_train)

In [None]:
sns.countplot(x="Parch", data=df_train)

### Histogram - frequency distribution

In [None]:
g = sns.FacetGrid(df_train, col='Survived')
g.map(plt.hist, 'Age', bins=20)

#### Box plot - Numerical continuous variables 

In [None]:
def plot_boxplot(_data):
    d = sns.boxplot(x=_data)

sns.boxplot(data=df_train[["Age", "Fare"]], orient='h')

In [None]:
plot_boxplot(df_train["Fare"])

### Bar plots - relationship between different variables

In [None]:
sns.barplot(x="Survived", y="Age", hue="Pclass", data=df_train);

In [None]:
sns.barplot(x="Survived", y="Sex", data=df_train);

### Facetgrid

In [None]:
grid = sns.FacetGrid(df_train, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

In [None]:
grid = sns.FacetGrid(df_train, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Fare', alpha=.5, bins=20)
grid.add_legend();

In [None]:
grid = sns.FacetGrid(df_train, row='Embarked', col='Survived', size=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
grid.add_legend()

### Scatterplot - relationship between two continuous variables

In [None]:
sns.regplot(x=df_train["Age"], y=df_train["Fare"])

## Observations and Decisions

- Consider Age in our model training - handle missing data
    - Most passengers are in 15-35 age range.
    - Oldest passengers (Age = 80) survived.
    - Large number of 15-25 year olds did not survive.
    - should band age groups?
- Consider Pclass for model training
     - Pclass=3 had most passengers, however most did not survive. 
     - Infant passengers in Pclass=2 and Pclass=3 mostly survived
     - Most passengers in Pclass=1 survived
     - Pclass varies in terms of Age distribution of passengers
- Consider Fare
    - Higher fare paying passengers had better survival. 
- Consider Sex
    - Female passengers had much better survival rate than males
- Consider Parch and SibSp
    - more than 80% of people who survived did not travel with Parch, Sibling or Spouse
- We drop Cabin column
    - More than 75% of data is missing
- Consider Embarked - handle missing data

## Handling missing data
- we need to handle for Age and Embarked

In [None]:
# look at the column-wise missing values count
df_train.isnull().sum()

- Embarked feature takes S, Q, C values based on port of embarkation. Our training dataset has two missing values. We simply fill these with the most common occurance.

In [None]:
freq_port = df_train.Embarked.dropna().mode()[0]
freq_port

In [None]:
df_train.Embarked= df_train.Embarked.fillna(freq_port)
df_train.isnull().sum()

- Age is a continuous numerical feature, even though does not make much sense, for simplicity, we impute the missing values with the mean

In [None]:
ag_mean_tr = df_train.Age.dropna().mean()
df_train.Age = df_train.Age.fillna(ag_mean_tr)
df_train.isnull().sum()

- Similarly, impute missing Age values in test data with mean

In [None]:
ag_mean_tst = df_test.Age.dropna().mean()
df_test.Age = df_test.Age.fillna(ag_mean_tst)
df_test.isnull().sum()

- Similary, impute Fare with mean value in test

In [None]:
ag_mean_tst = df_test.Fare.dropna().mean()
df_test.Fare = df_test.Fare.fillna(ag_mean_tst)
df_test.isnull().sum()

In [None]:
df_train.info()
print("-"*40)
df_test.info()

## Prepare Dataset

#### Convert categorical features to  numerical features
- one-hot encoding

In [None]:
categorical_features = ["Pclass", "Sex", "Embarked"]

def handle_categorical_data(df):
    for col, t in df.dtypes.iteritems():
        if col in categorical_features: 
            df[col] = df[col].astype("category")
    return df

In [None]:
categorical_X = handle_categorical_data(df_train).select_dtypes(include="category")
categorical_X.head()

In [None]:
# similarly for test data
categorical_X_test = handle_categorical_data(df_test).select_dtypes(include="category")
categorical_X_test.head()

In [None]:
categorical_X.columns

In [None]:
final_cat_to_num = pd.get_dummies(categorical_X)
final_cat_to_num.head()

In [None]:
final_cat_to_num.shape

In [None]:
final_cat_to_num_test = pd.get_dummies(categorical_X_test)
final_cat_to_num_test.head()

In [None]:
final_cat_to_num_test.shape

#### Numerical features

In [None]:
y = df_train["Survived"].copy()
# X_test  = test_df.drop("PassengerId", axis=1).copy()

In [None]:
df_train = df_train.drop(["Survived"], axis=1)

In [None]:
def get_numerical_data(df):
    numerical_X = df.select_dtypes("number")
    if "PassengerId" in numerical_X.columns:
        numerical_X = numerical_X.drop(["PassengerId"], axis=1)
    return numerical_X

In [None]:
numerical_X = get_numerical_data(df_train)
numerical_X.head()

In [None]:
numerical_X.shape

In [None]:
# do same thing for test data
numerical_X_test = get_numerical_data(df_test)
numerical_X_test.head()

In [None]:
numerical_X_test.shape

In [None]:
X_train  = pd.concat([numerical_X, final_cat_to_num], axis=1)
y_train = y

# verify that we have same number of examples in X_train and y
assert X_train.shape[0] == len(y)

X_test = pd.concat([numerical_X_test, final_cat_to_num_test], axis=1)

print("X train shape: ", X_train.shape)
print("y train shape: ", y_train.shape)
print("X test shape: ", X_test.shape)

# verify that we have same number of features in X_train and X_test
assert X_train.shape[1] == X_test.shape[1]

In [None]:
X_train.head()

In [None]:
y.head()

In [None]:
X_test.head()

## Building a machine learning model

### Logistic Regression

- Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. In the simplest form, the outcome is measured with a dichotomous variable (in which there are only two possible outcomes).

- probabilistic, parametric model
    - The logistic regression model is parametric because it has a finite set of parameters. Specifically, the parameters are the regression coefficients. These usually correspond to one for each predictor plus a constant. Logistic regression is a particular form of the generalised linear model.
    - Logistic regression is probabilistic because it assumes that P(Y=1) is the probability of the event occurring

- Logistic regression can handle all sorts of relationships, because it applies a non-linear log transformation to the predicted odds ratio.
- The independent variables do not need to be multivariate normal – although multivariate normality yields a more stable solution.

#### Assumptions
- It does not need a linear relationship between the dependent and independent variables. (Linear Regression, particularly that are solved using OLS method, does)
- Logistic regression assumes that P(Y=1) is the probability of the event occurring
- The model should be fitted correctly (concepts of overfitting, underfitting)
- Logistic regression requires each observation to be independent
- The larger the dataset, the better is the modeling

In [None]:
# A very simple implementation of Logistic Regression as Black-box

# create an instance of logistic regression
clf = LogisticRegression()

# train the model
clf.fit(X_train, y_train)

# make a prediction
Y_pred = clf.predict(X_test)
Y_pred

In [None]:
# make a prediction
Y_pred_prob = clf.predict_proba(X_test)
#Y_pred_prob

In [None]:
y.head()

In [None]:
# Accuracy
clf.score(X_train, y_train)

In [None]:
# Another Example: Logistic Regression fitting vs Linear REgression fitting (Visualization in 2D)

In [None]:
print(__doc__)


# Code source: Gael Varoquaux
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model

# this is our test set, it's just a straight line with some
# Gaussian noise
xmin, xmax = -5, 5
n_samples = 100
np.random.seed(0)
X = np.random.normal(size=n_samples)
y = (X > 0).astype(np.float)
X[X > 0] *= 4
X += .3 * np.random.normal(size=n_samples)

X = X[:, np.newaxis]
# run the classifier
clf = linear_model.LogisticRegression(C=1e5)
clf.fit(X, y)

# and plot the result
plt.figure(1, figsize=(4, 3))
plt.clf()
plt.scatter(X.ravel(), y, color='black', zorder=20)
X_test = np.linspace(-5, 10, 300)


def model(x):
    return 1 / (1 + np.exp(-x))
loss = model(X_test * clf.coef_ + clf.intercept_).ravel()
plt.plot(X_test, loss, color='red', linewidth=3)

ols = linear_model.LinearRegression()
ols.fit(X, y)
plt.plot(X_test, ols.coef_ * X_test + ols.intercept_, linewidth=1)
plt.axhline(.5, color='.5')

plt.ylabel('y')
plt.xlabel('X')
plt.xticks(range(-5, 10))
plt.yticks([0, 0.5, 1])
plt.ylim(-.25, 1.25)
plt.xlim(-4, 10)
plt.legend(('Logistic Regression Model', 'Linear Regression Model'),
           loc="lower right", fontsize='small')
plt.show()