## Titanic: Machine Learning from Disaster

### Overview

Titanic是Kaggler的必经之路。作为一个完整的机器学习分析流程，本文主要介绍以下几个方面：

    1) 问题分析
    2) 数据整理
    3) 特征工程
    4) 模型建立
    5) 模型集成
    6) 系统优化


### Step 1: 问题分析

关于Titanic的相关描述可参考官网，这是一个二分类的基本问题。

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


### Step 2: 数据整理

首先加载必要的库。数据这块儿用pandas，模型用scikit-learn和xgboost，计算的库numpy和scipy，作图用matplotlib和seaborn等。

In [1]:
#-*- coding: UTF-8 -*- 
#!/usr/bin/env python

# system parameters
import sys
print("Python version: {}". format(sys.version))

# functions for data processing and analysis
import pandas as pd
print("pandas version: {}". format(pd.__version__))

# machine learning algorithms
import sklearn
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from sklearn import feature_selection, model_selection, metrics
import xgboost
from xgboost import XGBClassifier
print("scikit-learn version: {}". format(sklearn.__version__))
print("xgboost version: {}". format(xgboost.__version__))

# scientific computing
import numpy as np
import scipy as sp
print("NumPy version: {}". format(np.__version__))
print("SciPy version: {}". format(sp.__version__)) 

# data visualization
from pandas.tools.plotting import scatter_matrix
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
import IPython
from IPython import display
print("matplotlib version: {}". format(mpl.__version__))
print("seaborn version: {}". format(sns.__version__))
print("IPython version: {}". format(IPython.__version__))
# Configure Visualization Defaults
# %matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use("ggplot")
sns.set_style("white")
pylab.rcParams["figure.figsize"] = 12,8

# misc libraries
import random
import time

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

print('-'*25)
# Input data files are available in the "datasets/titanic/" directory.
# Listing all files
from subprocess import check_output
data_dir = "datasets/titanic/"
print(check_output(["ls", data_dir]).decode("utf8"))


Python version: 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
pandas version: 0.23.4
scikit-learn version: 0.20.0
xgboost version: 0.81
NumPy version: 1.13.3
SciPy version: 1.1.0
matplotlib version: 2.0.2
seaborn version: 0.9.0
IPython version: 5.3.0
-------------------------
gender_submission.csv
test.csv
train.csv



接下来我们看看数据的基本情况，可以使用info、describe等函数。

In [2]:
train = pd.read_csv(data_dir+"train.csv")
test = pd.read_csv(data_dir+"test.csv")
train.info()
#test.info()
#train.describe(include = "all")
#data_train.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


个人习惯将train和test合并做数据整理和特征工程，另外PassengerId仅仅用于数据标识，Ticket意义不太明确，我们将这两项去掉。

In [3]:
# saving passenger id in advance in order to submit later.
passengerId = test.PassengerId

train_len = len(train)
dataset =  pd.concat([train, test], axis=0).reset_index(drop=True)
#dataset.info()
#dataset.head()
drop_column = ["PassengerId", "Ticket"]
dataset.drop(drop_column, axis=1, inplace = True)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
Age         1046 non-null float64
Cabin       295 non-null object
Embarked    1307 non-null object
Fare        1308 non-null float64
Name        1309 non-null object
Parch       1309 non-null int64
Pclass      1309 non-null int64
Sex         1309 non-null object
SibSp       1309 non-null int64
Survived    891 non-null float64
dtypes: float64(3), int64(3), object(4)
memory usage: 102.3+ KB


我们统计一下原数据中的缺失值。

In [4]:
print("Data columns with null values:\n", dataset.isnull().sum())

Data columns with null values:
 Age          263
Cabin       1014
Embarked       2
Fare           1
Name           0
Parch          0
Pclass         0
Sex            0
SibSp          0
Survived     418
dtype: int64


处理缺失值是数据预处理重要的一环，往往要综合考虑，多次尝试。常见的方法除均值、中值、众值外，还包括关联性考察、重新编码、去除缺失值过多的特征项、利用机器学习赋值等。

这里我们先看看Embarked特征项的两个缺失值，考察下Embarked特征项的整体情况。

In [5]:
dataset[dataset.Embarked.isnull()]
pd.DataFrame(dataset.Embarked.value_counts(dropna=False))

Unnamed: 0,Embarked
S,914
C,270
Q,123
,2


可以看到，出现频率最高的为S，所以一种处理方式就是将缺失值设为众数S。不过，仔细看下Embarked和Pclass/Fare之间的关系会发现，Pclass=1，Fare=80时，Embarked最有可能是C，所以这里我们也可将缺失值设为C。

后续数据中，我们暂时用类似处理Embarked的方法填补Fare，用中值填补Age，用N填补Cabin。

In [None]:
dataset.Embarked.fillna("C", inplace=True)
missing_value = dataset[(dataset.Pclass == 3) & (dataset.Embarked == "S") & (dataset.Sex == "male")].Fare.mean()
dataset.Fare.fillna(missing_value, inplace=True)
dataset.Age.fillna(dataset.Age.median(), inplace=True)
dataset.Cabin.fillna("N", inplace=True)
dataset.info()
dataset.describe(include = "all")

接下来，我们简单看看train中各特征值之间的关系，包括与目标值Survived之间的关系等。

In [None]:
data_plot = dataset[:train_len]
#histogram comparison of sex, class, and age by survival
h = sns.FacetGrid(data_plot, row = "Sex", col = "Pclass", hue = "Survived")
h.map(plt.hist, "Age", alpha = .75)
h.add_legend()
#pair plots of entire dataset
pp = sns.pairplot(data_plot, hue = "Survived", palette = "deep", size=1.2, diag_kind = "kde", diag_kws=dict(shade=True), plot_kws=dict(s=10) )
pp.set(xticklabels=[])
#correlation heatmap of dataset
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = colormap,
        square=True, 
        cbar_kws={"shrink":.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor="white",
        annot_kws={"fontsize":12 }
    )
    
    plt.title("Pearson Correlation of Features", y=1.05, size=15)
correlation_heatmap(data_plot)

### Step 3: 特征工程

特征工程可谓是个见仁见智见能力的事情。

我们逐个特征项来做吧，首先是Age。

In [None]:
g = sns.kdeplot(data_plot.Age[data_plot.Survived == 0], color="Red", shade = True)
g = sns.kdeplot(data_plot.Age[data_plot.Survived == 1], ax =g, color="Blue", shade= True)
g.set_xlabel("Age")
g.set_ylabel("Frequency")
g = g.legend(["Not Survived","Survived"])

可以看到整体差异不大，但儿童的生存率出现峰值，是人性的光辉吧。

In [None]:
dataset["AgeBin"] = pd.cut(dataset.Age.astype(int), 5)
dataset["IsChild"] = [1 if i<16 else 0 for i in dataset.Age]
#dataset.info()

Cabin项描述了客舱编号，其开头字母可能具有信息，我们不妨抽提。

In [None]:
#dataset.Cabin.head()
dataset.Cabin = [i[0] for i in dataset.Cabin]
g = sns.factorplot(y="Survived", x="Cabin", data=dataset, kind="bar")
g = g.set_ylabels("Survival Probability")
#dataset = pd.get_dummies(dataset, columns = ["Cabin"], prefix="Cabin")
dataset.sample(10)

对于Fare，我们查看下分布。

In [None]:
g = sns.kdeplot(data_plot.Fare[data_plot.Survived == 0], color="Red", shade = True)
g = sns.kdeplot(data_plot.Fare[data_plot.Survived == 1], ax =g, color="Blue", shade= True)
g.set_xlabel("Fare")
g.set_ylabel("Frequency")
g = g.legend(["Not Survived","Survived"])

In [None]:
dataset.info()