# Titanic: Machine Learning from Disaster

## Overview

Titanic是Kaggler的必经之路。作为一个完整的机器学习分析流程，本文参考了多个帖子，主要完成以下几个步骤：

    1) 问题分析
    2) 数据采集
    3) 数据清洗
    4) 特征工程
    5) 模型建立
    6) 模型集成
    7）系统优化


## Step 1: 问题分析

关于Titanic的相关描述可参考官网，这是一个二分类的基本问题。

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


## Step 2: 数据采集

官网已经提供了清楚明了的训练集和测试集，这一步就没什么了。

## Step 3: 数据清洗

首先加载必要的库。数据这块儿用pandas，模型用scikit-learn和xgboost，计算的库numpy和scipy，作图用matplotlib和seaborn等。

In [4]:
#-*- coding: UTF-8 -*- 
#!/usr/bin/env python

#system parameters
import sys
print("Python version: {}". format(sys.version))

#functions for data processing and analysis
import pandas as pd
print("pandas version: {}". format(pd.__version__))

#machine learning algorithms
import sklearn
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from sklearn import feature_selection, model_selection, metrics
import xgboost
from xgboost import XGBClassifier
print("scikit-learn version: {}". format(sklearn.__version__))
print("xgboost version: {}". format(xgboost.__version__))

#scientific computing
import numpy as np
import scipy as sp
print("NumPy version: {}". format(np.__version__))
print("SciPy version: {}". format(sp.__version__)) 

#data visualization
from pandas.tools.plotting import scatter_matrix
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
import IPython
from IPython import display
print("matplotlib version: {}". format(mpl.__version__))
print("seaborn version: {}". format(sns.__version__))
print("IPython version: {}". format(IPython.__version__))
#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

#misc libraries
import random
import time

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

print('-'*25)
# Input data files are available in the "datasets/titanic/" directory.
# Listing all files
from subprocess import check_output
data_dir = "datasets/titanic/"
print(check_output(["ls", data_dir]).decode("utf8"))


Python version: 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
pandas version: 0.23.4
scikit-learn version: 0.20.0
xgboost version: 0.81
NumPy version: 1.13.3
SciPy version: 1.1.0
matplotlib version: 2.0.2
seaborn version: 0.9.0
IPython version: 5.3.0
-------------------------
gender_submission.csv
test.csv
train.csv



接下来我们看看数据的基本情况，可以使用info、describe等函数。为方便起见，我们把train和test整合。

In [41]:
data_train = pd.read_csv(data_dir+"train.csv")
data_test = pd.read_csv(data_dir+"test.csv")
data_train.info()
data_train.describe(include = 'all')
#data_train.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Peters, Miss. Katie",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


从上面我们可以看到原数据中存在很多缺失值，具体地看下。

In [42]:
print('Train columns with null values:\n', data_train.isnull().sum())
print("-"*10)

print('Test/Validation columns with null values:\n', data_test.isnull().sum())
print("-"*10)

Train columns with null values:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
----------
Test/Validation columns with null values:
 PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
----------


In [54]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
print(df1)
df2 = pd.DataFrame({'A': ['A5', 'A6', 'A7', 'A8'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
print(df2)
df3 = pd.concat([df1, df2])
print(df3)
df3.iloc[0,2] = "sdf"
print(df1)
print(df2)
print(df3)

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
    A   B   C   D
0  A5  B0  C0  D0
1  A6  B1  C1  D1
2  A7  B2  C2  D2
3  A8  B3  C3  D3
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
0  A5  B0  C0  D0
1  A6  B1  C1  D1
2  A7  B2  C2  D2
3  A8  B3  C3  D3
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
    A   B   C   D
0  A5  B0  C0  D0
1  A6  B1  C1  D1
2  A7  B2  C2  D2
3  A8  B3  C3  D3
    A   B    C   D
0  A0  B0  sdf  D0
1  A1  B1   C1  D1
2  A2  B2   C2  D2
3  A3  B3   C3  D3
0  A5  B0   C0  D0
1  A6  B1   C1  D1
2  A7  B2   C2  D2
3  A8  B3   C3  D3
