在开始之前，先明确一下我们的目标是什么，同时对提供的信息做一些简单的分析  
> Competition Description  
> ...  
> **One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. **...      
>  In this challenge, **we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.**  

根据 Competition Description 我们可以得到哪些结论呢?  
1. 造成大规模死亡很大的原因是因为没有足够的救生艇  
2. 本次的目标是根据船上乘客/船员的特征判断是否能够生存  

OK，接下来让我们看一下数据集中有哪些有用的信息  

首先，让我们看一下数据中有哪些属性，以及这些属性可能的值有哪些  
下面的表格我们可以从[这里](https://www.kaggle.com/c/titanic/data)得到  

| variables | definition | key |  
|-----------|------------|-----|
| Survival | 是否生存| 0 = No, 1 = Yes |  
| Pclass  | 社会地位 | 1 = 1st, 2 = 2nd, 3 = 3rd |  
| Sex	| 性别 |  |  
| Age	| 年龄 |  |  
| SibSp | 船上兄弟姐妹及配偶的数目	|  |  
| Parch | 船上父母的数目及子辈的数目 |  |  
| Ticket | 票号 | |  
| Fare | 费用 | |  
| Cabin | 船舱号 | |  
| Embarked | 上船的位置 | C = Cherbourg, Q = Queenstown, S = Southampton |  

从上面的表格中我们可以知道每个数据属性的具体含义，这个时候我们可以做出以下不负责任的假设了 ~  
1. Pclass 越高 Survival 的可能性越大  
2. Sex 为 female 的 Survival 的可能性越大  
3. Age 处在`儿童`以及`老人`阶段的 Survival 的可能性越大  
4. SibSp 越大的 Survival 的可能性越小  
5. Parch 越大的 Survival 的可能性越小  
6. Fare 越高的 Survial 的可能性越小  
7. Cabin 代表的客舱位置越高的 Survival 的可能性越小  
8. Embarked 感觉上和 Survival 可能性应该没什么关系  

接下来，我们导入数据集并看看数据集长什么样子

In [2]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import Imputer
from sklearn.feature_extraction import DictVectorizer
data_train = pd.read_csv('../input/train.csv')
data_test = pd.read_csv('../input/test.csv')
print('training data shape: ', data_train.shape)
data_train.sample(10)

training data shape:  (891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
205,206,0,3,"Strom, Miss. Telma Matilda",female,2.0,0,1,347054,10.4625,G6,S
472,473,1,2,"West, Mrs. Edwy Arthur (Ada Mary Worth)",female,33.0,1,2,C.A. 34651,27.75,,S
328,329,1,3,"Goldsmith, Mrs. Frank John (Emily Alice Brown)",female,31.0,1,1,363291,20.525,,S
670,671,1,2,"Brown, Mrs. Thomas William Solomon (Elizabeth ...",female,40.0,1,1,29750,39.0,,S
75,76,0,3,"Moen, Mr. Sigurd Hansen",male,25.0,0,0,348123,7.65,F G73,S
416,417,1,2,"Drew, Mrs. James Vivian (Lulu Thorne Christian)",female,34.0,1,1,28220,32.5,,S
189,190,0,3,"Turcin, Mr. Stjepan",male,36.0,0,0,349247,7.8958,,S
795,796,0,2,"Otter, Mr. Richard",male,39.0,0,0,28213,13.0,,S
660,661,1,1,"Frauenthal, Dr. Henry William",male,50.0,2,0,PC 17611,133.65,,S
866,867,1,2,"Duran y More, Miss. Asuncion",female,27.0,1,0,SC/PARIS 2149,13.8583,,C


啊咧咧，看着这个表格，我们能得到很多关于数据的信息  
1. Name 我们无法直接利用，但是 Name 格式是 ?Name + 称谓 + ?Name 的形式，感觉上可以和 Sex 扯上点关系  
2. Age 中有 NaN，需要处理一下，同时 Age 是浮点值  
3. SibSp 和 Parch 为整数  
4. Ticket 看起来没什么规律，和其他属性也没什么显而易见的关联，drop 掉  
5. Cabin 有可能是空值，看起来只是一串字符串，形式应该是 \[CabinPrefix\]\[CabinNumber\]    

直观上得到的结论有这些，接下来，让我们用代码做更进一步的验证  

In [29]:
print('= ' * 20 + 'Check is data contain nan value' + ' =' * 20)
def get_contain_nan_columns(data):
    return [column for column in data.columns if data[column].isna().any()]
columns_contain_nan = get_contain_nan_columns(data_train)
print('Columns contain nan value: ', columns_contain_nan)
print('= = ' * 20)
def print_columns_value_counts(data, columns_ignored=['PassengerId', 'Sex', 'Survived', 'Pclass']):
    _data = data.drop(columns_ignored, axis=1)
    for column in _data.columns:
        value_counts = data_train[column].value_counts(dropna=True)
        print('Value Counts Of ', column)
        print(value_counts.sample(value_counts.shape[0] if value_counts.shape[0] < 30 else 30))
        print('= = ' * 20)
print_columns_value_counts(data_train)


= = = = = = = = = = = = = = = = = = = = Check is data contain nan value = = = = = = = = = = = = = = = = = = = =
Columns contain nan value:  ['Age', 'Cabin', 'Embarked']
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 
Value Counts Of  Name
Hood, Mr. Ambrose Jr                                 1
Taylor, Mrs. Elmer Zebley (Juliet Cummins Wright)    1
Madigan, Miss. Margaret "Maggie"                     1
Leyson, Mr. Robert William Norman                    1
Windelov, Mr. Einar                                  1
Bostandyeff, Mr. Guentcho                            1
O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey)      1
Radeff, Mr. Alexander                                1
Chambers, Mrs. Norman Campbell (Bertha Griggs)       1
Holm, Mr. John Fredrik Alexander                     1
Palsson, Mrs. Nils (Alma Cornelia Berglund)          1
van Billiard, Mr. Austin Blyler                      1
Hedman, Mr. Oskar Arvid                              1
Serepeca, Mis

从上面的结果中我们可以看到，需要处理 NaN 值的属性有 Age, Cabin, Embarked  
对于 Name 这个属性，表面上看起来是各不相同的，但是我们从这些名字中的称谓可以推测出乘客的大致年龄段，从而更好的处理缺失的 Age 值  
另外，对于 Cabin，除了包含 NaN 值意外，我们还能发现，某些 Cabin 的值是类似 `C23 C25 C27` 这种形式的


In [12]:
special_cabin_value = 'C23 C25 C27'
data_train.loc[data_train['Cabin'] == special_cabin_value]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
341,342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0,C23 C25 C27,S
438,439,0,1,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0,C23 C25 C27,S


利用上面的代码，我们找到了拥有相同 Special Cabin 值的乘客，我们看到了什么？这些乘客看起来像是一家人！  
但是，是不是少了什么？这几位乘客的家人似乎有一位的 Cabin 不是 Special Cabin，让我们用他们的名字特征再来找找看  

In [28]:
special_name_lastname = 'Fortune'
print('= = ' * 20)
print(data_train.loc[data_train['Name'].str.startswith(special_name_lastname)].describe())
print('= = ' * 20)
print(data_train.loc[data_train['Ticket'] == '19950'].describe())

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 
       PassengerId  Survived  Pclass        Age  SibSp  Parch   Fare
count     4.000000   4.00000     4.0   4.000000    4.0    4.0    4.0
mean    224.500000   0.50000     1.0  32.500000    2.5    2.5  263.0
std     197.306023   0.57735     0.0  21.110819    1.0    1.0    0.0
min      28.000000   0.00000     1.0  19.000000    1.0    2.0  263.0
25%      73.750000   0.00000     1.0  22.000000    2.5    2.0  263.0
50%     215.500000   0.50000     1.0  23.500000    3.0    2.0  263.0
75%     366.250000   1.00000     1.0  34.000000    3.0    2.5  263.0
max     439.000000   1.00000     1.0  64.000000    3.0    4.0  263.0
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 
       PassengerId  Survived  Pclass        Age  SibSp  Parch   Fare
count     4.000000   4.00000     4.0   4.000000    4.0    4.0    4.0
mean    224.500000   0.50000     1.0  32.500000    2.5    2.5  263.0
std     19

，，，头疼，看起来似乎真的没有办法利用利用这些信息找到哪位不见的家人了，，，  
通过上面的分析，我们可以又可以作出一些决定了：  
1. Carbin 只需要保留开头的字母就可以用来确认船舱类型  
2. Name 只需要保留中间的称谓部分用于后续使用  
4. SibSp & Parch 值是确定的，不需要特殊处理  
5. Embarked 应该没什么用，drop 掉  
6. Age & Fare 为 float64，转化成分段的数据  

In [4]:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import Imputer
from sklearn.feature_extraction import DictVectorizer
data_train = pd.read_csv('../input/train.csv')
data_test = pd.read_csv('../input/test.csv')

# for column in X_train.columns:
#     print(column, X_train[column].unique())
# categorical_columns = ['Sex', 'Cabin', 'Embarked']

def convert_sex_to_numerical(sex): 
    dic = {'male': 0, 'female': 1}
    return dic.get(sex, None)

def convert_cabin_to_numerical(cabin):
    import re
    dic = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6}
    pat = re.compile(r'(\w)\d+')
    if pd.isna(cabin): return cabin
    prefix = re.findall(pat, cabin)
    if len(prefix):
        return dic[prefix[0]]

def convert_embarked_to_numerical(embarked):
    dic = {'S': 0, 'C': 1, 'Q': 2}
    return dic.get(embarked, None)

def process_trainning_data(data):
    X = data.drop(['PassengerId', 'Name', 'Ticket'], axis=1)
    X['FamilySize'] = X['SibSp'] + X['Parch']
    X = X.drop(['SibSp', 'Parch'], axis=1)
    X.Sex = X.Sex.apply(convert_sex_to_numerical)
    X.Cabin = X.Cabin.apply(convert_cabin_to_numerical)
    X.Embarked = X.Embarked.apply(convert_embarked_to_numerical)
    return X


y_train = data_train.Survived
X_train = process_trainning_data(data_train)
X_train.drop(['Survived'], axis=1, inplace=True)
imputer = Imputer()
X_train = imputer.fit_transform(X_train)
X_test = process_trainning_data(data_test)
X_test = imputer.transform(X_test)

model = RandomForestClassifier()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
print(prediction)
first_submission = pd.DataFrame({
    "PassengerId": data_test["PassengerId"],
    "Survived": prediction
})
first_submission.to_csv('first_submission.csv', index=False)

[0 0 0 1 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 1 0
 0 0 1 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
 1 0 1 0 0 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0
 1 1 1 0 0 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 1
 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
 0 1 1 1 1 0 0 1 0 0 0]
