在开始之前，先明确一下我们的目标是什么，同时对提供的信息做一些简单的分析  
> Competition Description  
> ...  
> **One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. **...      
>  In this challenge, **we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.**  

根据 Competition Description 我们可以得到哪些结论呢?  
1. 造成大规模死亡很大的原因是因为没有足够的救生艇  
2. 本次的目标是根据船上乘客/船员的特征判断是否能够生存  

OK，接下来让我们看一下数据集中有哪些有用的信息  

首先，让我们看一下数据中有哪些属性，以及这些属性可能的值有哪些  
下面的表格我们可以从[这里](https://www.kaggle.com/c/titanic/data)得到  

| variables | definition | key |  
|-----------|------------|-----|
| Survival | 是否生存| 0 = No, 1 = Yes |  
| Pclass  | 社会地位 | 1 = 1st, 2 = 2nd, 3 = 3rd |  
| Sex	| 性别 |  |  
| Age	| 年龄 |  |  
| SibSp | 船上兄弟姐妹及配偶的数目	|  |  
| Parch | 船上父母的数目及子辈的数目 |  |  
| Ticket | 票号 | |  
| Fare | 费用 | |  
| Cabin | 船舱号 | |  
| Embarked | 上船的位置 | C = Cherbourg, Q = Queenstown, S = Southampton |  

从上面的表格中我们可以知道每个数据属性的具体含义，这个时候我们可以做出以下不负责任的假设了 ~  
1. Pclass 越高 Survival 的可能性越大  
2. Sex 为 female 的 Survival 的可能性越大  
3. Age 处在`儿童`以及`老人`阶段的 Survival 的可能性越大  
4. SibSp 越大的 Survival 的可能性越小  
5. Parch 越大的 Survival 的可能性越小  
6. Fare 越高的 Survial 的可能性越小  
7. Cabin 代表的客舱位置越高的 Survival 的可能性越小  
8. Embarked 感觉上和 Survival 可能性应该没什么关系  

接下来，我们导入数据集并看看数据集长什么样子

In [11]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import Imputer
from sklearn.feature_extraction import DictVectorizer
data_train = pd.read_csv('../input/train.csv')
data_test = pd.read_csv('../input/test.csv')
print('training data shape: ', data_train.shape)
data_train.sample(10)

training data shape:  (891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
346,347,1,2,"Smith, Miss. Marion Elsie",female,40.0,0,0,31418,13.0,,S
451,452,0,3,"Hagland, Mr. Ingvald Olai Olsen",male,,1,0,65303,19.9667,,S
535,536,1,2,"Hart, Miss. Eva Miriam",female,7.0,0,2,F.C.C. 13529,26.25,,S
106,107,1,3,"Salkjelsvik, Miss. Anna Kristine",female,21.0,0,0,343120,7.65,,S
594,595,0,2,"Chapman, Mr. John Henry",male,37.0,1,0,SC/AH 29037,26.0,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
449,450,1,1,"Peuchen, Major. Arthur Godfrey",male,52.0,0,0,113786,30.5,C104,S
677,678,1,3,"Turja, Miss. Anna Sofia",female,18.0,0,0,4138,9.8417,,S
200,201,0,3,"Vande Walle, Mr. Nestor Cyriel",male,28.0,0,0,345770,9.5,,S
58,59,1,2,"West, Miss. Constance Mirium",female,5.0,1,2,C.A. 34651,27.75,,S


啊咧咧，看着这个表格，我们能得到很多关于数据的信息  
1. Name 我们无法直接利用，但是 Name 格式是 ?Name + 称谓 + ?Name 的形式，感觉上可以和 Sex 扯上点关系  
2. Age 中有 NaN，需要处理一下，同时 Age 是浮点值  
3. SibSp 和 Parch 为整数  
4. Ticket 看起来没什么规律，和其他属性也没什么显而易见的关联，drop 掉  
5. Cabin 有可能是空值，看起来只是一串字符串，形式应该是 \[CabinPrefix\]\[CabinNumber\]    

直观上得到的结论有这些，接下来，让我们用代码做更进一步的验证  

In [34]:
print('= ' * 20 + 'Check is data contain nan value' + ' =' * 20)
def get_contain_nan_columns(data):
    return [column for column in data.columns if data[column].isna().any()]
columns_contain_nan = get_contain_nan_columns(data_train)
print('Columns contain nan value: ', columns_contain_nan)
print('= = ' * 20)
def print_columns_value_counts(data):
    for column in data.columns:
        value_counts = data_train[column].value_counts(dropna=True)
        print('Value Counts Of ', column)
        print(value_counts.sample(value_counts.shape[0] if value_counts.shape[0] < 20 else 20))
        print('= = ' * 20)
print_columns_value_counts(data_train)


= = = = = = = = = = = = = = = = = = = = Check is data contain nan value = = = = = = = = = = = = = = = = = = = =
Columns contain nan value:  ['Age', 'Cabin', 'Embarked']
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 
Value Counts Of  PassengerId
300    1
448    1
438    1
219    1
116    1
564    1
703    1
214    1
562    1
670    1
684    1
709    1
295    1
117    1
445    1
435    1
347    1
151    1
306    1
774    1
Name: PassengerId, dtype: int64
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 
Value Counts Of  Survived
1    342
0    549
Name: Survived, dtype: int64
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 
Value Counts Of  Pclass
3    491
2    184
1    216
Name: Pclass, dtype: int64
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 
Value Counts Of  Name
Renouf, Mrs. Peter Henry (Lillian Jefferys)            1
Buss, Miss. Kate                    

In [2]:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import Imputer
from sklearn.feature_extraction import DictVectorizer
data_train = pd.read_csv('../input/train.csv')
data_test = pd.read_csv('../input/test.csv')

# for column in X_train.columns:
#     print(column, X_train[column].unique())
# categorical_columns = ['Sex', 'Cabin', 'Embarked']

def convert_sex_to_numerical(sex): 
    dic = {'male': 0, 'female': 1}
    return dic.get(sex, None)

def convert_cabin_to_numerical(cabin):
    import re
    dic = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6}
    pat = re.compile(r'(\w)\d+')
    if pd.isna(cabin): return cabin
    prefix = re.findall(pat, cabin)
    if len(prefix):
        return dic[prefix[0]]

def convert_embarked_to_numerical(embarked):
    dic = {'S': 0, 'C': 1, 'Q': 2}
    return dic.get(embarked, None)

def process_trainning_data(data):
    X = data.drop(['PassengerId', 'Name', 'Ticket'], axis=1)
    X['FamilySize'] = X['SibSp'] + X['Parch']
    X = X.drop(['SibSp', 'Parch'], axis=1)
    X.Sex = X.Sex.apply(convert_sex_to_numerical)
    X.Cabin = X.Cabin.apply(convert_cabin_to_numerical)
    X.Embarked = X.Embarked.apply(convert_embarked_to_numerical)
    return X


y_train = data_train.Survived
X_train = process_trainning_data(data_train)
X_train.drop(['Survived'], axis=1, inplace=True)
imputer = Imputer()
X_train = imputer.fit_transform(X_train)
X_test = process_trainning_data(data_test)
X_test = imputer.transform(X_test)

model = RandomForestClassifier()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
print(prediction)
first_submission = pd.DataFrame({
    "PassengerId": data_test["PassengerId"],
    "Survived": prediction
})
first_submission.to_csv('first_submission.csv', index=False)

[0 0 0 1 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0
 0 0 1 0 1 0 1 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 0 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0
 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0
 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 1 0 0 0 1 1 0
 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 0 0 1 0
 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
 0 1 1 1 1 0 0 1 0 0 0]
