# 使用 Keras 分析 TITANIC 乘客生还概率

In [336]:
# Imports
import numpy as np
import pandas as pd
import re
import keras
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(42)

## 1. 加载数据

该数据集预先加载了 Keras，所以一个简单的命令就会帮助我们训练和测试数据。 这里有一个我们想看多少单词的参数。 我们已将它设置为1000，但你可以随时尝试设置为其他数字。

In [337]:
# Loading the data (it's preloaded in Keras)
#(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=1000)

df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

df_train['Embarked'].fillna(df_train['Embarked'].mode()[0], inplace=True)
df_train['Age'].fillna(df_train['Age'].mean(), inplace=True)
#df_train.fillna(0, inplace=True)
#df_test.fillna(0, inplace=True)

df_train.head()
#df_test.head()
#print(x_train.shape)
#print(x_test.shape)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [338]:
df_train.info()
df_train['Fare'].mean()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


32.204207968574636

In [339]:
sex2index = {"male":0, "female":1}
emb2index = {"S":0, "C":1, "Q":2, "null":3}


x_total = df_train[['Pclass','SibSp','Parch']]
x_total['Sex'] = [sex2index[sex] for sex in df_train['Sex']]
x_total['Age'] = [int(age) for age in df_train['Age']]
x_total['Far'] = [int(far) for far in df_train['Fare']]
x_total['Cab'] = [1 if cab else 0 for cab in df_train['Cabin']]
x_total['Emb'] = [emb2index[emb] for emb in df_train['Embarked']]
#x_total['Tic'] = [re.findall('\d', ss) for ss in df_train['Ticket']]
x_total['Tic'] = [len(tic) for tic in df_train['Ticket']]
#x_total['Nam'] = [len(nam) for nam in df_train['Name']]

y_total = df_train['Survived']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-co

In [340]:
x_total[:3]
#np.mean(df_train['Age'])

Unnamed: 0,Pclass,SibSp,Parch,Sex,Age,Far,Cab,Emb,Tic
0,3,1,0,0,22,7,1,0,9
1,1,1,0,1,38,71,1,1,8
2,3,0,0,1,26,7,1,0,16


In [341]:
x_train = np.array(x_total[:650])
y_train = np.array(y_total[:650])

x_test = np.array(x_total[650:])
y_test = np.array(y_total[650:])

x_train

array([[ 3,  1,  0, ...,  1,  0,  9],
       [ 1,  1,  0, ...,  1,  1,  8],
       [ 3,  0,  0, ...,  1,  0, 16],
       ..., 
       [ 1,  0,  0, ...,  1,  1,  5],
       [ 3,  0,  0, ...,  1,  0, 13],
       [ 3,  0,  0, ...,  1,  0,  8]])

## 2. 检查数据

请注意，数据已经过预处理，其中所有单词都包含数字，评论作为向量与评论中包含的单词一起出现。 例如，如果单词'the'是我们词典中的第一个单词，并且评论包含单词'the'，那么在相应的向量中有 1。

输出结果是 1 和 0 的向量，其中 1 表示正面评论，0 是负面评论。

In [342]:
print(x_train[0])
print(y_train[0])

[ 3  1  0  0 22  7  1  0  9]
0


## 3. 输出的 One-hot 编码

在这里，我们将输入向量转换为 (0,1)-向量。 例如，如果预处理的向量包含数字 14，则在处理的向量中，第 14 个输入将是 1。

In [343]:
# One-hot encoding the output into vector mode, each of length 1000
#tokenizer = Tokenizer(num_words=1000)
#x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
#x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
#print(x_train[0])

同时我们将对输出进行 one-hot 编码。

In [344]:
# One-hot encoding the output
num_classes = 2
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_test.shape)

(650, 2)
(241, 2)


## 4. 模型构建

使用 sequential 在这里构建模型。 请随意尝试不同的层和大小！ 此外，你可以尝试添加 dropout 层以减少过拟合。

In [345]:
# Building the model architecture with one layer of length 100
model = Sequential()
model.add(Dense(512, activation='relu', input_dim=9))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.summary()

# Compiling the model using categorical_crossentropy loss, and rmsprop optimizer.
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_49 (Dense)             (None, 512)               5120      
_________________________________________________________________
dropout_25 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_50 (Dense)             (None, 2)                 1026      
Total params: 6,146
Trainable params: 6,146
Non-trainable params: 0
_________________________________________________________________


## 5. 训练模型

运行模型。 你可以尝试不同的 batch_size 和 epoch 数量！

In [348]:
# Running and evaluating the model
hist = model.fit(x_train, y_train,
          batch_size=32,
          epochs=20,
          validation_data=(x_test, y_test), 
          verbose=2)

Train on 650 samples, validate on 241 samples
Epoch 1/20
 - 0s - loss: 0.7134 - acc: 0.7400 - val_loss: 0.6397 - val_acc: 0.6929
Epoch 2/20
 - 0s - loss: 0.6723 - acc: 0.7215 - val_loss: 0.9407 - val_acc: 0.6639
Epoch 3/20
 - 0s - loss: 0.6956 - acc: 0.7277 - val_loss: 0.5784 - val_acc: 0.7884
Epoch 4/20
 - 0s - loss: 0.7206 - acc: 0.7108 - val_loss: 0.3914 - val_acc: 0.8382
Epoch 5/20
 - 0s - loss: 0.6943 - acc: 0.7354 - val_loss: 0.8012 - val_acc: 0.7303
Epoch 6/20
 - 0s - loss: 0.7399 - acc: 0.7123 - val_loss: 0.6573 - val_acc: 0.7676
Epoch 7/20
 - 0s - loss: 0.8268 - acc: 0.7185 - val_loss: 0.4223 - val_acc: 0.8091
Epoch 8/20
 - 0s - loss: 0.7240 - acc: 0.7323 - val_loss: 0.7273 - val_acc: 0.7510
Epoch 9/20
 - 0s - loss: 0.7925 - acc: 0.7369 - val_loss: 0.7316 - val_acc: 0.7801
Epoch 10/20
 - 0s - loss: 0.7840 - acc: 0.7431 - val_loss: 0.4140 - val_acc: 0.8257
Epoch 11/20
 - 0s - loss: 0.7425 - acc: 0.7385 - val_loss: 0.6153 - val_acc: 0.7842
Epoch 12/20
 - 0s - loss: 0.6591 - acc:

## 6. 评估模型

你可以在测试集上评估模型，这将为你提供模型的准确性。你得出的结果可以大于 85% 吗？

In [349]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: ", score[1])

Accuracy:  0.838174274354
