<hr>
import urllib.request: 用於下載檔案<br>
import os: 用於確認檔案存在
<hr>

In [1]:
import urllib.request
import os

In [2]:
url = 'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls'
filepath = 'data/titanic3.xls'
if not os.path.isfile(filepath):
    result = urllib.request.urlretrieve(url, filepath)
    print('downloaded', result)

In [3]:
import numpy
import pandas as pd

In [4]:
all_df = pd.read_excel(filepath)

<hr>
paclss: 艙等<br>
sibsp: 手足或配偶也在船上數量<br>
parch: 雙親或子女也在船上數量<br>
fare: 旅客費用<br>
cabin: 艙位號碼<br>
embarked: 登船港口
<hr>

In [5]:
all_df[:2]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"


<hr>
關連不大的忽略, 只選擇相關欄位
<hr>

In [6]:
cols = ['survived', 'name', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']
all_df = all_df[cols]

In [7]:
all_df[:2]

Unnamed: 0,survived,name,pclass,sex,age,sibsp,parch,fare,embarked
0,1,"Allen, Miss. Elisabeth Walton",1,female,29.0,0,0,211.3375,S
1,1,"Allison, Master. Hudson Trevor",1,male,0.9167,1,2,151.55,S


<hr>
drop移除name欄位
<hr>

In [8]:
df = all_df.drop(['name'], axis = 1)

<hr>
找出那些欄位含有null值(無資料)
<hr>

In [9]:
all_df.isnull().sum()

survived      0
name          0
pclass        0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
dtype: int64

<hr>
進行深度學習, 欄位資料必須是數字, 不能是null值<br>
所以我們將null值填上欄位的平均
<hr>

In [10]:
age_mean = df['age'].mean()
df['age'] = df['age'].fillna(age_mean)

In [11]:
fare_mean = df['fare'].mean()
df['fare'] =  df['fare'].fillna(fare_mean)

<hr>
原本性別欄是文字, 必須轉換為0與1才能進行機器學習訓練<br>
使用map方法, 將female轉換為0, male轉換為1
<hr>

In [12]:
df['sex'] = df['sex'].map({'female': 0, 'male': 1}).astype(int)

<hr>
embarked有3個分類C,Q,S必須使用Onehot encoding轉換<br>
Pandas提供了Onehot encoding轉換, 使用get_dummies傳入下列參數:<br>
data: 要轉換的dataframe<br>
columns:要轉換的欄位
<hr>

In [13]:
x_One_Hot_df = pd.get_dummies(data = df, columns = ['embarked'])

In [14]:
x_One_Hot_df[:2]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked_C,embarked_Q,embarked_S
0,1,1,0,29.0,0,0,211.3375,0,0,1
1,1,1,1,0.9167,1,2,151.55,0,0,1


<hr>
後續進行深度學習訓練, 將dataframe轉換為array
<hr>

In [15]:
ndarry = x_One_Hot_df.values

In [16]:
ndarry.shape

(1309, 10)

In [17]:
ndarry[:2]

array([[  1.    ,   1.    ,   0.    ,  29.    ,   0.    ,   0.    ,
        211.3375,   0.    ,   0.    ,   1.    ],
       [  1.    ,   1.    ,   1.    ,   0.9167,   1.    ,   2.    ,
        151.55  ,   0.    ,   0.    ,   1.    ]])

<hr>
擷取label與features
<hr>

In [18]:
Label = ndarry[:, 0]
Features = ndarry[:, 1:]

In [19]:
from sklearn import preprocessing

<hr>
使用preprocessing.MinMaxScaler進行標準化, 需輸入參數feature_range設定標準化之後的範圍在0和1之間
<hr>

In [20]:
minmax_scale = preprocessing.MinMaxScaler(feature_range = (0,1))

In [21]:
scaledFeatures = minmax_scale.fit_transform(Features)

In [22]:
scaledFeatures[:2]

array([[0.        , 0.        , 0.36116884, 0.        , 0.        ,
        0.41250333, 0.        , 0.        , 1.        ],
       [0.        , 1.        , 0.00939458, 0.125     , 0.22222222,
        0.2958059 , 0.        , 0.        , 1.        ]])

<hr>
將資料隨機分為訓練資料與測試資料
<hr>

In [23]:
msk = numpy.random.rand(len(all_df)) < 0.8
train_df = all_df[msk]
test_df = all_df[~msk]

In [24]:
print('total: ', len(all_df),
      '\ntrain: ', len(train_df), 
      '\ntest: ', len(test_df))

total:  1309 
train:  1035 
test:  274


<hr>
將之前資料預處理的命令, 全部整理在ProcessData函數, 方便後續使用
<hr>

In [29]:
def PreprocessData(raw_df):
    df = raw_df.drop(['name'], axis = 1)
    age_mean = df['age'].mean()
    df['age'] = df['age'].fillna(age_mean)
    fare_mean = df['fare'].mean()
    df['fare'] = df['fare'].fillna(fare_mean)
    df['sex'] = df['sex'].map({'female': 0, 'male': 1}).astype(int)
    y_One_Hot_df = pd.get_dummies(data = df, columns = ['embarked'])
    
    ndarray = y_One_Hot_df.values
    Features = ndarray[:, 1:]
    Label = ndarray[:, 0]
    
    minmax_scale = preprocessing.MinMaxScaler(feature_range = (0, 1))
    scaledFeature = minmax_scale.fit_transform(Features)
    
    return scaledFeature, Label

In [30]:
train_Features, train_Label = PreprocessData(train_df)
test_Featurea, test_Label = PreprocessData(test_df)

In [31]:
train_Features[:2]

array([[0.        , 0.        , 0.36116884, 0.        , 0.        ,
        0.41250333, 0.        , 0.        , 1.        ],
       [0.        , 1.        , 0.00939458, 0.125     , 0.22222222,
        0.2958059 , 0.        , 0.        , 1.        ]])

In [32]:
train_Label[:2]

array([1., 1.])