# 作業 : (Kaggle)鐵達尼生存預測
https://www.kaggle.com/c/titanic

# [作業目標]
- 試著模仿範例寫法, 在鐵達尼生存預測中, 觀察填補缺值以及 標準化 / 最小最大化 對數值的影響

# [作業重點]
- 觀察替換不同補缺方式, 對於特徵的影響 (In[4]~In[6], Out[4]~Out[6])
- 觀察替換不同特徵縮放方式, 對於特徵的影響 (In[7]~In[8], Out[7]~Out[8])

In [1]:
# 程式區塊 A：將需要的都import進來，設定好要使用的data路徑。
import os
import pandas            as pd
import numpy             as np
import seaborn           as sns
import matplotlib.pyplot as plt
import math
import warnings
from sklearn.preprocessing   import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.linear_model    import LogisticRegression
from sklearn.model_selection import cross_val_score
warnings.filterwarnings('ignore')
%matplotlib inline

# 設定【data的資料夾路徑】，命名為【data_folder】
data_folder = 'C:/Users/Ynitsed/Documents/GitHub/2nd-ML100Days/data'

In [2]:
# 設定t001為某個data路徑
# 設定t002為pd裡read data的功能
t001_train = os.path.join(data_folder, 'titanic_train.csv')
t002_train = pd.read_csv(t001_train)
print('Path of read in data: %s' %t001_train)
print(t002_train.shape)
t002_train.head()

Path of read in data: C:/Users/Ynitsed/Documents/GitHub/2nd-ML100Days/data\titanic_train.csv
(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# 設定t001為某個data路徑
# 設定t002為pd裡read data的功能
t001_test  = os.path.join(data_folder,  'titanic_test.csv')
t002_test  = pd.read_csv(t001_test)
print('Path of read in data: %s' %t001_test)
print(t002_test.shape)
t002_test.head()

Path of read in data: C:/Users/Ynitsed/Documents/GitHub/2nd-ML100Days/data\titanic_test.csv
(418, 11)


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [4]:
# 程式區塊 B-1：做t003，train取【Survived】、test取【PassengerId】
t003_train = t002_train['Survived']
t003_test  = t002_test['PassengerId']
# 程式區塊 B-2：做t004，train捨棄【PassengerId,Survived】、test捨棄【PassengerId】。
t004_train = t002_train.drop(['PassengerId', 'Survived'] , axis=1)
t004_test  = t002_test.drop(['PassengerId'] , axis=1)
print(t004_train.shape)
print(t004_test.shape)
# 程式區塊 B-3：設計t005，把train和test合併，UNION ALL的概念。
t005 = pd.concat([t004_train,t004_test])
print(t005.shape)

(891, 10)
(418, 10)
(1309, 10)


In [5]:
#只取 int64, float64 兩種數值型欄位, 存於 num_features 中
num_features = []
for a, b in zip(t005.dtypes, t005.columns):
    if a == 'float64' or a == 'int64':
        num_features.append(b)
print(f'{len(num_features)} Numeric Features : {num_features}\n')

5 Numeric Features : ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']



In [6]:
# 削減文字型欄位, 只剩數值型欄位
t006 = t005[num_features]
t006.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
0,3,22.0,1,0,7.25
1,1,38.0,1,0,71.2833
2,3,26.0,0,0,7.925
3,1,35.0,1,0,53.1
4,3,35.0,0,0,8.05


In [7]:
# 算一下訓練資料有幾筆
train_cnt = t003_train.shape[0]
train_cnt

891

# 作業1
* 試著在補空值區塊, 替換並執行兩種以上填補的缺值, 看看何者比較好?

In [8]:
# 空值補 -1
t007 = t006.fillna(-1)

# 做羅吉斯迴歸
t008_train = t007[:train_cnt]
LR = LogisticRegression()
cross_val_score(LR, t008_train, t003_train, cv=5).mean()

0.6960299128976762

In [9]:
# 空值補 0
t007 = t006.fillna(0)

# 做羅吉斯迴歸
t008_train = t007[:train_cnt]
LR = LogisticRegression()
cross_val_score(LR, t008_train, t003_train, cv=5).mean()

0.6971535084032942

In [10]:
# 空值補 平均
t007 = t006.fillna(t006.mean())

# 做羅吉斯迴歸
t008_train = t007[:train_cnt]
LR = LogisticRegression()
cross_val_score(LR, t008_train, t003_train, cv=5).mean()

0.6981761033723469

# 作業2
* 使用不同的標準化方式 ( 原值 / 最小最大化 / 標準化 )，搭配羅吉斯迴歸模型，何者效果最好?

In [11]:
# 空值補 平均
t007 = t006.fillna(t006.mean())

# 搭配最大最小化
t008 = MinMaxScaler().fit_transform(t007)
t009_train = t008[:train_cnt]
LR = LogisticRegression()
cross_val_score(LR, t009_train, t003_train, cv=5).mean()

0.6993501991462476

In [14]:
# 空值補 平均
t007 = t006.fillna(t006.mean())

# 搭配標準化
t008 = StandardScaler().fit_transform(t007)
t009_train = t008[:train_cnt]
LR = LogisticRegression()
cross_val_score(LR, t009_train, t003_train, cv=5).mean()

0.6959413955734954

In [20]:
# 空值補 平均
t007 = t006.fillna(t006.mean())

# 搭配標準化
t008 = StandardScaler().fit_transform(t007)
t009_train = t008[:train_cnt]
LR = LogisticRegression()
cross_val_score(LR, t009_train, t003_train, cv=5).mean()

0.6959413955734954

### Day19教材方向和目標
1. .fillna(x)，null補值的方式，x可直接指定一個數，或平均值或其他變數。
2. 三種正規化：LogisticRegression()、MinMaxScaler()、StandardScaler()
3. cv=5是分5群
4. for a,b in zip(case_a,case_b)的用法

### Day19忽略部分
1. cross_val_score的其他參數

### Day19其他補充
今天教材很不錯，簡單清楚。