"""
# Framework of Machine Learning

Purpose:  choose (or construct) the model whose score is the best among others. 

0. Preprocessing (Most important and Most difficult)

Dealing with the missing data.
Feature extraction or feature selection.
For example, what variable is relevant to prediction ? Can you make a new variable (feature) from given variables ? 


1. separate the given data into train data and test data

 When choosing a model, the score or the accuracy is necessary to compare a model to another.
 The ratio len(train data):len(test data) is often set to 7:3 or 8:2. (however, we must set it flexibly in terms of data size)

2. candidates for model

 model_candidates={model_1,model_2,...,model_M}
 A model can be deep learning, SVM or XGBoost and so on.

 for model in {model_1,model_2,...,model_M}:
     Do 3 and 4 as stated below.

3. preparation for k-fold cross validation if you have some hyper parameters in your model

 k is often set to 10.(however, we must set it flexibly in terms of data size)
 We have to divide "train data" into k pieces roughly equally.
 (If you have 101 samples and k=10, then the length of one of ten pieces is 11.)
 Name them D_1, D_2,..., D_k.
 Candidates for hyper parameters {a_1,a_2,...,a_L}

 # PROCEDURE
 for alpha in {a_1,a_2,...,a_L}:
    
     cross_validation_score=0
    
     for i in {1,2,...,k}:
       
         train the model using alpha and {D_1,D_2,...,D_k}-{D_i}
         test the model using alpha and D_i
         cross_validation_score+=test score
        
     memorize cross_varidation_score/k (when using hyper parameter alpha)    
 # END

 A candidate with the highest cross_validation_score is your hyper parameter.


4. training and test

 train and test your model (if any, using your hyper parameter).


5. GOAL

 A model with the highest score is your model.
 
"""

## まずは，pandas・numpy・scipyをインポート

- **numpy**：N次元配列を使用する時に便利なライブラリ
- **pandas**：CSVファイルを使う，いろんなデータ型が混在しているデータを使う，データの前処理に使う時に便利なライブラリ
- **scipy**：様々な数学計算関数を使う時に便利なライブラリ
    -  **stats** ：scipyライブラリ内の統計計算用モジュール

In [1]:
import pandas as pd 
import numpy as np
from scipy import stats

`import pandas as pd `：以降，pandasライブラリ内の関数をpd.として使用できる．

`import numpy as np`：同上．

`from scipy import stats`：以降，scipyライブラリ内のstatモジュール内の関数をstat.として使用できる．

## trainデータとtestデータを読み込む

- `pd.read_csv()`：カンマ区切り値（csv）ファイルをDataFrameに読み込む
    - `index_col`：indexをintまたはstrで指定できる
- `DataFrame.head()`：DataFrameの最初のN行を返す（defaultは５行）

In [2]:
D=pd.read_csv('train.csv',index_col=0) 
test=pd.read_csv('test.csv')

In [3]:
D.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## trainデータのsurvived行とそれ以外を分ける
- `DataFrame[列名]`：単一の列を取得する．（この時，１次元なのでseriesとして出力）
- `DataFrame.drop`：ラベルを削除する．
    - `axis`：0は行，１は列を削除する

In [5]:
t=D['Survived'] 
X=D.drop('Survived',axis=1)

In [6]:
#t.head()だと，表示がいまいちなので，DataFrameにいったん変換している
pd.DataFrame(t).head()

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
1,0
2,1
3,1
4,1
5,0


In [7]:
X.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 今回使わない列を削除する


In [8]:
delete_list=['Name','Ticket','Cabin','Embarked']
X=X.drop(delete_list,axis=1) 

testデータに関しても、上記の列を削除。提出用にpassengerIDを分ける。

In [9]:
test=test.drop(delete_list,axis=1) 
passenser_id=test['PassengerId'] 
test=test.drop('PassengerId',axis=1) 

In [10]:
test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
0,3,male,34.5,0,0,7.8292
1,3,female,47.0,1,0,7.0
2,2,male,62.0,0,0,9.6875
3,3,male,27.0,0,0,8.6625
4,3,female,22.0,1,1,12.2875


- `np.array()`：多次元配列を作成する．機械学習計算に必要．

In [11]:
X=np.array(X)
t=np.array(t)
test=np.array(test) 

## データの性別を0,1に変換
- `len()`：列数を返す．

In [12]:
for i in range(len(X)):  
    
    if X[i,1]=='male':
        X[i,1]=1
        
    else:
        X[i,1]=0

for i in range(len(test)):
    if test[i,1]=='male':
        test[i,1]=1
    else:
        test[i,1]=0

In [13]:
pd.DataFrame(X).head()

Unnamed: 0,0,1,2,3,4,5
0,3,1,22,1,0,7.25
1,1,0,38,1,0,71.2833
2,3,0,26,0,0,7.925
3,1,0,35,1,0,53.1
4,3,1,35,0,0,8.05


## X，testのデータ型を，今後の計算のためにfloatに変換する

- `np.asarray()`：多次元配列に変換する．array型が入ると，Dtypeを変換する．

In [14]:
X=np.asarray(X,dtype=float)                        
test=np.asarray(test,dtype=float)

## 欠損値を平均で代替
- `np.nanmean()`：NaN（欠損値）を無視して，軸にそって平均を計算．axisを設定しないと，全要素での平均を出す．axis=0は列ごと．
- `np.shape()`：配列の大きさを（行，列）で返す．[0]は行数．
- `np.isnan()`：NaNであれば，Trueを返す．

In [15]:
X_mean=np.nanmean(X,axis=0)
        
for i in range(np.shape(X)[0]):
    for j in range(np.shape(X)[1]):
        if np.isnan(X[i,j]):
            X[i,j]=X_mean[j]

for i in range(np.shape(test)[0]):
    for j in range(np.shape(test)[1]):
        if np.isnan(test[i,j]):
            test[i,j]=X_mean[j]            

## 正規化
- `stats.zscore()`：平均0、分散1に正規化（標準化）

In [16]:
X=stats.zscore(X)
test=stats.zscore(test)

## 出来上がったデータを見てみる

In [17]:
pd.DataFrame(X,columns=['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']).head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
0,0.827377,0.737695,-0.592481,0.432793,-0.473674,-0.502445
1,-1.566107,-1.355574,0.638789,0.432793,-0.473674,0.786845
2,0.827377,-1.355574,-0.284663,-0.474545,-0.473674,-0.488854
3,-1.566107,-1.355574,0.407926,0.432793,-0.473674,0.42073
4,0.827377,0.737695,0.407926,-0.474545,-0.473674,-0.486337


In [18]:
pd.DataFrame(t,columns=['Survived']).head()

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0


In [19]:
pd.DataFrame(test,columns=['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']).head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
0,0.873482,0.755929,0.344284,-0.49947,-0.400248,-0.498258
1,0.873482,-1.322876,1.334655,0.616992,-0.400248,-0.513125
2,-0.315819,0.755929,2.523099,-0.49947,-0.400248,-0.46494
3,0.873482,0.755929,-0.249938,-0.49947,-0.400248,-0.483317
4,0.873482,-1.322876,-0.646086,0.616992,0.619896,-0.418323


In [20]:
pd.DataFrame(passenser_id).head()

Unnamed: 0,PassengerId
0,892
1,893
2,894
3,895
4,896


## 与えあれたtrainデータを，さらに，trainデータとtestデータに分ける
- `sklearn.model_selection.train_test_split`：配列をランダムに分割する．test_sizeで分割の割合を決めることができる．

In [21]:
# 1. separate the given train data into train data and test data
from sklearn.model_selection import train_test_split

test_ratio=0.2
X_train, X_test, t_train, t_test =train_test_split(X, t, test_size=test_ratio,shuffle=True) 

## 機械学習に必要な各種モジュールをインポートする．

In [22]:
# 2. candidates for models
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import datetime
dt_now = datetime.datetime.now()

## K-Fold 交差検定を行う準備．

train_v,test_vに，１０通り（split_size）の検定用データのindexを格納する

In [35]:
from sklearn.model_selection import KFold

split_size=10
k_fold=KFold(n_splits=split_size,shuffle=True)

train_v=[]
test_v=[]

for train_indices, validation_indices in k_fold.split(X_train):
    train_v.append(train_indices)
    test_v.append(validation_indices)



## LogisticRegression

In [23]:
validation_score=-100
hyperpara_opt=0
for alpha in {0.001,0.01,0.1,1,10,100,1000}:
    
    validation_score_sum=0
    model=LogisticRegression(C=alpha,solver='lbfgs',max_iter=10000) 
   
    for i in range(split_size):
    
        model.fit(X_train[train_v[i]],t_train[train_v[i]])
        validation_score_sum+=model.score(X_train[test_v[i]],t_train[test_v[i]])
        
    if validation_score_sum/split_size>validation_score:
        validation_score=validation_score_sum/split_size
        hyperpara_opt=alpha        

print('LR_validation score:'+str(validation_score))
print('LR_hyperparameter:'+str(hyperpara_opt))

LR_validation score:0.7894170579029736
LR_hyperparameter:0.1


In [24]:
# 4. trainin and test
model=LogisticRegression(C=1,solver='lbfgs',max_iter=100)
model.fit(X_train,t_train)
print('score:',model.score(X_test,t_test))

outputLR=pd.DataFrame(index=passenser_id)
pred=model.predict(test)
outputLR['Survived']=pred
outputLR.to_csv(dt_now.strftime('%Y-%m-%d-%H-%M-%S')+'outputLR'+'.csv')
outputLR.head()

score: 0.8491620111731844


Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,1


## Multi layer parceptron classifier

In [25]:
model=MLPClassifier(hidden_layer_sizes=(80,80),alpha=0.001,max_iter=100000,activation='relu')
model.fit(X_train,t_train)
print(model.score(X_train,t_train))

outputMLP=pd.DataFrame(index=passenser_id)
pred=model.predict(test)
outputMLP['Survived']=pred
outputMLP.to_csv(dt_now.strftime('%Y-%m-%d-%H-%M-%S')+'outputMLP'+'.csv')
outputMLP.head()

0.8764044943820225


Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,0


## Sapport vector classifier

In [26]:
validation_score=-100
param_list=[0.001,0.01,0.1,1,10,100]

for C in param_list:
    for gamma in param_list:
    
        validation_score_sum=0
        model=SVC(C=C,kernel='rbf',gamma=gamma,degree=3)
   
        for i in range(split_size):
    
            model.fit(X_train[train_v[i]],t_train[train_v[i]])
            validation_score_sum+=model.score(X_train[test_v[i]],t_train[test_v[i]])
        
        if validation_score_sum/split_size>validation_score:
            validation_score=validation_score_sum/split_size
            hyperpara_C=C
            hyperpara_gamma=gamma
            
print('SVC_validation score:'+str(validation_score))
print('SVC_hyperpara C:'+str(hyperpara_C))
print('SVC_hyperpara gamma:'+str(hyperpara_gamma))

SVC_validation score:0.8175078247261345
SVC_hyperpara C:10
SVC_hyperpara gamma:0.1


In [27]:
model=SVC(C=hyperpara_C,kernel='rbf',gamma=hyperpara_gamma,degree=3)
model.fit(X_train,t_train)
print(model.score(X_test,t_test))

outputSVC=pd.DataFrame(index=passenser_id)
pred=model.predict(test)
outputSVC['Survived']=pred
outputSVC.to_csv(dt_now.strftime('%Y-%m-%d-%H-%M-%S')+'outputSVC'+'.csv')
outputSVC.head()

0.8715083798882681


Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,0


## Random forest classifier

In [28]:
validation_score=-100

for  n_estimators in {10,20,30,40,50,60,70,80,100}:
    
    validation_score_sum=0
    model=RandomForestClassifier(n_estimators=n_estimators)
   
    for i in range(split_size):
    
        model.fit(X_train[train_v[i]],t_train[train_v[i]])
        validation_score_sum+=model.score(X_train[test_v[i]],t_train[test_v[i]])
        
    if validation_score_sum/split_size>validation_score:
        validation_score=validation_score_sum/split_size
        hyperpara_n_estimators=n_estimators        

print('RFC_validation score:'+str(validation_score))
print('RFC_hyperparameter:'+str(hyperpara_n_estimators))

RFC_validation score:0.8077464788732394
RFC_hyperparameter:60


In [29]:
model=RandomForestClassifier(n_estimators=hyperpara_n_estimators)
model.fit(X_train,t_train)
print(model.score(X_test,t_test))

outputRFC=pd.DataFrame(index=passenser_id)
pred=model.predict(test)
outputRFC['Survived']=pred
outputRFC.to_csv(dt_now.strftime('%Y-%m-%d-%H-%M-%S')+'outputRFC'+'.csv')
outputRFC.head()

0.8603351955307262


Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,1
895,0
896,0
