<a href="https://colab.research.google.com/github/alicelindel3/ibm5100/blob/main/cp1/cross_validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 交差検証
「交差検証」により、過学習の問題に対処します。 

## データの準備
必要なライブラリの導入、データの読み込みと加工を行います。

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

import lightgbm as lgb

path = "/content/drive/My Drive/Colab Notebooks/"

train_data = pd.read_csv(path+'train_Perished.csv')  # 訓練データ
test_data = pd.read_csv(path+'test_Perished.csv')  # テストデータ

test_id = test_data["PassengerId"]  # 結果の提出時に使用

data = pd.concat([train_data, test_data], sort=False)  # テストデータ、訓練データを結合

# カテゴリデータの変換
data["Sex"].replace(["male", "female"], [0, 1], inplace=True)
data["Embarked"].fillna(("S"), inplace=True)
data["Embarked"] = data["Embarked"].map({"S": 0, "C": 1, "Q": 2})

# 欠損値を埋める
data["Fare"].fillna(data["Fare"].mean(), inplace=True)
data["Age"].fillna(data["Age"].mean(), inplace=True)

# 新しい特徴量の作成
data["Family"] = data["Parch"] + data["SibSp"]

# 不要な特徴量の削除
data.drop(["Name", "PassengerId", "SibSp", "Parch", "Ticket", "Cabin"],
          axis=1, inplace=True)

# 入力と正解の作成
train_data = data[:len(train_data)]
test_data = data[len(train_data):]
t = train_data["Perished"]  # 正解
x_train = train_data.drop("Perished", axis=1)  # 訓練時の入力
x_test = test_data.drop("Perished", axis=1)  # テスト時の入力

x_train.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Family
0,3,0,22.0,7.25,0,1
1,1,1,38.0,71.2833,1,1
2,3,1,26.0,7.925,0,0
3,1,1,35.0,53.1,0,1
4,3,0,35.0,8.05,0,0


## 交差検証
scikit-learnの`StratifiedKFold`により交差検証を行います。  
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html  
`StratifiedKFold`を使えば検証用データ中の0と1の割合を一定に保つことができます。

In [3]:
y_valids = np.zeros((len(x_train),)) # 予測結果: 検証用データ
y_tests = []  # 予測結果: テスト用データ

skf = StratifiedKFold(n_splits=5, shuffle=True)

# ハイパーパラメータの設定
params = {
    "objective": "binary",  # 二値分類
    "max_bin": 300,  # 特徴量の最大分割数
    "learning_rate": 0.05,  # 学習率
    "num_leaves": 32  # 分岐の末端の最大数
}

categorical_features = ["Embarked", "Pclass", "Sex"]

for _, (ids_train, ids_valid) in enumerate(skf.split(x_train, t)):
    x_tr = x_train.loc[ids_train, :]
    x_val = x_train.loc[ids_valid, :]
    t_tr = t[ids_train]
    t_val = t[ids_valid]

    # データセットの作成
    lgb_train = lgb.Dataset(x_tr, t_tr, categorical_feature=categorical_features)
    lgb_val = lgb.Dataset(x_val, t_val, reference=lgb_train, categorical_feature=categorical_features)
    
    # モデルの訓練
    model = lgb.train(params, lgb_train, valid_sets=[lgb_train, lgb_val],
                      verbose_eval=20,  # 学習過程の表示間隔
                      num_boost_round=500,  # 学習回数の最大値
                      early_stopping_rounds=10)  # 連続して10回性能が向上しなければ終了

    # 結果を保持
    y_valids[ids_valid] = model.predict(x_val, num_iteration=model.best_iteration)
    y_test = model.predict(x_test, num_iteration=model.best_iteration)
    y_tests.append(y_test)



Training until validation scores don't improve for 10 rounds.
[20]	training's binary_logloss: 0.431602	valid_1's binary_logloss: 0.435809
[40]	training's binary_logloss: 0.35522	valid_1's binary_logloss: 0.389465
[60]	training's binary_logloss: 0.312695	valid_1's binary_logloss: 0.38188
[80]	training's binary_logloss: 0.280017	valid_1's binary_logloss: 0.38258
Early stopping, best iteration is:
[70]	training's binary_logloss: 0.294628	valid_1's binary_logloss: 0.379567
Training until validation scores don't improve for 10 rounds.
[20]	training's binary_logloss: 0.429995	valid_1's binary_logloss: 0.431091
[40]	training's binary_logloss: 0.352603	valid_1's binary_logloss: 0.374046
Early stopping, best iteration is:
[47]	training's binary_logloss: 0.3356	valid_1's binary_logloss: 0.368723
Training until validation scores don't improve for 10 rounds.
[20]	training's binary_logloss: 0.417613	valid_1's binary_logloss: 0.4924
[40]	training's binary_logloss: 0.340479	valid_1's binary_logloss: 

## 正解率
検証用データによる予測結果と正解を使い、正解率を計算します。

In [4]:
y_valids_bin = (y_valids>0.5).astype(int)  # 結果を0か1に
accuracy_score(t, y_valids_bin)  # 正解率の計算

0.813692480359147

## 提出用のデータ
提出量データの形式を整え、CSVファイルに保存します。

In [5]:
y_test_subm = sum(y_tests) / len(y_tests)  # 平均をとる
y_test_subm = (y_test > 0.5).astype(int)  # 結果を0か1に

# 形式を整える
survived_test = pd.Series(y_test_subm, name="Perished")
subm_data = pd.concat([test_id, survived_test], axis=1)

# 提出用のcsvファイルを保存
subm_data.to_csv("submission_cv.csv", index=False)

subm_data

Unnamed: 0,PassengerId,Perished
0,892,1
1,893,1
2,894,1
3,895,1
4,896,1
...,...,...
413,1305,1
414,1306,0
415,1307,1
416,1308,1


In [6]:
from google.colab import files
files.download('submission_cv.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>