<a href="https://colab.research.google.com/github/fm-yodai/kaggle-training/blob/main/titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a target="_blank" href="https://colab.research.google.com/github/fm-yodai/kaggle-training.git">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<h3>下準備</h3>
<ul>
<li>Google Driveのマウント</li>
<li>KaggleのAPIを使うための準備</li>
</ul>

**Google Driveのマウント**

google driveの「マイドライブ」直下に「kaggle」フォルダを作っておく

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**KaggleのAPIを使うための準備**

1. [kaggle](https://www.kaggle.com/)にサインインし、右上のアイコン→Accountの順にクリック
2. APIのところにある「Create New API Token」をクリック（kaggle.json）がダウンロードされる
3. kaggle.jsonをGoogle Driveの「マイドライブ」直下の「kaggle」フォルダに置く
4. 下記セルを実行（Kaggle APIは、.kaggle/kaggle.jsonのusernameとkeyを見に行ってくれるっぽい）

In [2]:
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp /content/drive/MyDrive/kaggle/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

<h3>データのダウンロード</h3>
<ul>
<li>KaggleのAPIを使ってデータをダウンロード</li>
<li>ダウンロードしたデータを解凍</li>
</ul>

**KaggleのAPIを使ってデータをダウンロード**

In [3]:
!kaggle competitions download -c titanic

Downloading titanic.zip to /content
  0% 0.00/34.1k [00:00<?, ?B/s]
100% 34.1k/34.1k [00:00<00:00, 9.09MB/s]


**ダウンロードしたデータを解凍**

In [4]:
!unzip titanic.zip -d /content/drive/MyDrive/kaggle/titanic/data

Archive:  titanic.zip
replace /content/drive/MyDrive/kaggle/titanic/data/gender_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: a
error:  invalid response [a]
replace /content/drive/MyDrive/kaggle/titanic/data/gender_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: /content/drive/MyDrive/kaggle/titanic/data/gender_submission.csv  
  inflating: /content/drive/MyDrive/kaggle/titanic/data/test.csv  
  inflating: /content/drive/MyDrive/kaggle/titanic/data/train.csv  


<h3>データの読み込み</h3>
<ul>
<li>データを読み込む</li>
<li>データの確認</li>
</ul>

**データを読み込む**

In [5]:
import pandas as pd
train = pd.read_csv('/content/drive/MyDrive/kaggle/titanic/data/train.csv')
test = pd.read_csv('/content/drive/MyDrive/kaggle/titanic/data/test.csv')

**データの確認**

In [6]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


<h3>データの前処理</h3>
<ul>
<li>欠損値の補完</li>
<li>カテゴリ変数の処理</li>
<li>特徴量の選択</li>
</ul>

**欠損値の補完**

In [8]:
train['Age'] = train['Age'].fillna(train['Age'].median())
train['Embarked'] = train['Embarked'].fillna('S')
test['Age'] = test['Age'].fillna(test['Age'].median())
test['Fare'] = test['Fare'].fillna(test['Fare'].median())

**カテゴリ変数の処理**

In [9]:
train['Sex'] = train['Sex'].map({'female': 0, 'male': 1}).astype(int)
train['Embarked'] = train['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)
test['Sex'] = test['Sex'].map({'female': 0, 'male': 1}).astype(int)
test['Embarked'] = test['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

**特徴量の選択**

In [10]:
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X_train = train[features]
X_test = test[features]
y_train = train['Survived']

<h3>モデルの学習</h3>
<ul>
<li>モデルの定義</li>
<li>モデルの学習</li>
</ul>

**モデルの定義**

In [11]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)

**モデルの学習**

In [12]:
model.fit(X_train, y_train)

<h3>予測</h3>
<ul>
<li>予測</li>
</ul>

**予測**

In [13]:
predictions = model.predict(X_test)

<h3>提出</h3>
<ul>
<li>提出用ファイルの作成</li>
<li>提出</li>
</ul>

**提出用ファイルの作成**

In [14]:
output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions})
output.to_csv('/content/drive/MyDrive/kaggle/titanic/my_submission.csv', index=False)

**提出**

In [15]:
!kaggle competitions submit -c titanic -f /content/drive/MyDrive/kaggle/titanic/my_submission.csv -m "First submission"

100% 2.77k/2.77k [00:00<00:00, 5.63kB/s]
Successfully submitted to Titanic - Machine Learning from Disaster

<h3>参考</h3>
<ul>
<li><a href="https://www.kaggle.com/c/titanic">Titanic: Machine Learning from Disaster</a></li>
<li><a href="https://www.kaggle.com/alexisbcook/titanic-tutorial">Titanic Tutorial</a></li>
</ul>

[1]: https://www.kaggle.com/c/titanic
[2]: https://www.kaggle.com/alexisbcook/titanic-tutorial