<br>
titanicのsurvived判定

## 変数一覧

| Variable | Definition | Key |
| --- | --- | --- |
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex |  |
| Age | Age | in years |  | 
| sibsp | # of siblings / spouses aboard the Titanic |  |
| parch | # of parents / children aboard the Titanic |  |
| ticke | Ticket number |  |
| fare | Passenger fare |  |
| cabin | Cabin number |  |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

### 変数詳細

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

## モジュールの読み込み

In [1]:
import pandas as pd
import math

In [2]:
train_df=pd.read_csv('train.csv').drop(['PassengerId','Name','Ticket','Cabin'],axis=1)
train_df.head(5)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


In [4]:
test_df=pd.read_csv('test.csv').drop(['PassengerId','Name','Ticket','Cabin'],axis=1)
test_df.head(5)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,34.5,0,0,7.8292,Q
1,3,female,47.0,1,0,7.0,S
2,2,male,62.0,0,0,9.6875,Q
3,3,male,27.0,0,0,8.6625,S
4,3,female,22.0,1,1,12.2875,S


In [5]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass      418 non-null int64
Sex         418 non-null object
Age         332 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Fare        417 non-null float64
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 22.9+ KB


## 前処理

### Ageの欠損値補完

In [6]:
# Pclass=3,SibSp=1,Parch=9の欠損値は、Pclass=3,SibSp=1,Parch=6の値で補った
age_df=pd.concat([train_df,test_df],).groupby(
    ['Pclass','SibSp','Parch']
)['Survived','Age'].mean().reset_index().drop('Survived',axis=1).fillna(41.5)
age_df.head(5)

Unnamed: 0,Pclass,SibSp,Parch,Age
0,1,0,0,40.488281
1,1,0,1,36.769231
2,1,0,2,27.454545
3,1,1,0,38.732394
4,1,1,1,48.217391


In [7]:
age_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 4 columns):
Pclass    47 non-null int64
SibSp     47 non-null int64
Parch     47 non-null int64
Age       47 non-null float64
dtypes: float64(1), int64(3)
memory usage: 1.5 KB


In [8]:
"""
Ageの欠損値を埋める関数
Pclass,SibSp,Parchの値が同じ乗客の平均年齢で補完する
"""
def fill_age_nan(ser,age_df):

    if math.isnan(ser.Age): 
        return age_df[(age_df.Pclass==ser.Pclass) & (age_df.SibSp==ser.SibSp) & (age_df.Parch==ser.Parch)].Age.values[0].round(2)
    else:
        return ser.Age

In [9]:
# trainデータに適用
train_df['Age']=train_df.apply(lambda x: fill_age_nan(x,age_df),axis=1)
# Ageの欠損値を埋められたか確認
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


In [10]:
# testデータに適用
test_df['Age']=test_df.apply(lambda x: fill_age_nan(x,age_df),axis=1)
# Ageの欠損値を埋められたか確認
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass      418 non-null int64
Sex         418 non-null object
Age         418 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Fare        417 non-null float64
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 22.9+ KB


### Fareの欠損値(testデータ)の補完

In [11]:
fare_df=pd.concat([train_df,test_df],).groupby(
    ['Pclass','Sex','SibSp','Parch']
)['Survived','Fare'].mean().reset_index().drop('Survived',axis=1)
fare_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 5 columns):
Pclass    84 non-null int64
Sex       84 non-null object
SibSp     84 non-null int64
Parch     84 non-null int64
Fare      84 non-null float64
dtypes: float64(1), int64(3), object(1)
memory usage: 3.4+ KB


In [12]:
def fill_fare_nan(ser,fare_df):

    if math.isnan(ser.Fare): 
        return fare_df[(fare_df.Pclass==ser.Pclass) & (fare_df.Sex==ser.Sex) & (fare_df.SibSp==ser.SibSp) & (fare_df.Parch==ser.Parch)].Fare.values[0].round(2)
    else:
        return ser.Fare

In [13]:
test_df['Fare']=test_df.apply(lambda x: fill_fare_nan(x,fare_df),axis=1)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass      418 non-null int64
Sex         418 non-null object
Age         418 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Fare        418 non-null float64
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 22.9+ KB


### カテゴリー特徴量をダミー変数化

In [14]:
# Embarkedのカラムで欠損値が２つあるので先にdrop
train_df=train_df.dropna()
print(train_df.info())
# dammy変数化
train_df=pd.concat([train_df,pd.get_dummies(train_df[['Sex','Embarked']])],axis=1).drop(['Sex','Embarked'],axis=1)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 8 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
Sex         889 non-null object
Age         889 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Fare        889 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 62.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 11 columns):
Survived      889 non-null int64
Pclass        889 non-null int64
Age           889 non-null float64
SibSp         889 non-null int64
Parch         889 non-null int64
Fare          889 non-null float64
Sex_female    889 non-null uint8
Sex_male      889 non-null uint8
Embarked_C    889 non-null uint8
Embarked_Q    889 non-null uint8
Embarked_S    889 non-null uint8
dtypes: float64(2), int64(4), uint8(5)
memory usage: 53.0 KB


In [15]:
test_df=pd.concat([test_df,pd.get_dummies(test_df[['Sex','Embarked']])],axis=1).drop(['Sex','Embarked'],axis=1)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
Pclass        418 non-null int64
Age           418 non-null float64
SibSp         418 non-null int64
Parch         418 non-null int64
Fare          418 non-null float64
Sex_female    418 non-null uint8
Sex_male      418 non-null uint8
Embarked_C    418 non-null uint8
Embarked_Q    418 non-null uint8
Embarked_S    418 non-null uint8
dtypes: float64(2), int64(3), uint8(5)
memory usage: 18.4 KB


### 重複している行の削除

In [17]:
train_df=train_df.drop_duplicates()
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 778 entries, 0 to 890
Data columns (total 11 columns):
Survived      778 non-null int64
Pclass        778 non-null int64
Age           778 non-null float64
SibSp         778 non-null int64
Parch         778 non-null int64
Fare          778 non-null float64
Sex_female    778 non-null uint8
Sex_male      778 non-null uint8
Embarked_C    778 non-null uint8
Embarked_Q    778 non-null uint8
Embarked_S    778 non-null uint8
dtypes: float64(2), int64(4), uint8(5)
memory usage: 46.3 KB


## SVM
交差検証とグリッドサーチ

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
# 標準化のためのモジュール
from sklearn.preprocessing import StandardScaler

# パラメータの設定
param_grid = { 'C':[0.001,0.01,0.1,1,10,100]
              ,'gamma':[0.001,0.01,0.1,1,10,100]}

grid_search = GridSearchCV(SVC(),param_grid,cv=5)

X_train, X_validation, y_train, y_validation = train_test_split(
    train_df.drop('Survived',axis=1), train_df['Survived'], stratify = train_df['Survived'], random_state=0)

ss=StandardScaler()
ss.fit(X_train)
X_train_scaled=ss.transform(X_train)
X_validation_scaled=ss.transform(X_validation)

grid_search.fit(X_train_scaled,y_train)

print("Test set score:{:.2f}".format(grid_search.score(X_validation_scaled,y_validation)))
print("Best parameters:{}".format(grid_search.best_params_))
print("Best cross-validation score:{:.2f}".format(grid_search.best_score_))

Test set score:0.79
Best parameters:{'C': 100, 'gamma': 0.01}
Best cross-validation score:0.82
