# 第6回課題
前処理と特徴量選択により，SVM の最初のスコアよりも1割程度良いテストスコアを出してください．
ちょこっといじった程度では，線形回帰はスコアが変わらなかったので，参考程度に使ってください

### 必須事項
- 前処理：正規化，標準化，外れ値の排除など
- 特徴量選択: 検証は必須．増やす・減らす・変えないの結果は自由
- テストスコアの向上: mse で 0.41 くらいは出ると思います

### 自由事項
- 指標の変更
- パラメータの変更（モデルの変更は想定してません）

### 余談
特徴量選択で正解を用意するのは，やはり難しいなと解答を作る時に感じました．解答の方は最低限の考察と検証をしていますが，4時間かかりました( ;∀;)

## 注意事項
特徴量の分析時に分割したデータを使ってない時点でせこいというか，おかしいです．ここでは無視していますが，本来はできません．

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.svm import SVC, SVR
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import mean_squared_error, make_scorer
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):

- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol

Output variable (based on sensory data):
- quality (score between 0 and 10)

In [2]:
wine_quality_df = pd.read_csv("winequality-red.csv",delimiter=";")
print(wine_quality_df.shape)
wine_quality_df.head()

(1599, 12)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [3]:
wine_quality_df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [4]:
feature_names = list(np.copy(wine_quality_df.columns))
feature_names.remove("quality")

In [5]:
X_train, X_test, y_train, y_test = \
    train_test_split(wine_quality_df[feature_names], wine_quality_df["quality"], 
                     test_size=0.3, random_state=0)
X_train.shape, X_test.shape

((1119, 11), (480, 11))

## 注意
ここで，test score まで同時に出してしまっているんですが，1つの関数で実行してしまう方が楽だったという理由だけで，これらは分けた方が良いです．
パラメータや特徴量について考えるときには，CV だけで調整すべきです

In [6]:
kfold = KFold(n_splits=5, random_state=0)
def cross_validation(model, test=True):
    global X_train, X_test, y_train, y_test, feature_names
    scores = cross_val_score(model, X_train[feature_names], y_train, cv=kfold, 
                             scoring=make_scorer(mean_squared_error))
    # 各分割におけるスコア
    print('Cross-Validation scores: {}'.format(scores))
    # スコアの平均値
    print('Average score: {}'.format(np.mean(scores)))
    if test:
        model.fit(X_train[feature_names], y_train)
        pred = model.predict(X_test[feature_names])
        print('Test score: {}'.format(mean_squared_error(y_test, pred)))

# モデルによる予測

In [7]:
linear_reg = Ridge(random_state=0)
cross_validation(linear_reg)

Cross-Validation scores: [0.48812538 0.48841541 0.42327305 0.42566563 0.36728215]
Average score: 0.43855232598926486
Test score: 0.4010466305154148


In [8]:
svm_clf = SVC(kernel="rbf", random_state=0)
cross_validation(svm_clf)

Cross-Validation scores: [0.67857143 0.82142857 0.71428571 0.75446429 0.62780269]
Average score: 0.719310538116592
Test score: 0.68125


In [9]:
svm_reg = SVR(kernel="rbf")
cross_validation(svm_reg)

Cross-Validation scores: [0.5715145  0.68175944 0.57909369 0.61098986 0.50245379]
Average score: 0.5891622546486783
Test score: 0.5085468469038228


# 以降にコードを追加

In [10]:
X_train.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
92,8.6,0.49,0.29,2.0,0.11,19.0,133.0,0.9972,2.93,1.98,9.8
1017,8.0,0.18,0.37,0.9,0.049,36.0,109.0,0.99007,2.89,0.44,12.7
1447,6.8,0.67,0.0,1.9,0.08,22.0,39.0,0.99701,3.4,0.74,9.7
838,10.1,0.31,0.35,1.6,0.075,9.0,28.0,0.99672,3.24,0.83,11.2
40,7.3,0.45,0.36,5.9,0.074,12.0,87.0,0.9978,3.33,0.83,10.5


In [11]:
y_train.head()

92      5
1017    6
1447    5
838     7
40      5
Name: quality, dtype: int64

In [12]:
feature_names

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

# 標準化

In [13]:
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()
for col in feature_names:
    X_train[col] = standard_scaler.fit_transform(np.array(X_train[col]).reshape(-1, 1))
    X_test[col] = standard_scaler.fit_transform(np.array(X_test[col]).reshape(-1, 1))

In [14]:
X_train.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
92,0.148946,-0.21806,0.091982,-0.371725,0.450263,0.256185,2.498011,0.234476,-2.444015,7.385412,-0.573058
1017,-0.195652,-1.890462,0.502669,-1.108115,-0.780595,1.838977,1.797243,-3.520329,-2.701983,-1.234462,2.151676
1447,-0.884848,0.753012,-1.396757,-0.438669,-0.155077,0.535501,-0.246662,0.134418,0.587115,0.444734,-0.667015
838,1.010441,-1.189132,0.399998,-0.639503,-0.255967,-0.674868,-0.567847,-0.018302,-0.444759,0.948493,0.742331
40,-0.597683,-0.433854,0.451333,2.239114,-0.276145,-0.395552,1.154873,0.550448,0.13567,0.948493,0.084636


In [15]:
linear_reg = Ridge(random_state=0)
cross_validation(linear_reg)

Cross-Validation scores: [0.49239137 0.48985979 0.43040695 0.42268956 0.36400115]
Average score: 0.43986976362616464
Test score: 0.405475240519006


In [16]:
svm_clf = SVC(kernel="rbf", random_state=0)
cross_validation(svm_clf)

Cross-Validation scores: [0.56696429 0.52678571 0.5        0.52678571 0.4529148 ]
Average score: 0.5146901024983984
Test score: 0.4583333333333333


In [17]:
svm_reg = SVR(kernel="rbf")
cross_validation(svm_reg)

Cross-Validation scores: [0.45148575 0.42211498 0.38000374 0.4269028  0.3493321 ]
Average score: 0.40596787342893403
Test score: 0.3924823681610074


# 特徴量選択

In [33]:
from sklearn.feature_selection import SelectKBest, f_classif

kfold = KFold(n_splits=5, random_state=0)
feature_names = list(np.copy(wine_quality_df.columns))
feature_names.remove("quality")
copy_feature = feature_names.copy()
n = len(feature_names)

best_feature_ridge = []
best_feature_svc = []
best_feature_svr = []

best_score_ridge = 1
best_score_svc = 1
best_score_svr = 1

for i in range(1, n):
    selector = SelectKBest(score_func=f_classif, k=i)
    selector.fit(X_train, y_train)
    mask = selector.get_support()
    feature_names = [feature for feature, m in zip(feature_names, mask) if m]
    
    print('-' * 70)
    print(feature_names)
    
    # Ridge
    scores = cross_val_score(Ridge(random_state=0), X_train[feature_names], y_train, cv=kfold, 
                             scoring=make_scorer(mean_squared_error))
    if np.mean(scores) < best_score_ridge:
        best_score_ridge = np.mean(scores)
        best_feature_ridge = feature_names.copy()
    print('Average score of linear_reg: {}'.format(np.mean(scores)))
    
    # SVC
    scores = cross_val_score(SVC(kernel='rbf', random_state=0), X_train[feature_names], y_train, cv=kfold, 
                             scoring=make_scorer(mean_squared_error))
    if np.mean(scores) < best_score_svc:
        best_score_svc = np.mean(scores)
        best_feature_svc = feature_names.copy()
    print('Average score of svm_clf: {}'.format(np.mean(scores)))
    
    # SVR
    scores = cross_val_score(SVR(kernel='rbf'), X_train[feature_names], y_train, cv=kfold, 
                             scoring=make_scorer(mean_squared_error))
    if np.mean(scores) < best_score_svr:
        best_score_svr = np.mean(scores)
        best_feature_svr = feature_names.copy()
    print('Average score of svm_reg: {}'.format(np.mean(scores)))
    feature_names = copy_feature.copy()

----------------------------------------------------------------------
['alcohol']
Average score of linear_reg: 0.5352654334958145
Average score of svm_clf: 0.6353058936579117
Average score of svm_reg: 0.5521921014969338
----------------------------------------------------------------------
['volatile acidity', 'alcohol']
Average score of linear_reg: 0.4581833735265362
Average score of svm_clf: 0.5593649903907751
Average score of svm_reg: 0.47090991856097497
----------------------------------------------------------------------
['volatile acidity', 'total sulfur dioxide', 'alcohol']
Average score of linear_reg: 0.4538502962885794
Average score of svm_clf: 0.558472133247918
Average score of svm_reg: 0.47092343697693134
----------------------------------------------------------------------
['volatile acidity', 'citric acid', 'total sulfur dioxide', 'alcohol']
Average score of linear_reg: 0.4534056343769528
Average score of svm_clf: 0.5566904228058938
Average score of svm_reg: 0.467697456

In [36]:
best_feature_ridge

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

<p>Ridgeの場合, 特徴量は全部使った時が一番スコアが良い</p>

In [37]:
best_feature_svc

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

<p>SVMの場合, "chlorides"を抜いた時が一番スコアが良い</p>

In [38]:
best_feature_svr

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

<p>SVRの場合, 特徴量は全部使った時が一番スコアが良い</p>

In [39]:
feature_names = best_feature_ridge.copy()
cross_validation(linear_reg)

Cross-Validation scores: [0.48761245 0.48790686 0.43066806 0.42442699 0.36526994]
Average score: 0.4391768591191891
Test score: 0.40446614443168244


In [40]:
feature_names = best_feature_svc.copy()
cross_validation(svm_clf)

Cross-Validation scores: [0.53571429 0.50446429 0.49107143 0.48660714 0.4573991 ]
Average score: 0.49505124919923127
Test score: 0.45625


In [57]:
feature_names = best_feature_svr.copy()
cross_validation(svm_reg)

Cross-Validation scores: [0.4415309  0.42435346 0.38567597 0.42725716 0.34471665]
Average score: 0.4047068287331017
Test score: 0.4413478906841522


<p>
    SVMの場合"cholorides"を抜くことで, ほんの少しスコアが良くなったが, 他は全部使った時が一番スコアが良い.<br>
    今回のデータの場合, 特徴量を削る必要はないと感じた.
</p>

# 正規化

In [42]:
feature_names = list(np.copy(wine_quality_df.columns))
feature_names.remove("quality")
X_train, X_test, y_train, y_test = \
    train_test_split(wine_quality_df[feature_names], wine_quality_df["quality"], 
                     test_size=0.3, random_state=0)

In [44]:
from sklearn.preprocessing import MinMaxScaler

minimax_scaler = MinMaxScaler()
for col in feature_names:
    X_train[col] = minimax_scaler.fit_transform(np.array(X_train[col]).reshape(-1, 1))
    X_test[col] = minimax_scaler.fit_transform(np.array(X_test[col]).reshape(-1, 1))

In [45]:
X_train.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
92,0.348214,0.253425,0.29,0.075342,0.163606,0.253521,0.448763,0.523495,0.149606,0.98773,0.215385
1017,0.294643,0.041096,0.37,0.0,0.06177,0.492958,0.363958,0.0,0.11811,0.042945,0.661538
1447,0.1875,0.376712,0.0,0.068493,0.113523,0.295775,0.116608,0.509545,0.519685,0.226994,0.2
838,0.482143,0.130137,0.35,0.047945,0.105175,0.112676,0.077739,0.488253,0.393701,0.282209,0.430769
40,0.232143,0.226027,0.36,0.342466,0.103506,0.15493,0.286219,0.567548,0.464567,0.282209,0.323077


In [46]:
linear_reg = Ridge(random_state=0)
cross_validation(linear_reg)

Cross-Validation scores: [0.4892121  0.48830701 0.42580411 0.42564279 0.36217685]
Average score: 0.4382285727903389
Test score: 0.44920164403284163


In [47]:
svm_clf = SVC(kernel="rbf", random_state=0)
cross_validation(svm_clf)

Cross-Validation scores: [0.55803571 0.51785714 0.52232143 0.54910714 0.4529148 ]
Average score: 0.5200472453555414
Test score: 0.5145833333333333


In [48]:
svm_reg = SVR(kernel="rbf")
cross_validation(svm_reg)

Cross-Validation scores: [0.4538687  0.42541329 0.38508794 0.43251404 0.3540613 ]
Average score: 0.4101890514465902
Test score: 0.436649611258576


# 特徴量選択

In [49]:
from sklearn.feature_selection import SelectKBest, f_classif

kfold = KFold(n_splits=5, random_state=0)
feature_names = list(np.copy(wine_quality_df.columns))
feature_names.remove("quality")
copy_feature = feature_names.copy()
n = len(feature_names)

best_feature_ridge = []
best_feature_svc = []
best_feature_svr = []

best_score_ridge = 1
best_score_svc = 1
best_score_svr = 1

for i in range(1, n):
    selector = SelectKBest(score_func=f_classif, k=i)
    selector.fit(X_train, y_train)
    mask = selector.get_support()
    feature_names = [feature for feature, m in zip(feature_names, mask) if m]
    
    print('-' * 70)
    print(feature_names)
    
    # Ridge
    scores = cross_val_score(Ridge(random_state=0), X_train[feature_names], y_train, cv=kfold, 
                             scoring=make_scorer(mean_squared_error))
    if np.mean(scores) < best_score_ridge:
        best_score_ridge = np.mean(scores)
        best_feature_ridge = feature_names.copy()
    print('Average score of linear_reg: {}'.format(np.mean(scores)))
    
    # SVC
    scores = cross_val_score(SVC(kernel='rbf', random_state=0), X_train[feature_names], y_train, cv=kfold, 
                             scoring=make_scorer(mean_squared_error))
    if np.mean(scores) < best_score_svc:
        best_score_svc = np.mean(scores)
        best_feature_svc = feature_names.copy()
    print('Average score of svm_clf: {}'.format(np.mean(scores)))
    
    # SVR
    scores = cross_val_score(SVR(kernel='rbf'), X_train[feature_names], y_train, cv=kfold, 
                             scoring=make_scorer(mean_squared_error))
    if np.mean(scores) < best_score_svr:
        best_score_svr = np.mean(scores)
        best_feature_svr = feature_names.copy()
    print('Average score of svm_reg: {}'.format(np.mean(scores)))
    feature_names = copy_feature.copy()

----------------------------------------------------------------------
['alcohol']
Average score of linear_reg: 0.5353931487661371
Average score of svm_clf: 0.6353058936579117
Average score of svm_reg: 0.5521921014969348
----------------------------------------------------------------------
['volatile acidity', 'alcohol']
Average score of linear_reg: 0.45842947424522273
Average score of svm_clf: 0.563821268417681
Average score of svm_reg: 0.4706794059775817
----------------------------------------------------------------------
['volatile acidity', 'total sulfur dioxide', 'alcohol']
Average score of linear_reg: 0.45406745271684856
Average score of svm_clf: 0.5629404228058936
Average score of svm_reg: 0.4700983583054191
----------------------------------------------------------------------
['volatile acidity', 'citric acid', 'total sulfur dioxide', 'alcohol']
Average score of linear_reg: 0.4536232835180769
Average score of svm_clf: 0.560281870595772
Average score of svm_reg: 0.4678074904

In [53]:
best_feature_ridge

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

<p>Ridgeの場合, 特徴量は全部使った時が一番スコアが良い</p>

In [51]:
best_feature_svc

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

<p>SVCの場合, "cholorides"を抜いた時が一番スコアが良い</p>

In [52]:
best_feature_svr

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

<p>SVRの場合, 特徴量は全部使った時が一番スコアが良い</p>

In [54]:
feature_names = best_feature_ridge.copy()
cross_validation(linear_reg)

Cross-Validation scores: [0.48364241 0.48765627 0.42648673 0.42763401 0.3640758 ]
Average score: 0.43789904332131047
Test score: 0.44868337262902164


In [55]:
feature_names = best_feature_svc.copy()
cross_validation(svm_clf)

Cross-Validation scores: [0.5625     0.47767857 0.52678571 0.51785714 0.4529148 ]
Average score: 0.5075472453555413
Test score: 0.5083333333333333


In [56]:
feature_names = best_feature_svr.copy()
cross_validation(svm_reg)

Cross-Validation scores: [0.4415309  0.42435346 0.38567597 0.42725716 0.34471665]
Average score: 0.4047068287331017
Test score: 0.4413478906841522


<p>
    標準化の時と同様で, SVMの場合"cholorides"を抜くことで, ほんの少しスコアが良くなったが, 他は全部使った時が一番スコアが良い.<br>
    今回のデータの場合, 特徴量を削る必要はないと感じた.<br>
    正規化より標準化を行った方がスコアが良くなることがわかる.
</p>