# ハイパーパラメータ探索

クロスバリデーションについては
[sklearnのドキュメント](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)
のドキュメントが詳しいです。

データをロードしておきます。

In [1]:
import pandas as pd
from sklearn import model_selection

# positive, neutral, negative の中から negative なレビューを当てるタスクとして、
# negativeを1に、それ以外のpositive, neutralを0に設定します。
data = pd.read_csv("input/pn_same_judge.csv")
data["label_num"] = data["label"].map({"positive": 0, "neutral": 0, "negative": 1})
train, test = model_selection.train_test_split(data, test_size=0.1, random_state=0)

テキストのトークン化に必要なトークナイザを定義します。

In [2]:
import spacy

nlp = spacy.load("ja_core_news_md")

def tokenize(text):
   return [token.text for token in nlp(text)]

2022-05-09 08:02:39.482435: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-09 08:02:39.482524: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## 指標を最大化するハイパーパラメータ探索

ある指標を最大化するパラメータを探索するには GridSearchCV が便利です。

ハイパーパラメータの探索では、引数だけでなく、例えばMultinomialNBとLogisticRegressionを比較するといったクラス自体も変更可能です。

参考: [Stack Overflow](https://stackoverflow.com/questions/64258622/gridsearchcv-with-tfidf-and-count-vectorizer)



In [3]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline


pipe = Pipeline([
    ("vect", TfidfVectorizer(tokenizer=tokenize)),
    ("clf", LogisticRegression())
])

params = [
    {
        "clf": [MultinomialNB()],
    },
    {
        "clf": [LogisticRegression()],
        "clf__class_weight": [None, "balanced"],
    }
]

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
search = GridSearchCV(pipe, params, scoring="average_precision", cv=cv, verbose=2, n_jobs=2)
search.fit(X=train["text"], y=train["label_num"])

Fitting 3 folds for each of 3 candidates, totalling 9 fits


2022-05-09 08:02:51.611437: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-09 08:02:51.611531: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-05-09 08:02:54.526855: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-09 08:02:54.526965: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=0, shuffle=True),
             estimator=Pipeline(steps=[('vect',
                                        TfidfVectorizer(tokenizer=<function tokenize at 0x7fc86ca539d0>)),
                                       ('clf', LogisticRegression())]),
             n_jobs=2,
             param_grid=[{'clf': [MultinomialNB()]},
                         {'clf': [LogisticRegression(class_weight='balanced')],
                          'clf__class_weight': [None, 'balanced']}],
             scoring='average_precision', verbose=2)

In [4]:
pd.DataFrame(search.cv_results_).T

Unnamed: 0,0,1,2
mean_fit_time,41.165774,43.328188,29.202276
std_fit_time,1.474656,0.592522,2.558648
mean_score_time,20.369568,20.659912,15.043999
std_score_time,1.378666,1.915008,1.882787
param_clf,MultinomialNB(),LogisticRegression(class_weight='balanced'),LogisticRegression(class_weight='balanced')
param_clf__class_weight,,,balanced
params,{'clf': MultinomialNB()},{'clf': LogisticRegression(class_weight='balan...,{'clf': LogisticRegression(class_weight='balan...
split0_test_score,0.7273,0.81205,0.818982
split1_test_score,0.70588,0.744122,0.749205
split2_test_score,0.744426,0.821502,0.825885


In [5]:
search.best_params_

{'clf': LogisticRegression(class_weight='balanced'),
 'clf__class_weight': 'balanced'}

最適なパラメータを使って、学習データ全体でもう一度モデルを学習するにはset_paramsを使います。


参考: [Stack Overflow](https://stackoverflow.com/questions/60608474/scikit-pipeline-parameters-fit-got-an-unexpected-keyword-argument-gamma)

In [6]:
pipe.set_params(**search.best_params_)

pipe.fit(X=train["text"], y=train["label_num"])

Pipeline(steps=[('vect',
                 TfidfVectorizer(tokenizer=<function tokenize at 0x7fc86ca539d0>)),
                ('clf', LogisticRegression(class_weight='balanced'))])

In [7]:
proba = pipe.predict_proba(X=test["text"])

## カスタムループ

ある指標を最大化するだけでなく、例えばすべてのPRカーブを描くなどの操作が必要な場合には
ハイパーパラメータの組み合わせに対して自分で学習、推論のコードを実装する必要があります。

ハイパーパラメータの組み合わせは [ParameterGrid](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterGrid.html#sklearn.model_selection.ParameterGrid)
を使うことができます。

ParameterGridにパラメータを渡すと、GridSearchCVでPipelineに渡される組み合わせと同様のパラメータのその組み合わせを返します。

In [8]:
from sklearn.model_selection import ParameterGrid

params = [
    {"a": ["a1", "a2"], "b": ["b1", "b2"]},
    {"a": ["a3", "a4"], "b": ["b3", "b4"]},
]
list(ParameterGrid(params))

[{'a': 'a1', 'b': 'b1'},
 {'a': 'a1', 'b': 'b2'},
 {'a': 'a2', 'b': 'b1'},
 {'a': 'a2', 'b': 'b2'},
 {'a': 'a3', 'b': 'b3'},
 {'a': 'a3', 'b': 'b4'},
 {'a': 'a4', 'b': 'b3'},
 {'a': 'a4', 'b': 'b4'}]

ParameterGridとset_paramsを使うことで次のように自分で
各パラメータの組み合わせに対してクロスバリデーションを実行することが可能になります。

In [9]:
from sklearn.model_selection import ParameterGrid
import numpy as np


def run_cv(pipe, params, cv, X, y):
    """paramsの組み合わせに対して、各フォールドで学習、推論を行い、その結果を返す。
    """
    result = []
    for param in ParameterGrid(params):
        pipe.set_params(**param)
        print(pipe)
        pred = np.zeros((len(X), ))
        for fold_id, (train_idx, test_idx) in enumerate(cv.split(X=X, y=y)):
            print("Fold:", fold_id)
            pipe.fit(X=X.iloc[train_idx], y=y.iloc[train_idx])
            # ここでは推定器にはpredict_probaがあることを想定しています。
            # この実装では、例えばSVCでは動作しないことに注意してください。
            pred[test_idx] = pipe.predict_proba(X.iloc[test_idx])[:,1]

        result.append((param, pred))
    return result

In [10]:
pipe = Pipeline([
    ("vect", TfidfVectorizer(tokenizer=tokenize)),
    ("clf", LogisticRegression())
])

params = [
    {
        "clf": [MultinomialNB()],
    },
    {
        "clf": [LogisticRegression()],
        "clf__class_weight": [None, "balanced"],
    }
]
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
result = run_cv(pipe=pipe, params=params, cv=cv, X=train["text"], y=train["label_num"])

Pipeline(steps=[('vect',
                 TfidfVectorizer(tokenizer=<function tokenize at 0x7fc86ca539d0>)),
                ('clf', MultinomialNB())])
Fold: 0
Fold: 1
Fold: 2
Pipeline(steps=[('vect',
                 TfidfVectorizer(tokenizer=<function tokenize at 0x7fc86ca539d0>)),
                ('clf', LogisticRegression())])
Fold: 0
Fold: 1
[CV] END ................................clf=MultinomialNB(); total time=  59.9s
[CV] END ...clf=LogisticRegression(), clf__class_weight=None; total time= 1.1min
[CV] END ...clf=LogisticRegression(), clf__class_weight=None; total time= 1.0min
[CV] END clf=LogisticRegression(), clf__class_weight=balanced; total time=  46.3s
Fold: 2
Pipeline(steps=[('vect',
                 TfidfVectorizer(tokenizer=<function tokenize at 0x7fc86ca539d0>)),
                ('clf', LogisticRegression(class_weight='balanced'))])
Fold: 0
[CV] END ................................clf=MultinomialNB(); total time=  59.2s
[CV] END ................................clf=Mult