# 演習問題

与えられたニュースをカテゴリに仕分けする分類器（カテゴリ分類器）を構築したい．今回は，[livedoor ニュースコーパス](https://www.rondhuit.com/download.html)を用い，記事の本文やタイトルからその情報源を推定する分類器を構築する．

+ トピックニュース: http://news.livedoor.com/category/vender/news/
+ Sports Watch: http://news.livedoor.com/category/vender/208/
+ ITライフハック: http://news.livedoor.com/category/vender/223/
+ 家電チャンネル: http://news.livedoor.com/category/vender/kadench/
+ MOVIE ENTER: http://news.livedoor.com/category/vender/movie_enter/
+ 独女通信: http://news.livedoor.com/category/vender/90/
+ エスマックス: http://news.livedoor.com/category/vender/smax/
+ livedoor HOMME: http://news.livedoor.com/category/vender/homme/
+ Peachy: http://news.livedoor.com/category/vender/ldgirls/


なお，ニュースのカテゴリ分類は[ニュース・キュレーションサービスなどで実際に用いられている](https://webtan.impress.co.jp/e/2015/04/14/19666)技術である．今回の演習では，構築する分類器は線形識別器（多クラスロジスティック回帰など）に限定する（多層ニューラルネットワークや非線形サポートベクトルマシンを使ってはいけない）．

## 1. データのダウンロードと整形

[livedoor ニュースコーパス](https://www.rondhuit.com/download.html)は[クリエイティブ・コモンズライセンス「表示 – 改変禁止」](https://creativecommons.org/licenses/by-nd/2.1/jp/)のライセンスで配布されているため，データを加工したものを再配布することができない．そこで，データのダウンロードから整形まで，各自の環境で実行する必要がある．データの整形を行う手順を以下に示すので，そのまま実行するだけでよい．ただし，この演習では全員が同じ学習データ，検証データ，評価データを用いたいので，以下の手順を改変することなく実行せよ．

### 1.1. 訓練データ，検証データ，評価データの準備

コーパスをダウンロード．

In [1]:
!wget https://www.rondhuit.com/download/ldcc-20140209.tar.gz

--2020-12-26 12:29:57--  https://www.rondhuit.com/download/ldcc-20140209.tar.gz
CA証明書 '/etc/ssl/certs/ca-certificates.crt' をロードしました
www.rondhuit.com (www.rondhuit.com) をDNSに問いあわせています... 59.106.19.174
www.rondhuit.com (www.rondhuit.com)|59.106.19.174|:443 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 8855190 (8.4M) [application/x-gzip]
`ldcc-20140209.tar.gz' に保存中


2020-12-26 12:29:57 (32.0 MB/s) - `ldcc-20140209.tar.gz' へ保存完了 [8855190/8855190]



ダウンロードしたコーパスを解凍．

In [2]:
!tar -zxvf ldcc-20140209.tar.gz > /dev/null

解凍したファイルを読み込み，記事のリストとしてデータ`D`を作成する．

In [3]:
import pathlib

D = []
p = pathlib.Path('text')
for d in p.iterdir():
    if not d.is_dir():
        continue
    source = d.name
    for fname in d.glob('*.txt'):
        with open(fname) as fi:
            url = fi.readline().strip()
            timestamp = fi.readline().strip()
            title = fi.readline().strip()
            text = [line.strip() for line in fi if line.strip()]
            D.append(
                dict(source=source, url=url, timestamp=timestamp, title=title, text=text)
                )

訓練データ`Dtrain`, 検証データ`Ddev`，評価データ`Dtest`に分ける．

In [4]:
D.sort(key=lambda x: x['url'])

Dtrain = []
Ddev = []
Dtest = []

for i, d in enumerate(D):
    if i % 10 == 8:
        Ddev.append(d)
    elif i % 10 == 9:
        Dtest.append(d)
    else:
        Dtrain.append(d)

コーパスを訓練データ，検証データ，評価データに正しく分割できたかを，ハッシュ値を用いてチェックする．もし，以下のコードを実行した時に**AssertionError例外が出た場合はそれまでの手順を変更してしまった可能性がある**（正常であれば"OK"と表示される）．どうしてもAssertionError例外が出る場合は連絡すること．

In [35]:
import hashlib

def compute_hash(D):
    m = hashlib.sha256()
    for d in D:
        m.update(d['url'].encode('utf-8'))
    return m.hexdigest()

assert compute_hash(Dtrain) == 'f1294a19b25952e5b18510e3eb74c21be9d5d18a86c369d2d2639c9e5ea93d6c'
assert compute_hash(Ddev) == '64f709e1e739ac880b8b7acc49ce342b60e80b804279bac68c5f27d08b5fb141'
assert compute_hash(Dtest) == '4acf6822099a9e4cc5794cade26ae0ddd8df88ccc99690e7b48cdd8aa3bf1bcd'
print("OK")

OK


### 1.2 日本語の分かち書き

タイトルと本文の日本語を分かち書きするために，[MeCab](https://taku910.github.io/mecab/)をインストールする（mecab-python3の最新版は環境によってエラーが出るので，v0.996.5を指定してインストールしている）．

In [7]:
!pip install mecab-python3==0.996.5

Collecting mecab-python3==0.996.5
  Downloading mecab_python3-0.996.5-cp38-cp38-manylinux2010_x86_64.whl (17.1 MB)
[K     |████████████████████████████████| 17.1 MB 10.8 MB/s eta 0:00:01
[?25hInstalling collected packages: mecab-python3
Successfully installed mecab-python3-0.996.5


タイトルと本文を単語（形態素）区切りで分割する．

In [9]:
import MeCab
tagger = MeCab.Tagger('-Owakati')
def tokenize(s):
    return tagger.parse(s).split()

def add_tokenization(D):
    for d in D:
        d['title.tokenized'] = tokenize(d['title'])
        d['text.tokenized'] = [tokenize(s) for s in d['text']]

add_tokenization(Dtrain)
add_tokenization(Ddev)
add_tokenization(Dtest)

### 整形済みのデータをファイルに保存する場合

もし，整形済みのデータをファイルに保存しておきたい場合は，以下のコードを実行して"livedoor.json"というファイルに保存する．ただし，Google Colaboratory上で実行している場合は，インスタンスが消滅すると保存したファイルも消えてしまうので，以下のいずれかで対応することになる．

1. インスタンスを新たに立ち上げた（インスタンスがリセットされた）度に，これまでの処理を再実行する
1. "livedoor.json"を自分のPCに保存しておき，インスタンスを立ち上げる毎にアップロードする
1. "livedoor.json"を自分のGoogle Driveに保存しておき，インスタンスを立ち上げたときにマウントして読み込む

In [176]:
import json

with open('livedoor.json', 'w') as fo:
    json.dump(
        dict(train=Dtrain, test=Dtest, dev=Ddev),
        fo
        )

### 保存された整形済みのデータを読み込む

In [1]:
import json

with open('livedoor.json') as fi:
    D = json.load(fi)
    
Dtrain = D['train']
Ddev = D['dev']
Dtest = D['test']

## 作成されたデータの確認

訓練データの先頭の事例を表示してみる．各訓練データは，以下のような辞書で表現される．各フィールドの意味は以下の通りである．

+ `source`: 記事のカテゴリ
+ `url`: 記事のURL
+ `timestamp`: 記事の発行日時
+ `title`: 記事のタイトル
+ `text`: 記事の本文（段落（文字列）を要素としたリスト形式
+ `title.tokenized`: 記事のタイトルをMeCabで分かち書きしたもの
+ `text.tokenized`: 記事の本文を分かち書きしたもの．段落が単語のリストとして表現され，その段落のリストを格納している

`source`フィールドのクラスを目的変数とみなし，それ以外のフィールドの情報から目的変数を予測する高性能なモデルを構築するのが，今回の演習の趣旨である．

In [18]:
Dtrain[4749]

{'source': 'peachy',
 'url': 'http://news.livedoor.com/article/detail/6684605/',
 'timestamp': '2000-06-25T15:45:00+0900',
 'title': 'キーワードは「カワイク＆賢く！」イマドキスマホ女子に人気のスマホグッズ紹介',
 'text': ['・カカオチョコレート',
  '・ウサギケース ラビットしっぽ',
  '・フォンピアス\u3000イヤホンジャックアクセサリー\u3000クマ',
  '・ブラウンポンポンゴールド スマートフォンアクセ'],
 'title.tokenized': ['キーワード',
  'は',
  '「',
  'カワイク',
  '＆',
  '賢く',
  '！',
  '」',
  'イマドキスマホ',
  '女子',
  'に',
  '人気',
  'の',
  'スマホグッズ',
  '紹介'],
 'text.tokenized': [['・', 'カカオ', 'チョコレート'],
  ['・', 'ウサギ', 'ケース', 'ラビット', 'しっぽ'],
  ['・', 'フォン', 'ピアス', 'イヤホンジャックアクセサリー', 'クマ'],
  ['・', 'ブラウンポンポンゴールド', 'スマートフォンアクセ']]}

## 2. 単語の出現頻度による特徴量ベクトル

学習データ`Dtrain`に含まれる任意の事例に対して，分かち書きされたテキスト（`text.tokenized`）に含まれる単語の出現頻度を計測し，単語から頻度への連想配列（辞書）形式のオブジェクトに格納せよ（小レポート1 7-1を参考にせよ）．例として，`Dtrain[3521]`の学習事例のテキストに対して，単語の出現頻度を計測した結果の一部を示す．

```
{'4': 1,
 '月': 2,
 '19': 1,
 '日': 1,
 '（': 3,
 '）': 3,
 'より': 1,
 ...
 '類': 1,
 'と': 1,
 'なる': 1}
```

In [3]:
from collections import defaultdict

In [4]:
def token2vec(token):
    vec = defaultdict(int)
    for sentence in token:
        for word in sentence:
                vec[word] += 1
    return vec

In [126]:
token = Dtrain[238]['text.tokenized']
token2vec(token)

defaultdict(int,
            {'4': 1,
             '月': 2,
             '19': 1,
             '日': 1,
             '（': 3,
             '）': 3,
             'より': 1,
             '、': 6,
             '神戸': 1,
             '大丸': 1,
             'インテリア': 1,
             '専門': 1,
             '館': 1,
             '「': 2,
             'ミュゼエール': 1,
             '＠': 1,
             '六甲': 1,
             'アイランド': 1,
             '」': 2,
             '2': 1,
             '階': 1,
             'で': 2,
             'は': 3,
             'ヨーロッパ': 1,
             '最大': 1,
             'の': 2,
             'フィットネスマシンメーカー': 1,
             'テクノ': 1,
             'ジム': 1,
             '社': 1,
             'パートナー': 1,
             'ショールーム': 1,
             'が': 1,
             '関西': 1,
             '初めて': 1,
             '開設': 1,
             'さ': 1,
             'れる': 1,
             '。': 2,
             '展示': 1,
             '品': 1,
             'キネシス・パーソナル': 1,
             'Vision': 1,
            

## 3. 線形分類モデルの学習

2.で構築したプログラムを使い，学習データ`Dtrain`の分かち書きされたテキスト（`text.tokenized`）に含まれる単語の頻度を特徴量ベクトル$\pmb{x}$として，目的変数（情報源である`source`フィールド）を予測する線形識別モデルを学習せよ．

ヒント
+ 線形分類モデルの実装に[sklearn.linear_model.SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)を使う場合は，単語とクラス名を自然数のID番号に変換し，学習事例を`np.array`に変換する必要がある．この変換には，小レポート1 7-2が参考になるし，[sklearn.feature_extraction.DictVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html)および[sklearn.preprocessing.LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)を使ってもよい．
+ 特徴量の空間，すなわち線形分類モデルが扱うことのできる単語集合は訓練データ中に含まれる全ての単語とすればよい．
+ 訓練データ中には出現しなかったが，検証データや評価データのみに出現する単語がある．そのような単語（OOV: out-of-vocabulary）は無視すればよい．

In [5]:
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn import preprocessing

import numpy as np

In [68]:
def token2vec(token):
    vec = defaultdict(int)
    for sentence in token:
        for word in sentence:
                vec[word] += 1
    return vec

In [87]:
train_vec = [token2vec(data['text.tokenized']) for data in Dtrain]
dev_vec = [token2vec(data['text.tokenized']) for data in Ddev]
test_vec = [token2vec(data['text.tokenized']) for data in Dtest]

In [88]:
# word to IDVec
VX = DictVectorizer()
Xtrain = VX.fit_transform(train_vec).toarray()
Xdev = VX.transform(dev_vec).toarray()
Xtest = VX.transform(test_vec).toarray()

In [98]:
# source to ID
VY = preprocessing.LabelEncoder()
Ytrain = VY.fit_transform([data['source'] for data in Dtrain])
Ydev = VY.transform([data['source'] for data in Ddev])
Ytest = VY.transform([data['source'] for data in Dtest])

In [138]:
model = SGDClassifier(loss='log')
model.fit(Xtrain, Ytrain)

SGDClassifier(loss='log')

In [139]:
#model をpickle化
import pickle

with open('SGD_loss-log_1226.pickle', 'wb') as f:
    pickle.dump(model, f)

In [140]:
Ytrain_pred=model.predict(Xtrain)
model.score(Xtrain, Ytrain)

0.994747543205693

## 4. 検証データ上での正解率

3で学習したモデルの検証データ上での正解率を求めよ．

In [141]:
model.score(Xdev,Ydev)

0.937584803256445

## 5. 検証データ上でのマクロ平均適合率，再現率，F1スコア

3で学習したモデルの検証データ上での適合率，再現率，F1スコアを求めよ．ただし，これらの指標を求めるときは，マクロ平均を用いよ．

In [6]:
from sklearn.metrics import recall_score, precision_score, f1_score

In [142]:
Ydev_pred=model.predict(Xdev)

In [143]:
precision_score(Ydev, Ydev_pred, average='macro')

0.9360393590673417

In [144]:
recall_score(Ydev, Ydev_pred, average='macro')

0.9283801510999806

In [145]:
f1_score(Ydev, Ydev_pred, average='macro')

0.930737047396014

## 6. 検証データ上での混同行列

3で学習したモデルの検証データ上での混同行列を求めよ．

In [7]:
from sklearn.metrics import confusion_matrix

In [146]:
confusion_matrix(Ydev, Ydev_pred)

array([[90,  0,  0,  1,  1,  2,  0,  0,  0],
       [ 0, 78,  2,  0,  1,  0,  1,  2,  0],
       [ 0,  1, 85,  0,  0,  0,  0,  1,  1],
       [ 0,  1,  4, 39,  5,  3,  1,  1,  0],
       [ 0,  0,  1,  0, 77,  2,  0,  1,  1],
       [ 2,  0,  0,  3,  0, 73,  0,  0,  0],
       [ 0,  0,  1,  0,  0,  1, 81,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0, 92,  3],
       [ 0,  0,  1,  0,  0,  1,  0,  1, 76]])

## 7. 事例の分類 (*)

3で学習したモデルを用い，検証データの先頭の事例のクラスを予測し，表示せよ．

In [147]:
VY.classes_

array(['dokujo-tsushin', 'it-life-hack', 'kaden-channel',
       'livedoor-homme', 'movie-enter', 'peachy', 'smax', 'sports-watch',
       'topic-news'], dtype='<U14')

In [148]:
print(f'pred: {Ydev_pred[0]}, {VY.classes_[Ydev_pred[0]]}')
print(f'true: {Ydev[0]}, {VY.classes_[Ydev[0]]}')

pred: 5, peachy
true: 5, peachy


## 8. クラスと確率の表示 (*)

3で学習したモデルを用い，検証データの先頭の事例に対して，各クラスに分類される確率（条件付き確率）を求めよ．

In [149]:
model.predict_proba(Xdev[0:1])

array([[1.32249729e-025, 2.43165634e-186, 5.60295358e-124,
        9.06614257e-289, 1.38627054e-226, 1.00000000e+000,
        0.00000000e+000, 9.65786427e-278, 1.22413911e-109]])

## 9. 検証データをターゲットとした性能向上 (**)

カテゴリ分類器のハイパーパラメータや記事からの特徴量抽出を工夫し，検証データ上でのF1スコアが最も高くなるカテゴリ分類器を見つけよ．工夫においてどのような方針でモデルを検討・実験し，その中でどのモデルの性能が最も良かったのか，説明せよ（Markdown形式で記述せよ）．さらに，検討した中で性能が最も高いカテゴリ分類器を学習するプログラムと，その分類器の評価データ上での適合率，再現率，F1スコアを報告せよ．

+ 識別モデルの学習パラメータ（L2正則化の係数など）に加えて，記事の特徴量などにも工夫する余地がある．
+ 必要であれば1.2以降の処理を変更し，単語の分かち書きの方法を変えてもよい
+ 学習データ，検証データ，評価データの分け方を変更してはならない（1.1までの手順は変更不可）
+ 言うまでもないが，評価データでカテゴリ分類器を学習してはいけない

今回は以下のような処理を行って`SGDClassifier`によるカテゴリ分類器を学習した．
- 前処理を行う
    - 助詞や助動詞などは無視する
    - 数値以外は原形にする
    - 記事本文だけでなくタイトルも用いている（タイトルを本文の先頭に結合）
    - urlは正規表現で除去
- TFIDFVectorizerを用いて単語をベクトル化する
    - CountVectorizerなど様々な方法を試した中で最も良かった方法がTF-IDFによる方法だった
    - `max_df`や`min_df`などを指定して不要な語彙を省く
- optunaでパラメータチューニング
    - 正則化項に関するパラメータ`alpha`と`L1_ratio`についてチューニングする
    - 検証データを用いてチューニングする場合と，訓練データのみを用いて交差検証によるチューニングの両方を試した結果，前者のほうが良いスコアが出たためこちらを採用した
- 最適なパラメータでのfittingを100回施行し，最も検証データのf1スコアが高かったモデルを採用する

最も検証データが良かった結果は，`params = {'alpha': 4.2842570078130996e-07, 'l1_ratio': 0.0014625643848466556}`のとき

|       | accuracy | precision | recall | f1_score |
| ----- | -------- | --------- | ------ | ---------- |
| train | 1.0      | 1.0       | 1.0    | 1.0        |
| dev   | 0.972    | 0.971     | 0.967  | 0.968      |
| test  | 0.970    | 0.968     | 0.965  | 0.967      |

である．

また，混同行列は，

In [68]:
confusion_matrix(Ydev, Ydev_pred)

array([[90,  0,  0,  0,  1,  3,  0,  0,  0],
       [ 0, 84,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0, 88,  0,  0,  0,  0,  0,  0],
       [ 1,  0,  0, 47,  2,  3,  1,  0,  0],
       [ 0,  0,  0,  0, 81,  1,  0,  0,  0],
       [ 0,  0,  0,  1,  2, 75,  0,  0,  0],
       [ 0,  0,  2,  0,  0,  0, 81,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0, 95,  0],
       [ 0,  0,  0,  1,  0,  1,  0,  2, 75]])

In [107]:
print(f'precision:{precision_score(Ydev, Ydev_pred, average=None)}')
print(f'recall_score:{recall_score(Ydev, Ydev_pred, average=None)}')
print(f'f1_score:{f1_score(Ydev, Ydev_pred, average=None)}')

precision:[0.96703297 0.98809524 0.96666667 0.93877551 0.95294118 0.90123457
 0.97590361 0.96907216 0.96103896]
recall_score:[0.93617021 0.98809524 0.98863636 0.85185185 0.98780488 0.93589744
 0.97590361 0.98947368 0.93670886]
f1_score:[0.95135135 0.98809524 0.97752809 0.89320388 0.97005988 0.91823899
 0.97590361 0.97916667 0.94871795]


上のように，3番目（`livedoor-homme`）の再現率が低い．

実際訓練データについても`livedoor-homme`の要素数が最も少ないため，学習が進まなかったのだと考えられる．

In [123]:
for i in range(len(VY.classes_)):
    print(f'{i}: {len(np.where(Ytrain == i)[0])}\t{VY.classes_[i]}')

0: 685	dokujo-tsushin
1: 694	it-life-hack
2: 697	kaden-channel
3: 405	livedoor-homme
4: 694	movie-enter
5: 687	peachy
6: 712	smax
7: 714	sports-watch
8: 614	topic-news


そこで，`class_weight='balanced'`を指定して同じように学習すると性能が上がると考えたが，以下のようにテストデータの性能は下がっている．

(`{'alpha': 6.246369486906041e-07, 'l1_ratio': 0.0005504654723293009}`)

|       | accuracy | precision | recall | f1_measure |
| ----- | -------- | --------- | ------ | ---------- |
| train | 1.000      | 1.000       | 1.000    | 1.000        |
| dev   | 0.973    | 0.972     | 0.969  | 0.970      |
| test  | 0.966    | 0.965     | 0.961  | 0.963      |

また，混同行列は，３番め目の分類についてごく僅かに改善の傾向が見られるが，誤差の範囲である．

In [147]:
confusion_matrix(Ydev, Ydev_pred)

array([[91,  0,  0,  1,  0,  2,  0,  0,  0],
       [ 0, 84,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0, 88,  0,  0,  0,  0,  0,  0],
       [ 0,  2,  0, 48,  2,  2,  0,  0,  0],
       [ 0,  0,  0,  0, 81,  1,  0,  0,  0],
       [ 1,  0,  0,  1,  2, 74,  0,  0,  0],
       [ 0,  0,  2,  0,  0,  0, 81,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0, 94,  1],
       [ 0,  0,  0,  0,  0,  1,  0,  2, 76]])

## TRIAL

In [2]:
#用いたライブラリ．　一部使用していないものもある

from sklearn.metrics import recall_score, precision_score, f1_score, confusion_matrix
from sklearn.model_selection import cross_validate, StratifiedKFold

from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import preprocessing

from sklearn.preprocessing import StandardScaler

import optuna
from collections import defaultdict
import numpy as np
import re

import neologdn
import umap
from sklearn.decomposition import TruncatedSVD

In [69]:
# データ読込
import json

with open('livedoor.json') as fi:
    D = json.load(fi)
    
Dtrain = D['train']
Ddev = D['dev']
Dtest = D['test']

In [23]:
# 標準化
# 今回はあまり効果が出ないので用いていない
ss_X = StandardScaler()
Xtrain_ = ss_X.fit_transform(Xtrain)
Xdev_ = ss_X.transform(Xdev)
Xtest_ = ss_X.transform(Xtest)

In [3]:
import MeCab
tagger = MeCab.Tagger('-Owakati')

# mecab-ipadic-neologdを使っても性能が出なかった
# neologd = MeCab.Tagger('-d /usr/lib/mecab/dic/mecab-ipadic-neologd')
# neologd.parse('') 

def tokenize(s):
    return tagger.parse(s).split()

def token2vec(token):
    vec = defaultdict(int)
    for sentence in token:
        for word in sentence:
                vec[word] += 1
    return vec

In [22]:
urlre=re.compile(r'(http|https)://([-\w]+\.)+[-\w]+(/[-\w./?%&=]*)?')
symbolre = re.compile('[，．、。]')
# numre = re.compile(r'\d+')

def mytokenize(d):
    s=''.join(d)
    token = []
    s=s.replace('\u3000','')
    neologdn.normalize(s)
    s=urlre.sub("", s)
    s=symbolre.sub(" ", s)
#     s=numre.sub('0', s)
    node = tagger.parseToNode(s)
    while node:
        features = node.feature.split(',')
        pos = features[0]
#         pos_sub1 = features[1]
        base = features[6]
        if node.surface == '':
            node = node.next
            continue
        if pos in ['名詞', '動詞', '形容詞', '連体詞', '副詞', '感動詞', '記号']: # and pos_sub1 not in  ['非自立', '接尾']:
            if base == "*":
                token.append(node.surface)
            else:
                token.append(base)

        node = node.next

    return token

In [88]:
# vector化 TFIDF:単語の重要度によるベクトル化． 単語の出現頻度と逆文書頻度（単語の希少さ）の積
# min_df=3 : 出現数3未満の語彙は除外
# max_df=0.7 : 70%の文書で出現する語彙は除外
vectorizer = TfidfVectorizer(analyzer=mytokenize,min_df=3, max_df=0.7, norm='l2', sublinear_tf=True)
vectorizer.fit([[d['title']]+d['text'] for d in Dtrain])

# 用いるデータ
# titleとtextを結合してベクトル化
Xtrain_tfidf = vectorizer.transform([[d['title']]+ d['text'] for d in Dtrain])
Xdev_tfidf =vectorizer.transform([[d['title']]+ d['text'] for d in Ddev])
Xtest_tfidf = vectorizer.transform([[d['title']]+ d['text'] for d in Dtest])

# 分類
VY = preprocessing.LabelEncoder()
Ytrain = VY.fit_transform([data['source'] for data in Dtrain])
Ydev = VY.transform([data['source'] for data in Ddev])
Ytest = VY.transform([data['source'] for data in Dtest])

In [695]:
# opt1(不採用)
# 訓練データの交差検証によるチューニング
Xtrain_opt = Xtrain_tfidf
Xdev_opt = Xdev_tfidf

def objective(trial):
    alpha = trial.suggest_loguniform('alpha', 1e-10, 1e-2)
    l1_ratio = trial.suggest_loguniform('l1_ratio', 1e-10, 1e-2)
    skf = StratifiedKFold(n_splits=5, shuffle=True)
    clf = SGDClassifier(loss='log', alpha=alpha, l1_ratio = l1_ratio)
    scores=cross_validate(clf, X=Xtrain_opt,y=Ytrain, scoring='f1_macro',cv=skf)
    return scores['test_score'].mean()

In [24]:
# opt2
# 検証データを用いたパラメータチューニング
Xtrain_opt = Xtrain_tfidf
Xdev_opt = Xdev_tfidf
def objective(trial):
    alpha = trial.suggest_loguniform('alpha', 1e-10, 1e-2)
    l1_ratio = trial.suggest_loguniform('l1_ratio', 1e-10, 1)
    skf = StratifiedKFold(n_splits=5, shuffle=True)
    clf = SGDClassifier(loss='log', alpha=alpha, l1_ratio = l1_ratio)
    clf.fit(Xtrain_opt, Ytrain)
    Ydev_pred = clf.predict(Xdev_opt)
    return f1_score(Ydev, Ydev_pred, average="macro")

In [27]:
# 1000回のトライアルで最適なパラメータを採用
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=1000)
params=study.best_params

[32m[I 2020-12-27 01:47:04,087][0m A new study created in memory with name: no-name-124fff66-dac8-4e3b-8d5f-cfa2ab9c0e46[0m
[32m[I 2020-12-27 01:47:04,478][0m Trial 0 finished with value: 0.9341645729816757 and parameters: {'alpha': 0.00016174656759979948, 'l1_ratio': 4.66143678726133e-06}. Best is trial 0 with value: 0.9341645729816757.[0m
[32m[I 2020-12-27 01:47:04,824][0m Trial 1 finished with value: 0.9549495631913903 and parameters: {'alpha': 4.05765377420326e-06, 'l1_ratio': 1.424288322961808e-06}. Best is trial 1 with value: 0.9549495631913903.[0m
[32m[I 2020-12-27 01:47:05,198][0m Trial 2 finished with value: 0.9536654596686677 and parameters: {'alpha': 3.921209246786544e-06, 'l1_ratio': 3.076522524539712e-06}. Best is trial 1 with value: 0.9549495631913903.[0m
[32m[I 2020-12-27 01:47:05,600][0m Trial 3 finished with value: 0.9629073166762476 and parameters: {'alpha': 8.258553326504105e-07, 'l1_ratio': 7.693428924884304e-09}. Best is trial 3 with value: 0.96290731

[32m[I 2020-12-27 01:47:21,998][0m Trial 37 finished with value: 0.9517440642364056 and parameters: {'alpha': 2.4816955822843077e-07, 'l1_ratio': 9.879270696502463e-09}. Best is trial 17 with value: 0.9648780123506495.[0m
[32m[I 2020-12-27 01:47:22,626][0m Trial 38 finished with value: 0.9448840794725444 and parameters: {'alpha': 1.1220525794198014e-08, 'l1_ratio': 5.466243572632491e-06}. Best is trial 17 with value: 0.9648780123506495.[0m
[32m[I 2020-12-27 01:47:23,109][0m Trial 39 finished with value: 0.9573675632420877 and parameters: {'alpha': 6.916069692258957e-06, 'l1_ratio': 2.6234169333329677e-09}. Best is trial 17 with value: 0.9648780123506495.[0m
[32m[I 2020-12-27 01:47:23,932][0m Trial 40 finished with value: 0.9547293363139823 and parameters: {'alpha': 2.5651893933187358e-09, 'l1_ratio': 0.0005444253926552143}. Best is trial 17 with value: 0.9648780123506495.[0m
[32m[I 2020-12-27 01:47:24,504][0m Trial 41 finished with value: 0.9566712151618274 and parameters

[32m[I 2020-12-27 01:47:40,812][0m Trial 74 finished with value: 0.953055373431822 and parameters: {'alpha': 1.7952982775265987e-07, 'l1_ratio': 3.055031849005231e-09}. Best is trial 50 with value: 0.9682699321380732.[0m
[32m[I 2020-12-27 01:47:41,312][0m Trial 75 finished with value: 0.9605413546736915 and parameters: {'alpha': 3.720340309765598e-07, 'l1_ratio': 1.0879059245572077e-05}. Best is trial 50 with value: 0.9682699321380732.[0m
[32m[I 2020-12-27 01:47:41,846][0m Trial 76 finished with value: 0.9622621359915509 and parameters: {'alpha': 3.9548191010171895e-07, 'l1_ratio': 1.9275508780825363e-05}. Best is trial 50 with value: 0.9682699321380732.[0m
[32m[I 2020-12-27 01:47:42,332][0m Trial 77 finished with value: 0.958345098851547 and parameters: {'alpha': 2.209300684537242e-06, 'l1_ratio': 3.108864297826192e-05}. Best is trial 50 with value: 0.9682699321380732.[0m
[32m[I 2020-12-27 01:47:42,938][0m Trial 78 finished with value: 0.939822265399572 and parameters: {

[32m[I 2020-12-27 01:47:59,372][0m Trial 111 finished with value: 0.9659728490635965 and parameters: {'alpha': 5.220060572219642e-07, 'l1_ratio': 1.3225695024173962e-06}. Best is trial 50 with value: 0.9682699321380732.[0m
[32m[I 2020-12-27 01:47:59,799][0m Trial 112 finished with value: 0.9522495053026583 and parameters: {'alpha': 3.211357913472072e-07, 'l1_ratio': 3.76121660155949e-07}. Best is trial 50 with value: 0.9682699321380732.[0m
[32m[I 2020-12-27 01:48:00,327][0m Trial 113 finished with value: 0.9632182151870319 and parameters: {'alpha': 1.5916007330937159e-06, 'l1_ratio': 1.838575344335788e-06}. Best is trial 50 with value: 0.9682699321380732.[0m
[32m[I 2020-12-27 01:48:00,718][0m Trial 114 finished with value: 0.9599511952612035 and parameters: {'alpha': 1.6749502100577711e-06, 'l1_ratio': 1.206379353141307e-06}. Best is trial 50 with value: 0.9682699321380732.[0m
[32m[I 2020-12-27 01:48:01,111][0m Trial 115 finished with value: 0.9582094662378755 and paramet

[32m[I 2020-12-27 01:48:16,961][0m Trial 148 finished with value: 0.9543002622459922 and parameters: {'alpha': 1.1399299680981366e-06, 'l1_ratio': 6.124547355163848e-09}. Best is trial 122 with value: 0.968464444980014.[0m
[32m[I 2020-12-27 01:48:17,606][0m Trial 149 finished with value: 0.9517304491422328 and parameters: {'alpha': 2.674566518842881e-07, 'l1_ratio': 1.0782378016010889e-10}. Best is trial 122 with value: 0.968464444980014.[0m
[32m[I 2020-12-27 01:48:18,107][0m Trial 150 finished with value: 0.9572920463015012 and parameters: {'alpha': 3.3271478222760886e-06, 'l1_ratio': 1.95222590910821e-05}. Best is trial 122 with value: 0.968464444980014.[0m
[32m[I 2020-12-27 01:48:18,637][0m Trial 151 finished with value: 0.9533291639462518 and parameters: {'alpha': 5.846548011137138e-07, 'l1_ratio': 1.2721138782575881e-05}. Best is trial 122 with value: 0.968464444980014.[0m
[32m[I 2020-12-27 01:48:19,041][0m Trial 152 finished with value: 0.9538982442187603 and parame

[32m[I 2020-12-27 01:48:32,451][0m Trial 185 finished with value: 0.9631494177587437 and parameters: {'alpha': 3.1803007797486807e-07, 'l1_ratio': 3.384475447378584e-07}. Best is trial 122 with value: 0.968464444980014.[0m
[32m[I 2020-12-27 01:48:32,884][0m Trial 186 finished with value: 0.9612403194878864 and parameters: {'alpha': 3.286463298267112e-07, 'l1_ratio': 4.606021742467956e-07}. Best is trial 122 with value: 0.968464444980014.[0m
[32m[I 2020-12-27 01:48:33,303][0m Trial 187 finished with value: 0.9650301251783309 and parameters: {'alpha': 2.0305365077803043e-07, 'l1_ratio': 6.396071026992771e-08}. Best is trial 122 with value: 0.968464444980014.[0m
[32m[I 2020-12-27 01:48:33,704][0m Trial 188 finished with value: 0.9569280259336144 and parameters: {'alpha': 1.0242488853982225e-07, 'l1_ratio': 5.730466259117196e-08}. Best is trial 122 with value: 0.968464444980014.[0m
[32m[I 2020-12-27 01:48:34,118][0m Trial 189 finished with value: 0.9560873994562294 and parame

[32m[I 2020-12-27 01:48:47,187][0m Trial 222 finished with value: 0.9573295578005268 and parameters: {'alpha': 6.699895937107699e-07, 'l1_ratio': 1.9761557787239824e-07}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:48:47,543][0m Trial 223 finished with value: 0.9671689305373936 and parameters: {'alpha': 1.0705681723999051e-06, 'l1_ratio': 0.00972342160344605}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:48:47,918][0m Trial 224 finished with value: 0.9601849679028999 and parameters: {'alpha': 4.6392612722402934e-07, 'l1_ratio': 0.006996474560807697}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:48:48,297][0m Trial 225 finished with value: 0.9557767227702799 and parameters: {'alpha': 1.008686171092715e-06, 'l1_ratio': 0.03247132179174259}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:48:48,686][0m Trial 226 finished with value: 0.9677108444183169 and paramet

[32m[I 2020-12-27 01:49:02,204][0m Trial 259 finished with value: 0.9602756846504216 and parameters: {'alpha': 1.208359003161694e-06, 'l1_ratio': 7.371326225496188e-09}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:02,544][0m Trial 260 finished with value: 0.9635744427987579 and parameters: {'alpha': 2.2269422293653257e-06, 'l1_ratio': 3.0448722663248823e-05}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:02,903][0m Trial 261 finished with value: 0.9564274010396431 and parameters: {'alpha': 1.9709967088961716e-06, 'l1_ratio': 2.8613385714278102e-05}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:03,259][0m Trial 262 finished with value: 0.9586924782446195 and parameters: {'alpha': 2.249587653582013e-06, 'l1_ratio': 1.530310836952946e-08}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:03,676][0m Trial 263 finished with value: 0.9565694192657364 and p

[32m[I 2020-12-27 01:49:16,687][0m Trial 296 finished with value: 0.9577533328426375 and parameters: {'alpha': 1.3333883108734458e-07, 'l1_ratio': 1.2082126127754019e-05}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:17,060][0m Trial 297 finished with value: 0.9546332700959683 and parameters: {'alpha': 1.135175046527549e-06, 'l1_ratio': 0.016372822670203055}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:17,485][0m Trial 298 finished with value: 0.9594576399414458 and parameters: {'alpha': 6.158357121801793e-07, 'l1_ratio': 3.568539177272455e-08}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:17,913][0m Trial 299 finished with value: 0.9567756313808301 and parameters: {'alpha': 2.320051715863372e-07, 'l1_ratio': 2.0041587001898983e-05}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:18,282][0m Trial 300 finished with value: 0.9598010233497057 and par

[32m[I 2020-12-27 01:49:31,380][0m Trial 333 finished with value: 0.9601593161747721 and parameters: {'alpha': 2.8656456385361035e-07, 'l1_ratio': 0.0032180791896817586}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:31,752][0m Trial 334 finished with value: 0.9613359770692598 and parameters: {'alpha': 3.9509867887237576e-07, 'l1_ratio': 0.006406264831236288}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:32,106][0m Trial 335 finished with value: 0.9606638646434287 and parameters: {'alpha': 1.1387344009369466e-06, 'l1_ratio': 0.014167545577942087}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:32,510][0m Trial 336 finished with value: 0.9647620280636061 and parameters: {'alpha': 1.877118729471922e-07, 'l1_ratio': 3.765777277425439e-09}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:32,922][0m Trial 337 finished with value: 0.9499382761824917 and para

[32m[I 2020-12-27 01:49:46,209][0m Trial 370 finished with value: 0.9600104975889191 and parameters: {'alpha': 2.5134828752814205e-07, 'l1_ratio': 4.010963651915077e-08}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:46,588][0m Trial 371 finished with value: 0.9616825890913003 and parameters: {'alpha': 8.775038771733773e-07, 'l1_ratio': 2.9051473747370977e-05}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:46,956][0m Trial 372 finished with value: 0.9596914701551422 and parameters: {'alpha': 1.408610630970264e-06, 'l1_ratio': 0.00013158567454195133}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:47,361][0m Trial 373 finished with value: 0.9523731821552421 and parameters: {'alpha': 3.1108533714685697e-07, 'l1_ratio': 3.5121152677185018e-09}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:49:47,753][0m Trial 374 finished with value: 0.9649702778381509 and 

[32m[I 2020-12-27 01:50:02,592][0m Trial 407 finished with value: 0.806486144583562 and parameters: {'alpha': 0.005372285153192789, 'l1_ratio': 0.00045250594450848875}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:02,973][0m Trial 408 finished with value: 0.9586802571294092 and parameters: {'alpha': 1.0446879270534534e-06, 'l1_ratio': 8.168206675903832e-09}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:03,382][0m Trial 409 finished with value: 0.9605739113488655 and parameters: {'alpha': 4.1900362185662955e-07, 'l1_ratio': 2.98517388605919e-07}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:03,761][0m Trial 410 finished with value: 0.9593478050155626 and parameters: {'alpha': 6.318797497175431e-07, 'l1_ratio': 7.883359281589192e-07}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:04,117][0m Trial 411 finished with value: 0.9574315734330605 and param

[32m[I 2020-12-27 01:50:18,985][0m Trial 444 finished with value: 0.9517260709099382 and parameters: {'alpha': 8.754574674298228e-08, 'l1_ratio': 3.1392650806733104e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:19,570][0m Trial 445 finished with value: 0.9557325659262433 and parameters: {'alpha': 1.5621158291909428e-06, 'l1_ratio': 9.03606648446171e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:20,037][0m Trial 446 finished with value: 0.956432800180238 and parameters: {'alpha': 4.5281806692985887e-07, 'l1_ratio': 1.6732064183813906e-09}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:20,454][0m Trial 447 finished with value: 0.9580623417415796 and parameters: {'alpha': 9.206849431185254e-07, 'l1_ratio': 2.6575580582751577e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:20,909][0m Trial 448 finished with value: 0.9564270365075921 and pa

[32m[I 2020-12-27 01:50:36,368][0m Trial 481 finished with value: 0.9599879696698513 and parameters: {'alpha': 2.472606156858938e-06, 'l1_ratio': 2.9550864041918417e-05}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:36,790][0m Trial 482 finished with value: 0.9633714082364436 and parameters: {'alpha': 4.682398044740363e-07, 'l1_ratio': 0.027815636910908517}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:37,183][0m Trial 483 finished with value: 0.9522264076330517 and parameters: {'alpha': 2.7499849327235783e-07, 'l1_ratio': 8.47643604657135e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:37,563][0m Trial 484 finished with value: 0.9621486040922791 and parameters: {'alpha': 7.827055069379123e-07, 'l1_ratio': 2.151080442257579e-09}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:37,923][0m Trial 485 finished with value: 0.9519038432221935 and param

[32m[I 2020-12-27 01:50:50,836][0m Trial 518 finished with value: 0.9619491433082294 and parameters: {'alpha': 1.3954790689751097e-06, 'l1_ratio': 1.1972891416564965e-05}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:51,241][0m Trial 519 finished with value: 0.956578222636517 and parameters: {'alpha': 3.3187345813966776e-07, 'l1_ratio': 4.026025647530634e-08}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:51,627][0m Trial 520 finished with value: 0.9592447237402403 and parameters: {'alpha': 1.704132551951636e-07, 'l1_ratio': 4.560899195612813e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:52,015][0m Trial 521 finished with value: 0.9581843545911816 and parameters: {'alpha': 7.11663998582011e-07, 'l1_ratio': 1.2385993559085906e-07}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:50:52,393][0m Trial 522 finished with value: 0.9554712619267344 and par

[32m[I 2020-12-27 01:51:06,224][0m Trial 555 finished with value: 0.952480590387958 and parameters: {'alpha': 2.560625028113753e-07, 'l1_ratio': 6.841827871008456e-10}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:06,612][0m Trial 556 finished with value: 0.9576553475089453 and parameters: {'alpha': 8.150219304520064e-07, 'l1_ratio': 4.459956989986585e-10}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:07,064][0m Trial 557 finished with value: 0.9429860271830783 and parameters: {'alpha': 6.902604412577363e-08, 'l1_ratio': 9.862842306840238e-10}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:07,487][0m Trial 558 finished with value: 0.9553187461910932 and parameters: {'alpha': 1.4705566033417888e-07, 'l1_ratio': 4.231094493573552e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:07,906][0m Trial 559 finished with value: 0.9548886433854543 and param

[32m[I 2020-12-27 01:51:20,734][0m Trial 592 finished with value: 0.955171292145911 and parameters: {'alpha': 1.1373211056210178e-06, 'l1_ratio': 3.9964461476427455e-07}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:21,119][0m Trial 593 finished with value: 0.9632021789757234 and parameters: {'alpha': 7.628812238661993e-07, 'l1_ratio': 1.5008253787309505e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:21,512][0m Trial 594 finished with value: 0.9482669618575628 and parameters: {'alpha': 4.5223586013290005e-07, 'l1_ratio': 2.6924867564650146e-07}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:21,886][0m Trial 595 finished with value: 0.9645372730489125 and parameters: {'alpha': 1.4699663319517123e-06, 'l1_ratio': 9.977355095752331e-08}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:22,279][0m Trial 596 finished with value: 0.9640246458458341 and 

[32m[I 2020-12-27 01:51:35,521][0m Trial 629 finished with value: 0.9640365589499884 and parameters: {'alpha': 9.861829203979364e-07, 'l1_ratio': 6.216512474404294e-10}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:35,881][0m Trial 630 finished with value: 0.9604632880909842 and parameters: {'alpha': 2.247753018729187e-06, 'l1_ratio': 1.080397381724308e-07}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:36,294][0m Trial 631 finished with value: 0.9590141619188738 and parameters: {'alpha': 9.32264233926365e-08, 'l1_ratio': 3.804076245253861e-09}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:36,740][0m Trial 632 finished with value: 0.9552988141738165 and parameters: {'alpha': 2.048513016019897e-07, 'l1_ratio': 1.3047001089233108e-09}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:37,141][0m Trial 633 finished with value: 0.9569807066446369 and param

[32m[I 2020-12-27 01:51:49,814][0m Trial 666 finished with value: 0.9562119152353223 and parameters: {'alpha': 4.620993178220396e-07, 'l1_ratio': 1.0573097481686496e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:50,196][0m Trial 667 finished with value: 0.9490082061013775 and parameters: {'alpha': 7.080846438333022e-07, 'l1_ratio': 4.2966303503341396e-07}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:50,514][0m Trial 668 finished with value: 0.9549615488040912 and parameters: {'alpha': 7.3980825612012255e-06, 'l1_ratio': 3.6557790317694427e-07}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:50,908][0m Trial 669 finished with value: 0.959541268645936 and parameters: {'alpha': 3.381510815425196e-07, 'l1_ratio': 6.5142183927986e-07}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:51:51,273][0m Trial 670 finished with value: 0.9608268648741076 and para

[32m[I 2020-12-27 01:52:04,646][0m Trial 703 finished with value: 0.9599454553869937 and parameters: {'alpha': 8.375118912048078e-07, 'l1_ratio': 0.011532030015820078}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:04,993][0m Trial 704 finished with value: 0.9535989723867083 and parameters: {'alpha': 3.4562624582347084e-06, 'l1_ratio': 2.145011952690577e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:05,416][0m Trial 705 finished with value: 0.9493718334857976 and parameters: {'alpha': 2.092199430663181e-07, 'l1_ratio': 4.422494591765705e-10}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:05,786][0m Trial 706 finished with value: 0.9628860563009043 and parameters: {'alpha': 1.4442492255436575e-06, 'l1_ratio': 1.376868182627343e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:06,183][0m Trial 707 finished with value: 0.9482345126346723 and para

[32m[I 2020-12-27 01:52:18,969][0m Trial 740 finished with value: 0.9618248059849441 and parameters: {'alpha': 1.04699373663088e-06, 'l1_ratio': 0.006483925140495018}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:19,360][0m Trial 741 finished with value: 0.9504356739668169 and parameters: {'alpha': 7.909524672565416e-07, 'l1_ratio': 0.0050640304257813015}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:19,726][0m Trial 742 finished with value: 0.9626575828956313 and parameters: {'alpha': 1.5651020587367506e-06, 'l1_ratio': 0.002679968573918007}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:20,111][0m Trial 743 finished with value: 0.9595243128985117 and parameters: {'alpha': 6.897519237079814e-07, 'l1_ratio': 0.010978090409979949}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:20,482][0m Trial 744 finished with value: 0.9568519294194426 and paramete

[32m[I 2020-12-27 01:52:33,250][0m Trial 777 finished with value: 0.9564996704223293 and parameters: {'alpha': 4.778533211448791e-07, 'l1_ratio': 1.5274930708413164e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:33,628][0m Trial 778 finished with value: 0.9556283814549251 and parameters: {'alpha': 1.104090602432534e-06, 'l1_ratio': 0.001934713381800173}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:34,042][0m Trial 779 finished with value: 0.9523427366406152 and parameters: {'alpha': 1.1428741327073431e-07, 'l1_ratio': 2.8935925371138915e-10}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:34,401][0m Trial 780 finished with value: 0.9591249177986244 and parameters: {'alpha': 2.818968039154277e-06, 'l1_ratio': 0.036106721214162696}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:34,827][0m Trial 781 finished with value: 0.9637407584354943 and para

[32m[I 2020-12-27 01:52:47,935][0m Trial 814 finished with value: 0.9607997740658855 and parameters: {'alpha': 2.291711737577322e-06, 'l1_ratio': 5.849503599613816e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:48,435][0m Trial 815 finished with value: 0.9566340812568165 and parameters: {'alpha': 1.2287076190750484e-06, 'l1_ratio': 2.9945482147709944e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:48,846][0m Trial 816 finished with value: 0.9515415550667482 and parameters: {'alpha': 5.889470463318363e-07, 'l1_ratio': 6.632496728428655e-07}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:49,232][0m Trial 817 finished with value: 0.9585596875043336 and parameters: {'alpha': 8.388480177896972e-07, 'l1_ratio': 1.5136845877352137e-06}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:52:49,616][0m Trial 818 finished with value: 0.9563279360687008 and pa

[32m[I 2020-12-27 01:53:03,376][0m Trial 851 finished with value: 0.9620619758436759 and parameters: {'alpha': 7.751616992139034e-07, 'l1_ratio': 0.004865194191767008}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:53:03,807][0m Trial 852 finished with value: 0.9602808874293451 and parameters: {'alpha': 1.2991682932164096e-06, 'l1_ratio': 2.036817944343516e-10}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:53:04,200][0m Trial 853 finished with value: 0.9609420359719049 and parameters: {'alpha': 5.107184441497404e-07, 'l1_ratio': 0.00017999581868006521}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:53:04,560][0m Trial 854 finished with value: 0.9580583558174968 and parameters: {'alpha': 1.89259938017136e-06, 'l1_ratio': 4.846088199602143e-10}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:53:04,930][0m Trial 855 finished with value: 0.9570757604313975 and param

[32m[I 2020-12-27 01:53:17,844][0m Trial 888 finished with value: 0.9488998806463955 and parameters: {'alpha': 5.318562030479874e-07, 'l1_ratio': 0.029881193028136667}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:53:18,224][0m Trial 889 finished with value: 0.9644141498905527 and parameters: {'alpha': 1.2267104948139495e-06, 'l1_ratio': 0.004511942565723841}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:53:18,611][0m Trial 890 finished with value: 0.9622970763692478 and parameters: {'alpha': 7.552770507972276e-07, 'l1_ratio': 0.005019696310007065}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:53:19,011][0m Trial 891 finished with value: 0.9560746834238615 and parameters: {'alpha': 3.402068219122442e-07, 'l1_ratio': 0.009360564892552301}. Best is trial 214 with value: 0.9691920683625413.[0m
[32m[I 2020-12-27 01:53:19,386][0m Trial 892 finished with value: 0.9545276342141737 and paramete

[32m[I 2020-12-27 01:53:32,851][0m Trial 925 finished with value: 0.9597000136528524 and parameters: {'alpha': 1.9046436149809141e-07, 'l1_ratio': 0.0016877051477186074}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:53:33,237][0m Trial 926 finished with value: 0.9541325585736808 and parameters: {'alpha': 4.523196296106364e-07, 'l1_ratio': 0.009227479707147384}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:53:33,643][0m Trial 927 finished with value: 0.9614042671375217 and parameters: {'alpha': 6.526779815402007e-07, 'l1_ratio': 0.006051559736345411}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:53:33,993][0m Trial 928 finished with value: 0.9413301336863442 and parameters: {'alpha': 9.757827396683872e-05, 'l1_ratio': 0.0031332367757568942}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:53:34,392][0m Trial 929 finished with value: 0.9546261776045054 and parame

[32m[I 2020-12-27 01:53:47,341][0m Trial 962 finished with value: 0.9525211411722653 and parameters: {'alpha': 4.580233512924963e-07, 'l1_ratio': 5.205312735450015e-07}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:53:47,718][0m Trial 963 finished with value: 0.9582956598498922 and parameters: {'alpha': 1.2227950946427216e-06, 'l1_ratio': 2.4500041961282285e-07}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:53:48,139][0m Trial 964 finished with value: 0.9542371716917047 and parameters: {'alpha': 2.932930370786567e-07, 'l1_ratio': 0.001435946318142203}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:53:48,525][0m Trial 965 finished with value: 0.8366619899965516 and parameters: {'alpha': 0.0024228572855517654, 'l1_ratio': 0.006988505770394634}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:53:48,927][0m Trial 966 finished with value: 0.9617835641639461 and param

[32m[I 2020-12-27 01:54:02,428][0m Trial 999 finished with value: 0.9546674131765768 and parameters: {'alpha': 2.820775927017957e-07, 'l1_ratio': 4.6551485496073425e-06}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:54:02,826][0m Trial 1000 finished with value: 0.9581597774454218 and parameters: {'alpha': 4.6863174021725865e-07, 'l1_ratio': 0.0024422693999005406}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:54:03,218][0m Trial 1001 finished with value: 0.9607507143301236 and parameters: {'alpha': 6.003029346982473e-07, 'l1_ratio': 0.00022979025854738612}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:54:03,643][0m Trial 1002 finished with value: 0.9514597749855587 and parameters: {'alpha': 2.04745651630705e-07, 'l1_ratio': 0.0035471969473356664}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:54:04,057][0m Trial 1003 finished with value: 0.9586655984734137 and

[32m[I 2020-12-27 01:54:17,012][0m Trial 1036 finished with value: 0.9552016070907143 and parameters: {'alpha': 1.9782387753816634e-06, 'l1_ratio': 2.041344554826877e-06}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:54:17,384][0m Trial 1037 finished with value: 0.9646689176278941 and parameters: {'alpha': 1.0469095216504847e-06, 'l1_ratio': 6.476517770799823e-06}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:54:17,753][0m Trial 1038 finished with value: 0.957683112344982 and parameters: {'alpha': 1.48525609652618e-06, 'l1_ratio': 1.4032203624446674e-06}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:54:18,160][0m Trial 1039 finished with value: 0.948967673285946 and parameters: {'alpha': 2.120740253163237e-07, 'l1_ratio': 0.007431235900716517}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:54:18,529][0m Trial 1040 finished with value: 0.953253624224015 and pa

[32m[I 2020-12-27 01:54:32,563][0m Trial 1073 finished with value: 0.9620000062143063 and parameters: {'alpha': 7.574639537982776e-07, 'l1_ratio': 0.006789857342584852}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:54:32,957][0m Trial 1074 finished with value: 0.9594060597606036 and parameters: {'alpha': 5.403297693725514e-07, 'l1_ratio': 0.0018573449904575653}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:54:33,404][0m Trial 1075 finished with value: 0.883212739065433 and parameters: {'alpha': 0.0008589158745808311, 'l1_ratio': 0.01593865178083446}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:54:33,851][0m Trial 1076 finished with value: 0.9552906214490144 and parameters: {'alpha': 1.6083621313509292e-06, 'l1_ratio': 0.0005359599665017151}. Best is trial 898 with value: 0.9692415780683468.[0m
[32m[I 2020-12-27 01:54:34,401][0m Trial 1077 finished with value: 0.9508925330553455 and par

In [60]:
# パラメータは固定して100回の施行で検証データのf1scoreが最大となるモデルを採用
best_score=0
for i in range(100):
    clf = SGDClassifier(loss='log')
    clf.set_params(**params)
    clf.fit(Xtrain_tfidf, Ytrain)
    Ydev_pred=clf.predict(Xdev_tfidf)
    if best_score< f1_score(Ydev, Ydev_pred, average="macro"):
        best_model = clf
        best_score = f1_score(Ydev, Ydev_pred, average="macro")
print(best_score)
clf = best_model

0.9684165110263153


In [None]:
#model をpickle化
import pickle

with open('SGD_best.pickle', 'wb') as f:
    pickle.dump(clf, f)

In [65]:
# 結果の出力(検証データ最大)
Ytrain_prev=clf.predict(Xtrain_tfidf)
Ydev_pred=clf.predict(Xdev_tfidf)
Ytest_pred=clf.predict(Xtest_tfidf)
print(params)
print(f'train: {clf.score(Xtrain_tfidf, Ytrain)}')
print(f'precision:{precision_score(Ytrain, Ytrain_prev, average="macro")}')
print(f'recall_score:{recall_score(Ytrain, Ytrain_prev, average="macro")}')
print(f'f1_score:{f1_score(Ytrain, Ytrain_prev, average="macro")}')
print(f'dev: {clf.score(Xdev_tfidf, Ydev)}')
print(f'precision:{precision_score(Ydev, Ydev_pred, average="macro")}')
print(f'recall_score:{recall_score(Ydev, Ydev_pred, average="macro")}')
print(f'f1_score:{f1_score(Ydev, Ydev_pred, average="macro")}')
print(f'test: {clf.score(Xtest_tfidf, Ytest)}')
print(f'test precision:{precision_score(Ytest, Ytest_pred, average="macro")}')
print(f'test recall_score:{recall_score(Ytest, Ytest_pred, average="macro")}')
print(f'test f1_score:{f1_score(Ytest, Ytest_pred, average="macro")}')

{'alpha': 0.003808901322441786, 'l1_ratio': 5.332697964899795e-08}
train: 1.0
precision:1.0
recall_score:1.0
f1_score:1.0
dev: 0.9715061058344641
precision:0.9709592982837232
recall_score:0.9669368023926308
f1_score:0.9684165110263153
test: 0.9701492537313433
test precision:0.968393445978932
test recall_score:0.9652803286692152
test f1_score:0.9665045737809737

train: 1.0
dev: 0.9715061058344641
test: 0.9701492537313433


In [612]:
# svdによる次元削減（不採用）
svd=TruncatedSVD(n_components=1000)
svd.fit(Xtrain_tfidf)

Xtrain_svd=svd.transform(Xtrain_tfidf)
Xdev_svd=svd.transform(Xdev_tfidf)
Xtest_svd=svd.transform(Xtest_tfidf)

In [62]:
# umapによる次元削減（不採用）
um=umap.UMAP()
um.fit(Xtrain_tfidf)

Xtrain_um=um.transform(Xtrain_tfidf)
Xdev_um=um.transform(Xdev_tfidf)
Xtest_um=um.transform(Xtest_tfidf)

---

In [124]:
# opt2
# 検証データを用いたパラメータチューニング
Xtrain_opt = Xtrain_tfidf
Xdev_opt = Xdev_tfidf
def objective(trial):
    alpha = trial.suggest_loguniform('alpha', 1e-10, 1e-2)
    l1_ratio = trial.suggest_loguniform('l1_ratio', 1e-10, 1)
    skf = StratifiedKFold(n_splits=5, shuffle=True)
    clf = SGDClassifier(loss='log', alpha=alpha, l1_ratio = l1_ratio, class_weight='balanced')
    clf.fit(Xtrain_opt, Ytrain)
    Ydev_pred = clf.predict(Xdev_opt)
    return f1_score(Ydev, Ydev_pred, average="macro")

In [125]:
# 1000回のトライアルで最適なパラメータを採用
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=1000)
params=study.best_params

[32m[I 2020-12-27 03:46:39,008][0m A new study created in memory with name: no-name-130256c0-6778-4d9e-afaf-7feda2ab9026[0m
[32m[I 2020-12-27 03:46:39,440][0m Trial 0 finished with value: 0.9588744527324197 and parameters: {'alpha': 8.279140196285479e-09, 'l1_ratio': 9.915855115470507e-06}. Best is trial 0 with value: 0.9588744527324197.[0m
[32m[I 2020-12-27 03:46:39,899][0m Trial 1 finished with value: 0.9465959130077801 and parameters: {'alpha': 4.22002398811827e-10, 'l1_ratio': 0.0007274646929666491}. Best is trial 0 with value: 0.9588744527324197.[0m
[32m[I 2020-12-27 03:46:40,417][0m Trial 2 finished with value: 0.9512609071743614 and parameters: {'alpha': 3.0526300755634705e-08, 'l1_ratio': 0.00870163738357796}. Best is trial 0 with value: 0.9588744527324197.[0m
[32m[I 2020-12-27 03:46:40,994][0m Trial 3 finished with value: 0.9565732809719001 and parameters: {'alpha': 4.283831129552136e-09, 'l1_ratio': 1.0045453469452455e-05}. Best is trial 0 with value: 0.95887445

[32m[I 2020-12-27 03:46:56,689][0m Trial 37 finished with value: 0.9491310051638246 and parameters: {'alpha': 1.1391849458135147e-07, 'l1_ratio': 2.0019582300922656e-05}. Best is trial 25 with value: 0.9599879696698513.[0m
[32m[I 2020-12-27 03:46:57,203][0m Trial 38 finished with value: 0.9588662861082857 and parameters: {'alpha': 1.1408618913238915e-06, 'l1_ratio': 3.976882475880406e-08}. Best is trial 25 with value: 0.9599879696698513.[0m
[32m[I 2020-12-27 03:46:57,715][0m Trial 39 finished with value: 0.9507077534962016 and parameters: {'alpha': 1.5346578588374277e-08, 'l1_ratio': 9.28294749357772e-06}. Best is trial 25 with value: 0.9599879696698513.[0m
[32m[I 2020-12-27 03:46:58,085][0m Trial 40 finished with value: 0.9514758125735872 and parameters: {'alpha': 5.115264599959507e-07, 'l1_ratio': 0.007197900814424878}. Best is trial 25 with value: 0.9599879696698513.[0m
[32m[I 2020-12-27 03:46:58,491][0m Trial 41 finished with value: 0.9533274436317049 and parameters: 

[32m[I 2020-12-27 03:47:13,827][0m Trial 74 finished with value: 0.9623937443347685 and parameters: {'alpha': 9.501312727911903e-07, 'l1_ratio': 1.9053080237051481e-07}. Best is trial 63 with value: 0.9629312452339016.[0m
[32m[I 2020-12-27 03:47:14,233][0m Trial 75 finished with value: 0.9539678942601382 and parameters: {'alpha': 9.137989872931339e-07, 'l1_ratio': 1.9134007107048416e-07}. Best is trial 63 with value: 0.9629312452339016.[0m
[32m[I 2020-12-27 03:47:14,737][0m Trial 76 finished with value: 0.9516201485665634 and parameters: {'alpha': 2.735275544803073e-07, 'l1_ratio': 1.3375376667423002e-06}. Best is trial 63 with value: 0.9629312452339016.[0m
[32m[I 2020-12-27 03:47:15,265][0m Trial 77 finished with value: 0.9550429229234124 and parameters: {'alpha': 1.610172031761822e-07, 'l1_ratio': 3.9110062212722565e-07}. Best is trial 63 with value: 0.9629312452339016.[0m
[32m[I 2020-12-27 03:47:15,659][0m Trial 78 finished with value: 0.9640866933254993 and parameters

[32m[I 2020-12-27 03:47:31,406][0m Trial 111 finished with value: 0.9566453414177682 and parameters: {'alpha': 1.828987594858314e-06, 'l1_ratio': 2.1655133653305747e-07}. Best is trial 91 with value: 0.9644819934909046.[0m
[32m[I 2020-12-27 03:47:31,887][0m Trial 112 finished with value: 0.9653795099535228 and parameters: {'alpha': 2.7803637602534817e-06, 'l1_ratio': 4.739993378772792e-07}. Best is trial 112 with value: 0.9653795099535228.[0m
[32m[I 2020-12-27 03:47:32,289][0m Trial 113 finished with value: 0.9537395994831392 and parameters: {'alpha': 2.8164616990266004e-06, 'l1_ratio': 1.0761559099267666e-07}. Best is trial 112 with value: 0.9653795099535228.[0m
[32m[I 2020-12-27 03:47:32,634][0m Trial 114 finished with value: 0.9560079427787787 and parameters: {'alpha': 1.2781865311744027e-05, 'l1_ratio': 6.082589751582932e-07}. Best is trial 112 with value: 0.9653795099535228.[0m
[32m[I 2020-12-27 03:47:32,982][0m Trial 115 finished with value: 0.9541785293697226 and p

[32m[I 2020-12-27 03:47:48,197][0m Trial 148 finished with value: 0.9537168642833402 and parameters: {'alpha': 2.17175902478213e-07, 'l1_ratio': 6.581254280166436e-08}. Best is trial 112 with value: 0.9653795099535228.[0m
[32m[I 2020-12-27 03:47:48,647][0m Trial 149 finished with value: 0.9408208157477264 and parameters: {'alpha': 8.629548674306009e-08, 'l1_ratio': 1.0302409992526122e-06}. Best is trial 112 with value: 0.9653795099535228.[0m
[32m[I 2020-12-27 03:47:49,068][0m Trial 150 finished with value: 0.9589040235790647 and parameters: {'alpha': 3.2095317341153773e-07, 'l1_ratio': 3.1102492395227473e-07}. Best is trial 112 with value: 0.9653795099535228.[0m
[32m[I 2020-12-27 03:47:49,577][0m Trial 151 finished with value: 0.9588937708666602 and parameters: {'alpha': 1.9419706699441e-06, 'l1_ratio': 5.233435614349159e-07}. Best is trial 112 with value: 0.9653795099535228.[0m
[32m[I 2020-12-27 03:47:49,992][0m Trial 152 finished with value: 0.9603120479663811 and param

[32m[I 2020-12-27 03:48:05,318][0m Trial 185 finished with value: 0.9607200175570113 and parameters: {'alpha': 1.205368192884539e-06, 'l1_ratio': 1.404941505625587e-08}. Best is trial 161 with value: 0.9657451915245068.[0m
[32m[I 2020-12-27 03:48:05,716][0m Trial 186 finished with value: 0.9583882670198172 and parameters: {'alpha': 1.7166195713485288e-06, 'l1_ratio': 8.676175201552179e-08}. Best is trial 161 with value: 0.9657451915245068.[0m
[32m[I 2020-12-27 03:48:06,158][0m Trial 187 finished with value: 0.9570262562055761 and parameters: {'alpha': 2.7148189572559643e-06, 'l1_ratio': 2.5731608357470895e-07}. Best is trial 161 with value: 0.9657451915245068.[0m
[32m[I 2020-12-27 03:48:06,589][0m Trial 188 finished with value: 0.9636239131601201 and parameters: {'alpha': 1.072068171635266e-06, 'l1_ratio': 2.4388818028213604e-08}. Best is trial 161 with value: 0.9657451915245068.[0m
[32m[I 2020-12-27 03:48:07,101][0m Trial 189 finished with value: 0.9593664115675791 and p

[32m[I 2020-12-27 03:48:23,548][0m Trial 222 finished with value: 0.9554287716475566 and parameters: {'alpha': 1.8361990856105154e-06, 'l1_ratio': 1.582640671241746e-08}. Best is trial 206 with value: 0.9658632450008527.[0m
[32m[I 2020-12-27 03:48:24,009][0m Trial 223 finished with value: 0.9571351640243838 and parameters: {'alpha': 8.913724409664681e-07, 'l1_ratio': 3.4840570237194606e-08}. Best is trial 206 with value: 0.9658632450008527.[0m
[32m[I 2020-12-27 03:48:24,558][0m Trial 224 finished with value: 0.9581743977731156 and parameters: {'alpha': 1.6335199972827912e-06, 'l1_ratio': 7.452708731882525e-08}. Best is trial 206 with value: 0.9658632450008527.[0m
[32m[I 2020-12-27 03:48:25,138][0m Trial 225 finished with value: 0.9574649694321299 and parameters: {'alpha': 6.966230275323811e-07, 'l1_ratio': 5.312091109469128e-08}. Best is trial 206 with value: 0.9658632450008527.[0m
[32m[I 2020-12-27 03:48:25,673][0m Trial 226 finished with value: 0.9552385970025775 and pa

[32m[I 2020-12-27 03:48:43,765][0m Trial 259 finished with value: 0.9601609386184828 and parameters: {'alpha': 1.576876684235112e-06, 'l1_ratio': 2.9092205355860393e-07}. Best is trial 258 with value: 0.9684776104190888.[0m
[32m[I 2020-12-27 03:48:44,227][0m Trial 260 finished with value: 0.9572137782424862 and parameters: {'alpha': 7.448745845327644e-07, 'l1_ratio': 1.6354634367270256e-08}. Best is trial 258 with value: 0.9684776104190888.[0m
[32m[I 2020-12-27 03:48:44,687][0m Trial 261 finished with value: 0.954269479068119 and parameters: {'alpha': 5.180486755764038e-07, 'l1_ratio': 0.0010992506284771418}. Best is trial 258 with value: 0.9684776104190888.[0m
[32m[I 2020-12-27 03:48:45,130][0m Trial 262 finished with value: 0.9581819366592457 and parameters: {'alpha': 1.208520692185047e-06, 'l1_ratio': 4.134976969670472e-07}. Best is trial 258 with value: 0.9684776104190888.[0m
[32m[I 2020-12-27 03:48:45,576][0m Trial 263 finished with value: 0.9591802747298797 and para

[32m[I 2020-12-27 03:49:00,883][0m Trial 296 finished with value: 0.958665929936399 and parameters: {'alpha': 3.595104130522809e-06, 'l1_ratio': 0.0028150412462927556}. Best is trial 258 with value: 0.9684776104190888.[0m
[32m[I 2020-12-27 03:49:01,313][0m Trial 297 finished with value: 0.9539054857849121 and parameters: {'alpha': 2.101838033085326e-06, 'l1_ratio': 0.001825159374129318}. Best is trial 258 with value: 0.9684776104190888.[0m
[32m[I 2020-12-27 03:49:01,715][0m Trial 298 finished with value: 0.960699084753873 and parameters: {'alpha': 1.5783727520710288e-06, 'l1_ratio': 0.002362505388027155}. Best is trial 258 with value: 0.9684776104190888.[0m
[32m[I 2020-12-27 03:49:02,069][0m Trial 299 finished with value: 0.955511589910097 and parameters: {'alpha': 5.741664077313661e-06, 'l1_ratio': 3.938514380906948e-08}. Best is trial 258 with value: 0.9684776104190888.[0m
[32m[I 2020-12-27 03:49:02,537][0m Trial 300 finished with value: 0.959867229565806 and parameters

[32m[I 2020-12-27 03:49:16,474][0m Trial 333 finished with value: 0.9586822908538795 and parameters: {'alpha': 5.717485552207598e-07, 'l1_ratio': 0.0011910053178792494}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:49:16,997][0m Trial 334 finished with value: 0.9574319269006308 and parameters: {'alpha': 7.505038866340179e-07, 'l1_ratio': 0.0004970156713597926}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:49:17,453][0m Trial 335 finished with value: 0.9565345857655451 and parameters: {'alpha': 5.487553341202902e-07, 'l1_ratio': 7.501693419088837e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:49:17,870][0m Trial 336 finished with value: 0.9597707430381672 and parameters: {'alpha': 9.28158859532132e-07, 'l1_ratio': 0.0008047725049297985}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:49:18,264][0m Trial 337 finished with value: 0.95767307484854 and paramete

[32m[I 2020-12-27 03:49:33,957][0m Trial 370 finished with value: 0.9566620330655284 and parameters: {'alpha': 4.458674469789401e-07, 'l1_ratio': 4.07121652776738e-09}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:49:34,493][0m Trial 371 finished with value: 0.9628040334504147 and parameters: {'alpha': 8.15696356872492e-07, 'l1_ratio': 5.667217360633619e-09}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:49:34,967][0m Trial 372 finished with value: 0.9570662272863648 and parameters: {'alpha': 1.3540186587893057e-06, 'l1_ratio': 1.0805330905851606e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:49:35,381][0m Trial 373 finished with value: 0.9562385140941956 and parameters: {'alpha': 6.255960103942196e-07, 'l1_ratio': 3.1424499685922795e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:49:35,778][0m Trial 374 finished with value: 0.9562139323382114 and para

[32m[I 2020-12-27 03:49:50,194][0m Trial 407 finished with value: 0.9582463610237915 and parameters: {'alpha': 2.0361980194963323e-06, 'l1_ratio': 7.102580164305053e-09}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:49:50,614][0m Trial 408 finished with value: 0.9594180204825477 and parameters: {'alpha': 1.2912535441271999e-06, 'l1_ratio': 0.010562609388441615}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:49:51,082][0m Trial 409 finished with value: 0.9632188525588852 and parameters: {'alpha': 7.245176195214796e-07, 'l1_ratio': 4.7734963432577654e-05}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:49:51,536][0m Trial 410 finished with value: 0.9611669330054146 and parameters: {'alpha': 3.8189813749067694e-07, 'l1_ratio': 5.224923350368098e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:49:51,979][0m Trial 411 finished with value: 0.9585338362229449 and pa

[32m[I 2020-12-27 03:50:05,905][0m Trial 444 finished with value: 0.9561288357085748 and parameters: {'alpha': 2.2528596827058924e-06, 'l1_ratio': 1.4228710475777994e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:06,317][0m Trial 445 finished with value: 0.9617449948539867 and parameters: {'alpha': 1.438379193185054e-06, 'l1_ratio': 0.06932812036810986}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:06,751][0m Trial 446 finished with value: 0.9614830976672688 and parameters: {'alpha': 7.512758045203961e-07, 'l1_ratio': 3.6555585545957505e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:07,211][0m Trial 447 finished with value: 0.9481248676926669 and parameters: {'alpha': 2.21615730652001e-07, 'l1_ratio': 5.59046686137136e-09}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:07,684][0m Trial 448 finished with value: 0.9600657557750645 and parame

[32m[I 2020-12-27 03:50:22,330][0m Trial 481 finished with value: 0.9629419779401426 and parameters: {'alpha': 6.456424723299003e-07, 'l1_ratio': 2.278429320266155e-07}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:22,735][0m Trial 482 finished with value: 0.9586432826780613 and parameters: {'alpha': 1.1970780132348796e-06, 'l1_ratio': 5.365495414766894e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:23,121][0m Trial 483 finished with value: 0.9563475434099542 and parameters: {'alpha': 3.398444860865301e-06, 'l1_ratio': 0.11520397902071976}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:23,523][0m Trial 484 finished with value: 0.9649904007740481 and parameters: {'alpha': 1.5579985070148925e-06, 'l1_ratio': 0.0023926053616261465}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:23,904][0m Trial 485 finished with value: 0.9565254326110365 and param

[32m[I 2020-12-27 03:50:38,945][0m Trial 518 finished with value: 0.9544592302457213 and parameters: {'alpha': 4.46284033524597e-07, 'l1_ratio': 4.0683522713222436e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:39,464][0m Trial 519 finished with value: 0.9586241754512097 and parameters: {'alpha': 2.641915382799293e-07, 'l1_ratio': 0.006962839083935995}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:39,883][0m Trial 520 finished with value: 0.958253324646498 and parameters: {'alpha': 3.692345330918693e-07, 'l1_ratio': 0.019329363050046115}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:40,394][0m Trial 521 finished with value: 0.9635718839591509 and parameters: {'alpha': 5.59267937780064e-07, 'l1_ratio': 7.490073101644888e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:40,957][0m Trial 522 finished with value: 0.9605163698714164 and parameter

[32m[I 2020-12-27 03:50:56,540][0m Trial 555 finished with value: 0.954995456078183 and parameters: {'alpha': 2.5756196500747704e-06, 'l1_ratio': 1.1387309196176373e-07}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:56,960][0m Trial 556 finished with value: 0.9613906101102899 and parameters: {'alpha': 1.1832790206338446e-06, 'l1_ratio': 1.9360787357437734e-07}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:57,435][0m Trial 557 finished with value: 0.9622396071153039 and parameters: {'alpha': 4.6060838202721856e-06, 'l1_ratio': 4.7168208593314104e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:57,945][0m Trial 558 finished with value: 0.9532021442775009 and parameters: {'alpha': 1.7802067497868722e-06, 'l1_ratio': 9.277695301563987e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:50:58,407][0m Trial 559 finished with value: 0.9528069241060518 and

[32m[I 2020-12-27 03:51:14,455][0m Trial 592 finished with value: 0.9605818084449379 and parameters: {'alpha': 1.460620138545715e-06, 'l1_ratio': 7.239539885853753e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:51:15,070][0m Trial 593 finished with value: 0.9561039909916589 and parameters: {'alpha': 9.516555870789981e-07, 'l1_ratio': 1.3824924876742152e-07}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:51:15,827][0m Trial 594 finished with value: 0.9549168065262644 and parameters: {'alpha': 5.261893672403603e-07, 'l1_ratio': 0.005032711853118277}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:51:16,624][0m Trial 595 finished with value: 0.9574375049954705 and parameters: {'alpha': 2.4692318533900297e-07, 'l1_ratio': 3.314244685438128e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:51:17,209][0m Trial 596 finished with value: 0.9587014757961965 and para

[32m[I 2020-12-27 03:51:35,634][0m Trial 629 finished with value: 0.9619382652782529 and parameters: {'alpha': 4.099699564773921e-07, 'l1_ratio': 0.002177700122972292}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:51:36,097][0m Trial 630 finished with value: 0.9603162374477551 and parameters: {'alpha': 9.650754489513867e-07, 'l1_ratio': 9.04466783655712e-09}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:51:36,565][0m Trial 631 finished with value: 0.9604801829111689 and parameters: {'alpha': 2.1664047023591462e-06, 'l1_ratio': 2.0078567745355068e-07}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:51:36,972][0m Trial 632 finished with value: 0.9596529217820683 and parameters: {'alpha': 6.485747473208413e-07, 'l1_ratio': 0.003792970043554905}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:51:37,452][0m Trial 633 finished with value: 0.9571295155176315 and parame

[32m[I 2020-12-27 03:51:53,616][0m Trial 666 finished with value: 0.9550773433934658 and parameters: {'alpha': 1.8178551314728642e-06, 'l1_ratio': 2.5707121801445888e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:51:54,232][0m Trial 667 finished with value: 0.960993546159025 and parameters: {'alpha': 1.1046996552863097e-06, 'l1_ratio': 1.093322661724056e-07}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:51:54,930][0m Trial 668 finished with value: 0.955538552752857 and parameters: {'alpha': 3.854348862810837e-07, 'l1_ratio': 5.454184351132483e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:51:55,367][0m Trial 669 finished with value: 0.9574172462400503 and parameters: {'alpha': 2.643323411766192e-06, 'l1_ratio': 2.463870315528602e-07}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:51:55,869][0m Trial 670 finished with value: 0.9604801829111689 and para

[32m[I 2020-12-27 03:52:11,692][0m Trial 703 finished with value: 0.9622639121493466 and parameters: {'alpha': 4.255523728862125e-07, 'l1_ratio': 3.2757257958411567e-07}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:52:12,082][0m Trial 704 finished with value: 0.9587493432209406 and parameters: {'alpha': 2.379668867488198e-06, 'l1_ratio': 7.327364718953162e-09}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:52:12,508][0m Trial 705 finished with value: 0.957009846319828 and parameters: {'alpha': 6.438834078463873e-07, 'l1_ratio': 0.03006014764623839}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:52:12,972][0m Trial 706 finished with value: 0.9566862964033488 and parameters: {'alpha': 1.1570228497307538e-06, 'l1_ratio': 0.009059149387814273}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:52:13,376][0m Trial 707 finished with value: 0.96006446226767 and parameter

[32m[I 2020-12-27 03:52:30,236][0m Trial 740 finished with value: 0.9440222188204179 and parameters: {'alpha': 1.402656719293848e-07, 'l1_ratio': 3.0647697632168915e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:52:30,688][0m Trial 741 finished with value: 0.9536217069314807 and parameters: {'alpha': 5.206220650192598e-07, 'l1_ratio': 1.2290852615763315e-05}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:52:31,131][0m Trial 742 finished with value: 0.9510484280758938 and parameters: {'alpha': 3.20196581051973e-07, 'l1_ratio': 0.0003670750241064495}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:52:31,567][0m Trial 743 finished with value: 0.9645499153317465 and parameters: {'alpha': 5.589981788698317e-07, 'l1_ratio': 1.532293588082099e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:52:32,030][0m Trial 744 finished with value: 0.9631878576060733 and para

[32m[I 2020-12-27 03:52:47,979][0m Trial 777 finished with value: 0.9613972489575819 and parameters: {'alpha': 1.0862399302772088e-06, 'l1_ratio': 7.212619565707621e-09}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:52:48,357][0m Trial 778 finished with value: 0.9577901914106418 and parameters: {'alpha': 7.190480006465787e-06, 'l1_ratio': 1.0609730370452176e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:52:48,934][0m Trial 779 finished with value: 0.9607324950275703 and parameters: {'alpha': 7.134974236574777e-07, 'l1_ratio': 1.7230115777122887e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:52:49,443][0m Trial 780 finished with value: 0.9597677147123179 and parameters: {'alpha': 1.2277271008043374e-06, 'l1_ratio': 1.9969869619072496e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:52:49,920][0m Trial 781 finished with value: 0.9550592752066369 and 

[32m[I 2020-12-27 03:53:05,118][0m Trial 814 finished with value: 0.9549138967785556 and parameters: {'alpha': 1.404561643774008e-06, 'l1_ratio': 4.07084204833897e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:53:05,513][0m Trial 815 finished with value: 0.953394723792306 and parameters: {'alpha': 4.068857100739254e-06, 'l1_ratio': 2.0106791587852035e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:53:05,951][0m Trial 816 finished with value: 0.9551869426196813 and parameters: {'alpha': 2.213688300929763e-06, 'l1_ratio': 1.0646362910348084e-08}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:53:06,395][0m Trial 817 finished with value: 0.9581296183317681 and parameters: {'alpha': 9.003586300401129e-07, 'l1_ratio': 3.002997444485092e-05}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:53:06,889][0m Trial 818 finished with value: 0.9624813526377154 and param

[32m[I 2020-12-27 03:53:24,696][0m Trial 851 finished with value: 0.9575982432206706 and parameters: {'alpha': 9.751885981852297e-07, 'l1_ratio': 0.026427720471411578}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:53:25,141][0m Trial 852 finished with value: 0.9594295561944267 and parameters: {'alpha': 4.346917064735131e-06, 'l1_ratio': 0.03739299574651865}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:53:25,757][0m Trial 853 finished with value: 0.9488416388050731 and parameters: {'alpha': 1.650892996054099e-09, 'l1_ratio': 0.11309140768655357}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:53:26,398][0m Trial 854 finished with value: 0.9601530957218719 and parameters: {'alpha': 5.176857493631945e-07, 'l1_ratio': 0.05291350945334005}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:53:26,926][0m Trial 855 finished with value: 0.9567960248958052 and parameters: 

[32m[I 2020-12-27 03:53:43,157][0m Trial 888 finished with value: 0.958665929936399 and parameters: {'alpha': 2.4249922989215395e-06, 'l1_ratio': 0.04352580131196756}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:53:43,730][0m Trial 889 finished with value: 0.9649390020939982 and parameters: {'alpha': 7.845104952333479e-07, 'l1_ratio': 0.1005982319790811}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:53:44,346][0m Trial 890 finished with value: 0.9507819757770799 and parameters: {'alpha': 4.3391328319055557e-07, 'l1_ratio': 0.05381084692905791}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:53:45,013][0m Trial 891 finished with value: 0.9492572883809929 and parameters: {'alpha': 2.505611486729317e-07, 'l1_ratio': 0.064487614135821}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:53:45,545][0m Trial 892 finished with value: 0.9603716424096496 and parameters: {'a

[32m[I 2020-12-27 03:54:02,928][0m Trial 925 finished with value: 0.9577901914106418 and parameters: {'alpha': 8.584987904000447e-06, 'l1_ratio': 0.0006167495145664192}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:54:03,463][0m Trial 926 finished with value: 0.9591004448854696 and parameters: {'alpha': 9.590875677285794e-07, 'l1_ratio': 0.0013667284084714678}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:54:03,961][0m Trial 927 finished with value: 0.958248389648014 and parameters: {'alpha': 1.910633917296141e-06, 'l1_ratio': 0.00026680471416254306}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:54:04,519][0m Trial 928 finished with value: 0.9622396071153039 and parameters: {'alpha': 4.9819866352613374e-06, 'l1_ratio': 0.002281369462365549}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:54:05,064][0m Trial 929 finished with value: 0.957781060352311 and parame

[32m[I 2020-12-27 03:54:22,905][0m Trial 962 finished with value: 0.958298808939909 and parameters: {'alpha': 1.0268387669707223e-06, 'l1_ratio': 0.00719615428024436}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:54:23,553][0m Trial 963 finished with value: 0.9555826195385809 and parameters: {'alpha': 1.904004824251437e-07, 'l1_ratio': 1.305133067565599e-07}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:54:24,349][0m Trial 964 finished with value: 0.9545573339588098 and parameters: {'alpha': 4.447521838223859e-07, 'l1_ratio': 0.00010924270238148436}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:54:24,992][0m Trial 965 finished with value: 0.9557418212198682 and parameters: {'alpha': 1.5683438472606547e-06, 'l1_ratio': 0.014866810461976599}. Best is trial 332 with value: 0.9714564177069506.[0m
[32m[I 2020-12-27 03:54:25,555][0m Trial 966 finished with value: 0.958665929936399 and paramet

[32m[I 2020-12-27 03:54:44,093][0m Trial 999 finished with value: 0.9609059878607291 and parameters: {'alpha': 6.749940774743254e-07, 'l1_ratio': 0.010355739078595693}. Best is trial 332 with value: 0.9714564177069506.[0m


In [126]:
print(params)

{'alpha': 6.246369486906041e-07, 'l1_ratio': 0.0005504654723293009}


In [142]:
# パラメータは固定して100回の施行で検証データのf1scoreが最大となるモデルを採用
best_score=0
for i in range(100):
    clf = SGDClassifier(loss='log', class_weight='balanced')
    clf.set_params(**params)
    clf.fit(Xtrain_tfidf, Ytrain)
    Ydev_pred=clf.predict(Xdev_tfidf)
    if best_score< f1_score(Ydev, Ydev_pred, average="macro"):
        best_model = clf
        best_score = f1_score(Ydev, Ydev_pred, average="macro")
print(best_score)
clf = best_model

0.9701996303242613


In [143]:
#model をpickle化
import pickle

with open('SGD_best_balanced.pickle', 'wb') as f:
    pickle.dump(clf, f)

In [146]:
# 結果の出力(検証データ最大)
Ytrain_prev=clf.predict(Xtrain_tfidf)
Ydev_pred=clf.predict(Xdev_tfidf)
Ytest_pred=clf.predict(Xtest_tfidf)
print(params)
print(f'train: {clf.score(Xtrain_tfidf, Ytrain)}')
print(f'precision:{precision_score(Ytrain, Ytrain_prev, average="macro")}')
print(f'recall_score:{recall_score(Ytrain, Ytrain_prev, average="macro")}')
print(f'f1_score:{f1_score(Ytrain, Ytrain_prev, average="macro")}')
print(f'dev: {clf.score(Xdev_tfidf, Ydev)}')
print(f'precision:{precision_score(Ydev, Ydev_pred, average="macro")}')
print(f'recall_score:{recall_score(Ydev, Ydev_pred, average="macro")}')
print(f'f1_score:{f1_score(Ydev, Ydev_pred, average="macro")}')
print(f'test: {clf.score(Xtest_tfidf, Ytest)}')
print(f'test precision:{precision_score(Ytest, Ytest_pred, average="macro")}')
print(f'test recall_score:{recall_score(Ytest, Ytest_pred, average="macro")}')
print(f'test f1_score:{f1_score(Ytest, Ytest_pred, average="macro")}')

{'alpha': 6.246369486906041e-07, 'l1_ratio': 0.0005504654723293009}
train: 0.9998305659098611
precision:0.9998193315266486
recall_score:0.9998439450686641
f1_score:0.9998315099623546
dev: 0.9728629579375848
precision:0.9719748031952378
recall_score:0.9689888263514056
f1_score:0.9701996303242613
test: 0.966078697421981
test precision:0.9647403127029965
test recall_score:0.9610233444545434
test f1_score:0.9626538472838951
