このノートブックではツイートデータの「活動あり」，「活動なし」への２クラス分類タスクに取り組む<br>

# 目次
1. [基本設定](#section1)
    - [使用ライブラリ](#section1-1)
    - [使用データセット, 出力パス指定](#section1-2)
    - [パラメータの設定](#section1-3)
2. [実験的な予測](#section2)
    - 2.1 [使用データの読み込み](#section2-1)
    - 2.2 [使用モデル](#section2-2)
    - 2.3 [実験に使用するデータを訓練用と評価用に分割（テストなし）](#section2-3)
    - 2.4 [モデル学習](#section2-4)
    - 2.5 [評価データを用いた評価](#section2-5)
    - 2.6 [wandb終了](#section2-6)
3. [Cross validation](#section3-1)

<a id="section1"></a>
## 1. 基本設定

In [1]:
cd ..

/home/is/akiyoshi-n/my-project


<a id='section1-1'></a>
### 使用ライブラリ

In [2]:
from pathlib import Path
from datetime import datetime
from src.my_project.dataset import load_dataset_2class_classification, split_test_data, load_text_dataset, load_dataset_2class_classification_v2
from src.my_project.train_v2 import ActClassifier
from sklearn.model_selection import train_test_split
import wandb

<a id='section1-2'></a>
### 使用データセット, 出力パス指定

In [3]:
DATASET_PATH = Path('/home/is/akiyoshi-n/my-project/data')
# 本日の日付
timestamp = datetime.now().strftime("%Y-%m-%d")
# 出力先ディレクトリ
output_dir = Path('/home/is/akiyoshi-n/my-project/outputs/{}'.format(timestamp))

<a id='section1-3'></a>
### パラメータの設定

In [4]:
# 最大トークン数
MAX_LEN = 128
# バッチサイズ
BATCH_SIZE = 16
# エポック数
NUM_EPOCHS = 100
# 学習率
LEARNING_RATE = 2e-5
# Cross Validation時のFold数
NUM_FOLDS = 5
# 早期停止のための忍耐値
PATIENCE = 2
# 乱数シード
SEED = 2023
# クラス数
NUM_LABELS = 2

<a id='section2'></a>
## 2. 実験的な予測

<a id='section2-1'></a>
### 2.1 使用データの読み込み

In [5]:
data = load_dataset_2class_classification(f"{DATASET_PATH}/act_classification_final.xlsx")

<a id='section2-2'></a>
### 2.2 使用モデル

In [6]:
# 東北大BERT-v3
MODEL_NAME = 'cl-tohoku/bert-base-japanese-v3'
Classifier_model = ActClassifier(model_name = MODEL_NAME, num_labels=NUM_LABELS, seed=SEED)

<a id='section2-3'></a>
### 2.3 実験に使用するデータを訓練用と評価用に分割（テストなし）

In [7]:
# 訓練データと評価データを辞書型で抽出
train_dataset = {
    'texts': [data['texts'][i] for i in range(900)],
    'labels': [data['labels'][i] for i in range(900)]
}
eval_dataset = {
    'texts': [data['texts'][i] for i in range(900, 1100)],
    'labels': [data['labels'][i] for i in range(900, 1100)]
}

<a id='section2-4'></a>
### 2.4 model学習

In [8]:
trainer = Classifier_model.train_model(train_dataset, eval_dataset, MAX_LEN, NUM_EPOCHS, LEARNING_RATE, BATCH_SIZE, PATIENCE, output_dir, project_name='ActClassification', run_name='test')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Parameter 'fn_kwargs'={'tokenizer': BertJapaneseTokenizer(name_or_path='cl-tohoku/bert-base-japanese-v3', vocab_size=32768, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False,

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7198,0.701722,0.55,0.415584
2,0.6574,0.659056,0.625,0.409449
3,0.5978,0.638591,0.69,0.436364
4,0.4943,0.650766,0.695,0.460177
5,0.3749,0.655998,0.7,0.558824


In [11]:
# predictメソッドで予測
prediction = Classifier_model.predict(trainer, eval_dataset, MAX_LEN)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [12]:
prediction

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 0])

In [13]:
Classifier_model.extract_prediction_activity_data(eval_dataset, prediction)

AttributeError: 'ActClassifier' object has no attribute 'extract_prediction_activity_data'

In [31]:
b

{'texts': ['[USR] 私も飲んでますがw\u3000こういうのは「ついのみ」とかいうハッシュタグつけるといいんでしたっけ．',
  'おおー！ぜひお話をうかがってみたいです！司会進行頑張って下さい！ RT [USR] おお！\u3000RT [USR] 仙台にて東北観光のキーパーソン達に集って頂き座談会開催。第一回目なのでテーマは「東北観光を磨くには？」司会進行役なので、しゃべり過ぎないように注意しよう\ue409',
  '[USR] こんにちは～！わ、みられてましたか！とりあえずこれで。ほんとはもうちょっと拡大したほうがバランスいいんですけどおそれおおいのでこれでw',
  '起きてた！バス混んでました…。今日もクソゲー頑張ってきま',
  '奥の方がエビだと思ってグロ画像だと空目\u3000「サンタエビ」話題に\u3000 [URL]',
  'とりあえずごはんごはんーーーー',
  'こっから銀座までなんぷんかね',
  'うわ！\u3000エア始まってるやんwww\u3000はい\u3000かんぱーい！\u3000ってラス１ですよw',
  '仕入れなくちゃ！ RT [USR] ショコラブルワリー(サッポロ)とショコラカクテル(アサヒ) [URL]',
  'フォローしようと思ってた人がフォローしてくれてた',
  '[USR] きゃーーーー！！こじゅーーー！！斬り捨ててーーーー！！！ｗｗｗ',
  '牛めし290円最強＼(^o^)／',
  'どちらにしても、夜まで秋葉にはいるつもりだったのでだいじょぶですよー。またーり待ってるので、気をつけて来てくださいなヽ(´ー｀)ノ',
  'あれ、しかもRetweetぼたんできとるわ。ん？何がなくなったんだ？',
  '足湯なう[USR] #touhokutrip  [URL]',
  'なんでこんな煙草ＴＬなのｗｗｗ',
  '汗かくような運動してる？体と精神のバランスだから・・・私が抱きしめてあげる～ぅRT [USR] ここんとこ、睡眠の質が悪い。疲れが取れないよ…',
  '[USR] ザッと読んでみたが，まぁこのくらいでは楽して食えてる方じゃないのかなw',
  '映画ＴＬに嫉妬＾＾＾// いいなあはやくみたいー！ [mb]',
  '綿アメかと思うくらい雪質が柔らかい！RT [USR] 綿

<a id='section2-5'></a>
### 2.5 評価データを用いた評価

In [9]:
Classifier_model.evaluation(trainer, eval_dataset, MAX_LEN)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

{'eval_loss': 0.5638209581375122,
 'eval_accuracy': 0.69,
 'eval_f1': 0.5079365079365079,
 'eval_runtime': 0.7559,
 'eval_samples_per_second': 264.598,
 'eval_steps_per_second': 6.615,
 'epoch': 7.0}

In [9]:
# add_data
add_dataset = load_text_dataset(f"{DATASET_PATH}/add_data_sub.txt.xlsx")

In [10]:
add_dataset

{'texts': ['ディズニーランドホテルなう。',
  '舞浜地ビールなう。奈緒ちゃんの粋な計らい。',
  'ディズニーランドホテルは仕掛けがいろいろあって面白い。喫煙所が一つしかないけど…',
  '喫煙所は完全に隔離され、ちゃちじゃないがホテル全体のトンマナに悪影響を及ぼさない設計',
  'レヴィ＝ストロース氏死去､ 残念だ。ご冥福を祈ります。',
  '相反する二つの目的を同時に達成するようなルール作り。これはクリエイティブ。サッカーにおけるオフサイドのようなやつね。',
  'やべ。いい企画おもいついったー…',
  'バズマン、その調子だ。',
  'ディズニーのサービスクオリティって、アタマから安心できるよね。',
  '昨日、昔のプロフェッショナル仕事の流儀がやってて、DNAのﾅﾝﾊﾞ社長が｢仕事が人を育てる｣と頑なに言ってたけど、八割くらいそうだと思う。',
  'バズマン、今日の進捗全部メールしといてね。',
  'まぢで？RT伊藤直樹がGTを卒業し、wieden+kenedyの東京オフィス代表に就任しました。',
  'いろいろ、悩むなぁ。',
  'デスクの上の本を整理しはじめて、早2時間。。。',
  '今月は消費が激しいが、なんとか10万貯金する。',
  '最近、セミナー講師をやることが多く、とっても勉強になっている件。',
  '「会食」ってコトバはやっぱりすきじゃないね。',
  '「できること」と「できないこと」の境界線をどれだけしっているか。という点はプランナーにとって不可欠。もちろん「できる」前提で「どうすればできるか」という発想も大切なのは言うまでもないが、「境界線」を知らなければ「どうすれば・・・」という発想すら生まれないわけで。',
  '若くして出世できる会社（この言い方、すごく違和感あるけど）は、ものすごいメリットがある反面、頭ごなしに否定してくれる人間がいないので、胃の中のなんちゃらになりがち。きちんと市場対応できるようになるためには、外に開いていないと。裸の王様になっちまう。',
  'ガスガスっと、こう、上からグシャって感じでつぶしたい。',
  '「代理店連結育成プログラム」ってのを代理店連結でやっているらしいのだが、代理店連結だけでやることに大して意味はないので、本部側から色々歩み寄るべきだと思う。

In [11]:
from transformers import AutoTokenizer
from src.my_project.dataset import preprocess_for_Trainer
# tokenizerの定義
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# データセットの前処理
add_data = preprocess_for_Trainer(add_dataset, tokenizer, max_len=MAX_LEN)

Map:   0%|          | 0/6887 [00:00<?, ? examples/s]

In [19]:
add_data

Dataset({
    features: ['texts', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 6887
})

In [27]:
import numpy as np

In [15]:
prediction = trainer.predict(add_data)

In [20]:
import torch

In [21]:
logits = torch.from_numpy(prediction.predictions)
predictions_proba = torch.sigmoid(logits)
predictions_proba

tensor([[0.2034, 0.5783],
        [0.1482, 0.6651],
        [0.5333, 0.4985],
        ...,
        [0.4281, 0.6999],
        [0.1768, 0.6997],
        [0.6294, 0.5123]])

<a id='section2-6'></a>
### 2.6 wandb終了

In [10]:
wandb.finish()

<a id='section3-1'></a>
## 3.1 Cross Validation

In [6]:
# 東北大BERT-v3
MODEL_NAME = 'cl-tohoku/bert-base-japanese-v3'
Classifier_model = ActClassifier(model_name = MODEL_NAME, num_labels=NUM_LABELS, seed=SEED)

In [7]:
# testデータと訓練に使用するデータに分割
dataset, test_data = split_test_data(data=data, test_size=0.1, SEED=SEED)

In [8]:
result = Classifier_model.cross_validation(dataset, test_data, MAX_LEN, NUM_EPOCHS, LEARNING_RATE, BATCH_SIZE, PATIENCE, NUM_FOLDS, output_dir, project_name='ActClassification_cross_validation_weight')

-----------------Fold: 1-----------------


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Parameter 'fn_kwargs'={'tokenizer': BertJapaneseTokenizer(name_or_path='cl-tohoku/bert-base-japanese-v3', vocab_size=32768, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False,

Map:   0%|          | 0/792 [00:00<?, ? examples/s]

Map:   0%|          | 0/198 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7112,0.710597,0.530303,0.480447
2,0.6695,0.656976,0.656566,0.622222
3,0.5895,0.600233,0.737374,0.686747
4,0.4874,0.595034,0.742424,0.68323
5,0.3711,0.608887,0.747475,0.6875
6,0.2611,0.649163,0.717172,0.705263
7,0.156,0.790321,0.732323,0.653595


Map:   0%|          | 0/110 [00:00<?, ? examples/s]

{'eval_loss': 0.6291524171829224, 'eval_accuracy': 0.7090909090909091, 'eval_f1': 0.627906976744186, 'eval_runtime': 0.4645, 'eval_samples_per_second': 236.835, 'eval_steps_per_second': 6.459, 'epoch': 7.0}
-----------------Fold: 2-----------------


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/792 [00:00<?, ? examples/s]

Map:   0%|          | 0/198 [00:00<?, ? examples/s]

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112318384564585, max=1.0…

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7485,0.697589,0.525253,0.45977
2,0.6732,0.660961,0.621212,0.452555
3,0.6045,0.63258,0.661616,0.578616
4,0.4996,0.634274,0.686869,0.602564
5,0.3916,0.626672,0.676768,0.619048
6,0.2744,0.764754,0.666667,0.592593
7,0.1598,0.843573,0.671717,0.644809
8,0.1096,1.043405,0.666667,0.565789


Map:   0%|          | 0/110 [00:00<?, ? examples/s]

{'eval_loss': 0.6080381870269775, 'eval_accuracy': 0.6545454545454545, 'eval_f1': 0.6346153846153846, 'eval_runtime': 0.463, 'eval_samples_per_second': 237.59, 'eval_steps_per_second': 6.48, 'epoch': 8.0}
-----------------Fold: 3-----------------


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/792 [00:00<?, ? examples/s]

Map:   0%|          | 0/198 [00:00<?, ? examples/s]

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112376167956326, max=1.0…

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7524,0.707752,0.505051,0.542056
2,0.6755,0.686234,0.555556,0.526882
3,0.6119,0.622533,0.661616,0.659898
4,0.5053,0.603069,0.686869,0.643678
5,0.3805,0.607134,0.69697,0.655172
6,0.2493,0.783901,0.666667,0.547945
7,0.1446,0.766034,0.676768,0.676768


Map:   0%|          | 0/110 [00:00<?, ? examples/s]

{'eval_loss': 0.5835622549057007, 'eval_accuracy': 0.6545454545454545, 'eval_f1': 0.6274509803921569, 'eval_runtime': 0.4784, 'eval_samples_per_second': 229.954, 'eval_steps_per_second': 6.271, 'epoch': 7.0}
-----------------Fold: 4-----------------


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/792 [00:00<?, ? examples/s]

Map:   0%|          | 0/198 [00:00<?, ? examples/s]

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112243609709872, max=1.0…

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7572,0.683052,0.590909,0.552486
2,0.6694,0.64915,0.626263,0.559524
3,0.6048,0.617457,0.681818,0.631579
4,0.5027,0.605807,0.70202,0.654971
5,0.3914,0.651602,0.671717,0.670051
6,0.2777,0.683358,0.676768,0.627907
7,0.1768,0.763322,0.691919,0.666667


Map:   0%|          | 0/110 [00:00<?, ? examples/s]

{'eval_loss': 0.6545231342315674, 'eval_accuracy': 0.6363636363636364, 'eval_f1': 0.5348837209302325, 'eval_runtime': 0.4757, 'eval_samples_per_second': 231.235, 'eval_steps_per_second': 6.306, 'epoch': 7.0}
-----------------Fold: 5-----------------


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/792 [00:00<?, ? examples/s]

Map:   0%|          | 0/198 [00:00<?, ? examples/s]

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112461228751473, max=1.0…

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7593,0.687292,0.590909,0.547486
2,0.6757,0.639361,0.661616,0.524823
3,0.6174,0.588679,0.70202,0.609272
4,0.538,0.544966,0.722222,0.674556
5,0.4376,0.526422,0.722222,0.708995
6,0.3335,0.533907,0.762626,0.711656
7,0.2455,0.537196,0.777778,0.741176
8,0.1731,0.545736,0.747475,0.736842


Map:   0%|          | 0/110 [00:00<?, ? examples/s]

{'eval_loss': 0.6337300539016724, 'eval_accuracy': 0.6545454545454545, 'eval_f1': 0.6041666666666666, 'eval_runtime': 0.4631, 'eval_samples_per_second': 237.555, 'eval_steps_per_second': 6.479, 'epoch': 8.0}


In [9]:
result

[{'eval_loss': 0.6291524171829224,
  'eval_accuracy': 0.7090909090909091,
  'eval_f1': 0.627906976744186,
  'eval_runtime': 0.4645,
  'eval_samples_per_second': 236.835,
  'eval_steps_per_second': 6.459,
  'epoch': 7.0},
 {'eval_loss': 0.6080381870269775,
  'eval_accuracy': 0.6545454545454545,
  'eval_f1': 0.6346153846153846,
  'eval_runtime': 0.463,
  'eval_samples_per_second': 237.59,
  'eval_steps_per_second': 6.48,
  'epoch': 8.0},
 {'eval_loss': 0.5835622549057007,
  'eval_accuracy': 0.6545454545454545,
  'eval_f1': 0.6274509803921569,
  'eval_runtime': 0.4784,
  'eval_samples_per_second': 229.954,
  'eval_steps_per_second': 6.271,
  'epoch': 7.0},
 {'eval_loss': 0.6545231342315674,
  'eval_accuracy': 0.6363636363636364,
  'eval_f1': 0.5348837209302325,
  'eval_runtime': 0.4757,
  'eval_samples_per_second': 231.235,
  'eval_steps_per_second': 6.306,
  'epoch': 7.0},
 {'eval_loss': 0.6337300539016724,
  'eval_accuracy': 0.6545454545454545,
  'eval_f1': 0.6041666666666666,
  'eval_r

In [10]:
average_accuracy = sum(d['eval_accuracy'] for d in result)/len(result)
average_f1 = sum(d['eval_f1'] for d in result)/len(result)
print("Average accuracy:", average_accuracy)
print("Average f1:", average_f1)

Average accuracy: 0.6618181818181819
Average f1: 0.6058047458697253
