このノートブックではツイートデータの「活動あり」，「活動なし」への２クラス分類タスクに取り組む<br>

# 目次
1. [基本設定](#section1)
    - [使用ライブラリ](#section1-1)
    - [使用データセット, 出力パス指定](#section1-2)
    - [パラメータの設定](#section1-3)
2. [実験的な予測](#section2)
    - 2.1 [使用データの読み込み](#section2-1)
    - 2.2 [使用モデル](#section2-2)
    - 2.3 [実験に使用するデータを訓練用と評価用に分割（テストなし）](#section2-3)
    - 2.4 [モデル学習](#section2-4)
    - 2.5 [評価データを用いた評価](#section2-5)
    - 2.6 [wandb終了](#section2-6)
3. [Cross validation](#section3-1)

<a id="section1"></a>
## 1. 基本設定

In [1]:
cd ..

/home/is/akiyoshi-n/my-project


<a id='section1-1'></a>
### 使用ライブラリ

In [2]:
import os
# 使用するGPUを指定. この環境変数の場所は，pytorchをimportする前に入れる
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from pathlib import Path
from datetime import datetime
from src.my_project.dataset import load_dataset_2class_classification, split_test_data, load_text_dataset, load_dataset_2class_classification_v2, split_test_data_stratify
from src.my_project.train_v2 import ActClassifier
from sklearn.model_selection import train_test_split
import wandb
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

<a id='section1-2'></a>
### 使用データセット, 出力パス指定

In [3]:
DATASET_PATH = Path('/home/is/akiyoshi-n/my-project/data')
# 本日の日付
timestamp = datetime.now().strftime("%Y-%m-%d")
# 出力先ディレクトリ
output_dir = Path('/home/is/akiyoshi-n/my-project/outputs/{}'.format(timestamp))

<a id='section1-3'></a>
### パラメータの設定

In [4]:
# 最大トークン数
MAX_LEN = 128
# バッチサイズ
BATCH_SIZE = 16
# エポック数
NUM_EPOCHS = 100
# 学習率
LEARNING_RATE = 2e-5
# Cross Validation時のFold数
NUM_FOLDS = 5
# 早期停止のための忍耐値
PATIENCE = 2
# 乱数シード
SEED = 2023
# クラス数
NUM_LABELS = 2

<a id='section2'></a>
## 2. 実験的な予測

<a id='section2-1'></a>
### 2.1 使用データの読み込み

In [5]:
# 辞書型でデータ取得
data = load_dataset_2class_classification(f"{DATASET_PATH}/act_classification_final.xlsx")

<a id='section2-2'></a>
### 2.2 使用モデル

In [6]:
# 東北大BERT-v3
MODEL_NAME = 'cl-tohoku/bert-base-japanese-v3'
Classifier_model = ActClassifier(model_name = MODEL_NAME, num_labels=NUM_LABELS, seed=SEED)

<a id='section2-3'></a>
### 2.3 実験に使用するデータを訓練用と評価用に分割（テストなし）

In [7]:
# 訓練データと評価データを辞書型で抽出
train_dataset = {
    'texts': [data['texts'][i] for i in range(900)],
    'labels': [data['labels'][i] for i in range(900)]
}
eval_dataset = {
    'texts': [data['texts'][i] for i in range(900, 1100)],
    'labels': [data['labels'][i] for i in range(900, 1100)]
}

<a id='section2-4'></a>
### 2.4 model学習

In [8]:
trainer = Classifier_model.train_model(train_dataset, eval_dataset, MAX_LEN, NUM_EPOCHS, LEARNING_RATE, BATCH_SIZE, PATIENCE, output_dir, project_name='ActClassification', run_name='test')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Parameter 'fn_kwargs'={'tokenizer': BertJapaneseTokenizer(name_or_path='cl-tohoku/bert-base-japanese-v3', vocab_size=32768, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False,

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.71,0.691379,0.505,0.497462
2,0.6444,0.623425,0.66,0.476923
3,0.5345,0.617915,0.655,0.524138
4,0.4099,0.628376,0.67,0.576923
5,0.2297,0.800343,0.69,0.575342


In [7]:
import pandas as pd
from transformers import AutoTokenizer
from src.my_project.dataset import preprocess_for_Trainer
import numpy as np
# tokenizerの定義
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
column_names = ['Num_ID','Name','text','time']
df = pd.read_csv(f"{DATASET_PATH}/urso_users.txt", sep='\t', names=column_names)
# # テキストデータを取得
# texts = df['text'].values.tolist()
# # テキストを含む辞書型のデータセットとして返す
# test_dataset = {
#     'texts': texts
# }
# # 辞書型をデータフレームに変換
# test_dataset = preprocess_for_Trainer(test_dataset, tokenizer, max_len=MAX_LEN)

In [10]:
df

Unnamed: 0,Num_ID,Name,text,time
0,5365228407,468251793,同意！ RT @yniimi: @468251793 自分は吸わないですが愛煙家の中で他人の...,2009-11-03 00:25:04
1,5365512833,468251793,みなさんおやすみなさい～。また明日。,2009-11-03 00:37:35
2,5375233193,468251793,おはようございます。今日は休みなので少し寝坊。,2009-11-03 07:45:41
3,5375293031,468251793,"RT @TweetMedmail: Ukraine西部Ternopil, Lviv, Iva...",2009-11-03 07:48:12
4,5375348952,468251793,@Hurdy @mizu34 @mie_sama おはようございます。寒い！,2009-11-03 07:50:33
...,...,...,...,...
510264,11034722800,you1,仕事は俺と同じ夢を見てくれているんだろうか…。 RT @Consaneko: 仕事とともに泣...,2010-03-25 22:50:04
510265,11034768259,you1,@hitomi5310 いや絶対この発言するように仕向けたでしょ姐さんw,2010-03-25 22:51:05
510266,11035587060,you1,@hitomi5310 姐さんやっぱSだわ,2010-03-25 23:09:28
510267,11035699513,you1,かかか、かえるー,2010-03-25 23:12:00


In [16]:
test_dataset

Dataset({
    features: ['texts', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 510269
})

In [18]:
import pickle
# Datasetオブジェクトをpickleを使って保存
with open(f"{DATASET_PATH}/urso_Dataset_type.pkl", 'wb') as f:
    pickle.dump(test_dataset, f)

In [19]:
with open(f'{DATASET_PATH}/urso_Dataset_type.pkl', 'rb') as f:
    a = pickle.load(f)

In [22]:
a['texts']

['同意！ RT @yniimi: @468251793 自分は吸わないですが愛煙家の中で他人の煙は嫌いって自己中心的な方も居るそうでイラッとします。',
 'みなさんおやすみなさい～。また明日。',
 'おはようございます。今日は休みなので少し寝坊。',
 'RT @TweetMedmail: Ukraine西部Ternopil, Lviv, Ivano-Frankivsk, Chernivtsiなどで新型インフルエンザによる入院、死者急増(WHO確認)。子ども1100人以上を含む2300人超が入院、ICU利用は子ども32人を含む13',
 '@Hurdy @mizu34 @mie_sama おはようございます。寒い！',
 '伊集院光さんがtwitter始めたようです。 RT @morishi_ss: 伊集院光だ！本人？とりあえずフォロー @HikaruIjuin',
 '雑誌なんかは全部電子書籍でもいいですね。 RT @jungalian: 免許更新の際に自動車教習所でもらうハンドブックなど講習の際に一度しか開かないので、そういうものが電子化されるといいですねRT @maruyama3: @mao3mao3 例えば、日本において現在発行されている',
 '予想通り風邪発症であります。これ以上悪化しないよう、今日は休養に専念します。',
 'ありがとうございます。本当は悪寒、関節の違和感があったあたりで休めば良かったのですが、仕事で無理をしてしまいました。RT @yniimi: @468251793 お大事に',
 '上が30！よくぞご無事で・・・。RT @suizou: 嫌な感じがしたので、部屋持ちさんに報告。血圧と脈を取ってもらう。上が130を超えるくらい、普段は90。前にショックを起こした時はほとんど何も覚えていない（当たり前）。その時は上が30を切っていたから、ショックではないのかも。',
 '@maruyama3 元々デジタルデータなのですから、電子書籍にした方が出版社も利点が多いはず。',
 '@pluself アロマとかハーブは勤めてる所でも使っているので興味ありです。よろしくお願いします。',
 'RT @47news: 速報:新型インフルに感染した６０代女性死亡と名古屋市が発表。国内の死者は疑い例も含め４７人目。 http://bit.ly/1

In [21]:
df

Unnamed: 0,Num_ID,Name,text,time
0,5365228407,468251793,同意！ RT @yniimi: @468251793 自分は吸わないですが愛煙家の中で他人の...,2009-11-03 00:25:04
1,5365512833,468251793,みなさんおやすみなさい～。また明日。,2009-11-03 00:37:35
2,5375233193,468251793,おはようございます。今日は休みなので少し寝坊。,2009-11-03 07:45:41
3,5375293031,468251793,"RT @TweetMedmail: Ukraine西部Ternopil, Lviv, Iva...",2009-11-03 07:48:12
4,5375348952,468251793,@Hurdy @mizu34 @mie_sama おはようございます。寒い！,2009-11-03 07:50:33
...,...,...,...,...
510264,11034722800,you1,仕事は俺と同じ夢を見てくれているんだろうか…。 RT @Consaneko: 仕事とともに泣...,2010-03-25 22:50:04
510265,11034768259,you1,@hitomi5310 いや絶対この発言するように仕向けたでしょ姐さんw,2010-03-25 22:51:05
510266,11035587060,you1,@hitomi5310 姐さんやっぱSだわ,2010-03-25 23:09:28
510267,11035699513,you1,かかか、かえるー,2010-03-25 23:12:00


In [17]:
# 活動あり/なしの予測確率
prediction = trainer.predict(test_dataset)
prediction

KeyboardInterrupt: 

In [9]:
from transformers import AutoTokenizer
from src.my_project.dataset import preprocess_for_Trainer
import numpy as np
# tokenizerの定義
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# データセットの前処理
eval_dataset_use = preprocess_for_Trainer(eval_dataset, tokenizer, max_len=MAX_LEN)
predictions = trainer.predict(eval_dataset_use)
predictions

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

PredictionOutput(predictions=array([[-5.98845840e-01,  5.86128831e-01],
       [ 1.04071844e+00, -9.57929552e-01],
       [ 3.40541527e-02, -2.72699505e-01],
       [ 8.02536905e-01, -1.37396419e+00],
       [ 5.76453581e-02, -7.32637048e-01],
       [ 8.35022688e-01, -1.42254376e+00],
       [-5.05978823e-01, -6.35256097e-02],
       [-6.91351295e-01, -2.39208695e-02],
       [ 6.21732548e-02, -9.45274115e-01],
       [ 5.61031401e-01, -1.04228950e+00],
       [-2.72949219e-01,  4.43204612e-01],
       [ 4.24563706e-01, -7.72331297e-01],
       [-1.26413852e-01,  3.86720806e-01],
       [-9.61326063e-01,  1.25460362e+00],
       [-7.29861975e-01,  9.27565768e-02],
       [ 7.76768267e-01, -1.08607018e+00],
       [ 1.24802709e+00, -1.50126135e+00],
       [ 8.02649975e-01, -1.10072803e+00],
       [ 6.91271424e-01, -1.41748726e+00],
       [ 6.49327159e-01, -6.78548455e-01],
       [ 3.36950779e-01, -1.76611632e-01],
       [-7.22840667e-01,  5.72938286e-02],
       [ 2.21539289e-03, 

In [14]:
len(predictions)

200

In [17]:
predictions

array([1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1])

In [21]:
prediction_label = [[0 for i in range(4)] for j in range(10)]
prediction_label[0][-1]= 100
prediction_label

[[0, 0, 0, 100],
 [0, 0, 0, 0],
 [0, 0, 0, 0],
 [0, 0, 0, 0],
 [0, 0, 0, 0],
 [0, 0, 0, 0],
 [0, 0, 0, 0],
 [0, 0, 0, 0],
 [0, 0, 0, 0],
 [0, 0, 0, 0]]

wandb: Network error (ReadTimeout), entering retry loop.


In [13]:
predictions = np.argmax(predictions.predictions, axis=-1) # 最大値のラベルを予測値とする
predictions

AttributeError: 'numpy.ndarray' object has no attribute 'predictions'

In [11]:
trainer2 = Classifier_model.train_model(train_dataset, eval_dataset, MAX_LEN, NUM_EPOCHS, LEARNING_RATE, BATCH_SIZE, PATIENCE, output_dir, project_name='ActClassification', run_name='test')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.0111121432027883, max=1.0))…

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.6997,0.678303,0.545,0.461538
2,0.6382,0.621693,0.68,0.542857
3,0.5366,0.613573,0.645,0.517007
4,0.3946,0.673403,0.64,0.526316
5,0.2117,0.804602,0.7,0.583333


In [14]:
from transformers import AutoTokenizer
from src.my_project.dataset import preprocess_for_Trainer
import numpy as np
# tokenizerの定義
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# データセットの前処理
eval_dataset_use = preprocess_for_Trainer(eval_dataset, tokenizer, max_len=MAX_LEN)
predictions = trainer.predict(eval_dataset_use)
predictions = np.argmax(predictions.predictions, axis=-1) # 最大値のラベルを予測値とする

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [15]:
predictions

array([1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 1])

In [16]:
from transformers import AutoTokenizer
from src.my_project.dataset import preprocess_for_Trainer
import numpy as np
# tokenizerの定義
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# データセットの前処理
eval_dataset_use = preprocess_for_Trainer(eval_dataset, tokenizer, max_len=MAX_LEN)
predictions = trainer2.predict(eval_dataset_use)
predictions = np.argmax(predictions.predictions, axis=-1) # 最大値のラベルを予測値とする

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [17]:
predictions

array([1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1])

In [10]:
from transformers import AutoTokenizer
from src.my_project.dataset import preprocess_for_Trainer
import numpy as np
# tokenizerの定義
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# データセットの前処理
eval_dataset_use = preprocess_for_Trainer(eval_dataset, tokenizer, max_len=MAX_LEN)
predictions = trainer.predict(eval_dataset_use)
predictions = np.argmax(predictions.predictions, axis=-1) # 最大値のラベルを予測値とする

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [12]:
# predictメソッドで予測
prediction = Classifier_model.predict(trainer, eval_dataset, MAX_LEN)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [11]:
prediction

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 1])

<a id='section2-5'></a>
### 2.5 評価データを用いた評価

In [13]:
# trainerの予測値とeval_dataset['labels']のAccuracyとF1を出す
from sklearn.metrics import accuracy_score, f1_score
accuracy = accuracy_score(eval_dataset['labels'], predictions)
f1 = f1_score(eval_dataset['labels'], predictions)
print(f'Accuracy: {accuracy:.4f}')
print(f'F1: {f1:.4f}')

Accuracy: 0.6800
F1: 0.5294


In [14]:
Classifier_model.evaluation(trainer, eval_dataset, MAX_LEN)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

{'eval_loss': 0.5914315581321716,
 'eval_accuracy': 0.68,
 'eval_f1': 0.5294117647058824,
 'eval_runtime': 0.6022,
 'eval_samples_per_second': 332.127,
 'eval_steps_per_second': 21.588,
 'epoch': 5.0}

In [None]:
# add_data
add_dataset = load_text_dataset(f"{DATASET_PATH}/add_data_sub.txt.xlsx")

In [None]:
add_dataset

{'texts': ['ディズニーランドホテルなう。',
  '舞浜地ビールなう。奈緒ちゃんの粋な計らい。',
  'ディズニーランドホテルは仕掛けがいろいろあって面白い。喫煙所が一つしかないけど…',
  '喫煙所は完全に隔離され、ちゃちじゃないがホテル全体のトンマナに悪影響を及ぼさない設計',
  'レヴィ＝ストロース氏死去､ 残念だ。ご冥福を祈ります。',
  '相反する二つの目的を同時に達成するようなルール作り。これはクリエイティブ。サッカーにおけるオフサイドのようなやつね。',
  'やべ。いい企画おもいついったー…',
  'バズマン、その調子だ。',
  'ディズニーのサービスクオリティって、アタマから安心できるよね。',
  '昨日、昔のプロフェッショナル仕事の流儀がやってて、DNAのﾅﾝﾊﾞ社長が｢仕事が人を育てる｣と頑なに言ってたけど、八割くらいそうだと思う。',
  'バズマン、今日の進捗全部メールしといてね。',
  'まぢで？RT伊藤直樹がGTを卒業し、wieden+kenedyの東京オフィス代表に就任しました。',
  'いろいろ、悩むなぁ。',
  'デスクの上の本を整理しはじめて、早2時間。。。',
  '今月は消費が激しいが、なんとか10万貯金する。',
  '最近、セミナー講師をやることが多く、とっても勉強になっている件。',
  '「会食」ってコトバはやっぱりすきじゃないね。',
  '「できること」と「できないこと」の境界線をどれだけしっているか。という点はプランナーにとって不可欠。もちろん「できる」前提で「どうすればできるか」という発想も大切なのは言うまでもないが、「境界線」を知らなければ「どうすれば・・・」という発想すら生まれないわけで。',
  '若くして出世できる会社（この言い方、すごく違和感あるけど）は、ものすごいメリットがある反面、頭ごなしに否定してくれる人間がいないので、胃の中のなんちゃらになりがち。きちんと市場対応できるようになるためには、外に開いていないと。裸の王様になっちまう。',
  'ガスガスっと、こう、上からグシャって感じでつぶしたい。',
  '「代理店連結育成プログラム」ってのを代理店連結でやっているらしいのだが、代理店連結だけでやることに大して意味はないので、本部側から色々歩み寄るべきだと思う。

In [None]:
from transformers import AutoTokenizer
from src.my_project.dataset import preprocess_for_Trainer
# tokenizerの定義
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# データセットの前処理
add_data = preprocess_for_Trainer(add_dataset, tokenizer, max_len=MAX_LEN)

Map:   0%|          | 0/6887 [00:00<?, ? examples/s]

In [None]:
add_data

Dataset({
    features: ['texts', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 6887
})

In [None]:
import numpy as np

In [None]:
prediction = trainer.predict(add_data)

In [None]:
import torch

In [None]:
logits = torch.from_numpy(prediction.predictions)
predictions_proba = torch.sigmoid(logits)
predictions_proba

tensor([[0.2034, 0.5783],
        [0.1482, 0.6651],
        [0.5333, 0.4985],
        ...,
        [0.4281, 0.6999],
        [0.1768, 0.6997],
        [0.6294, 0.5123]])

<a id='section2-6'></a>
### 2.6 wandb終了

In [None]:
wandb.finish()

<a id='section3-1'></a>
## 3.1 Cross Validation

In [5]:
# 辞書型でデータ取得
data = load_dataset_2class_classification(f"{DATASET_PATH}/act_classification_final_ChatGPT4.xlsx")

In [6]:
# 東北大BERT-v3
MODEL_NAME = 'cl-tohoku/bert-base-japanese-v3'
Classifier_model = ActClassifier(model_name = MODEL_NAME, num_labels=2, seed=SEED)

In [7]:
# testデータと訓練に使用するデータに分割
dataset, test_data, a, b = split_test_data_stratify(data=data, test_size=0.1, SEED=SEED)

In [8]:
result = Classifier_model.cross_validation(dataset, test_data, MAX_LEN, NUM_EPOCHS, LEARNING_RATE, BATCH_SIZE, PATIENCE, NUM_FOLDS, output_dir, project_name='ChatGPT_data_2class_weight')

-----------------Fold: 1-----------------


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Parameter 'fn_kwargs'={'tokenizer': BertJapaneseTokenizer(name_or_path='cl-tohoku/bert-base-japanese-v3', vocab_size=32768, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False,

Map:   0%|          | 0/864 [00:00<?, ? examples/s]

Map:   0%|          | 0/216 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7895,0.677569,0.592593,0.5315
2,0.6344,0.619041,0.62963,0.62963
3,0.5239,0.58936,0.699074,0.698292
4,0.3689,0.63966,0.699074,0.699068
5,0.1437,0.923452,0.694444,0.694025


Map:   0%|          | 0/120 [00:00<?, ? examples/s]

{'eval_loss': 0.5667014718055725, 'eval_accuracy': 0.7333333333333333, 'eval_f1': 0.7314685314685314, 'eval_runtime': 0.3797, 'eval_samples_per_second': 316.054, 'eval_steps_per_second': 21.07, 'epoch': 5.0}
-----------------Fold: 2-----------------


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/864 [00:00<?, ? examples/s]

Map:   0%|          | 0/216 [00:00<?, ? examples/s]



VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112357329370246, max=1.0…

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7029,0.646559,0.648148,0.641227
2,0.6221,0.562556,0.703704,0.702786
3,0.5326,0.554245,0.736111,0.73606
4,0.3814,0.631185,0.671296,0.67112
5,0.2065,0.869023,0.689815,0.687396


Map:   0%|          | 0/120 [00:00<?, ? examples/s]

{'eval_loss': 0.5440957546234131, 'eval_accuracy': 0.7166666666666667, 'eval_f1': 0.7163515016685206, 'eval_runtime': 0.3828, 'eval_samples_per_second': 313.514, 'eval_steps_per_second': 20.901, 'epoch': 5.0}
-----------------Fold: 3-----------------


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/864 [00:00<?, ? examples/s]

Map:   0%|          | 0/216 [00:00<?, ? examples/s]



VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01111243411174251, max=1.0)…

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.698,0.684259,0.546296,0.531806
2,0.6149,0.627648,0.62037,0.620338
3,0.5437,0.600899,0.652778,0.650614
4,0.3785,0.686562,0.662037,0.654929
5,0.2225,0.966857,0.62963,0.61039


Map:   0%|          | 0/120 [00:00<?, ? examples/s]

{'eval_loss': 0.5412334203720093, 'eval_accuracy': 0.7083333333333334, 'eval_f1': 0.7081509276631228, 'eval_runtime': 0.3753, 'eval_samples_per_second': 319.709, 'eval_steps_per_second': 21.314, 'epoch': 5.0}
-----------------Fold: 4-----------------


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/864 [00:00<?, ? examples/s]

Map:   0%|          | 0/216 [00:00<?, ? examples/s]



VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112335971039203, max=1.0…

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.6983,0.654303,0.675926,0.657578
2,0.6306,0.57425,0.703704,0.702786
3,0.5607,0.55101,0.75463,0.747212
4,0.4046,0.510164,0.777778,0.77709
5,0.2513,0.633037,0.75463,0.7531
6,0.1149,0.785499,0.75463,0.754624


Map:   0%|          | 0/120 [00:00<?, ? examples/s]

{'eval_loss': 0.5839369297027588, 'eval_accuracy': 0.7, 'eval_f1': 0.6945701357466063, 'eval_runtime': 0.3819, 'eval_samples_per_second': 314.22, 'eval_steps_per_second': 20.948, 'epoch': 6.0}
-----------------Fold: 5-----------------


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/864 [00:00<?, ? examples/s]

Map:   0%|          | 0/216 [00:00<?, ? examples/s]



VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112326854426, max=1.0)))

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.6857,0.675721,0.592593,0.592278
2,0.6327,0.639561,0.615741,0.601379
3,0.5239,0.631208,0.652778,0.651875
4,0.3723,0.722571,0.666667,0.665951
5,0.2211,0.887898,0.666667,0.666409


Map:   0%|          | 0/120 [00:00<?, ? examples/s]

{'eval_loss': 0.544661283493042, 'eval_accuracy': 0.7, 'eval_f1': 0.6996662958843158, 'eval_runtime': 0.3761, 'eval_samples_per_second': 319.043, 'eval_steps_per_second': 21.27, 'epoch': 5.0}


In [9]:
# 重みなしの場合の結果（cv=5）
average_accuracy = sum(d['eval_accuracy'] for d in result)/len(result)
average_macro_f1 = sum(d['eval_f1'] for d in result)/len(result)
print("Average accuracy:", average_accuracy)
print("Average Macro f1:", average_macro_f1)

Average accuracy: 0.7116666666666667
Average Macro f1: 0.7103996199279953


In [10]:
# 重みありの場合の結果（cv=5）
# average_accuracy = sum(d['eval_accuracy'] for d in result)/len(result)
# average_macro_f1 = sum(d['eval_f1'] for d in result)/len(result)
# print("Average accuracy:", average_accuracy)
# print("Average Macro f1:", average_macro_f1)

Average accuracy: 0.7216666666666667
Average Macro f1: 0.7199262505130635


### 全データを用いたモデル作成

In [9]:
# 辞書型でデータ取得
data = load_dataset_2class_classification(f"{DATASET_PATH}/act_classification_final_ChatGPT4.xlsx")

In [11]:
# 東北大BERT-v3
MODEL_NAME = 'cl-tohoku/bert-base-japanese-v3'
Classifier_model = ActClassifier(model_name = MODEL_NAME, num_labels=2, seed=SEED)

In [14]:
# testデータと訓練に使用するデータに分割
dataset, eval_dataset, a, b = split_test_data_stratify(data=data, test_size=0.2, SEED=SEED)

In [16]:
model = Classifier_model.train_model(dataset, eval_dataset, MAX_LEN, NUM_EPOCHS, LEARNING_RATE, BATCH_SIZE, PATIENCE, output_dir, project_name='ActClassification_2class_5_7', run_name='basic_model')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/960 [00:00<?, ? examples/s]

Map:   0%|          | 0/240 [00:00<?, ? examples/s]



VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112601165142325, max=1.0…

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7114,0.653258,0.6375,0.635672
2,0.6216,0.589315,0.691667,0.69029
3,0.5102,0.565273,0.704167,0.703544
4,0.3644,0.623763,0.683333,0.68254
5,0.1867,0.816322,0.704167,0.703296


In [17]:
model

<src.my_project.train_v2.SinglelabelTrainer at 0x7f69b0c8a010>

<src.my_project.train_v2.SinglelabelTrainer object at 0x7f69b0c8a010>


### majority classの精度

In [8]:
# リスト型のtest_data['labels']の1の数をカウント
print(dataset['labels'].count(1))
print(dataset['labels'].count(0))

530
550


In [9]:
majority_pred = [0 for i in range(len(test_data['labels']))]

In [10]:
# majority classの精度
accuracy = accuracy_score(y_true=test_data['labels'], y_pred=majority_pred)
macro_f1 = f1_score(y_true=test_data['labels'], y_pred=majority_pred, average='macro')
print("accuracy:", accuracy)
print("Macro f1:", macro_f1)

accuracy: 0.5083333333333333
Macro f1: 0.33701657458563533


In [11]:
f1_score(y_true=test_data['labels'], y_pred=majority_pred, average=None, zero_division=0)

array([0.67403315, 0.        ])

In [13]:
class_recall = recall_score(y_true=test_data['labels'], y_pred=majority_pred, average=None, zero_division=0)
# クラス毎のPrecisionを計算
class_precision = precision_score(y_true=test_data['labels'], y_pred=majority_pred, average=None, zero_division=0)

In [14]:
print(class_recall)
print(class_precision)

[1. 0.]
[0.50833333 0.        ]
