# FinBERTを用いたセンチメント分析

このノートブックでは、[FinBERT](https://github.com/ProsusAI/finBERT)を使って、株のセンチメントを分類する方法を紹介します。

## 準備

### パッケージのインストール

In [1]:
!pip install -q pandas==1.1.5 transformers==4.10.2 scikit-learn==0.23.2 datasets==1.12.1

[K     |████████████████████████████████| 2.8 MB 11.5 MB/s 
[K     |████████████████████████████████| 6.8 MB 42.4 MB/s 
[K     |████████████████████████████████| 270 kB 48.2 MB/s 
[K     |████████████████████████████████| 52 kB 1.6 MB/s 
[K     |████████████████████████████████| 3.3 MB 35.4 MB/s 
[K     |████████████████████████████████| 895 kB 43.9 MB/s 
[K     |████████████████████████████████| 636 kB 39.9 MB/s 
[K     |████████████████████████████████| 1.3 MB 40.4 MB/s 
[K     |████████████████████████████████| 123 kB 51.8 MB/s 
[K     |████████████████████████████████| 243 kB 53.0 MB/s 
[K     |████████████████████████████████| 142 kB 47.6 MB/s 
[K     |████████████████████████████████| 294 kB 51.4 MB/s 
[?25h

### インポート

In [2]:
import numpy as np
import pandas as pd
from datasets import load_metric
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

### データのアップロード

今回のデータセットは[stochtwits](https://api.stocktwits.com/developers/docs)から作成できます。1時間あたり200リクエストまでのポリシーがあるので尊重してください。このAPIを使って作成したデータセット（`FinBERT_Data.csv`）が本章の`Data`フォルダの中にあるので、アップロードしましょう。

In [3]:
from google.colab import files

uploaded = files.upload()

Saving FinBERT_Data.csv to FinBERT_Data.csv


### データの読み込み

In [4]:
df = pd.read_csv("FinBERT_Data.csv")
df.head()

Unnamed: 0,symbol,sentiment,message,message_id
0,GOOGL,Bullish,$GOOGL want to know how to get more day trades...,216640409
1,GOOGL,Bullish,$googl $amzn $fb\nWhy we still bullish? Good v...,216632254
2,GOOGL,Bullish,$tsla $aapl $googl\nTsla ideas \nhttps://youtu...,216621697
3,GOOGL,Bearish,$AAPL $AMZN $GOOGL $FB $MSFT \nTime to short t...,216598529
4,FB,Bearish,2020 is not even close to being over.. \nBig B...,216673119


In [5]:
display(df["symbol"].value_counts())
df["sentiment"].value_counts()

FB       823
AMZN     671
GOOGL    620
Name: symbol, dtype: int64

Bullish    1426
Bearish     688
Name: sentiment, dtype: int64

## 前処理

`LabelEncoder`を使って、ラベルの文字列を数字に変換します。

In [6]:
le = LabelEncoder()
df["sentiment"] = le.fit_transform(df["sentiment"])
df.head()

Unnamed: 0,symbol,sentiment,message,message_id
0,GOOGL,1,$GOOGL want to know how to get more day trades...,216640409
1,GOOGL,1,$googl $amzn $fb\nWhy we still bullish? Good v...,216632254
2,GOOGL,1,$tsla $aapl $googl\nTsla ideas \nhttps://youtu...,216621697
3,GOOGL,0,$AAPL $AMZN $GOOGL $FB $MSFT \nTime to short t...,216598529
4,FB,0,2020 is not even close to being over.. \nBig B...,216673119


データを3つに分割します。

In [7]:
x_train, x_test, y_train, y_test = train_test_split(
    list(df["message"].values),
    list(df["sentiment"].values),
    test_size=0.2,
    random_state=2021
)

x_valid, x_test, y_valid, y_test = train_test_split(
    x_test,
    y_test,
    test_size=0.5,
    random_state=2021
)

`BERTTokenizerFast.from_pretrained`メソッドでトークナイザーをインスタンス化します。

In [8]:
model_name = "ProsusAI/finbert"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/252 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/758 [00:00<?, ?B/s]

トークナイザーにテキストを与えて、エンコーディングしましょう。

In [9]:
train_encodings = tokenizer(x_train, truncation=True, padding=True)
val_encodings = tokenizer(x_valid, truncation=True, padding=True)
test_encodings = tokenizer(x_test, truncation=True, padding=True)

ラベルとエンコーディングした入力を与えて、データセットを作成します。

In [10]:
import torch

class StockDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = StockDataset(train_encodings, y_train)
val_dataset = StockDataset(val_encodings, y_valid)
test_dataset = StockDataset(test_encodings, y_test)

## モデルの学習

Datasetsライブラリの`load_metric`関数を使って、評価用の関数を用意しましょう。今回は単純に正解率を用意します。`compute_metrics`関数にlogitsを予測へ変換させて、それを`metric`の`compute`メソッドに与えるだけです。

In [11]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

In [12]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=10,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    weight_decay=0.01,               # strength of weight decay
    evaluation_strategy="steps",
    logging_dir='./logs',
    logging_steps=100,
    eval_steps=100,
    save_steps=100,
    load_best_model_at_end=True,
)

model = AutoModelForSequenceClassification.from_pretrained(
    "ProsusAI/finbert",
    num_labels=2,
    ignore_mismatched_sizes=True
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,             # evaluation dataset
    compute_metrics=compute_metrics,
)

trainer.train()

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ProsusAI/finbert and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([3, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([2]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
***** Running training *****
  Num examples = 1691
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1060


Step,Training Loss,Validation Loss,Accuracy
100,0.5866,0.554538,0.748815
200,0.4005,0.617676,0.701422
300,0.2414,0.743033,0.767773
400,0.1526,1.097764,0.748815
500,0.1121,1.223887,0.763033
600,0.0474,1.464489,0.772512
700,0.0271,1.600546,0.781991
800,0.0368,1.566037,0.772512
900,0.0291,1.570817,0.753555
1000,0.0108,1.674281,0.763033


***** Running Evaluation *****
  Num examples = 211
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-100
Configuration saved in ./results/checkpoint-100/config.json
Model weights saved in ./results/checkpoint-100/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 211
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-200
Configuration saved in ./results/checkpoint-200/config.json
Model weights saved in ./results/checkpoint-200/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 211
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-300
Configuration saved in ./results/checkpoint-300/config.json
Model weights saved in ./results/checkpoint-300/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 211
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-400
Configuration saved in ./results/checkpoint-400/config.json
Model weights saved in ./results/checkpoint-400/pytorch_model.bin
***** Ru

TrainOutput(global_step=1060, training_loss=0.15629789919223425, metrics={'train_runtime': 2076.649, 'train_samples_per_second': 8.143, 'train_steps_per_second': 0.51, 'total_flos': 3102279759318600.0, 'train_loss': 0.15629789919223425, 'epoch': 10.0})

In [13]:
trainer.evaluate(test_dataset)

***** Running Evaluation *****
  Num examples = 212
  Batch size = 64


{'epoch': 10.0,
 'eval_accuracy': 0.7924528301886793,
 'eval_loss': 0.4584706723690033,
 'eval_runtime': 7.8488,
 'eval_samples_per_second': 27.011,
 'eval_steps_per_second': 0.51}

正解率はまずまずといったところです。前処理やハイパーパラメータチューニングをすれば、もう少し良くなるでしょう。

## 参考資料

- [Fine-tuning a pretrained model](https://colab.research.google.com/github/huggingface/notebooks/blob/master/transformers_doc/training.ipynb)