<a href="https://colab.research.google.com/github/tomonari-masada/course-nlp2020/blob/master/08_document_classification_with_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 08 RNNを使った文書分類
* RNNの出力を文書の潜在表現として利用し、文書分類を行う

## 08-01 torchtextを使ってIMDbデータを読み込む
* ここでIMDbデータセットの読み込みにつかう`torchtext.datasets`については、下記を参照。
 * https://torchtext.readthedocs.io/en/latest/datasets.html

### 実験の再現性確保のための設定など
* https://pytorch.org/docs/stable/notes/randomness.html

In [1]:
import os
os.environ['CUBLAS_WORKSPACE_CONFIG']=":4096:8"


In [2]:
import random
import time
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchtext import datasets
from torchtext.data import Field, LabelField, BucketIterator

SEED = 123

random.seed(SEED)
torch.manual_seed(SEED)
torch.set_deterministic(True)

### torchtextのフィールド
* TEXTフィールドと、LABELフィールドという２種類のFieldオブジェクトのインスタンスを作る。
 * Fieldクラスの詳細については[ここ](https://github.com/pytorch/text/blob/master/torchtext/data/field.py)を参照。
* TEXTフィールドは、テキストの前処理の仕方を決めておくのに使う。
* LABELフィールドは、ラベルの前処理に使う。

In [3]:
TEXT = Field(tokenize="spacy")
LABEL = LabelField()

### IMDbデータセットをダウンロードした後、前処理しつつ読み込む
* ダウンロードはすぐ終わるが、解凍に少し時間がかかる。
* また、TEXTフィールドでspaCyのtokenizationを使うように設定したので、少し時間がかかる。

In [4]:
train_valid_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

### テストセット以外の部分を訓練データと検証データに分ける

In [5]:
train_data, valid_data = train_valid_data.split(split_ratio=0.8)
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 20000
Number of validation examples: 5000
Number of testing examples: 25000


### データセットの語彙とラベルを作る
* TEXTラベルのほうでは、最大語彙サイズを指定する。

In [6]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train_data, max_size=MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


In [7]:
print(TEXT.vocab.itos[:10])

['<unk>', '<pad>', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is']


In [8]:
print(TEXT.vocab.freqs.most_common(20))

[('the', 231466), (',', 220572), ('.', 189097), ('a', 125017), ('and', 125012), ('of', 114875), ('to', 106992), ('is', 87340), ('in', 70294), ('I', 61900), ('it', 61164), ('that', 56231), ('"', 50199), ("'s", 49490), ('this', 48385), ('-', 42420), ('/><br', 40924), ('was', 40198), ('as', 34764), ('with', 34223)]


### デバイスの取得

In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### ミニバッチを取り出すためのiteratorを作る

In [10]:
BATCH_SIZE = 100

train_iterator = BucketIterator(train_data, batch_size=BATCH_SIZE, device=device,
                                     sort_within_batch=True, shuffle=True, sort_key=lambda x: len(x.text))
valid_iterator = BucketIterator(valid_data, batch_size=BATCH_SIZE, device=device)
test_iterator = BucketIterator(test_data, batch_size=BATCH_SIZE, device=device)

### 定数の設定

In [11]:
INPUT_DIM = len(TEXT.vocab)
NUM_CLASS = len(LABEL.vocab)
EMBED_DIM = 64
HIDDEN_DIM = 64
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

### モデルの定義
* LSTMを使う（GRUに変えても良い）

In [12]:
class RNNTextSentiment(nn.Module):
  def __init__(self, emb_dim, hid_dim,
               num_class, vocab_size, padding_idx, p=0.0):
    super().__init__()

    self.input_dim = vocab_size
    self.emb_dim = emb_dim
    self.hid_dim = hid_dim

    self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx)
    self.rnn = nn.LSTM(emb_dim, hid_dim)
    self.fc = nn.Linear(hid_dim * 2, num_class)
    self.dropout = nn.Dropout(p=p)

  def forward(self, src):
    # srcの形は[単語列長, バッチサイズ]

    embedded = self.dropout(self.embedding(src))
    # embeddedの形は[単語列長, バッチサイズ, 埋め込み次元数]

    outputs, (hidden, _) = self.rnn(embedded)
    # outputsの形は[単語列長, バッチサイズ, 隠れ状態の次元数]
    # hiddenの形は[1, バッチサイズ, 隠れ状態の次元数]

    mean_outputs = outputs.mean(0)
    hidden = hidden.squeeze()
    # mean_outputsの形は[バッチサイズ, 隠れ状態の次元数]
    # hiddenの形は[バッチサイズ, 隠れ状態の次元数]
    output = self.fc(torch.cat((mean_outputs, hidden), dim=1))

    return output

### 重みの初期化はこのような方法でも可能
* 関数を定義しておき、applyする

In [13]:
def init_weights(m):
  for name, param in m.named_parameters():
    if 'weight' in name:
      nn.init.normal_(param.data, mean=0, std=0.01)
    else:
      nn.init.constant_(param.data, 0)

* モデルのインスタンスを得る

In [14]:
model = RNNTextSentiment(EMBED_DIM, HIDDEN_DIM, NUM_CLASS, INPUT_DIM,
                         padding_idx=PAD_IDX, p=0.5).to(device)

* 重みの初期化を実行する

In [15]:
model.apply(init_weights)

RNNTextSentiment(
  (embedding): Embedding(25002, 64, padding_idx=1)
  (rnn): LSTM(64, 64)
  (fc): Linear(in_features=128, out_features=2, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

* バイアスがゼロに初期化されているか確認してみる

In [16]:
  for name, param in model.named_parameters():
    if 'weight' not in name:
      print(param.data)

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 

### 最適化アルゴリズムの設定

In [17]:
optimizer = optim.Adam(model.parameters(), lr=0.0001)

パラメータの数を数えてみる。

In [18]:
def count_parameters(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 1,633,666 trainable parameters


### 文書分類の損失関数の設定

In [19]:
criterion = nn.CrossEntropyLoss()

### 訓練用の関数

In [20]:
def train(model, iterator, optimizer, criterion, clip):
  model.train()
  epoch_loss = 0.
  epoch_acc = 0.
  for batch in iterator:

    optimizer.zero_grad()
    output = model(batch.text)
    loss = criterion(output, batch.label)
    loss.backward()

    # RNNではgradientのクリッピングをよく行う
    nn.utils.clip_grad_norm_(model.parameters(), clip)
    optimizer.step()

    epoch_loss += loss.item()
    epoch_acc += (output.argmax(1) == batch.label).sum().item()

  return epoch_loss / len(iterator), epoch_acc / len(iterator.dataset)

### 評価用の関数

In [21]:
def evaluate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0.
  epoch_acc = 0.
  with torch.no_grad():
    for batch in iterator:
      output = model(batch.text)
      loss = criterion(output, batch.label)
      epoch_loss += loss.item()
      epoch_acc += (output.argmax(1) == batch.label).sum().item()

  return epoch_loss / len(iterator), epoch_acc / len(iterator.dataset)

In [22]:
def epoch_time(start_time, end_time):
  elapsed_time = end_time - start_time
  elapsed_mins = int(elapsed_time // 60)
  elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
  return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 100
CLIP = 1.

for epoch in range(1, N_EPOCHS + 1):

  start_time = time.time()
  train_loss, train_acc = train(model, train_iterator, optimizer, criterion, CLIP)
  valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
  end_time = time.time()
  epoch_mins, epoch_secs = epoch_time(start_time, end_time)

  print(f'Epoch {epoch} | time in {epoch_mins} minutes, {epoch_secs} seconds')
  print(f'\tLoss {train_loss:.4f} (train)\t|\tAcc {train_acc * 100:.1f}% (train)')
  print(f'\tLoss {valid_loss:.4f} (valid)\t|\tAcc {valid_acc * 100:.1f}% (valid)')

Epoch 1 | time in 0 minutes, 9 seconds
	Loss 0.6931 (train)	|	Acc 52.5% (train)
	Loss 0.6931 (valid)	|	Acc 61.9% (valid)
Epoch 2 | time in 0 minutes, 9 seconds
	Loss 0.5621 (train)	|	Acc 76.6% (train)
	Loss 0.4897 (valid)	|	Acc 81.5% (valid)
Epoch 3 | time in 0 minutes, 9 seconds
	Loss 0.3633 (train)	|	Acc 85.8% (train)
	Loss 0.4286 (valid)	|	Acc 85.1% (valid)
Epoch 4 | time in 0 minutes, 9 seconds
	Loss 0.2851 (train)	|	Acc 89.3% (train)
	Loss 0.4155 (valid)	|	Acc 86.5% (valid)
Epoch 5 | time in 0 minutes, 9 seconds
	Loss 0.2382 (train)	|	Acc 91.4% (train)
	Loss 0.4236 (valid)	|	Acc 87.6% (valid)
Epoch 6 | time in 0 minutes, 9 seconds
	Loss 0.2030 (train)	|	Acc 92.7% (train)
	Loss 0.4265 (valid)	|	Acc 88.1% (valid)
Epoch 7 | time in 0 minutes, 9 seconds
	Loss 0.1781 (train)	|	Acc 93.9% (train)
	Loss 0.4902 (valid)	|	Acc 88.5% (valid)
Epoch 8 | time in 0 minutes, 9 seconds
	Loss 0.1507 (train)	|	Acc 95.2% (train)
	Loss 0.4945 (valid)	|	Acc 89.0% (valid)
Epoch 9 | time in 0 minutes, 9 s

Epoch 69 | time in 0 minutes, 9 seconds
	Loss 0.0011 (train)	|	Acc 100.0% (train)
	Loss 2.0102 (valid)	|	Acc 87.1% (valid)
Epoch 70 | time in 0 minutes, 9 seconds
	Loss 0.0036 (train)	|	Acc 99.9% (train)
	Loss 2.2077 (valid)	|	Acc 86.6% (valid)
