# paper : TopicBERT for Energy Efficient Document Classification
(link : https://arxiv.org/abs/2010.16407)

### Summray

- Goal
    - Complementary Fine-tuning
    - Efficient Fine-tuning : Speed(CO2 Estimation) + Accuracy

### Architecture
NVDM + BERT

VAE로 구성된 NVDM의 latent vector를 BERT의 CLS token과 concat 후 MLP를 지나 classification 진행한다.


### Dataset

본 논문에서는 BERT에 대해서는 총 5개의 datasets (Reuter8, Imdb, 20NS, Ohsumed, AGnews) 을 사용하였으며, DistilBERT에 대해서는 총 2개의 datasets(Reuters8, 20NS)를 사용하였다.
본 코드에서는 2개의 dataset (IMDB, 20NS(newsgroup))를 dataset으로 선정하였으며, DistilBERT에 대해서 tuning 진행한다.

### Project
- Architecture : DistilBERT + NVDM
- Baseline : BERT, DistilBERT
- Dataset : imdb, 20NS
- Evaluate Metric : accuracy(micro-F1), macro-F1, $T_{epoch}$ \
(논문에서는 $T_{epoch}$, $T$를 통해 CO_2 emission을 구하였으나, $T_{epoch}$와 유사한 지표로 판단해 대체하였다. 또한 Retention은 Macro-F1과 사실상 동일한 지표)

### Objective Function
- NVDM objective : beta-VAE의 objective function과 동일하다\
VAE : https://arxiv.org/abs/1312.6114 \
Beta-VAE : https://openreview.net/forum?id=Sy2fzU9gl
- TopicBERT objective : CE Loss와 NVDM Loss의 결합
- 각 architecture에 구현해두었다.

### Environment
- google colab GPU : A100

In [1]:
!pip install transformers datasets
!pip install git+https://github.com/huggingface/transformers.git
!pip install accelerate>=0.20.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-kxt7tii7
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-kxt7tii7
  Resolved https://github.com/huggingface/transformers.git to commit ee88ae59940fd4b2c8fc119373143d7a1175c651
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [2]:
from datasets import load_dataset
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
from transformers import AutoTokenizer
import torch
from torch import LongTensor
from torch.utils.data import DataLoader
from torch.optim import Adam
import nltk
import pandas as pd
from torch import nn
from torch.nn import functional as F
from transformers import DistilBertModel

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Dataset (imdb, 20NS)

In [4]:
from sklearn.datasets import fetch_20newsgroups
from datasets import Dataset

def split_dataset(data):
    if data=="imdb":
        imdb = load_dataset("imdb")
        train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(4000))])
        test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(400))])

    elif data=="20NS":
        train_20ns = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
        test_20ns = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)
        train_df = pd.DataFrame(data={"text":train_20ns["data"], "label":train_20ns["target"]})
        test_df = pd.DataFrame(data={"text":test_20ns["data"], "label":test_20ns["target"]})
        train_dataset = Dataset.from_pandas(train_df).shuffle(seed=42).select([i for i in list(range(4000))])
        test_dataset = Dataset.from_pandas(test_df).shuffle(seed=42).select([i for i in list(range(400))])

    else:
        print("check data name")
        return None, None

    return train_dataset, test_dataset

Hyperparameter

In [5]:
HYPERPAREMTER = {
    "imdb": {
        "NVDM_Hidden_Size": 256,
        "NVDM_Latent_Size": 100,
        "BATCH_SIZE": 16,
        "VAE_BETA": 1,
        "TOPIC_BERT_ALPHA": 0.5,
        "TOPIC_BERT_HIDDEN_SIZE": 768,
        "TOPIC_BERT_OUTPUT_SIZE": 2,
        "EPOCHS": 10,
        "LR": 2e-5,
        "EVAL":2
    },
    "20NS": {
        "NVDM_Hidden_Size": 256,
        "NVDM_Latent_Size": 100,
        "BATCH_SIZE": 16,
        "VAE_BETA": 1,
        "TOPIC_BERT_ALPHA": 0.5,
        "TOPIC_BERT_HIDDEN_SIZE": 768,
        "TOPIC_BERT_OUTPUT_SIZE": 20,
        "EPOCHS": 10,
        "LR": 2e-5,
        "EVAL":2
    },
}

Tokenizer (BERT, distilBERT)

In [6]:
from nltk.corpus import stopwords
import string
nltk.download('stopwords')

distilBERT_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def distilBERT_tokenize_function(examples):
    return distilBERT_tokenizer(examples["text"], padding='max_length', max_length=512, truncation=True)

BERT_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def BERT_tokenize_function(examples):
    return BERT_tokenizer(examples["text"], padding='max_length', max_length=512, truncation=True)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


NVDM

In [7]:
class NVDM(nn.Module):
    def __init__(self, vocab_size, hidden_size, latent_size, beta=1):
        super(NVDM, self).__init__()
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.latent_size = latent_size
        self.beta = beta

        self.fc1 = nn.Linear(vocab_size, hidden_size)
        self.fc21 = nn.Linear(hidden_size, latent_size)
        self.fc22 = nn.Linear(hidden_size, latent_size)
        self.fc3 = nn.Linear(latent_size, hidden_size)
        self.fc4 = nn.Linear(hidden_size, vocab_size)
        self.epsilon = 1e-8

    def encode(self, x):
        h1 = F.relu(self.fc1(x))
        return self.fc21(h1), self.fc22(h1)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5*logvar)
        eps = torch.randn_like(std)
        return mu + eps*std

    def decode(self, z):
        h3 = F.relu(self.fc3(z))
        h4 = self.fc4(h3)
        return torch.softmax(h4, dim=1)

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar

    def loss_function(self, recon_x, x, mu, logvar):
        recon_loss = -torch.sum(torch.log(recon_x + self.epsilon) * x, dim=1)
        latent_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(), dim=1)
        return torch.mean(recon_loss + self.beta * latent_loss)

TopicBERT (Distil BERT)

In [8]:
class TopicBERT(nn.Module):
    def __init__(self, nvdm, hidden_size, output_size, alpha=0.5):
        super(TopicBERT, self).__init__()
        self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
        self.nvdm = nvdm
        self.alpha = alpha
        self.output_size = output_size

        self.fc1 = nn.Linear(nvdm.latent_size + hidden_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.epsilon = 1e-8

    def forward(self, bert_input, nvdm_input):
        outputs = self.bert(**bert_input)
        cls_token = outputs.last_hidden_state[:, 0, :]
        mu, logvar = self.nvdm.encode(nvdm_input)
        z = self.nvdm.reparameterize(mu, logvar)
        nvdm_output = self.nvdm.decode(z)
        nvdm_loss = self.nvdm.loss_function(nvdm_output, nvdm_input, mu, logvar)

        z_cls = torch.cat([z, cls_token], dim=1)
        h = F.relu(self.fc1(z_cls))
        logits = self.fc2(h)

        return logits, nvdm_loss

    def loss_function(self, logits, label, nvdm_loss):
        ce_loss = F.cross_entropy(logits, label)
        loss = self.alpha * ce_loss + (1 - self.alpha) * nvdm_loss
        return loss

Baseline : DistilBERT + MLP

In [9]:
class DistilBERT_MLP(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DistilBERT_MLP, self).__init__()
        self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
        self.output_size = output_size

        self.fc1 = nn.Linear(hidden_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, bert_input):
        outputs = self.bert(**bert_input)
        cls_token = outputs.last_hidden_state[:, 0, :]

        h = F.relu(self.fc1(cls_token))
        logits = self.fc2(h)

        return logits

    def loss_function(self, logits, label):
        ce_loss = F.cross_entropy(logits, label)
        return ce_loss

Baseline : BERT + MLP

In [10]:
from transformers import BertModel

class BERT_MLP(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(BERT_MLP, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.output_size = output_size

        self.fc1 = nn.Linear(hidden_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, bert_input):
        outputs = self.bert(**bert_input)
        cls_token = outputs.last_hidden_state[:, 0, :]

        h = F.relu(self.fc1(cls_token))
        logits = self.fc2(h)

        return logits

    def loss_function(self, logits, label):
        ce_loss = F.cross_entropy(logits, label)
        return ce_loss

Metric

In [11]:
from sklearn.metrics import f1_score

def get_f1_score(logits, labels):
    predictions = torch.argmax(logits, dim=-1)

    predictions_np = predictions.cpu().numpy()
    labels_np = labels.cpu().numpy()

    f1 = f1_score(labels_np, predictions_np, average='macro')
    return f1


def get_accuracy(logits, labels):
    preds = torch.argmax(logits, dim=1)
    f1 = f1_score(labels.cpu(), preds.cpu(), average='micro')
    return f1

Train and Eval

In [12]:
import time
import datetime

def train_and_evaluate_model(model, train_dataloader, test_dataloader, optimizer, epochs, eval_interval):
    total_t0 = time.time()
    for epoch in range(epochs):
        t0 = time.time()
        for batch in train_dataloader:
            optimizer.zero_grad()

            input_ids = torch.stack(batch["input_ids"], dim=1).long().to(DEVICE)
            attention_mask = torch.stack(batch["attention_mask"], dim=1).long().to(DEVICE)
            bert_input = {"input_ids":input_ids, "attention_mask":attention_mask}
            labels = batch["label"].to(DEVICE)

            if isinstance(model, TopicBERT):
                nvdm_input = torch.stack(batch["BoW"], dim=1).float().to(DEVICE)
                logits, nvdm_loss = model(bert_input, nvdm_input)
                loss = model.loss_function(logits, labels, nvdm_loss)
            else:
                logits = model(bert_input)
                loss = model.loss_function(logits, labels)

            loss.backward()
            optimizer.step()

        if (epoch + 1) % eval_interval == 0:
            elapsed = format_time(time.time() - t0)
            print(f"Epoch {epoch+1}, Loss: {loss.item()}, Elapsed time: {elapsed}")

            model.eval()
            total_eval_loss = 0
            total_eval_f1_macro = 0
            accuracy = 0
            total_eval_accuracy = 0

            for batch in test_dataloader:
                with torch.no_grad():
                    input_ids = torch.stack(batch["input_ids"], dim=1).long().to(DEVICE)
                    attention_mask = torch.stack(batch["attention_mask"], dim=1).long().to(DEVICE)
                    bert_input = {"input_ids":input_ids, "attention_mask":attention_mask}
                    labels = batch["label"].to(DEVICE)

                    if isinstance(model, TopicBERT):
                        nvdm_input = torch.stack(batch["BoW"], dim=1).float().to(DEVICE)
                        logits, nvdm_loss = model(bert_input, nvdm_input)
                        loss = model.loss_function(logits, labels, nvdm_loss)
                    else:
                        logits = model(bert_input)
                        loss = model.loss_function(logits, labels)

                    total_eval_loss += loss.item()
                    f1_macro = get_f1_score(logits, labels)
                    total_eval_f1_macro += f1_macro
                    accuracy = get_accuracy(logits, labels)
                    total_eval_accuracy += accuracy

            avg_val_f1_macro = total_eval_f1_macro / len(test_dataloader)
            avg_val_accuracy = total_eval_accuracy / len(test_dataloader)
            avg_val_loss = total_eval_loss / len(test_dataloader)

            print(f"Test Accuracy: {avg_val_accuracy:.3f}")
            print(f"Test Macro F1 Score: {avg_val_f1_macro:.3f}")
            print(f"Test Loss: {avg_val_loss:.3f}")

    total_elapsed = format_time(time.time() - total_t0)
    print(f"Total training took {total_elapsed}")

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))

    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

DATA : IMDB

In [14]:
data = "imdb"

In [15]:
train_dataset, test_dataset = split_dataset(data)

NVDM_Hidden_Size = HYPERPAREMTER[data]['NVDM_Hidden_Size']
NVDM_Latent_Size = HYPERPAREMTER[data]['NVDM_Latent_Size']
BATCH_SIZE = HYPERPAREMTER[data]['BATCH_SIZE']
VAE_BETA = HYPERPAREMTER[data]['VAE_BETA']
TOPIC_BERT_ALPHA = HYPERPAREMTER[data]['TOPIC_BERT_ALPHA']
TOPIC_BERT_HIDDEN_SIZE = HYPERPAREMTER[data]['TOPIC_BERT_HIDDEN_SIZE']
TOPIC_BERT_OUTPUT_SIZE = HYPERPAREMTER[data]['TOPIC_BERT_OUTPUT_SIZE']
EPOCHS = HYPERPAREMTER[data]['EPOCHS']
LR = HYPERPAREMTER[data]['LR']
EVAL = HYPERPAREMTER[data]['EVAL']



  0%|          | 0/3 [00:00<?, ?it/s]



IMDB bow tokenize

In [16]:
from nltk.corpus import stopwords
import string
nltk.download('stopwords')

# Define additional stop words
additional_stopwords = ["br", "''", "``", "n't", "...", "--", "'s", "movie", "film", "one"]

# Combine default and additional stop words
stop_words = set(stopwords.words('english') + list(string.punctuation) + additional_stopwords)

# Tokenize the text
tokenized_text = [word_tokenize(text.lower()) for text in train_dataset["text"]]

# Remove stop words and punctuation, then calculate word frequencies
filtered_tokenized_text = [[word for word in text if not word in stop_words] for text in tokenized_text]
freq_dist = FreqDist(word for text in filtered_tokenized_text for word in text)

# Get 2000 most common BoW_words
most_common = freq_dist.most_common(2000)
BoW_words = [word for word,_ in most_common]

def BoW_tokenize_function(examples):
    text = examples["text"]
    BoW = [0] * 2000
    tokenized_text = word_tokenize(text.lower())

    # Remove stop words and punctuation
    filtered_text = [word for word in tokenized_text if not word in stop_words]

    # Calculate count of each word in BoW_words in the text
    for i, word in enumerate(BoW_words):
        BoW[i] = filtered_text.count(word)

    # Calculate total count of the words
    total_count = sum(BoW)

    # Normalize the BoW with the total count
    BoW = [count / total_count for count in BoW]

    return {"BoW": BoW}

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
print(most_common)



Train Topic BERT (Dataset : imdb)

In [18]:
# Train and Evaluation Tokenization
total_train = train_dataset.map(BoW_tokenize_function)
total_train = total_train.map(distilBERT_tokenize_function)
total_test = test_dataset.map(BoW_tokenize_function)
total_test = total_test.map(distilBERT_tokenize_function)

# Initialize Models and Optimizer
nvdm_model = NVDM(vocab_size=len(BoW_words), hidden_size=NVDM_Hidden_Size, latent_size=NVDM_Latent_Size, beta=VAE_BETA).to(DEVICE)
topic_bert = TopicBERT(nvdm_model, hidden_size=TOPIC_BERT_HIDDEN_SIZE, output_size=TOPIC_BERT_OUTPUT_SIZE, alpha=TOPIC_BERT_ALPHA).to(DEVICE)
optimizer = Adam(topic_bert.parameters(), lr=LR)

# Initialize Dataloader
train_dataloader = DataLoader(total_train, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(total_test, batch_size=BATCH_SIZE)

# Train and Evaluate the model
train_and_evaluate_model(topic_bert, train_dataloader, test_dataloader, optimizer, EPOCHS, EVAL)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Epoch 2, Loss: 3.74112606048584, Elapsed time: 0:00:57
Test Accuracy: 0.895
Test Macro F1 Score: 0.891
Test Loss: 3.852
Epoch 4, Loss: 3.6525015830993652, Elapsed time: 0:00:57
Test Accuracy: 0.875
Test Macro F1 Score: 0.867
Test Loss: 3.864
Epoch 6, Loss: 3.5952839851379395, Elapsed time: 0:00:57
Test Accuracy: 0.897
Test Macro F1 Score: 0.892
Test Loss: 3.801
Epoch 8, Loss: 3.558215379714966, Elapsed time: 0:00:57
Test Accuracy: 0.890
Test Macro F1 Score: 0.884
Test Loss: 3.822
Epoch 10, Loss: 3.532688617706299, Elapsed time: 0:00:57
Test Accuracy: 0.890
Test Macro F1 Score: 0.884
Test Loss: 3.818
Total training took 0:09:46


Train DistilBERT_MLP (Dataset : imdb)

In [19]:
# Train and Evaluation Tokenization
total_train = train_dataset.map(distilBERT_tokenize_function)
total_test = test_dataset.map(distilBERT_tokenize_function)

# Initialize Models and Optimizer
model = DistilBERT_MLP(hidden_size=TOPIC_BERT_HIDDEN_SIZE, output_size=TOPIC_BERT_OUTPUT_SIZE).to(DEVICE)
optimizer = Adam(model.parameters(), lr=LR)

# Initialize Dataloader
train_dataloader = DataLoader(total_train, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(total_test, batch_size=BATCH_SIZE)

# Train and Evaluate the model
train_and_evaluate_model(model, train_dataloader, test_dataloader, optimizer, EPOCHS, EVAL)




Epoch 2, Loss: 0.04989732801914215, Elapsed time: 0:00:48
Test Accuracy: 0.910
Test Macro F1 Score: 0.906
Test Loss: 0.248
Epoch 4, Loss: 0.01041967049241066, Elapsed time: 0:00:48
Test Accuracy: 0.890
Test Macro F1 Score: 0.884
Test Loss: 0.439
Epoch 6, Loss: 0.008572302758693695, Elapsed time: 0:00:48
Test Accuracy: 0.895
Test Macro F1 Score: 0.888
Test Loss: 0.422
Epoch 8, Loss: 0.00042274565203115344, Elapsed time: 0:00:48
Test Accuracy: 0.885
Test Macro F1 Score: 0.879
Test Loss: 0.524
Epoch 10, Loss: 0.00036452588392421603, Elapsed time: 0:00:48
Test Accuracy: 0.877
Test Macro F1 Score: 0.872
Test Loss: 0.606
Total training took 0:08:09


Train BERT_MLP (Dataset : imdb)

In [20]:
# Train and Evaluation Tokenization
total_train = train_dataset.map(BERT_tokenize_function)
total_test = test_dataset.map(BERT_tokenize_function)

# Initialize Models and Optimizer
model = BERT_MLP(hidden_size=TOPIC_BERT_HIDDEN_SIZE, output_size=TOPIC_BERT_OUTPUT_SIZE).to(DEVICE)
optimizer = Adam(model.parameters(), lr=LR)

# Initialize Dataloader
train_dataloader = DataLoader(total_train, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(total_test, batch_size=BATCH_SIZE)

# Train and Evaluate the model
train_and_evaluate_model(model, train_dataloader, test_dataloader, optimizer, EPOCHS, EVAL)




Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Epoch 2, Loss: 0.015240914188325405, Elapsed time: 0:01:32
Test Accuracy: 0.915
Test Macro F1 Score: 0.910
Test Loss: 0.243
Epoch 4, Loss: 0.0011954039800912142, Elapsed time: 0:01:32
Test Accuracy: 0.907
Test Macro F1 Score: 0.901
Test Loss: 0.393
Epoch 6, Loss: 0.010879521258175373, Elapsed time: 0:01:32
Test Accuracy: 0.902
Test Macro F1 Score: 0.897
Test Loss: 0.370
Epoch 8, Loss: 0.0002310553245479241, Elapsed time: 0:01:31
Test Accuracy: 0.897
Test Macro F1 Score: 0.891
Test Loss: 0.508
Epoch 10, Loss: 0.00011663106852211058, Elapsed time: 0:01:32
Test Accuracy: 0.897
Test Macro F1 Score: 0.891
Test Loss: 0.563
Total training took 0:15:34


DATA : 20NS

In [21]:
data = "20NS"

In [22]:
train_dataset, test_dataset = split_dataset(data)

NVDM_Hidden_Size = HYPERPAREMTER[data]['NVDM_Hidden_Size']
NVDM_Latent_Size = HYPERPAREMTER[data]['NVDM_Latent_Size']
BATCH_SIZE = HYPERPAREMTER[data]['BATCH_SIZE']
VAE_BETA = HYPERPAREMTER[data]['VAE_BETA']
TOPIC_BERT_ALPHA = HYPERPAREMTER[data]['TOPIC_BERT_ALPHA']
TOPIC_BERT_HIDDEN_SIZE = HYPERPAREMTER[data]['TOPIC_BERT_HIDDEN_SIZE']
TOPIC_BERT_OUTPUT_SIZE = HYPERPAREMTER[data]['TOPIC_BERT_OUTPUT_SIZE']
EPOCHS = HYPERPAREMTER[data]['EPOCHS']
LR = HYPERPAREMTER[data]['LR']
EVAL = HYPERPAREMTER[data]['EVAL']

20NS bow tokenize

In [23]:
from nltk.corpus import stopwords
import string
nltk.download('stopwords')

# Define additional stop words
additional_stopwords = ["br", "''", "``", "n't", "...", "--", "'s", "movie", "film", "one", "'ax", "subject", "lines"]

# Combine default and additional stop words
stop_words = set(stopwords.words('english') + list(string.punctuation) + additional_stopwords)

# Tokenize the text
tokenized_text = [word_tokenize(text.lower()) for text in train_dataset["text"]]

# Remove stop words and punctuation, then calculate word frequencies
filtered_tokenized_text = [[word for word in text if not word in stop_words] for text in tokenized_text]
freq_dist = FreqDist(word for text in filtered_tokenized_text for word in text)

# Get 2000 most common BoW_words
most_common = freq_dist.most_common(2000)
BoW_words = [word for word,_ in most_common]

def BoW_tokenize_function(examples):
    text = examples["text"]
    BoW = [0] * 2000
    tokenized_text = word_tokenize(text.lower())

    # Remove stop words and punctuation
    filtered_text = [word for word in tokenized_text if not word in stop_words]

    # Calculate count of each word in BoW_words in the text
    for i, word in enumerate(BoW_words):
        BoW[i] = filtered_text.count(word)

    # Calculate total count of the words
    total_count = sum(BoW)

    # Normalize the BoW with the total count
    BoW = [count / total_count for count in BoW]

    return {"BoW": BoW}

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
print(most_common)



Train Topic BERT (Dataset : 20NS)

In [25]:
# Train and Evaluation Tokenization
total_train = train_dataset.map(BoW_tokenize_function)
total_train = total_train.map(distilBERT_tokenize_function)
total_test = test_dataset.map(BoW_tokenize_function)
total_test = total_test.map(distilBERT_tokenize_function)

# Initialize Models and Optimizer
nvdm_model = NVDM(vocab_size=len(BoW_words), hidden_size=NVDM_Hidden_Size, latent_size=NVDM_Latent_Size, beta=VAE_BETA).to(DEVICE)
topic_bert = TopicBERT(nvdm_model, hidden_size=TOPIC_BERT_HIDDEN_SIZE, output_size=TOPIC_BERT_OUTPUT_SIZE, alpha=TOPIC_BERT_ALPHA).to(DEVICE)
optimizer = Adam(topic_bert.parameters(), lr=LR)

# Initialize Dataloader
train_dataloader = DataLoader(total_train, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(total_test, batch_size=BATCH_SIZE)

# Train and Evaluate the model
train_and_evaluate_model(topic_bert, train_dataloader, test_dataloader, optimizer, EPOCHS, EVAL)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Epoch 2, Loss: 3.928354024887085, Elapsed time: 0:00:58
Test Accuracy: 0.780
Test Macro F1 Score: 0.678
Test Loss: 4.100
Epoch 4, Loss: 3.691959857940674, Elapsed time: 0:00:58
Test Accuracy: 0.792
Test Macro F1 Score: 0.700
Test Loss: 4.000
Epoch 6, Loss: 3.6298272609710693, Elapsed time: 0:00:58
Test Accuracy: 0.833
Test Macro F1 Score: 0.748
Test Loss: 3.952
Epoch 8, Loss: 3.5947704315185547, Elapsed time: 0:00:58
Test Accuracy: 0.825
Test Macro F1 Score: 0.729
Test Loss: 3.940
Epoch 10, Loss: 3.5744593143463135, Elapsed time: 0:00:58
Test Accuracy: 0.830
Test Macro F1 Score: 0.738
Test Loss: 3.933
Total training took 0:09:53


Train DistilBERT_MLP (Dataset : 20NS)

In [26]:
# Train and Evaluation Tokenization
total_train = train_dataset.map(distilBERT_tokenize_function)
total_test = test_dataset.map(distilBERT_tokenize_function)

# Initialize Models and Optimizer
model = DistilBERT_MLP(hidden_size=TOPIC_BERT_HIDDEN_SIZE, output_size=TOPIC_BERT_OUTPUT_SIZE).to(DEVICE)
optimizer = Adam(model.parameters(), lr=LR)

# Initialize Dataloader
train_dataloader = DataLoader(total_train, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(total_test, batch_size=BATCH_SIZE)

# Train and Evaluate the model
train_and_evaluate_model(model, train_dataloader, test_dataloader, optimizer, EPOCHS, EVAL)


Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Epoch 2, Loss: 0.3916989862918854, Elapsed time: 0:00:48
Test Accuracy: 0.795
Test Macro F1 Score: 0.696
Test Loss: 0.711
Epoch 4, Loss: 0.03720424696803093, Elapsed time: 0:00:48
Test Accuracy: 0.820
Test Macro F1 Score: 0.723
Test Loss: 0.604
Epoch 6, Loss: 0.032865021377801895, Elapsed time: 0:00:48
Test Accuracy: 0.823
Test Macro F1 Score: 0.721
Test Loss: 0.714
Epoch 8, Loss: 0.007771733216941357, Elapsed time: 0:00:48
Test Accuracy: 0.835
Test Macro F1 Score: 0.738
Test Loss: 0.689
Epoch 10, Loss: 0.004873966798186302, Elapsed time: 0:00:48
Test Accuracy: 0.843
Test Macro F1 Score: 0.740
Test Loss: 0.768
Total training took 0:08:12


Train BERT_MLP (Dataset : 20NS)

In [27]:
# Train and Evaluation Tokenization
total_train = train_dataset.map(BERT_tokenize_function)
total_test = test_dataset.map(BERT_tokenize_function)

# Initialize Models and Optimizer
model = BERT_MLP(hidden_size=TOPIC_BERT_HIDDEN_SIZE, output_size=TOPIC_BERT_OUTPUT_SIZE).to(DEVICE)
optimizer = Adam(model.parameters(), lr=LR)

# Initialize Dataloader
train_dataloader = DataLoader(total_train, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(total_test, batch_size=BATCH_SIZE)

# Train and Evaluate the model
train_and_evaluate_model(model, train_dataloader, test_dataloader, optimizer, EPOCHS, EVAL)


Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Epoch 2, Loss: 0.1956147700548172, Elapsed time: 0:01:32
Test Accuracy: 0.815
Test Macro F1 Score: 0.721
Test Loss: 0.651
Epoch 4, Loss: 0.02384248748421669, Elapsed time: 0:01:32
Test Accuracy: 0.835
Test Macro F1 Score: 0.752
Test Loss: 0.610
Epoch 6, Loss: 0.010395611636340618, Elapsed time: 0:01:32
Test Accuracy: 0.828
Test Macro F1 Score: 0.736
Test Loss: 0.677
Epoch 8, Loss: 0.004651889204978943, Elapsed time: 0:01:32
Test Accuracy: 0.848
Test Macro F1 Score: 0.766
Test Loss: 0.642
Epoch 10, Loss: 0.00277692754752934, Elapsed time: 0:01:32
Test Accuracy: 0.860
Test Macro F1 Score: 0.785
Test Loss: 0.683
Total training took 0:15:37


Additional (VAE_BETA = 10 & 0.1) DATA : 20NS

In [28]:
VAE_BETA = 10

# Train and Evaluation Tokenization
total_train = train_dataset.map(BoW_tokenize_function)
total_train = total_train.map(distilBERT_tokenize_function)
total_test = test_dataset.map(BoW_tokenize_function)
total_test = total_test.map(distilBERT_tokenize_function)

# Initialize Models and Optimizer
nvdm_model = NVDM(vocab_size=len(BoW_words), hidden_size=NVDM_Hidden_Size, latent_size=NVDM_Latent_Size, beta=VAE_BETA).to(DEVICE)
topic_bert = TopicBERT(nvdm_model, hidden_size=TOPIC_BERT_HIDDEN_SIZE, output_size=TOPIC_BERT_OUTPUT_SIZE, alpha=TOPIC_BERT_ALPHA).to(DEVICE)
optimizer = Adam(topic_bert.parameters(), lr=LR)

# Initialize Dataloader
train_dataloader = DataLoader(total_train, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(total_test, batch_size=BATCH_SIZE)

# Train and Evaluate the model
train_and_evaluate_model(topic_bert, train_dataloader, test_dataloader, optimizer, EPOCHS, EVAL)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Epoch 2, Loss: 3.9142658710479736, Elapsed time: 0:00:58
Test Accuracy: 0.805
Test Macro F1 Score: 0.716
Test Loss: 4.079
Epoch 4, Loss: 3.686746835708618, Elapsed time: 0:00:58
Test Accuracy: 0.815
Test Macro F1 Score: 0.706
Test Loss: 3.978
Epoch 6, Loss: 3.652850866317749, Elapsed time: 0:00:58
Test Accuracy: 0.818
Test Macro F1 Score: 0.719
Test Loss: 3.948
Epoch 8, Loss: 3.587318181991577, Elapsed time: 0:00:58
Test Accuracy: 0.812
Test Macro F1 Score: 0.713
Test Loss: 3.963
Epoch 10, Loss: 3.5773727893829346, Elapsed time: 0:00:58
Test Accuracy: 0.833
Test Macro F1 Score: 0.741
Test Loss: 3.940
Total training took 0:09:53


In [29]:
VAE_BETA = 0.1

# Train and Evaluation Tokenization
total_train = train_dataset.map(BoW_tokenize_function)
total_train = total_train.map(distilBERT_tokenize_function)
total_test = test_dataset.map(BoW_tokenize_function)
total_test = total_test.map(distilBERT_tokenize_function)

# Initialize Models and Optimizer
nvdm_model = NVDM(vocab_size=len(BoW_words), hidden_size=NVDM_Hidden_Size, latent_size=NVDM_Latent_Size, beta=VAE_BETA).to(DEVICE)
topic_bert = TopicBERT(nvdm_model, hidden_size=TOPIC_BERT_HIDDEN_SIZE, output_size=TOPIC_BERT_OUTPUT_SIZE, alpha=TOPIC_BERT_ALPHA).to(DEVICE)
optimizer = Adam(topic_bert.parameters(), lr=LR)

# Initialize Dataloader
train_dataloader = DataLoader(total_train, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(total_test, batch_size=BATCH_SIZE)

# Train and Evaluate the model
train_and_evaluate_model(topic_bert, train_dataloader, test_dataloader, optimizer, EPOCHS, EVAL)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Epoch 2, Loss: 3.89190411567688, Elapsed time: 0:00:58
Test Accuracy: 0.780
Test Macro F1 Score: 0.665
Test Loss: 4.077
Epoch 4, Loss: 3.691689968109131, Elapsed time: 0:00:58
Test Accuracy: 0.823
Test Macro F1 Score: 0.722
Test Loss: 3.982
Epoch 6, Loss: 3.6274285316467285, Elapsed time: 0:00:58
Test Accuracy: 0.833
Test Macro F1 Score: 0.727
Test Loss: 3.949
Epoch 8, Loss: 3.58308744430542, Elapsed time: 0:00:58
Test Accuracy: 0.828
Test Macro F1 Score: 0.723
Test Loss: 3.962
Epoch 10, Loss: 3.578399896621704, Elapsed time: 0:00:58
Test Accuracy: 0.833
Test Macro F1 Score: 0.720
Test Loss: 3.949
Total training took 0:09:54


Additional (TOPIC_BERT_ALPHA = 0.1 & 0.9) DATA : 20NS

In [30]:
VAE_BETA = 1
TOPIC_BERT_ALPHA = 0.1

# Train and Evaluation Tokenization
total_train = train_dataset.map(BoW_tokenize_function)
total_train = total_train.map(distilBERT_tokenize_function)
total_test = test_dataset.map(BoW_tokenize_function)
total_test = total_test.map(distilBERT_tokenize_function)

# Initialize Models and Optimizer
nvdm_model = NVDM(vocab_size=len(BoW_words), hidden_size=NVDM_Hidden_Size, latent_size=NVDM_Latent_Size, beta=VAE_BETA).to(DEVICE)
topic_bert = TopicBERT(nvdm_model, hidden_size=TOPIC_BERT_HIDDEN_SIZE, output_size=TOPIC_BERT_OUTPUT_SIZE, alpha=TOPIC_BERT_ALPHA).to(DEVICE)
optimizer = Adam(topic_bert.parameters(), lr=LR)

# Initialize Dataloader
train_dataloader = DataLoader(total_train, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(total_test, batch_size=BATCH_SIZE)

# Train and Evaluate the model
train_and_evaluate_model(topic_bert, train_dataloader, test_dataloader, optimizer, EPOCHS, EVAL)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Epoch 2, Loss: 6.744850158691406, Elapsed time: 0:00:58
Test Accuracy: 0.810
Test Macro F1 Score: 0.720
Test Loss: 6.775
Epoch 4, Loss: 6.644150257110596, Elapsed time: 0:00:58
Test Accuracy: 0.805
Test Macro F1 Score: 0.713
Test Loss: 6.646
Epoch 6, Loss: 6.531432628631592, Elapsed time: 0:00:58
Test Accuracy: 0.825
Test Macro F1 Score: 0.729
Test Loss: 6.550
Epoch 8, Loss: 6.460189342498779, Elapsed time: 0:00:58
Test Accuracy: 0.830
Test Macro F1 Score: 0.739
Test Loss: 6.481
Epoch 10, Loss: 6.435951232910156, Elapsed time: 0:00:58
Test Accuracy: 0.830
Test Macro F1 Score: 0.738
Test Loss: 6.436
Total training took 0:09:53


In [31]:
VAE_BETA = 1
TOPIC_BERT_ALPHA = 0.9

# Train and Evaluation Tokenization
total_train = train_dataset.map(BoW_tokenize_function)
total_train = total_train.map(distilBERT_tokenize_function)
total_test = test_dataset.map(BoW_tokenize_function)
total_test = total_test.map(distilBERT_tokenize_function)

# Initialize Models and Optimizer
nvdm_model = NVDM(vocab_size=len(BoW_words), hidden_size=NVDM_Hidden_Size, latent_size=NVDM_Latent_Size, beta=VAE_BETA).to(DEVICE)
topic_bert = TopicBERT(nvdm_model, hidden_size=TOPIC_BERT_HIDDEN_SIZE, output_size=TOPIC_BERT_OUTPUT_SIZE, alpha=TOPIC_BERT_ALPHA).to(DEVICE)
optimizer = Adam(topic_bert.parameters(), lr=LR)

# Initialize Dataloader
train_dataloader = DataLoader(total_train, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(total_test, batch_size=BATCH_SIZE)

# Train and Evaluate the model
train_and_evaluate_model(topic_bert, train_dataloader, test_dataloader, optimizer, EPOCHS, EVAL)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Epoch 2, Loss: 1.023411512374878, Elapsed time: 0:00:58
Test Accuracy: 0.800
Test Macro F1 Score: 0.702
Test Loss: 1.355
Epoch 4, Loss: 0.7796066999435425, Elapsed time: 0:00:58
Test Accuracy: 0.815
Test Macro F1 Score: 0.724
Test Loss: 1.335
Epoch 6, Loss: 0.7416485548019409, Elapsed time: 0:00:58
Test Accuracy: 0.815
Test Macro F1 Score: 0.719
Test Loss: 1.434
Epoch 8, Loss: 0.7204241156578064, Elapsed time: 0:00:58
Test Accuracy: 0.825
Test Macro F1 Score: 0.733
Test Loss: 1.408
Epoch 10, Loss: 0.7184390425682068, Elapsed time: 0:00:58
Test Accuracy: 0.830
Test Macro F1 Score: 0.732
Test Loss: 1.469
Total training took 0:09:53


# Result

- $T_{epoch}$ (google colab A100 기준)
    - TopicBERT : 58s
    - DistilBERT : 48s
    - BERT : 1m 32s

- Accuracy (after 10 epochs)
    - IMDB : DistilBERT < TopicBERT < BERT
    - 20NS : TopicBERT < DistilBERT < BERT

- macro-F1 score (after 10 epochs)
    - IMDB : DistilBERT < TopicBERT < BERT
    - 20NS : TopicBERT < DistilBERT < BERT
        - DistilBERT < TopicBERT (VAE_BETA = 10)

현재까지의 결과로 보았을 때 TopicBERT가 DistilBERT보다 큰 메리트는 없었다. 하지만, 본 project는 dataset size(train dataset = 4000)가 실제 paper보다 훨씬 작아서 그럴 수 있을 것으로 추정된다. NVDM(VAE)이 학습하기에는 너무 적은 dataset으로 판단된다.

또한 TopicBERT에 맞게 하이퍼파라미터 튜닝을 한다면 더 좋은 결과를 얻을 수 있을 것으로 생각한다.

한편 VAE_BETA라는 hyperparameter를 추가시켜보았는데 VAE_BETA가 커짐에 따라 F1 score가 좋아짐을 확인할 수 있긴했지만, 이 역시도 많은 차이를 띄지는 않은 것을 확인할 수 있다. 본 논문에서는 epochs = 15로 했으며 project에서는 epochs = 10으로 진행함에 따라 이 역시도 더 큰 epoch로 학습하면 다른 결과를 얻을 수 있을 것으로 판단된다.
