<a href="https://colab.research.google.com/github/VanHoann/Yelp_Dataset_Challenges/blob/main/Sentiment_Analysis/Final_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [1]:
!nvidia-smi

Fri Sep  9 04:53:29 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install -q tqdm
!pip install -q torch
!pip install -q transformers
!pip install -q tensorflow-gpu
!pip install -q nltk
!pip install -q scikit-learn
!pip install -q absl-py
!pip install -q pandas

[K     |████████████████████████████████| 4.7 MB 37.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 54.6 MB/s 
[K     |████████████████████████████████| 120 kB 50.2 MB/s 
[K     |████████████████████████████████| 578.0 MB 16 kB/s 
[K     |████████████████████████████████| 438 kB 60.5 MB/s 
[K     |████████████████████████████████| 5.9 MB 45.9 MB/s 
[K     |████████████████████████████████| 1.7 MB 56.5 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.8.2+zzzcolab20220719082949 requires keras<2.9,>=2.8.0rc0, but you have keras 2.10.0 which is incompatible.
tensorflow 2.8.2+zzzcolab20220719082949 requires tensorboard<2.9,>=2.8, but you have tensorboard 2.10.0 which is incompatible.
tensorflow 2.8.2+zzzcolab20220719082949 requires tensorflow-estimator<2.9,>=2.8, but you have tensorflow-estimator 2.10.0 which is incompatible.[

Useful link:
* https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/


# Preprocess

In [4]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import os
import numpy as np

nltk.download("stopwords")
nltk.download("punkt")

stopwords = set(stopwords.words('english'))
ps = PorterStemmer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [5]:
def lower(s):
    """
    :param s: a string.
    return a string with lower characters
    Note that we allow the input to be nested string of a list.
    e.g.
    Input: 'Text mining is to identify useful information.'
    Output: 'text mining is to identify useful information.'
    """
    if isinstance(s, list):
        return [lower(t) for t in s]
    if isinstance(s, str):
        return s.lower()
    else:
        raise NotImplementedError("unknown datatype")


def tokenize(text):
    """
    :param text: a doc with multiple sentences, type: str
    return a word list, type: list
    e.g.
    Input: 'Text mining is to identify useful information.'
    Output: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    """
    return nltk.word_tokenize(text)


def stem(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of stemmed words, type: list
    e.g.
    Input: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    Output: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     results.append(ps.stem(token))
    # return results

    return [ps.stem(token) for token in tokens]


def n_gram(tokens, n=1):
    """
    :param tokens: a list of tokens, type: list
    :param n: the corresponding n-gram, type: int
    return a list of n-gram tokens, type: list
    e.g.
    Input: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.'], 2
    Output: ['text mine', 'mine is', 'is to', 'to identifi', 'identifi use', 'use inform', 'inform .']
    """
    if n == 1:
        return tokens
    else:
        results = list()
        for i in range(len(tokens) - n + 1):
            # tokens[i:i+n] will return a sublist from i th to i+n th (i+n th is not included)
            results.append(" ".join(tokens[i:i + n]))
        return results


def filter_stopwords(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of filtered tokens, type: list
    e.g.
    Input: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    Output: ['text', 'mine', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     if token not in stopwords and not token.isnumeric():
    #         results.append(token)
    # return results

    return [token for token in tokens if token not in stopwords and not token.isnumeric()]


def get_pretrained_embedding(file_path, tokenizer, embedding_dim):
    if not os.path.exists(file_path):
        return None
    embeddings_index = {}
    with open(file_path) as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs

    embedding_matrix = np.zeros((len(tokenizer.word_index) + 1, embedding_dim))
    for word, i in tokenizer.word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

    return embedding_matrix

# Dataset

In [7]:
data = 'https://raw.githubusercontent.com/VanHoann/Yelp_Dataset_Challenges/main/Sentiment_Analysis/data'

In [9]:
import pandas as pd
pd.read_csv(f"{data}/train.csv")[:2]

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,JCZEK7wiazoM6xiq8YeZyw,1,2018-01-16 20:13:13,1,oxj0_2jKOqQFIWEYRjWi6g,5,I've been here a handful of times now and I've...,1,1fq-gL1i_8xKhc9VgOZDGw
1,ALn_0f-Usn3n0a9WBcjhhg,0,2018-04-10,0,gZITaUSvzBUijZvNGXO_Cg,1,The service was terrible. The food was just ok...,0,wqG3PCf8ufXId2RG0oBufA


In [8]:
import os

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils.class_weight import compute_class_weight
import torch
from transformers import BertTokenizer, RobertaTokenizer, XLNetTokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [9]:
class SentimentDataset(torch.utils.data.Dataset):

    def __init__(self,
                 root,
                 mode,
                 model_name,
                 framework="pt",
                 max_length=256,
                 columns=["cool", "funny", "useful"],
                 tokenizer=None,
                 use_uncased=False):
        self.root = root
        self.mode = mode
        self.data_file = pd.read_csv(os.path.join(self.root, f"{self.mode}.csv"))
        self.framework = framework
        if use_uncased:
            self.data_file['text'] = self.data_file['text'].map(lower)

        self.review_texts = None

        if "roberta" in model_name:
            tokenizer_base = RobertaTokenizer
        elif "bert" in model_name:
            tokenizer_base = BertTokenizer
        elif "xlnet" in model_name:
            tokenizer_base = XLNetTokenizer
        else:
            raise NotImplementedError
            
        self.tokenizer = tokenizer_base.from_pretrained(model_name)
        self.max_length = max_length

        if self.review_texts is None:
            self.review_texts = self.data_file["text"].to_list()
        if mode != "test":
            self.stars = self.data_file["stars"].to_numpy()
            self.stars -= 1  # 1~5 -> 0~4

        if len(columns) == 0:
            self.other_features = None
            return

        # normalize other features to 0~1
        self.other_features = MinMaxScaler().fit_transform(
            self.data_file[columns].to_numpy())

    def __len__(self):
        return len(self.review_texts)

    def __getitem__(self, idx):
        text = self.review_texts[idx]
        if self.mode != "test":
            label = self.stars[idx]

        encoded = self.tokenizer.encode_plus(text,
                                             add_special_tokens=True,
                                             max_length=self.max_length,
                                             return_token_type_ids=False,
                                             padding='max_length',
                                             return_attention_mask=True,
                                             return_tensors=self.framework,
                                             truncation=True)

        data = {
            "input_ids": encoded["input_ids"][0],
            "attention_mask": encoded["attention_mask"][0]
        }
        if self.mode != "test":
            data["label"] = label

        if self.other_features is not None:
            data["features"] = torch.FloatTensor(self.other_features[idx])
        return data

    def get_class_weights(self):
        if self.mode == "test":
            return None
        return compute_class_weight('balanced',
                                    classes=np.unique(self.stars),
                                    y=self.stars)

    def get_keras_data(self):
        data = self.tokenizer.texts_to_sequences(self.review_texts)
        data = [pad_sequences(data, maxlen=self.max_length), self.other_features]

        return data, self.stars


def create_dataloader(root,
                      mode,
                      model_name,
                      batch_size=32,
                      max_length=256,
                      columns=["cool", "funny", "useful"],
                      use_uncased=False):
    review_ds = SentimentDataset(root,
                                 mode,
                                 model_name,
                                 max_length=max_length,
                                 columns=columns,
                                 use_uncased=use_uncased)

    # shuffle the dataset if it is not test dataset
    dataloader = torch.utils.data.DataLoader(review_ds,
                                             batch_size=batch_size,
                                             shuffle=mode == "train")

    class_weights = review_ds.get_class_weights()

    return dataloader, class_weights

# Model

In [10]:
from transformers import BertModel, RobertaModel, XLNetModel
import torch
from torch import nn
from tensorflow import keras
import tensorflow as tf

In [11]:
class TransformerSentimentAnalyzer(nn.Module):

    def __init__(self,
                 model_name,
                 num_class=5,
                 num_other_features=3,
                 hidden_size=10,
                 dropout_rate=0.3,
                 use_pooled=True):
        super().__init__()
        self.use_pooled = use_pooled

        if "roberta" in model_name:
            transformer_base = RobertaModel
        elif "bert" in model_name:
            transformer_base = BertModel
        elif "xlnet" in model_name:
            transformer_base = XLNetModel
            self.use_pooled = False  # no pooler for xlnet

        self.transformer = transformer_base.from_pretrained(model_name)
        if not self.use_pooled:
            self.hidden = nn.Linear(self.transformer.config.hidden_size,
                                    self.transformer.config.hidden_size)
            nn.init.xavier_uniform_(self.hidden.weight, gain=nn.init.calculate_gain('relu'))

        if num_other_features > 0:
            self.fc1 = nn.Linear(num_other_features, hidden_size)
            nn.init.xavier_uniform_(self.fc1.weight, gain=nn.init.calculate_gain('relu'))
            self.other_relu = nn.ReLU()
            self.classifier = nn.Linear(self.transformer.config.hidden_size + hidden_size,
                                        num_class)
        else:
            self.classifier = nn.Linear(self.transformer.config.hidden_size, num_class)

        nn.init.xavier_uniform_(self.classifier.weight)
        self.dropout = nn.Dropout(p=dropout_rate)

    def forward(self, input_ids, attention_mask, other_features): 
        transformer_out = self.transformer(input_ids=input_ids,
                                           attention_mask=attention_mask)
        if self.use_pooled:
            output = transformer_out["pooler_output"]
        else:
            cls_token = transformer_out["last_hidden_state"][:, 0]  # get the [CLS] token
            output = self.hidden(cls_token)
        dropped = self.dropout(output)  # [batch_size, 768]

        if hasattr(self, "fc1"):
            feat = self.fc1(other_features)  # [batch_size, num_other_features]
            feat = self.other_relu(feat)
            final = torch.cat([dropped, feat], axis=1)
        else:
            final = dropped
        return self.classifier(final)

    def count_parameters(self):
        return sum(p.numel() for p in self.parameters() if p.requires_grad)

    def fix_transformer_stem(self, yes=True):
        if yes:
            for param in self.transformer.parameters():
                param.requires_grad = False
            print(f'Fixed Transformer stem. Total head trainable parameters {self.count_parameters()}')
        else:
            for param in self.transformer.parameters():
                param.requires_grad = True
            print(f'Trained Transformer stem. Total head trainable parameters {self.count_parameters()}')

# Main

In [16]:
import os
import numpy as np
import random

from absl import app
from sklearn.metrics import classification_report, confusion_matrix
import torch
from torch import nn
from transformers import AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm

In [12]:
data

'https://raw.githubusercontent.com/VanHoann/Yelp_Dataset_Challenges/main/Sentiment_Analysis/data'

convert this class to `flags` later on

In [37]:
class parameters: #best set
  model_name = 'roberta-base'
  batch_size = 32
  max_len = 256
  use_pooled = True
  other_hidden_dim = 32
  epochs = 3
  eval_every = 300
  lr = 1e-05
  dropout = 0.4
  other_features = []
  data_path = data
  save_path = 'models/{}_bs{}_lr{}_drop{}_hidden{}_seed{}_lpft.pth'
  use_uncased = 0
  use_lpft = 1
  lp_step = 10
  seed = 101

FLAGS = parameters

Set the seed

In [17]:
seed = FLAGS.seed
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True

In [18]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE

device(type='cuda')

In [19]:
def train(model, data_train, data_val, epochs, device, criterion, 
          optimizer, scheduler, save_path, eval_every, use_lpft, lp_step):
    step = 0
    curr_best_val_f1_macro = 0
    best_val_at_step = 0

    if use_lpft:
        model.fix_transformer_stem(True)

    for epoch in range(epochs):
        model.train()
        train_bar = tqdm(data_train,
                         total=int(len(data_train)),
                         desc=f"train: {epoch + 1} / {epochs}")

        correct_num = 0
        total_num = 0
        running_loss = 0
        for batch in train_bar:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            other_features = batch["features"].to(device) if "features" in batch else None
            label = batch["label"].to(device)
            step += 1
            logits = model(input_ids, attention_mask, other_features)
            predicted = torch.max(logits, dim=1)[1]

            loss = criterion(logits, label)
            running_loss += loss.item()

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.step()

            correct_num += (predicted == label).sum().item()
            total_num += label.shape[0]

            train_bar.set_postfix(acc=(correct_num / total_num),
                                  loss=(running_loss / total_num))

            del batch, input_ids, attention_mask, other_features, label, logits, loss, predicted

            if step == lp_step and use_lpft:
                model.fix_transformer_stem(False)

            if step%eval_every == 0 or (step == lp_step and use_lpft):
                model.eval()
                y_pred = []
                y_true = []
                val_running_loss = 0
                
                with torch.no_grad():
                    for batch in data_val:
                        input_ids = batch["input_ids"].to(device)
                        attention_mask = batch["attention_mask"].to(device)
                        other_features = batch["features"].to(
                            device) if "features" in batch else None
                        label = batch["label"].to(device)

                        logits = model(input_ids, attention_mask, other_features)
                        predicted = torch.max(logits, dim=1)[1]

                        loss = criterion(logits, label)
                        val_running_loss += loss.item()

                        y_pred.extend(predicted.tolist())
                        y_true.extend(label.tolist())

                        del batch, input_ids, attention_mask, label, logits, predicted
                
                report = classification_report(y_true, y_pred, output_dict=True)
                print(
                    f"[valid] epoch: {epoch}, global step: {step}, loss: {val_running_loss / len(data_val)},"
                    f" report:\n{classification_report(y_true, y_pred, digits=4)}"
                    f"confusion_matrix:\n{confusion_matrix(y_true, y_pred)}"
                    )

                if report['macro avg']['f1-score'] > curr_best_val_f1_macro:
                    curr_best_val_f1_macro = report["macro avg"]['f1-score']
                    best_val_at_step = step
                    model_dir, name = save_path.rsplit("/", 1)
                    # name = f"acc{curr_best_val_f1_macro}_{name}"
                    os.makedirs(model_dir, exist_ok=True)
                    torch.save(model.state_dict(), os.path.join(model_dir, name))

        print(
            f"[train] epoch: {epoch}, global step: {step}, loss: {running_loss / total_num},"
            f" accuracy: {correct_num / total_num}")

    print(f"[finish] best valid macro avg is {curr_best_val_f1_macro}, achieved at global step {best_val_at_step}")

## Load data


In [23]:
train_dataloader, class_weights = create_dataloader(FLAGS.data_path,
                                                        "train",
                                                        FLAGS.model_name,
                                                        batch_size=FLAGS.batch_size,
                                                        max_length=FLAGS.max_len,
                                                        columns=FLAGS.other_features,
                                                        use_uncased=FLAGS.use_uncased)

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [24]:
print(class_weights) #balanced among #data per class

[1.34730539 2.47083047 1.80995475 0.90657265 0.45506257]


In [25]:
type(train_dataloader)

torch.utils.data.dataloader.DataLoader

In [26]:
len(train_dataloader) #an empty shell to load later on

563

In [27]:
val_dataloader, _ = create_dataloader(FLAGS.data_path,
                                          "valid",
                                          FLAGS.model_name,
                                          batch_size=FLAGS.batch_size,
                                          max_length=FLAGS.max_len,
                                          columns=FLAGS.other_features,
                                          use_uncased=FLAGS.use_uncased)

In [28]:
len(val_dataloader)

63

In [29]:
for batch in train_dataloader:
    print(batch)
    break

{'input_ids': tensor([[    0,  1779,    38,  ...,     5,  1883,     2],
        [    0, 32136,   460,  ...,     1,     1,     1],
        [    0,   100,   393,  ...,     1,     1,     1],
        ...,
        [    0,  2387,   122,  ...,     1,     1,     1],
        [    0, 40113,    77,  ...,     1,     1,     1],
        [    0, 40907,   636,  ...,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'label': tensor([1, 1, 4, 4, 3, 4, 3, 4, 4, 4, 1, 4, 4, 3, 3, 2, 0, 3, 2, 1, 4, 4, 0, 2,
        4, 0, 0, 4, 2, 4, 4, 4])}


## Build model
 

In [30]:
model = TransformerSentimentAnalyzer(FLAGS.model_name,
                                         num_class=5,
                                         num_other_features=len(FLAGS.other_features),
                                         hidden_size=FLAGS.other_hidden_dim,
                                         dropout_rate=FLAGS.dropout,
                                         use_pooled=FLAGS.use_pooled).to(DEVICE)

Downloading pytorch_model.bin:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Training

In [34]:
loss_fn = nn.CrossEntropyLoss(weight=torch.FloatTensor(class_weights).to(DEVICE))
bert_optim = AdamW(model.parameters(), lr=FLAGS.lr, correct_bias=False, no_deprecation_warning=True)

total_steps = len(train_dataloader) * FLAGS.epochs
scheduler = get_linear_schedule_with_warmup(bert_optim,
                                            num_warmup_steps=0,
                                            num_training_steps=total_steps)

model_save_path = FLAGS.save_path.format(FLAGS.model_name, FLAGS.batch_size, FLAGS.lr,
                                         FLAGS.dropout, FLAGS.other_hidden_dim, FLAGS.seed)

print(f'Fixed Transformer stem. Total head trainable parameters {model.count_parameters()}')

Fixed Transformer stem. Total head trainable parameters 124649477


In [38]:
with tf.device('/gpu:0'):
    history = train(model, train_dataloader, val_dataloader, FLAGS.epochs, DEVICE, loss_fn,
          bert_optim, scheduler, model_save_path,
          FLAGS.eval_every, FLAGS.use_lpft, FLAGS.lp_step)

Fixed Transformer stem. Total head trainable parameters 3845


train: 1 / 3:   2%|▏         | 9/563 [00:05<04:52,  1.89it/s, acc=0.662, loss=0.0313]

Trained Transformer stem. Total head trainable parameters 124649477
[valid] epoch: 0, global step: 10, loss: 1.2422451963500372, report:
              precision    recall  f1-score   support

           0     0.5336    0.9858    0.6924       282
           1     0.1622    0.0441    0.0694       136
           2     0.3491    0.4528    0.3943       212
           3     0.5379    0.1524    0.2375       466
           4     0.7691    0.8805    0.8210       904

    accuracy                         0.6235      2000
   macro avg     0.4704    0.5031    0.4429      2000
weighted avg     0.5962    0.6235    0.5706      2000
confusion_matrix:
[[278   0   3   1   0]
 [114   6  14   0   2]
 [ 75  18  96  10  13]
 [ 31   8 132  71 224]
 [ 23   5  30  50 796]]


train: 1 / 3:  53%|█████▎    | 299/563 [06:41<05:37,  1.28s/it, acc=0.675, loss=0.026]

[valid] epoch: 0, global step: 300, loss: 0.8593490208898272, report:
              precision    recall  f1-score   support

           0     0.7647    0.8759    0.8165       282
           1     0.3707    0.5588    0.4457       136
           2     0.3858    0.4858    0.4301       212
           3     0.5042    0.3906    0.4401       466
           4     0.8318    0.7765    0.8032       904

    accuracy                         0.6550      2000
   macro avg     0.5714    0.6175    0.5871      2000
weighted avg     0.6673    0.6550    0.6566      2000
confusion_matrix:
[[247  33   2   0   0]
 [ 43  76  16   1   0]
 [ 13  71 103  22   3]
 [  8  19 118 182 139]
 [ 12   6  28 156 702]]


train: 1 / 3: 100%|██████████| 563/563 [12:46<00:00,  1.36s/it, acc=0.684, loss=0.0255]


[train] epoch: 0, global step: 563, loss: 0.025500442501571442, accuracy: 0.6841111111111111


train: 2 / 3:   6%|▋         | 36/563 [00:49<11:44,  1.34s/it, acc=0.738, loss=0.0224]

[valid] epoch: 1, global step: 600, loss: 0.8130658983238159, report:
              precision    recall  f1-score   support

           0     0.8750    0.7943    0.8327       282
           1     0.4409    0.6029    0.5093       136
           2     0.5440    0.4953    0.5185       212
           3     0.5301    0.6245    0.5734       466
           4     0.8480    0.7655    0.8047       904

    accuracy                         0.6970      2000
   macro avg     0.6476    0.6565    0.6477      2000
weighted avg     0.7178    0.6970    0.7043      2000
confusion_matrix:
[[224  52   3   2   1]
 [ 22  82  30   1   1]
 [  3  46 105  56   2]
 [  5   2  48 291 120]
 [  2   4   7 199 692]]


train: 2 / 3:  60%|█████▉    | 336/563 [07:45<04:51,  1.29s/it, acc=0.759, loss=0.0192]

[valid] epoch: 1, global step: 900, loss: 0.8061559413160596, report:
              precision    recall  f1-score   support

           0     0.8351    0.8262    0.8307       282
           1     0.5124    0.4559    0.4825       136
           2     0.5451    0.6557    0.5953       212
           3     0.5349    0.5923    0.5621       466
           4     0.8396    0.7699    0.8032       904

    accuracy                         0.7030      2000
   macro avg     0.6534    0.6600    0.6548      2000
weighted avg     0.7145    0.7030    0.7071      2000
confusion_matrix:
[[233  36  10   1   2]
 [ 33  62  38   3   0]
 [  5  20 139  46   2]
 [  4   2  55 276 129]
 [  4   1  13 190 696]]


train: 2 / 3: 100%|██████████| 563/563 [13:05<00:00,  1.40s/it, acc=0.763, loss=0.0189]


[train] epoch: 1, global step: 1126, loss: 0.01894877203471131, accuracy: 0.7631666666666667


train: 3 / 3:  13%|█▎        | 73/563 [01:38<10:57,  1.34s/it, acc=0.771, loss=0.0172]

[valid] epoch: 2, global step: 1200, loss: 0.8670355441078307, report:
              precision    recall  f1-score   support

           0     0.8716    0.7943    0.8312       282
           1     0.4611    0.6103    0.5253       136
           2     0.5259    0.6226    0.5702       212
           3     0.5174    0.6073    0.5587       466
           4     0.8614    0.7290    0.7897       904

    accuracy                         0.6905      2000
   macro avg     0.6475    0.6727    0.6550      2000
weighted avg     0.7199    0.6905    0.7005      2000
confusion_matrix:
[[224  52   4   1   1]
 [ 22  83  29   2   0]
 [  3  39 132  36   2]
 [  5   3  72 283 103]
 [  3   3  14 225 659]]


train: 3 / 3:  66%|██████▋   | 373/563 [08:34<04:04,  1.29s/it, acc=0.812, loss=0.0145]

[valid] epoch: 2, global step: 1500, loss: 0.9068262179692587, report:
              precision    recall  f1-score   support

           0     0.8633    0.7837    0.8216       282
           1     0.4702    0.5809    0.5197       136
           2     0.5482    0.5896    0.5682       212
           3     0.5363    0.6030    0.5677       466
           4     0.8447    0.7699    0.8056       904

    accuracy                         0.7010      2000
   macro avg     0.6525    0.6654    0.6565      2000
weighted avg     0.7185    0.7010    0.7078      2000
confusion_matrix:
[[221  52   5   1   3]
 [ 22  79  32   3   0]
 [  5  31 125  48   3]
 [  5   3  55 281 122]
 [  3   3  11 191 696]]


train: 3 / 3: 100%|██████████| 563/563 [13:07<00:00,  1.40s/it, acc=0.821, loss=0.0138]

[train] epoch: 2, global step: 1689, loss: 0.013813368017474811, accuracy: 0.8207222222222222
[finish] best valid macro avg is 0.6565424643618735, achieved at global step 1500





In [39]:
model.count_parameters()

124649477

## Load best model

In [41]:
FLAGS.save_path

'models/{}_bs{}_lr{}_drop{}_hidden{}_seed{}_lpft.pth'

In [42]:
!ls models

roberta-base_bs32_lr1e-05_drop0.4_hidden32_seed101_lpft.pth


In [44]:
model_path = "models/roberta-base_bs32_lr1e-05_drop0.4_hidden32_seed101_lpft.pth"

In [45]:
model.load_state_dict(torch.load(model_path))

<All keys matched successfully>

## Evaluate

In [46]:
def evaluate(model, test_data, device, mode="test", save_name="pred.csv"):
    test_bar = tqdm(test_data, total=int(len(test_data)))

    model.eval()
    preds = []
    if mode == "valid":
        y_true = []
    with torch.no_grad():
        for batch in test_bar:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            other_features = batch["features"].to(device) if "features" in batch else None
            logits = model(input_ids, attention_mask, other_features)
            predicted = torch.max(logits, dim=1)[1]
            preds.extend(predicted.tolist())
            if mode == "valid":
                y_true.extend(batch["label"].tolist())

    if mode == "valid":
        print(classification_report(y_true, preds, digits=4))
    else:
        review_ids = test_data.dataset.data_file["review_id"]
        save_preds(review_ids, np.array(preds), save_name)

def save_preds(review_ids, preds, save_name="pred.csv"):
    answer_df = pd.DataFrame(data={
        'review_id': review_ids,
        'stars': preds + 1,
    })
    answer_df.to_csv(save_name, index=False)

In [47]:
evaluate(model,
             val_dataloader,
             DEVICE,
             mode='valid',
             save_name=FLAGS.save_path)

100%|██████████| 63/63 [00:29<00:00,  2.13it/s]

              precision    recall  f1-score   support

           0     0.8633    0.7837    0.8216       282
           1     0.4702    0.5809    0.5197       136
           2     0.5482    0.5896    0.5682       212
           3     0.5363    0.6030    0.5677       466
           4     0.8447    0.7699    0.8056       904

    accuracy                         0.7010      2000
   macro avg     0.6525    0.6654    0.6565      2000
weighted avg     0.7185    0.7010    0.7078      2000






# Write .py scripts 

## requirements.txt

In [2]:
%%writefile requirements.txt
tqdm
torch
transformers
tensorflow-gpu
nltk
scikit-learn
absl-py
pandas

Overwriting requirements.txt


## preprocess.py

In [4]:
%%writefile preprocess.py
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import os
import numpy as np

nltk.download("stopwords")
nltk.download("punkt")

stopwords = set(stopwords.words('english'))
ps = PorterStemmer()


def lower(s):
    """
    :param s: a string.
    return a string with lower characters
    Note that we allow the input to be nested string of a list.
    e.g.
    Input: 'Text mining is to identify useful information.'
    Output: 'text mining is to identify useful information.'
    """
    if isinstance(s, list):
        return [lower(t) for t in s]
    if isinstance(s, str):
        return s.lower()
    else:
        raise NotImplementedError("unknown datatype")


def tokenize(text):
    """
    :param text: a doc with multiple sentences, type: str
    return a word list, type: list
    e.g.
    Input: 'Text mining is to identify useful information.'
    Output: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    """
    return nltk.word_tokenize(text)


def stem(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of stemmed words, type: list
    e.g.
    Input: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    Output: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     results.append(ps.stem(token))
    # return results

    return [ps.stem(token) for token in tokens]


def n_gram(tokens, n=1):
    """
    :param tokens: a list of tokens, type: list
    :param n: the corresponding n-gram, type: int
    return a list of n-gram tokens, type: list
    e.g.
    Input: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.'], 2
    Output: ['text mine', 'mine is', 'is to', 'to identifi', 'identifi use', 'use inform', 'inform .']
    """
    if n == 1:
        return tokens
    else:
        results = list()
        for i in range(len(tokens) - n + 1):
            # tokens[i:i+n] will return a sublist from i th to i+n th (i+n th is not included)
            results.append(" ".join(tokens[i:i + n]))
        return results


def filter_stopwords(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of filtered tokens, type: list
    e.g.
    Input: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    Output: ['text', 'mine', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     if token not in stopwords and not token.isnumeric():
    #         results.append(token)
    # return results

    return [token for token in tokens if token not in stopwords and not token.isnumeric()]


def get_pretrained_embedding(file_path, tokenizer, embedding_dim):
    if not os.path.exists(file_path):
        return None
    embeddings_index = {}
    with open(file_path) as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs

    embedding_matrix = np.zeros((len(tokenizer.word_index) + 1, embedding_dim))
    for word, i in tokenizer.word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

    return embedding_matrix

Overwriting preprocess.py


## dataset.py

In [5]:
%%writefile dataset.py
import os

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils.class_weight import compute_class_weight
import torch
from transformers import BertTokenizer, RobertaTokenizer, XLNetTokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from preprocess import *

class SentimentDataset(torch.utils.data.Dataset):

    def __init__(self,
                 root,
                 mode,
                 model_name,
                 framework="pt",
                 max_length=256,
                 columns=["cool", "funny", "useful"],
                 tokenizer=None,
                 use_uncased=False):
        self.root = root
        self.mode = mode
        self.data_file = pd.read_csv(os.path.join(self.root, f"{self.mode}.csv"))
        self.framework = framework
        if use_uncased:
            self.data_file['text'] = self.data_file['text'].map(lower)

        self.review_texts = None
        if model_name == "lstm-cnn":
            self.review_texts = self.data_file["text"].map(lower).map(tokenize).map(stem)
            if mode != "train":
                assert tokenizer is not None
                assert isinstance(tokenizer, Tokenizer)
                self.tokenizer = tokenizer
            else:
                self.tokenizer = Tokenizer(split=' ', oov_token="[OOV]")
                self.tokenizer.fit_on_texts(self.review_texts)
        else:
            if "roberta" in model_name:
                tokenizer_base = RobertaTokenizer
            elif "bert" in model_name:
                tokenizer_base = BertTokenizer
            elif "xlnet" in model_name:
                tokenizer_base = XLNetTokenizer
            else:
                raise NotImplementedError
            self.tokenizer = tokenizer_base.from_pretrained(model_name)
        self.max_length = max_length

        if self.review_texts is None:
            self.review_texts = self.data_file["text"].to_list()
        if mode != "test":
            self.stars = self.data_file["stars"].to_numpy()
            self.stars -= 1  # 1~5 -> 0~4

        if len(columns) == 0:
            self.other_features = None
            return

        # normalize other features to 0~1
        self.other_features = MinMaxScaler().fit_transform(
            self.data_file[columns].to_numpy())

    def __len__(self):
        return len(self.review_texts)

    def __getitem__(self, idx):
        text = self.review_texts[idx]
        if self.mode != "test":
            label = self.stars[idx]

        encoded = self.tokenizer.encode_plus(text,
                                             add_special_tokens=True,
                                             max_length=self.max_length,
                                             return_token_type_ids=False,
                                             padding='max_length',
                                             return_attention_mask=True,
                                             return_tensors=self.framework,
                                             truncation=True)

        data = {
            "input_ids": encoded["input_ids"][0],
            "attention_mask": encoded["attention_mask"][0]
        }
        if self.mode != "test":
            data["label"] = label

        if self.other_features is not None:
            data["features"] = torch.FloatTensor(self.other_features[idx])
        return data

    def get_class_weights(self):
        if self.mode == "test":
            return None
        return compute_class_weight('balanced',
                                    classes=np.unique(self.stars),
                                    y=self.stars)

    def get_keras_data(self):
        data = self.tokenizer.texts_to_sequences(self.review_texts)
        data = [pad_sequences(data, maxlen=self.max_length), self.other_features]

        return data, self.stars

def create_dataloader(root,
                      mode,
                      model_name,
                      batch_size=32,
                      max_length=256,
                      columns=["cool", "funny", "useful"],
                      use_uncased=False):
    review_ds = SentimentDataset(root,
                                 mode,
                                 model_name,
                                 max_length=max_length,
                                 columns=columns,
                                 use_uncased=use_uncased)

    # shuffle the dataset if it is not test dataset
    dataloader = torch.utils.data.DataLoader(review_ds,
                                             batch_size=batch_size,
                                             shuffle=mode == "train")

    class_weights = review_ds.get_class_weights()

    return dataloader, class_weights

Writing dataset.py


## model.py

In [6]:
%%writefile model.py
from transformers import BertModel, RobertaModel, XLNetModel
import torch
from torch import nn
from tensorflow import keras
import tensorflow as tf


class TransformerSentimentAnalyzer(nn.Module):

    def __init__(self,
                 model_name,
                 num_class=5,
                 num_other_features=3,
                 hidden_size=10,
                 dropout_rate=0.3,
                 use_pooled=True):
        super().__init__()
        self.use_pooled = use_pooled

        if "roberta" in model_name:
            transformer_base = RobertaModel
        elif "bert" in model_name:
            transformer_base = BertModel
        elif "xlnet" in model_name:
            transformer_base = XLNetModel
            self.use_pooled = False  # no pooler for xlnet

        self.transformer = transformer_base.from_pretrained(model_name)
        if not self.use_pooled:
            self.hidden = nn.Linear(self.transformer.config.hidden_size,
                                    self.transformer.config.hidden_size)
            nn.init.xavier_uniform_(self.hidden.weight, gain=nn.init.calculate_gain('relu'))

        if num_other_features > 0:
            self.fc1 = nn.Linear(num_other_features, hidden_size)
            nn.init.xavier_uniform_(self.fc1.weight, gain=nn.init.calculate_gain('relu'))
            self.other_relu = nn.ReLU()
            self.classifier = nn.Linear(self.transformer.config.hidden_size + hidden_size,
                                        num_class)
        else:
            self.classifier = nn.Linear(self.transformer.config.hidden_size, num_class)

        nn.init.xavier_uniform_(self.classifier.weight)
        self.dropout = nn.Dropout(p=dropout_rate)

    def forward(self, input_ids, attention_mask, other_features):
        transformer_out = self.transformer(input_ids=input_ids,
                                           attention_mask=attention_mask)
        if self.use_pooled:
            output = transformer_out["pooler_output"]
        else:
            cls_token = transformer_out["last_hidden_state"][:, 0]  # get the [CLS] token
            output = self.hidden(cls_token)
        dropped = self.dropout(output)  # [batch_size, 768]

        if hasattr(self, "fc1"):
            feat = self.fc1(other_features)  # [batch_size, num_other_features]
            feat = self.other_relu(feat)
            final = torch.cat([dropped, feat], axis=1)
        else:
            final = dropped
        return self.classifier(final)

    def count_parameters(self):
        return sum(p.numel() for p in self.parameters() if p.requires_grad)

    def fix_transformer_stem(self, yes=True):
        if yes:
            for param in self.transformer.parameters():
                param.requires_grad = False
            print(f'Fixed Transformer stem. Total head trainable parameters {self.count_parameters()}')
        else:
            for param in self.transformer.parameters():
                param.requires_grad = True
            print(f'Trained Transformer stem. Total head trainable parameters {self.count_parameters()}')

Writing model.py


## main.py

In [17]:
%%writefile main.py
import os
import numpy as np
import random

from absl import flags
from absl import app
from sklearn.metrics import classification_report, confusion_matrix
import torch
from torch import nn
from transformers import AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm

from model import TransformerSentimentAnalyzer
from dataset import create_dataloader

flags.DEFINE_string("model_name", "bert-base-cased", "which transformer to use")
flags.DEFINE_integer("batch_size", 16, "batch size: 16 or 32 preferred")
flags.DEFINE_integer("max_len", 256, "max sentence length. max value is 512 for bert")
flags.DEFINE_bool("use_pooled", True, "whether to use pooled output of Bert")
flags.DEFINE_integer("other_hidden_dim", 10, "hidden dim for other features")

flags.DEFINE_integer("epochs", 3, "number of training epochs")
flags.DEFINE_integer("eval_every", 50, "number of training steps after each the model is evaluated")
flags.DEFINE_float("lr", 2e-5, "learning rate. Preferred 2e-5, 3e-5, 5e-5")

flags.DEFINE_float("dropout", 0.3, "dropout rate")
flags.DEFINE_list("other_features", [],
                  "other feature aggregations to use")

flags.DEFINE_string("data_path", "data", "data directory path")
flags.DEFINE_string("save_path", "models/{}_bs{}_lr{}_drop{}_hidden{}_seed{}.pth",
                    "where to save the model")

flags.DEFINE_integer("use_uncased", 0, "help to experiment with RoBERTa uncased")
flags.DEFINE_integer("use_lpft", 0, "whether to apply the method of Linear Probing and Finetuning. If True, in the lp_step steps, only train classifier head, after that finetune the whole model.")
flags.DEFINE_integer("lp_step", 100, "number of Linear Probing")

flags.DEFINE_integer("seed", 101, "to reproduce the experiment")

FLAGS = flags.FLAGS
#HP

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def train(model, data_train, data_val, epochs, device, criterion, 
          optimizer, scheduler, save_path, eval_every, use_lpft, lp_step):
    step = 0
    curr_best_val_f1_macro = 0
    best_val_at_step = 0

    if use_lpft:
        model.fix_transformer_stem(True)

    for epoch in range(epochs):
        model.train()
        train_bar = tqdm(data_train,
                         total=int(len(data_train)),
                         desc=f"train: {epoch + 1} / {epochs}")

        correct_num = 0
        total_num = 0
        running_loss = 0
        for batch in train_bar:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            other_features = batch["features"].to(device) if "features" in batch else None
            label = batch["label"].to(device)
            step += 1
            logits = model(input_ids, attention_mask, other_features)
            predicted = torch.max(logits, dim=1)[1]

            loss = criterion(logits, label)
            running_loss += loss.item()

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.step()

            correct_num += (predicted == label).sum().item()
            total_num += label.shape[0]

            train_bar.set_postfix(acc=(correct_num / total_num),
                                  loss=(running_loss / total_num))

            del batch, input_ids, attention_mask, other_features, label, logits, loss, predicted

            if step == lp_step and use_lpft:
                model.fix_transformer_stem(False)

            if step%eval_every == 0 or (step == lp_step and use_lpft):
                model.eval()
                y_pred = []
                y_true = []
                val_running_loss = 0
                
                with torch.no_grad():
                    for batch in data_val:
                        input_ids = batch["input_ids"].to(device)
                        attention_mask = batch["attention_mask"].to(device)
                        other_features = batch["features"].to(
                            device) if "features" in batch else None
                        label = batch["label"].to(device)

                        logits = model(input_ids, attention_mask, other_features)
                        predicted = torch.max(logits, dim=1)[1]

                        loss = criterion(logits, label)
                        val_running_loss += loss.item()

                        y_pred.extend(predicted.tolist())
                        y_true.extend(label.tolist())

                        del batch, input_ids, attention_mask, label, logits, predicted
                
                report = classification_report(y_true, y_pred, output_dict=True)
                print(
                    f"[valid] epoch: {epoch}, global step: {step}, loss: {val_running_loss / len(data_val)},"
                    f" report:\n{classification_report(y_true, y_pred, digits=4)}"
                    f"confusion_matrix:\n{confusion_matrix(y_true, y_pred)}"
                    )

                if report['macro avg']['f1-score'] > curr_best_val_f1_macro:
                    curr_best_val_f1_macro = report["macro avg"]['f1-score']
                    best_val_at_step = step
                    model_dir, name = save_path.rsplit("/", 1)
                    # name = f"acc{curr_best_val_f1_macro}_{name}"
                    os.makedirs(model_dir, exist_ok=True)
                    torch.save(model.state_dict(), os.path.join(model_dir, name))

        print(
            f"[train] epoch: {epoch}, global step: {step}, loss: {running_loss / total_num},"
            f" accuracy: {correct_num / total_num}")

    print(f"[finish] best valid macro avg is {curr_best_val_f1_macro}, achieved at global step {best_val_at_step}")


def main(args):
    del args  # not used
        
    seed = FLAGS.seed
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

    train_dataloader, class_weights = create_dataloader(FLAGS.data_path,
                                                        "train",
                                                        FLAGS.model_name,
                                                        batch_size=FLAGS.batch_size,
                                                        max_length=FLAGS.max_len,
                                                        columns=FLAGS.other_features,
                                                        use_uncased=FLAGS.use_uncased)
    val_dataloader, _ = create_dataloader(FLAGS.data_path,
                                          "valid",
                                          FLAGS.model_name,
                                          batch_size=FLAGS.batch_size,
                                          max_length=FLAGS.max_len,
                                          columns=FLAGS.other_features,
                                          use_uncased=FLAGS.use_uncased)

    model = TransformerSentimentAnalyzer(FLAGS.model_name,
                                         num_class=5,
                                         num_other_features=len(FLAGS.other_features),
                                         hidden_size=FLAGS.other_hidden_dim,
                                         dropout_rate=FLAGS.dropout,
                                         use_pooled=FLAGS.use_pooled).to(DEVICE)

    loss_fn = nn.CrossEntropyLoss(weight=torch.FloatTensor(class_weights).to(DEVICE))
    bert_optim = AdamW(model.parameters(), lr=FLAGS.lr, correct_bias=False)

    total_steps = len(train_dataloader) * FLAGS.epochs
    scheduler = get_linear_schedule_with_warmup(bert_optim,
                                                num_warmup_steps=0,
                                                num_training_steps=total_steps)

    model_save_path = FLAGS.save_path.format(FLAGS.model_name, FLAGS.batch_size, FLAGS.lr,
                                             FLAGS.dropout, FLAGS.other_hidden_dim, FLAGS.seed)

    print(f'Fixed Transformer stem. Total head trainable parameters {model.count_parameters()}')
    train(model, train_dataloader, val_dataloader, FLAGS.epochs, DEVICE, loss_fn,
          bert_optim, scheduler, model_save_path, 
          FLAGS.eval_every, FLAGS.use_lpft, FLAGS.lp_step)


if __name__ == "__main__":
    app.run(main)

Overwriting main.py


## evaluate.py

In [22]:
%%writefile evaluate.py
import os

from absl import flags
from absl import app
import pandas as pd
import torch
from tqdm import tqdm
import numpy as np
from sklearn.metrics import classification_report

from model import TransformerSentimentAnalyzer
from dataset import create_dataloader

flags.DEFINE_string("data_path", "data_2021_spring", "data directory path")
flags.DEFINE_integer("max_len", 256, "max sentence length. max value is 512 for bert")
flags.DEFINE_list("other_features", [],
                  "other feature aggregations to use")
flags.DEFINE_string("model_path", None, "where to save the model", required=True)
flags.DEFINE_bool("use_pooled", True, "whether to use pooled output of Bert")
flags.DEFINE_string("save_path", "preds/pred.csv", "name of the file to save predictions")
flags.DEFINE_string("which_data", "test", "which data to evaluate on. valid or test")

FLAGS = flags.FLAGS

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def evaluate(model, test_data, device, mode="test", save_name="pred.csv"):
    test_bar = tqdm(test_data, total=int(len(test_data)))

    model.eval()
    preds = []
    if mode == "valid":
        y_true = []
    with torch.no_grad():
        for batch in test_bar:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            other_features = batch["features"].to(device) if "features" in batch else None
            logits = model(input_ids, attention_mask, other_features)
            predicted = torch.max(logits, dim=1)[1]
            preds.extend(predicted.tolist())
            if mode == "valid":
                y_true.extend(batch["label"].tolist())

    if mode == "valid":
        print(classification_report(y_true, preds, digits=4))
    else:
        review_ids = test_data.dataset.data_file["review_id"]
        save_preds(review_ids, np.array(preds), save_name)


def save_preds(review_ids, preds, save_name="pred.csv"):
    answer_df = pd.DataFrame(data={
        'review_id': review_ids,
        'stars': preds + 1,
    })
    answer_df.to_csv(save_name, index=False)


def main(args):
    del args  # unused

    ckpt_name = os.path.basename(FLAGS.model_path)
    ckpt_name = ckpt_name.rsplit(".", 1)[0]
    try:
        model_name, batch_size, lr, dropout, hidden, seed, _ = ckpt_name.split("_")
    except:
        model_name, batch_size, lr, dropout, hidden, seed = ckpt_name.split("_")
    batch_size = int(batch_size[2:])
    dropout = float(dropout[4:])
    hidden = int(hidden[6:])

    test_dataloader, _ = create_dataloader(FLAGS.data_path,
                                           FLAGS.which_data,
                                           model_name,
                                           batch_size=batch_size,
                                           max_length=FLAGS.max_len,
                                           columns=FLAGS.other_features)

    model = TransformerSentimentAnalyzer(model_name,
                                         num_class=5,
                                         num_other_features=len(FLAGS.other_features),
                                         hidden_size=hidden,
                                         dropout_rate=dropout,
                                         use_pooled=FLAGS.use_pooled).to(DEVICE)
    model.load_state_dict(torch.load(FLAGS.model_path))
    model.eval()

    evaluate(model,
             test_dataloader,
             DEVICE,
             mode=FLAGS.which_data,
             save_name=FLAGS.save_path)


if __name__ == "__main__":
    app.run(main)

Writing evaluate.py


## Run

In [12]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.3-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 5.1 MB/s 
[?25hCollecting tensorflow-gpu
  Downloading tensorflow_gpu-2.10.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (578.0 MB)
[K     |████████████████████████████████| 578.0 MB 16 kB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 62.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 29.0 MB/s 
Collecting gast<=0.4.0,>=0.2.1
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting tensorflow-estimator<2.11,>=2.10.0
  Downloading tensorflow_estimator-2.10.0-py2.py3-none-any.whl (

In [13]:
!python main.py --help

2022-09-09 06:00:12.923382: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-09 06:00:13.136061: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-09-09 06:00:14.027915: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2022-09-09 06:00:14.028092: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open 

### Train

In [14]:
data

'https://raw.githubusercontent.com/VanHoann/Yelp_Dataset_Challenges/main/Sentiment_Analysis/data'

In [20]:
!CUDA_VISIBLE_DEVICES=0 python main.py --data_path https://raw.githubusercontent.com/VanHoann/Yelp_Dataset_Challenges/main/Sentiment_Analysis/data \
  --model_name roberta-base \
  --batch_size 32 \
  --max_len 256 \
  --epochs 3 \
  --lr 1e-5 \
  --dropout 0.4 \
  --save_path models/{}_bs{}_lr{}_drop{}_hidden{}_seed{}_lpft.pth \
  --use_pooled \
  --other_hidden_dim 32 \
  --eval_every 50 \
  --use_lpft 1 \
  --lp_step 10 \
  --seed 101 \
  > log_roberta_lpft.txt

2022-09-09 06:03:54.861863: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-09 06:03:55.066780: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-09-09 06:03:55.802013: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2022-09-09 06:03:55.802147: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open 

### Evaluate

In [21]:
!ls models

roberta-base_bs32_lr1e-05_drop0.4_hidden32_seed101_lpft.pth


In [24]:
!python evaluate.py --data_path https://raw.githubusercontent.com/VanHoann/Yelp_Dataset_Challenges/main/Sentiment_Analysis/data \
  --model_path models/roberta-base_bs32_lr1e-05_drop0.4_hidden32_seed101_lpft.pth \
  --which_data valid

2022-09-09 06:59:38.152135: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-09 06:59:38.354252: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-09-09 06:59:39.164833: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2022-09-09 06:59:39.164986: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open 

### Predict

In [26]:
!python evaluate.py --data_path https://raw.githubusercontent.com/VanHoann/Yelp_Dataset_Challenges/main/Sentiment_Analysis/data \
  --model_path models/roberta-base_bs32_lr1e-05_drop0.4_hidden32_seed101_lpft.pth \
  --which_data test \
  --save_path pred.csv

2022-09-09 07:02:05.574903: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-09 07:02:05.768320: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-09-09 07:02:06.750276: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2022-09-09 07:02:06.750421: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open 

In [27]:
import pandas as pd
pd.read_csv("pred.csv")[:5]

Unnamed: 0,review_id,stars
0,I77zZlSdCFAClxdjHwPcxw,5
1,ioFNKarf29KGjRZdH0qC8Q,5
2,9429anmcYIcaEcMptJCNKQ,1
3,PsUCdt7PKjzgBC0c7xXhJA,5
4,GQBlykKyShQcNeu2ivLdSA,4
