<a href="https://colab.research.google.com/github/andrewdge/CSE354-Final-Project/blob/main/CSE354_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Document Level Predictions Using Paragraph Level Sentimient

Using the PerSent dataset, we will be training our model based on paragraph-level sentiments. These will be aggregated to produce document-level sentiments.

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 31.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 65.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 66.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 67.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 8.5 MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacr

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import torch
import math
import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, TensorDataset, DataLoader, random_split
from torch.nn.utils.rnn import pad_sequence
from tqdm import tqdm
import numpy as np
import os
from sklearn.metrics import precision_score, recall_score, f1_score
from transformers import DistilBertModel, DistilBertConfig, DistilBertTokenizer, AdamW, DistilBertForSequenceClassification
from collections import Counter
torch.manual_seed(42)
np.random.seed(42)

# Constants

Constants we will use in our experiments. These may be subjected to change as hyperparameters

In [4]:
DISTILBERT_DROPOUT = 0.2
DISTILBERT_ATT_DROPOUT = 0.2
BATCH_SIZE = 16
EPOCHS = 3

# Andrew PATH
TEST_PATH = '/content/drive/MyDrive/CSE354/random_test.csv'
TRAIN_PATH = '/content/drive/MyDrive/CSE354/train.csv'
VAL_PATH = '/content/drive/MyDrive/CSE354/dev.csv'
SAVE_PATH = '/content/drive/MyDrive/CSE354/models'

test_data = pd.read_csv(TEST_PATH)
train_data = pd.read_csv(TRAIN_PATH)
# fixed_test will be used for validation
val_data = pd.read_csv(VAL_PATH)



# Initializing Our Model

Here is where we set up our DistilBERT model.

In [5]:
class DistillBERT():
  def __init__(self):
    # TODO(students): start
    self.tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
    config = DistilBertConfig(dropout=DISTILBERT_DROPOUT, 
                          attention_dropout=DISTILBERT_ATT_DROPOUT, 
                          output_hidden_states=True)
    self.model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)
    # TODO(students): end
  def get_tokenizer_and_model(self):
    return self.model, self.tokenizer

# DataLoader

This class handles loading, preprocessing, and tokenizing the data.

Each row in the dataframe contains text with some number of paragraphs, as well as a number as labels per paragraph. We add another column in the dataframe, paragraphs per document. This will be used later to test our predictions as compare paragraph-level predictions to paragraph labels, as well as document-level predictions to document labels. We also remove data without paragraph-level labels.

For the labels create new columns for each.

This format is largely takes inspiration from Assignment 3.

In [7]:
class DatasetLoader(Dataset):
  def __init__(self, data, tokenizer):
    # Data is the uncleaned data, as a dataframe.
    self.data = data
    self.tokenizer = tokenizer

  def preprocess_data(self):
    # Combine labels into list.
    df = self.data
    df = df[df['Paragraph0'].notna()]
    df['Paragraph Labels'] = df.iloc[:, 6:].values.tolist() #Includes Nans, remove them
    self.data = df

  def tokenize_data(self, mode="paragraph"):
    # Tokenizing
    tokens = []
    labels = []
    label_dict = {'Negative': 0,
                  'Neutral': 1,
                  'Positive': 2}
    document_list = self.data['DOCUMENT']

    # Tokenizes documents by paragraphs. Tokens = paragraph, labels = paragraph sentiment
    if mode == "paragraph":
      label_list = self.data['Paragraph Labels']
      for (document, doc_labels) in tqdm(zip(document_list, label_list), total=len(document_list)):
        paragraphs = document.split('\n')
        for paragraph, label in zip(paragraphs, doc_labels):
          encoding = self.tokenizer(text=paragraph, truncation='longest_first', max_length=512, return_tensors='pt')
          labels.append(label_dict[label])
          tokens.append(encoding.input_ids[0]) # Might need to CUDA

    # Tokenizes documents by document. Tokens = document, labels = document sentiment
    if mode == "document":
      label_list = self.data['TRUE_SENTIMENT']
      for (document, true_label) in tqdm(zip(document_list, label_list), total=len(document_list)):
        encoding = self.tokenizer(text=paragraph, truncation='longest_first', max_length=512, return_tensors='pt')
        labels.append(label_dict[label])
        tokens.append(encoding.input_ids[0]) # Might need to CUDA
    
    tokens = pad_sequence(tokens, batch_first=True)
    labels = torch.tensor(labels)
    labels.to("cuda:0" if torch.cuda.is_available() else "cpu")
    tokens.to("cuda:0" if torch.cuda.is_available() else "cpu")
    dataset = TensorDataset(tokens, labels)
    return dataset

  def get_data_loaders(self, shuffle=True):
    self.preprocess_data()
    processed_dataset = self.tokenize_data()
    data_loader = DataLoader(
        processed_dataset,
        shuffle=shuffle,
        batch_size=BATCH_SIZE
    )
    return data_loader

In [11]:
class Trainer():

  def __init__(self, args):
    self.train_data = args['train_data']
    self.val_data = args['val_data']
    self.batch_size = args['batch_size']
    self.epochs = args['epochs']
    self.save_path = args['save_path']
    self.training_type = args['training_type']
    self.device = args['device']
    transformer = DistillBERT()
    self.model, self.tokenizer = transformer.get_tokenizer_and_model()
    self.model.to(self.device)
    self.val_preds = []
    self.train_preds = []

  def get_performance_metrics(self, preds, labels, mode=None):
    pred_flat = np.argmax(preds, axis=1).flatten()
    # First portion from training preds, second from val preds
    if mode == "training":
      self.train_preds.extend(pred_flat)
    if mode == "val":
      self.val_preds.extend(pred_flat)
    labels_flat = labels.flatten()
    precision = precision_score(labels_flat, pred_flat, zero_division=0, average='macro')
    recall = recall_score(labels_flat, pred_flat, zero_division=0, average='macro')
    f1 = f1_score(labels_flat, pred_flat, zero_division=0, average='macro')
    return precision, recall, f1

  def set_training_parameters(self):
    
    if self.training_type == 'frozen_layers':
      for name, param in self.model.named_parameters(): 
          # print(f'{name},  {param.requires_grad}')
          if "distilbert." in name:
            self.model.get_parameter(name).requires_grad = False
    elif self.training_type == 'all_training':
      pass

  def train(self, data_loader, optimizer):
    self.model.train()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    for batch_idx, (reviews, labels) in enumerate(tqdm(data_loader)):
      self.model.zero_grad()
      # TODO(students): start
      output = self.model(reviews.to(self.device), labels=labels.to(self.device)) 
      loss = output.loss
      logits = output.logits
      with torch.no_grad():
        precision, recall, f1 = self.get_performance_metrics(logits.cpu(), labels.cpu(), mode='training')
      loss.backward()
      optimizer.step()
      total_loss += loss
      total_recall += recall
      total_precision += precision
      total_f1 += f1
      # TODO(students): end
    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def eval(self, data_loader):
    self.model.eval()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    with torch.no_grad():
      for (reviews, labels) in tqdm(data_loader):
        # TODO(students): start
        output = self.model(reviews.to(self.device), labels=labels.to(self.device)) 
        prec, rec, f1 = self.get_performance_metrics(output.logits.cpu(), labels.cpu(), mode='val')
        total_recall += rec
        total_precision += prec
        total_f1 += f1
        total_loss += output.loss
        # TODO(students): end
    
    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss
  
  def eval_doc(self, data_loader, mode):
    # Predictions per paragraph
    preds = []
    index = 0
    if mode == "training":
      preds = self.train_preds
    if mode == "val":
      preds = self.val_preds
    df = data_loader.data
    
    docs_and_paragraph_labels = pd.Series(df['TRUE_SENTIMENT'].values, index=df['Paragraph Labels']).to_dict()
    doc_preds = []
    doc_labels = df['TRUE_SENTIMENT'].values
    # For each document, check the number of predicted labels corresponding with its paragraphs.
    for true_sentiment, p_labels in docs_and_paragraph_labels.items():
      doc_paragraph_preds = []
      # For the number of paragraphs (denoted by p_labels) per document, append the prediction list for that document
      # based on the entire list of paragraph predictions, preds.
      for i in range(len(p_labels)):
        doc_paragraph_preds.append(preds[index])
        index += 1
      # Find the most common and append that to the document predictions.
      c = Counter(doc_paragraph_preds)
      doc_preds.append(c.most_common(1)[0][0])
    print("index value")
    print(index)
    print()
    return self.get_performance_metrics(doc_preds, doc_labels)
    

  def save_transformer(self):
    self.model.save_pretrained(self.save_path)
    self.tokenizer.save_pretrained(self.save_path)

  def execute(self):
    last_best = 0
    train_dataset = DatasetLoader(self.train_data, self.tokenizer)
    train_data_loader = train_dataset.get_data_loaders(self.batch_size)
    val_dataset = DatasetLoader(self.val_data, self.tokenizer)
    val_data_loader = val_dataset.get_data_loaders(self.batch_size)
    optimizer = AdamW(self.model.parameters(), lr = 3e-5, eps = 1e-8)
    self.set_training_parameters()
    for epoch_i in range(0, self.epochs):
      train_precision, train_recall, train_f1, train_loss = self.train(train_data_loader, optimizer)
      print(f'Epoch {epoch_i + 1}: train_loss: {train_loss:.4f} train_precision: {train_precision:.4f} train_recall: {train_recall:.4f} train_f1: {train_f1:.4f}')
      doc_precision, doc_recall, doc_f1 = self.eval_doc(train_data_loader, mode='training')
      val_precision, val_recall, val_f1, val_loss = self.eval(val_data_loader)
      print(f'Epoch {epoch_i + 1}: val_loss: {val_loss:.4f} val_precision: {val_precision:.4f} val_recall: {val_recall:.4f} val_f1: {val_f1:.4f}')
      if val_f1 > last_best:
        print("Saving model..")
        self.save_transformer()
        last_best = val_f1
        print("Model saved.")

In [None]:
import os
import gc
gc.collect()
torch.cuda.empty_cache()
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
args = {}
args['batch_size'] = BATCH_SIZE
args['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
args['train_data'] = train_data
args['val_data'] = val_data
args['save_path'] = SAVE_PATH + '_frozen_layers'
args['epochs'] = EPOCHS
args['training_type'] = 'frozen_layers'
print(args['device'])
CUDA_LAUNCH_BLOCKING=1
trainer = Trainer(args)

trainer.execute()

cuda:0


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classi

In [None]:
import os
import gc
gc.collect()
torch.cuda.empty_cache()
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
args = {}
args['batch_size'] = BATCH_SIZE
args['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
args['train_data'] = train_data
args['val_data'] = val_data
args['save_path'] = SAVE_PATH + '_all_training'
args['epochs'] = EPOCHS
args['training_type'] = 'all_training'
print(args['device'])
CUDA_LAUNCH_BLOCKING=1
trainer = Trainer(args)

trainer.execute()

# DO NOT DELETE
##Training all layer results:
Epoch 3: train_loss: 0.9514 train_precision: 0.2719 train_recall: 0.3635 train_f1: 0.2743
100%|██████████| 261/261 [01:09<00:00,  3.77it/s]Epoch 3: val_loss: 0.9390 val_precision: 0.1770 val_recall: 0.3678 val_f1: 0.2352

##Freeze all distilbert layer results:
Epoch 3: train_loss: 0.9435 train_precision: 0.3501 train_recall: 0.3875 train_f1: 0.3319
100%|██████████| 261/261 [01:09<00:00,  3.75it/s]Epoch 3: val_loss: 0.9261 val_precision: 0.3844 val_recall: 0.3959 val_f1: 0.3225