# Sentiment Analysis and Stance Classificaiton
## Modified from CIS 530 Homework Option 2 - Spring 2021




## **Part I:** Relevance Classification with BERT fine-tuning

#### Why do we need to fine-tune BERT?
Here is a nice [demo](https://demo.allennlp.org/masked-lm?text=The%20doctor%20ran%20to%20the%20emergency%20room%20to%20see%20%5BMASK%5D%20patient.) 



We will be using the [transformer](https://github.com/huggingface/transformers) package developed by Huggingface, based on PyTorch. It is the most popular library for BERT and other transformer-based language models like GPT-2. 


**IMPORTANT: Make sure that you have GPU set as your Hardware Accelerator in Runtime > Change runtime type before running this Colab.**

### Installing the Huggingface🤗 transformer package

In [2]:
# os.environ['CUDA_LAUNCH_BLOCKING'] = "0"
!pip install transformers
!pip3 install sentencepiece

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 5.2MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 14.8MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 17.7MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1
Collecting sentencepiece
[?25l  Downloading https://files.p

### Import the important packages that we need

In [2]:
import torch 
import numpy as np

### Mount your google drive 

We will be saving trained checkpoints on your Google Drive so that they can be accessed even if the Colab session dies. Make sure to login with your UPenn credentials, as you will be saving several gigabytes of data, and Penn gives you unlimited Drive storage.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Download the Review Datset

Note that with the default code, the files are not saved in your google drive, which means they will get deleted after the session close. You can either re-run this cell for each new colab session, or you can save it to the mounted drive at `/content/drive`

In [4]:
import json
import gzip

def parse(path):
  true = True
  false = False
  g = gzip.open(path, 'r')
  for l in g:
    yield json.dumps(eval(l))

def strict(path):
  f_path = path + ".json.gz"
  true = True
  false = False
  f = open(path+".json", 'w')
  # f.write("[")
  writable = "["
  # writable.join

  for l in parse(f_path):
    writable = writable + l + ',\n'
    # f.write(l + ',\n')
  writable = writable[:-2]
  writable = writable + "]"
  f.write(writable)

In [5]:
dataset_dir = '/content/drive/MyDrive/cis519_project'
strict('/content/drive/MyDrive/cis519_project/gift_Cards_5')

### Load the dataset and see what it looks like

For now let's first load the training dataset and see what it looks like. We will worry about the dev/test sets later...


In [6]:
import json
import os
with open(os.path.join(dataset_dir, 'gift_Cards_5.json')) as fin:
    train_set = json.load(fin)

print(train_set[0]['reviewText'])
print("Number of review in training set: {}".format(len(train_set)))
print("Here's how one of the example looks like: {}".format(json.dumps(train_set[100])))

Another great gift.
Number of review in training set: 2972
Here's how one of the example looks like: {"overall": 5.0, "verified": true, "reviewTime": "04 9, 2018", "reviewerID": "A2CM3SWOMP3A9C", "asin": "B005DHN6E2", "style": {"Gift Amount:": " 0"}, "reviewerName": "Mike", "reviewText": "Once again, who but Donald Trump (perhaps The Walton family) could hate an Amazon Gift Card?", "summary": "who but Donald Trump (perhaps The Walton family) could hate an Amazon Gift Card", "unixReviewTime": 1523232000}


Do some sampling TODO

In [None]:
import random

# def negative_sample(train_set, claim_id, claim_text, sample_size):
#     """
#     Given a perspective (A dictionnary with keys "id" and "text"), randomly sample {sample_size} negative examples from the dataset. E.g. get a perspective from a different claim
#     """
#     # Each perspective object in the list should be a dictionary with two keys "id", "text".
#     other_examples = [ex for ex in train_set if ex["cid"] != claim_id]
    
#     negative_examples = []
#     for i in range(sample_size):
#         rand_claim = random.choice(other_examples)
#         all_persps = rand_claim["perspective_for"] + rand_claim["perspective_against"]
#         random_persp = random.choice(all_persps)
#         negative_examples.append(random_persp)
    
#     return negative_examples

# training_sentence_pairs = []

# for claim in train_set:
#     positive_perspectives = claim["perspective_for"] + claim["perspective_against"]
    
#     # We keep the number of negative examples equal to positive, so that we will have a balanced training set
#     negative_perspectives = negative_sample(train_set, claim['cid'], claim['claim_text'], len(positive_perspectives)) 
    
#     for persp in positive_perspectives:
#         training_sentence_pairs.append({
#             "claim_id": claim["cid"],
#             "claim_text": claim["claim_text"],
#             "perspective_id": persp["id"],
#             "perspective_text": persp["text"],
#             "label": True
#         })

#     for persp in negative_perspectives:
#         training_sentence_pairs.append({
#             "claim_id": claim["cid"],
#             "claim_text": claim["claim_text"],
#             "perspective_id": persp["id"],
#             "perspective_text": persp["text"],
#             "label": False
#         })
training_reviews = []
id = 0;
max_len = 0 
for review in train_set:
  #no Review text
  if 'reviewText' not in review:
    continue
  if len(review['reviewText']) > 500:
    continue
  training_reviews.append({
            "review_id": id,
            "review_text": review['reviewText'],
            "label": review['overall']
        })
  
    # max_len = len(review['reviewText'])
  id = id + 1
print("Number of claim-perspective sentence pairs for training: {}".format(len(training_reviews)))
print(max_len)

Now it would be a good time to load our dev/test examples, which are already organized in the same sentence pair format as what you just did.

In [27]:
# with open(os.path.join(dataset_dir, 'perspectrum_relevance_dev.json')) as fin:
#     dev_sentence_pairs = json.load(fin)

# with open(os.path.join(dataset_dir, 'perspectrum_relevance_test_no_label.json')) as fin:
#     test_sentence_pairs = json.load(fin)

# print("Number of claim-perspective sentence pairs in dev set: {}".format(len(dev_sentence_pairs)))
# print("Number of claim-perspective sentence pairs in test set: {}".format(len(test_sentence_pairs)))

### Load Pretrained BERT Model
 

You can search for the available models [here](https://huggingface.co/models?search=bert).

You can find more examples of different use cases for BERT in the transformer github repo README -- https://github.com/huggingface/transformers


In [3]:
from transformers import InputExample
from transformers import (WEIGHTS_NAME, BertConfig,
                          BertForSequenceClassification, BertTokenizer)
from transformers import glue_convert_examples_to_features as convert_examples_to_features
from transformers.optimization import AdamW, get_linear_schedule_with_warmup
import tqdm

from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
#Bert-mini
bert_model_type = 'google/bert_uncased_L-4_H-256_A-4'   # Specs of BERT models with different sizes can be found at https://github.com/google-research/bert/
#Bert-base                                                       # You can experiment models with different sizes, to see how it affects performance. 
# bert_model_type = "albert-base-v2"

bert_model = BertForSequenceClassification.from_pretrained(bert_model_type)
config = BertConfig.from_pretrained(bert_model_type)
tokenizer = BertTokenizer.from_pretrained(bert_model_type)
# from transformers import AutoTokenizer, AutoModelForPreTraining, AutoConfig
# import sentencepiece

# tokenizer = AutoTokenizer.from_pretrained("google/bigbird-roberta-base")
# config = AutoConfig.from_pretrained("google/bigbird-roberta-base")
# model = AutoModelForPreTraining.from_pretrained("google/bigbird-roberta-base")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=383.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=45088961.0, style=ProgressStyle(descrip…




Some weights of the model checkpoint at google/bert_uncased_L-4_H-256_A-4 were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification w

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




### Convert examples to BERT input features
Much like every other neural network. You need to (1) tokenize your input sentences (2) Have a vocabulary/dictionary and convert each token to a vector/tensor. Luckily BERT offers a very nice set of interfaces, through which you can do these steps easily.

In this homework we provide this function to you. However, in case you would like to use BERT in the future, it is really important to understand BERT's input format and the word-piece tokenziation strategy that BERT adopts. Here are a few resources that we suggest -- 

1. The ["What is BERT" section](https://github.com/google-research/bert#what-is-bert) in the official BERT code repo by Google
2. Section 3 of the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf)


In [10]:
relevance_label_mapping = {
    1.0: 0,
    2.0: 1,
    3.0: 2,
    4.0: 3,
    5.0: 4
} # If you are working on stance classification, create a different label mapping

def convert_sentence_pair_to_tensor_input(sentence_reviews, label_mapping):

    # STEP 1: convert each sentence 
    input_examples = []
    for review in sentence_reviews:
        current_label = review["label"] if "label" in review else AssertionError
        input_examples.append(
            InputExample(guid="", # We don't really need this
                         text_a=review["review_text"], 
                        #  text_b="", 
                         label=label_mapping[current_label])
        )
    print(input_examples)
    label_list = [val for _, val in label_mapping.items()]

    features = convert_examples_to_features(input_examples,
                                                   tokenizer,
                                                   label_list=label_list,
                                                   max_length=512,  
                                                   output_mode="classification")
    
    input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
    token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
    labels = torch.tensor([f.label for f in features], dtype=torch.long)

    dataset = TensorDataset(input_ids, attention_mask, token_type_ids, labels)

    return dataset

In [11]:
train_dataset = convert_sentence_pair_to_tensor_input(training_reviews, relevance_label_mapping)

[InputExample(guid='', text_a='Another great gift.', text_b=None, label=4), InputExample(guid='', text_a='Gift card for my daughter', text_b=None, label=3), InputExample(guid='', text_a='Nice present', text_b=None, label=4), InputExample(guid='', text_a='My niece loved this birthday greeting/gift card.', text_b=None, label=4), InputExample(guid='', text_a='fine as a gift.', text_b=None, label=4), InputExample(guid='', text_a='I would have preferred some more choices.', text_b=None, label=4), InputExample(guid='', text_a='great', text_b=None, label=4), InputExample(guid='', text_a='Very cute design and enjoyed by recipient.', text_b=None, label=4), InputExample(guid='', text_a="I used the text option to send these last minute gift cards to my Granddaughters (via their mom's phone). Works really well, you get a confirmation email that it has been received, and a confirmation that the cards have been redeemed. Granddaughter's very happy with the card design. I love the options you have wh



### Choose your hyperparameters + model output directory
Before we get into training, we need to set our hyperparameters, e.g. Learning rates, mini-batch size for training/testing, etc..

In [12]:
HYPER_PARAMS = {
    "num_training_epoch": 3,
    "learning_rate": 3e-5,        # Suggested values -- [1e-5, 3e-5, 5e-5]
    "training_batch_size": 16,    # Suggested values -- [16, 32]
    "eval_batch_size": 8,
    "max_grad_norm": 1.0,
    "num_warmup_steps": 0.1
}

model_output_dir = "/content/drive/" # Model + prediction results will be saved to your GDrive, 
                                     # so you don't lose them after session closes

### Fine-tune BERT model

Remember NOT to re-run this cell multiple times, without re-initializing the BERT model. Multiple runs will effectively train your model with more epochs than you intended!

In [1]:
import tqdm

bert_model.to('cuda')

train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, 
                              sampler=train_sampler, 
                              batch_size=HYPER_PARAMS["training_batch_size"])

optimizer = AdamW(bert_model.parameters(), 
                  lr=HYPER_PARAMS['learning_rate'], 
                  correct_bias=False)

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=HYPER_PARAMS['num_warmup_steps'], 
                                            num_training_steps=len(train_dataloader))


global_step = 0
tr_loss = 0.0
bert_model.zero_grad()
bert_model.train()

for epc in range(HYPER_PARAMS["num_training_epoch"]):
    print("Epoch #{}: \n".format(epc))
    epoch_iterator = tqdm.notebook.tqdm(train_dataloader, desc="Training Steps")
    avg_loss_over_epoch = []
    for step, batch in enumerate(epoch_iterator):
        batch = tuple(t.to("cuda") for t in batch)
        inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1],
                  'token_type_ids': batch[2],
                  'labels': batch[3]}

        outputs = bert_model(**inputs)
        loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

        loss.backward()
        torch.nn.utils.clip_grad_norm_(bert_model.parameters(), HYPER_PARAMS["max_grad_norm"])
        tr_loss += loss.item()

        optimizer.step()
        scheduler.step()
        bert_model.zero_grad()

NameError: ignored

### Save the fine-tuned model
It is good practice to save your tokenizer + config for BERT at the same location, for best reproducibility

In [None]:
import os

# This is where we mounted your google drive. 
# You might need to re-mount it if your session was closed half way through
output_dir = "/content/drive/My Drive/cis530_perspective_hw/relevance_model_large/" 

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

bert_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
config.save_pretrained(output_dir)

### Test if you can load the model back!

In [None]:
bert_model = BertForSequenceClassification.from_pretrained(output_dir)
tokenizer = BertTokenizer.from_pretrained(output_dir)

# Don't forget to move your model to GPU/CUDA after loading back from disk!
bert_model = bert_model.to("cuda")

### Evaluate the fine-tuned model on dev set
Now we want to know how good our model is. Let's test it on the dev set!

We need to go through the same process -- convert sentence pairs into feature vectors/tensors

In [None]:
# Putting this here again, just so you don't forget what it is...
relevance_label_mapping = {
    True: 1,
    False: 0
} 

dev_dataset = convert_sentence_pair_to_tensor_input(dev_sentence_pairs, relevance_label_mapping)

# We are not random sampling anymore when evaluating... As we want to keep the order 
dev_sampler = SequentialSampler(dev_dataset)
dev_dataloader = DataLoader(dev_dataset, 
                            sampler=dev_sampler, 
                            batch_size=HYPER_PARAMS["eval_batch_size"])

predictions = None
out_label_ids = None

for batch in tqdm.notebook.tqdm(dev_dataloader, desc="Evaluating on Dev set..."):
    bert_model.eval()
    batch = tuple(t.to("cuda") for t in batch)
    inputs = {'input_ids': batch[0],
              'attention_mask': batch[1],
              'token_type_ids': batch[2],
              'labels': batch[3]}

    with torch.no_grad():
        outputs = bert_model(**inputs)
        logits = outputs[1] # This is 1x2 tensor, containing scores for both labels 

    if predictions is None:
        predictions = logits.detach().cpu().numpy()
        out_label_ids = inputs['labels'].detach().cpu().numpy()
    else:
        predictions = np.append(predictions, logits.detach().cpu().numpy(), axis=0)
        out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)

# whichever label gets higher score, we will predict that label
predictions = np.argmax(predictions, axis=1)


# We will simply use accuracy as our measure here 
def accuracy(preds, labels):
    return (preds == labels).mean()

acc = accuracy(predictions, out_label_ids)

print("The accuracy on dev set = {}".format(acc))



HBox(children=(FloatProgress(value=0.0, description='Evaluating on Dev set...', max=63.0, style=ProgressStyle(…


The accuracy on dev set = 0.384




```
# This is formatted as code
```

The TAs were able to get around 70-80% accuracy on the dev set, with the provided set of parameters and model. 

### Now it's your turn - Evaluate on the test data, and submit your results

**Important Note**: the labels of the test data are NOT given to you in this homework. However the helper functions will still generate a dummy label for each input sentence pair. The only way to measure the correct accuracy on test set is submitting your test results `relevance_test_predictions.txt` to Gradescope. 

Other than that this should be almost identical to what we just did for the dev set.

Please download the `relevance_test_predictions.txt` and follow guide on the homework webpage to make a submission.

In [None]:
def predict_on_test_set():
    """
    Return a list of 0/1 prediction for each test example, in sequential order.
    Please use the same label mapping as we have so far.
    1 = True (Relevant)
    0 = False (Not relevant)
    """

    test_dataset = convert_sentence_pair_to_tensor_input(test_sentence_pairs, relevance_label_mapping)

    # We are not random sampling anymore when evaluating... As we want to keep the order 
    test_sampler = SequentialSampler(test_dataset)
    test_dataloader = DataLoader(test_dataset, 
                                sampler=test_sampler, 
                                batch_size=HYPER_PARAMS["eval_batch_size"])

    predictions = None
    out_label_ids = None

    for batch in tqdm.notebook.tqdm(test_dataloader, desc="Evaluating on Test set..."):
        bert_model.eval()
        batch = tuple(t.to("cuda") for t in batch)
        inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1],
                  'token_type_ids': batch[2],
                  'labels': batch[3]}

        with torch.no_grad():
            outputs = bert_model(**inputs)
            logits = outputs[1] # This is 1x2 tensor, containing scores for both labels 

        if predictions is None:
            predictions = logits.detach().cpu().numpy()
            out_label_ids = inputs['labels'].detach().cpu().numpy()
        else:
            predictions = np.append(predictions, logits.detach().cpu().numpy(), axis=0)
            out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)

    # whichever label gets higher score, we will predict that label
    predictions = np.argmax(predictions, axis=1)
    
    list_of_predictions = predictions
    
    return list_of_predictions


# Feel free to change the save location as you like,
# but please keep the file name as "relevance_test_predictions.txt"
# So that the autograder will know what file to look for...
test_result_output_path = "/content/drive/My Drive/cis530_perspective_hw/relevance_test_predictions.txt"

test_predictions = predict_on_test_set()

with open(test_result_output_path, 'w') as fout:
    for pred in test_predictions: 
        fout.write("{}\n".format(int(pred)))



HBox(children=(FloatProgress(value=0.0, description='Evaluating on Test set...', max=250.0, style=ProgressStyl…




## **Part II:** DIY for stance classification (Optional, Extra Credit)

Now that you are becoming an expert for BERT (hopefully), why don't you try to tackle our second task -- stance classification, to predict whether a relevant perspective is eihter **supporting or refuting** the claim.

Since this is a different task, you will be generating positive and negative sentence pairs in a slightly different way. Sepcifically --

1.   In `perspectrum_train.json`, for each given claim, both supporting and refuting perspectives have been given to you. So you don't need to do negative sampling. Instead you should take the claim + "supporting" perspective as positive sentence pair and claim with "refuting" perspective as negative pair.   

2.   The task assumes that for every input claim-perspective pair, the perspective is relevant to the claim. So when generating training pairs, you should make sure of that.

But once you have generated sentence pairs from the training data, the training/evaluation procedure should be almost identical. For the most part you will be re-using code that we just went through.

### **What you need to submit**:
Like what we did for the perspective relevance classification, we want to you train a model and write your stance classification predictions on the test data to a file named `stance_test_predictions.txt`. 



# New Section

In [None]:
with open(os.path.join(dataset_dir, 'perspectrum_train.json')) as fin:
    train_set = json.load(fin)

# TODO: start from here

In [None]:
# The dev and test sets are, again, made into sentence pairs format for you already
with open(os.path.join(dataset_dir, 'perspectrum_stance_dev.json')) as fin:
    dev_sentence_pairs = json.load(fin)

with open(os.path.join(dataset_dir, 'perspectrum_stance_test_no_label.json')) as fin:
    test_sentence_pairs = json.load(fin)

print("Number of claim-perspective sentence pairs in dev set: {}".format(len(dev_sentence_pairs)))
print("Number of claim-perspective sentence pairs in test set: {}".format(len(test_sentence_pairs)))

stance_label_mapping = {
    "support": 1,
    "refute": 0
} 

# TODO: start from here