# Assignment 2
Bradley Thompson - CS 510 Large Language Models PDX Winter 2024

## Experimental Setting 1
To start, I spent a long time just trying to figure out how to implement the classifier for this assignment. I got started right away, but blasted through many hours just trying to figure it out, not finding any really good resources online. I found one that threw me off base, because I started looking into using huggingface's provided AutoModelClassifiers. After talking to the professor, I learned that this was not the intention for the solution, so I switched back to basing my work on assignment 1. I didn't gain any significant headway until the TA linked some articles in Slack, which gave me a base to research on. I eventually got to the point that I realized how I could check the output scores for all tokens in the tokenizer's vocabulary, and then get the highest log probability for the tokens that mapped to my target labels, to get the models classification.

At this point I ran into a few more residual issues around calculating metrics for the multi-class predictions, and also ran into an issue with the vocab because apparently bloom doesn't have "Neutral" in its embeddings. I wanted to make my labels case-insensitive with the hope that it would improve accuracy, so I went ahead and made a map to consider both component tokens ("Neut" and "ral") the same as "neutral".

Finally, I was able to play around with hyper parameters. The above work quantifies almost all waking hours for a single weekend, and then many multi-hour sessions after work across 2 weeks. Luckily, I have a pretty nice GPU at home, and have CUDA support, so I am able to run the models pretty fast for no added Colab cost.

I had no fancy configuration other than setting the model to only generate 1 new token to start. My f1 score was around `0.16` with Bloom 560m on the training set. I set out with the goal of trying to maximize this to some extent, before running my classifier implementation for all models across the test set.

First hyperparameter tweak was to try out `top_k` sampling; `top_k=2` had no effect on performance.

In [35]:
from transformers import AutoModelForCausalLM, AutoTokenizer
MODELS = (
    "bigscience/bloom-560m",
    "bigscience/bloom-1b1",
    "bigscience/bloom-1b7",
    "bigscience/bloomz-560m",
    "bigscience/bloomz-1b1",
    "bigscience/bloomz-1b7",
)

model_name = MODELS[0]
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

In [2]:
from datasets import load_dataset

dataset = load_dataset("cardiffnlp/tweet_sentiment_multilingual", "english")
target_dataset = dataset["train"]

# Check out what the data looks like:
target_dataset[2]

{'text': '"Frank Gaffrey\\u002c Cliff May\\u002c Steve Emerson: Brilliant. \\""""Looming Threats: Iran\\u002c Hezbollah Hamas\\"""" is the best #cufidc session I\\u2019ve had thus far." ',
 'label': 2}

In [3]:
import math

POSITIVE="positive"
NEUTRAL="neutral"
NEGATIVE="negative"


# Note: we have 2 options so we can be case insensitive
LABELS = "negative Negative neutral Neutral positive Positive"
ID_TO_LABEL = {
    0: NEGATIVE,
    1: NEUTRAL,
    2: POSITIVE,
}
# Note: Found that bloom apparently doesn't have `Neutral` in its vocabulary; so, will consider either `Neut` or `ral`.
TOKEN_INDEX_TO_LABEL_ID = {
    0: 0,
    1: 0,
    2: 1,
    3: 1,
    4: 1,
    5: 2,
    6: 2,
}

target_label_tokens = tokenizer(LABELS, add_special_tokens=False)["input_ids"]
print("Gathering vocab indices for target tokens:", tokenizer.decode(target_label_tokens))
for token in target_label_tokens:
    print(f"Token {token}: {tokenizer.decode(token)}")
target_label_tokens

Gathering vocab indices for target tokens: negative Negative neutral Neutral positive Positive
Token 111017: negative
Token 149414:  Negative
Token 40979:  neutral
Token 76420:  Neut
Token 4343: ral
Token 18121:  positive
Token 139904:  Positive


[111017, 149414, 40979, 76420, 4343, 18121, 139904]

In [36]:
from transformers import GenerationConfig
import torch as t

config = {
    "min_new_tokens": 1,
    "max_new_tokens": 1,
    "do_sample": True,
    "top_k": 4,
}

HYPOTHESIS = "This text has a positive sentiment, true or false:"

def classify(premise: str) -> int:
    """
    Use model to generate output scores across entire vocab for the next token based on probability
    of entailment for the given premise/hypothesis pair.
    https://joeddav.github.io/blog/2020/05/29/ZSL.html#Classification-as-Natural-Language-Inference
    :param premise: some input text string to be classified based on `LABELS`.
    :returns: classification label id
    """
    inputs = tokenizer.encode(premise, HYPOTHESIS, return_tensors="pt")
    gen_config: GenerationConfig = GenerationConfig.from_dict(config)
    # Get prediction scores across vocab for the only token generated b/c of gen config and normalize w/ softmax
    output = model.generate(inputs, gen_config, return_dict_in_generate=True, output_scores=True)["scores"][0]
    vocab_probs = output.softmax(dim=1)
    # Get probabilities of our target labels by index in our vocab
    labels_log_probs = t.index_select(vocab_probs, 1, t.tensor(target_label_tokens))[0]
    # Get highest log prob label
    labels_index = t.argmax(labels_log_probs).item()
    return TOKEN_INDEX_TO_LABEL_ID[labels_index]

sample = target_dataset[0]["text"]
print(f"Input text: {sample}\nClassification: {ID_TO_LABEL[classify(sample)]}")

Input text: okay i\u2019m sorry but TAYLOR SWIFT LOOKS NOTHING LIKE JACKIE O SO STOP COMPARING THE TWO. c\u2019mon America aren\u2019t you sick of her yet? (sorry) 
Classification: negative


In [29]:
import evaluate

recall_metric = evaluate.load("recall")
precision_metric = evaluate.load("precision")
f1_metric = evaluate.load("f1")

print(recall_metric.compute(references=[1, 1, 1, 2, 0], predictions=[1, 1, 0, 0, 2], average="weighted", labels=list(ID_TO_LABEL.keys())))
print(precision_metric.compute(references=[1, 1, 1, 2, 0], predictions=[1, 1, 0, 0, 2], average="weighted", labels=list(ID_TO_LABEL.keys())))
print(f1_metric.compute(references=[1, 1, 1, 2, 0], predictions=[1, 1, 0, 0, 2], average="weighted", labels=list(ID_TO_LABEL.keys())))

{'recall': 0.4}
{'precision': 0.6}
{'f1': 0.4800000000000001}


In [None]:
references = target_dataset["label"]
predictions = [ classify(sample) for sample in target_dataset["text"] ]

print(recall_metric.compute(references=references, predictions=predictions, average="weighted", labels=list(ID_TO_LABEL.keys())))
print(precision_metric.compute(references=references, predictions=predictions, average="weighted", labels=list(ID_TO_LABEL.keys())))
print(f1_metric.compute(references=references, predictions=predictions, average="weighted", labels=list(ID_TO_LABEL.keys())))