# Assignment 2
Bradley Thompson - CS 510 Large Language Models PDX Winter 2024

## Experimental Setting 1 - Preamble
Note: Included a bunch of my long-form thinking as I iterated on the implementation to figure out how to get this classifier to work correctly.

To start, I spent a long time just trying to figure out how to implement the classifier for this assignment. I got started right away, but blasted through many hours just trying to figure it out, not finding any really good resources online. I found one that threw me off base, because I started looking into using huggingface's provided AutoModelClassifiers. After talking to the professor, I learned that this was not the intention for the solution, so I switched back to basing my work on assignment 1. I didn't gain any significant headway until the TA linked some articles in Slack, which gave me a base to research on. I eventually got to the point that I realized how I could check the output scores for all tokens in the tokenizer's vocabulary, and then get the highest log probability for the tokens that mapped to my target labels, to get the models classification.

At this point I ran into a few more residual issues around calculating metrics for the multi-class predictions, and also ran into an issue with the vocab because apparently bloom doesn't have "Neutral" in its embeddings. I wanted to make my labels case-insensitive with the hope that it would improve accuracy, so I went ahead and made a map to consider both component tokens ("Neut" and "ral") the same as "neutral".

Finally, I was able to play around with hyper parameters. The above work quantifies almost all waking hours for a single weekend, and then many multi-hour sessions after work across 2 weeks. Luckily, I have a pretty nice GPU at home, and have CUDA support, so I am able to run the models pretty fast for no added Colab cost.

I started with no fancy configuration other than setting the model to only generate 1 new token to start. My f1 score was around `0.16` with Bloom 560m on the training set. I set out with the goal of trying to maximize this to some extent, before running my classifier implementation for all models across the test set.

First hyperparameter tweak was to try out `top_k` sampling; `top_k=2` had no effect on performance. Increasing to 4 saw an extremely marginal improvement, so I left it in before trying out `temperature` tuning. After decreasing temperature to `0.6` I noticed performance was unchanged. So I tried removing `top_k` and f1 score jumped up to `0.278`. So, for some reason, the combination of those sampling parameters was not good. I tested out two other generation configuration settings that I found as well: `epsilon_cutoff` and `eta_cutoff`. I didn't really look into what these do, just wanted to play around and see if there was any observable effect. First, `epsilon_cutoff` dropped my f1 score from `0.278` to around `0.2`; subsequently tested out `eta_cutoff`, with roughly the same result.

After tweaking the above parameters, I decided that simply adjusting `temperature` and my input prompt would be sufficient for achieving decent results. Decreasing temperature from `0.6` to `0.3` had a negligible effect. As a final test to confirm what parameters I'd like to stick with before testing across all models, to compare and see the effect of model size on performance, I tried out a `temperature` value greater than the default. At a value of `1.5` my f1 score on the training set with Bloom 560m was roughly unchanged, which was surprising. It maybe suggested that the sampling method change, from greedy decoding, was what actually resulted in some improvement. So, using sampling was probably what caused the increase in f1 score. After realizing that, I tried out beam sampling to see if it had any effect. It also caused the performance to decrease. So, I stuck with sampling as my decoding strategy, vs. greedy search, and left the temperature at its default value.

### Experimental Setting 1

Finally, I let all models run on the test set and got these result metrics:

```
[
    ('bigscience/bloom-560m', {'recall': 0.406896551724138, 'precision': 0.2733517463851896, 'f1': 0.32601205857019805}),
    ('bigscience/bloom-1b1', {'recall': 0.406896551724138, 'precision': 0.2733517463851896, 'f1': 0.32601205857019805}),
    ('bigscience/bloom-1b7', {'recall': 0.406896551724138, 'precision': 0.2733517463851896, 'f1': 0.32601205857019805}),
    ('bigscience/bloomz-560m', {'recall': 0.406896551724138, 'precision': 0.2733517463851896, 'f1': 0.32601205857019805}),
    ('bigscience/bloomz-1b1', {'recall': 0.406896551724138, 'precision': 0.2733517463851896, 'f1': 0.32601205857019805}),
    ('bigscience/bloomz-1b7', {'recall': 0.406896551724138, 'precision': 0.2733517463851896, 'f1': 0.32601205857019805})
]
```

So, clearly something was wrong with my implementation as the performance was unaffected by the model. At this point, I was starting to get close on time with the assignment deadline, though I'd spent a large sum of hours on the assignment already. Still, I tried to take a step back and see what was wrong.

I retried the approach of focusing on the first model and checking the outputs. What I immediately noticed is that the log probs for the token was either wrong, or inhibiting the approach entirely. On a random sample I tried to classify and check the output probability for each token:

```
111017 - negative: 0.0
149414 -  Negative: 0.0
40979 -  neutral: 0.0
76420 -  Neut: 0.0
4343 - ral: 0.0
18121 -  positive: 0.0
139904 -  Positive: 0.0
```

So, clearly the model isn't going to classify well if it isn't assigning any value to my target tokens at all. In fact, I checked all output scores across the entire vocab, and the only tokens with values greater than zero were: `'</s>)\n    a c b the ( 1 A T I "\n\n J is - i it * In [ if The “ \\ It what This Is true false You If How There Do "S What \n\n True "The "I Yes Your yes Does "It'`. I was honestly surprised; I had assumed that all tokens in the vocab would have some likelihood that the model would consider, yet here I was seeing that the considered tokens were highly dependent on the input text (premise and hypothesis). My starting hypothesis was "This text has a positive sentiment, true or false:", which apparently resulted in none of the target label tokens being present in the model's guess for the next token. I started playing around with the hypothesis and found that this string was able to at least get all target label tokens identified and at least slightly considered for the next token: "The text is neutral. Of the labels positive, negative or neutral, the text is:". I had to double up the mention of "neutral" to even get it on the board!

So, with this revelation, I looped back around to reconsidering my prompt a.k.a hypothesis, with the plan of retesting some parameters as well as comparing models. I settled on this string, which reiterated all my label options and then request one of the 3 target labels:

```
f"Text categories: {LABELS}. Of the labels neutral, negative or positive, the text is:"
```

This prompt was able always include at least some token that would represent one of each class. After viewing the output for a few samples, I decided to include more label tokens to additionally identify "Pos" and "Neg" as positive and negative, respectively.

With that said, when I retried parameter tuning, as well as running on all models, and there was no effect on model performance -- all stayed the same. So in the end I just had to move on to the second experimental setting.

### Experimental Setting 2




In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer
MODELS = (
    "bigscience/bloom-560m",
    "bigscience/bloom-1b1",
    "bigscience/bloom-1b7",
    "bigscience/bloomz-560m",
    "bigscience/bloomz-1b1",
    "bigscience/bloomz-1b7",
)

model_name = MODELS[0]
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

In [5]:
from datasets import load_dataset

dataset = load_dataset("cardiffnlp/tweet_sentiment_multilingual", "english")
target_dataset = dataset["train"]

# Check out what the data looks like:
target_dataset[2]

{'text': '"Frank Gaffrey\\u002c Cliff May\\u002c Steve Emerson: Brilliant. \\""""Looming Threats: Iran\\u002c Hezbollah Hamas\\"""" is the best #cufidc session I\\u2019ve had thus far." ',
 'label': 2}

In [6]:
import math

POSITIVE="positive"
NEUTRAL="neutral"
NEGATIVE="negative"


# Note: we have 2 options so we can be case insensitive
LABELS = "negative Negative neutral Neutral positive Positive Pos Neg"
ID_TO_LABEL = {
    0: NEGATIVE,
    1: NEUTRAL,
    2: POSITIVE,
}
# Note: Found that bloom apparently doesn't have `Neutral` in its vocabulary; so, will consider either `Neut` or `ral`.
TOKEN_INDEX_TO_LABEL_ID = {
    0: 0,
    1: 0,
    2: 1,
    3: 1,
    4: 1,
    5: 2,
    6: 2,
    7: 2,
    8: 0,
}

target_label_tokens = tokenizer(LABELS, add_special_tokens=False)["input_ids"]
print("Gathering vocab indices for target tokens:", tokenizer.decode(target_label_tokens))
for token in target_label_tokens:
    print(f"Token {token}: {tokenizer.decode(token)}")
target_label_tokens

Gathering vocab indices for target tokens: negative Negative neutral Neutral positive Positive Pos Neg
Token 111017: negative
Token 149414:  Negative
Token 40979:  neutral
Token 76420:  Neut
Token 4343: ral
Token 18121:  positive
Token 139904:  Positive
Token 18683:  Pos
Token 9775:  Neg


[111017, 149414, 40979, 76420, 4343, 18121, 139904, 18683, 9775]

In [7]:
from transformers import GenerationConfig
import torch as t

config = {
    "min_new_tokens": 1,
    "max_new_tokens": 1,
    # "do_sample": True,
    # "temperature": 0.7,
}

HYPOTHESIS = f"Text categories: {LABELS}. Of the labels neutral, negative or positive, the text is:"

VERY_VERBOSE = False

def classify(premise: str) -> int:
    """
    Use model to generate output scores across entire vocab for the next token based on probability
    of entailment for the given premise/hypothesis pair.
    https://joeddav.github.io/blog/2020/05/29/ZSL.html#Classification-as-Natural-Language-Inference
    :param premise: some input text string to be classified based on `LABELS`.
    :returns: classification label id
    """
    inputs = tokenizer.encode(premise, HYPOTHESIS, return_tensors="pt")
    gen_config: GenerationConfig = GenerationConfig.from_dict(config)
    # Get prediction scores across vocab for the only token generated b/c of gen config and normalize w/ softmax
    output = model.generate(inputs, gen_config, return_dict_in_generate=True, output_scores=True)["scores"][0]
    vocab_probs = output.softmax(dim=1)
    # Get probabilities of our target labels by index in our vocab
    labels_log_probs = t.index_select(vocab_probs, 1, t.tensor(target_label_tokens))[0]
    if VERY_VERBOSE:
        for token, prob in zip(target_label_tokens, labels_log_probs):
            print(f"{token} - {tokenizer.decode(token)}: {prob}")
    # Get highest log prob label
    labels_index = t.argmax(labels_log_probs).item()
    if VERY_VERBOSE:
        print("Selection index: ", labels_index)
    return TOKEN_INDEX_TO_LABEL_ID[labels_index]

sample = target_dataset[0]["text"]
print(f"Input text: {sample}\nClassification: {ID_TO_LABEL[classify(sample)]}")

Input text: okay i\u2019m sorry but TAYLOR SWIFT LOOKS NOTHING LIKE JACKIE O SO STOP COMPARING THE TWO. c\u2019mon America aren\u2019t you sick of her yet? (sorry) 
Classification: negative


In [107]:
%%script false --no-raise-error
"""
Note: Uncomment the above line to run this cell
This is used to test out what tokens are considered possible as the next token after premise/hypothesis.
"""

from transformers import GenerationConfig
import torch as t

config = {
    "min_new_tokens": 1,
    "max_new_tokens": 1,
    "do_sample": True,
    # "temperature": 1.5,
}


sample = target_dataset[0]["text"]
HYPOTHESIS = f"Text categories: {LABELS}. Of the labels neutral, negative or positive, the text is:"

inputs = tokenizer.encode(sample, HYPOTHESIS, return_tensors="pt")
gen_config: GenerationConfig = GenerationConfig.from_dict(config)
output = model.generate(inputs, gen_config, return_dict_in_generate=True, output_scores=True)["scores"][0]

mask = output[0] >= 0
indices = mask.nonzero()
tokenizer.decode(indices.transpose(0, 1)[0])

In [8]:
import evaluate
from typing import List

recall_metric = evaluate.load("recall")
precision_metric = evaluate.load("precision")
f1_metric = evaluate.load("f1")

def calculate_metrics(references: List[int], predictions: List[int]):
    """
    Calculate recall, precision and f1 scores.
    :returns: Dict of metric name to value across all predictions
    """
    return recall_metric.compute(references=references, predictions=predictions, average="weighted", labels=list(ID_TO_LABEL.keys())) | \
        precision_metric.compute(references=references, predictions=predictions, average="weighted", labels=list(ID_TO_LABEL.keys())) | \
        f1_metric.compute(references=references, predictions=predictions, average="weighted", labels=list(ID_TO_LABEL.keys()))

calculate_metrics([1, 1, 1, 2, 0], [1, 1, 0, 0, 2])

{'recall': 0.4, 'precision': 0.6, 'f1': 0.4800000000000001}

In [9]:
VERBOSE = False

def classify_with_step_reports(sample: str, model_name: str, step: int) -> int:
    if VERBOSE:
        print(f"[{model_name}] classification step {step} / {len(target_dataset)}")
    return classify(sample)

In [106]:
# Testing out different hyper parameters on a single model (Bloom 560m) for the train set.
references = target_dataset["label"]
predictions = [ classify_with_step_reports(sample, "bloom 560m", i) for i, sample in enumerate(target_dataset["text"]) ]

calculate_metrics(references, predictions)

{'recall': 0.32735182164219684,
 'precision': 0.359562136553287,
 'f1': 0.20588972452252285}

In [10]:
# Run across all models 
from typing import Dict

def run(model_name: str) -> Dict[str, float]:
    """
    Get a tokenizer and model by model name, then classify all of `target_dataset` and calculate metrics.
    :returns: Dict of calculated metrics for this model
    """
    if VERBOSE:
        print(f"Starting classification on {model_name}...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    target_dataset = dataset["test"]
    references = target_dataset["label"]
    predictions = [ classify_with_step_reports(sample, model_name, i) for i, sample in enumerate(target_dataset["text"]) ]
    return calculate_metrics(references, predictions)

result_metrics = [ (model, run(model)) for model in MODELS ]
result_metrics

[('bigscience/bloom-560m',
  {'recall': 0.3528735632183908,
   'precision': 0.48830283101382843,
   'f1': 0.24746735557369356}),
 ('bigscience/bloom-1b1',
  {'recall': 0.3528735632183908,
   'precision': 0.48830283101382843,
   'f1': 0.24746735557369356}),
 ('bigscience/bloom-1b7',
  {'recall': 0.3528735632183908,
   'precision': 0.48830283101382843,
   'f1': 0.24746735557369356}),
 ('bigscience/bloomz-560m',
  {'recall': 0.3528735632183908,
   'precision': 0.48830283101382843,
   'f1': 0.24746735557369356}),
 ('bigscience/bloomz-1b1',
  {'recall': 0.3528735632183908,
   'precision': 0.48830283101382843,
   'f1': 0.24746735557369356}),
 ('bigscience/bloomz-1b7',
  {'recall': 0.3528735632183908,
   'precision': 0.48830283101382843,
   'f1': 0.24746735557369356})]

In [None]:
import matplotlib.pyplot as plt
from typing import List, Dict

def plot(all_metrics: List[Dict[str, float]]):
    pass