<a href="https://colab.research.google.com/github/aakarsh/rl-llm-calibration-test/blob/main/LLAMA_Setup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Idea: ##

Calibration of LMs and whether they can (correctly) indicate uncertainty about their generations is an important question touching upon the reliability and correctness of LM generations. In class, work on calibration by Kadavath et al. [2022] was discussed at length which showed that big base LMs (by Anthropic) are well-calibrated in the sense that probabilities assigned to answer options on BIGBench and MMLU (and other benchmark datasets) correlate well with the correctness probability of these options (across trials). For RL-LMs, several observations have been made suggesting that calibration deteriorates with RL finetuning [OpenAI, 2023, Kadavath et al., 2022], but little consistent comparisons have been performed. Another line of work has focused on actually including indications of uncertainty in LM generations [Mielke et al., 2022, Lin et al., 2022], but all of this work focuses on factual correctness and calibration on abstractive tasks, not on context-dependent expressions, and little comparison to human behavior regarding knowledge and certainty expression has been done. Again, RL-LMs have not been under close inspection in this regard. Finally, an issue related to the correctness and confidence of statements generated by LLMs is so-called hallucination.
One intuitive solution to avoid hallucinations is to fine-tune LMs with RL where responses like “i don’t know”also receive high rewards, but it has been discussed that it would be rather difficult to not make the models evasive with this procedure. However, there have been few datasets or approaches how to include “i don’t know” in the first place.
The ideas within this project are all calibration / uncertainty and constitute a bit of a brainstorm. Focus can be chosen depending on interest. The goals of the subprojects are to

 ## Goals:
1. Provide more comprehensive understanding of calibration of RL-LMs by replicating the first part of the analysis by Kadavath et al. [2022] on a few of the same datasets (e.g., MMLU, TriviaQA) on Llama-2 base vs. chat.
2. Investigate LMs’ zero- or few-shot performance on naturalistic uncertainty expressions in-context. Specifically, in this project the model will be tested on contexts discribing situations where the exact state of the world is uncertain. Given these contexts, we will test whether expressions “certainly”, “possibly” and “probably” (not) are used in a human-like way depending on the context (e.g., Llama 7b base or chat can be tested). The project will replicate the experiment by Herbstritt and Franke [2016]. This will extend beyond previous work (e.g., Mielke et al. [2022]) by focusing on contextually induced uncertainty (rather than factual QA) and by comparing the results to human data
3. Try to construct a dataset of inputs and outputs which both contain informative responses and expressions of “i don’t know”. In particular, an extant dataset could be combined with manually constructed examples, e.g., with questions about things after the training data cutoff for a particular model, or where an evasive response would be more polite. This project idea is completely open-ended to your ideas and creativity

## Hypotheses:

 The testable questions for two of the subpprojects would be:

 1. If previous observations are further borne out, we will observe worse calibration on RL-LMs on benchmark datasets compared to their base LM counterparts. However, strategies like increased sampling temperature could improve calibration (cf . Kadavath et al. [2022], but note that this was done for evaluating the correctness of particular answer option, given a question
 2. If the tested model is well-aligned with human behavior, the probabilities assigned to the different uncertainty expressions in the different contexts will match human data from the reported experiment

## Work Steps

Work steps: Calibration:

1. Replicate the calibration experiment by Kadavath et al. [2022] on Llama-2 7b base and chat. Select a few of the same benchmarks, get the probabilities of the different response options, bin them following the reported procedure, and compute the frequencies of correct responses in each bin.

2. Compare the calibration plots of the base and the chat model.

3. *Run the same analyses for 13b and 70b models. Instead of the 70b model (which will require more intricate compute set up), one or two other analyses from the paper can be replicated.*


In [None]:
# note: if you are running the code on Colab, you may need to install the HuggingFace 'transformers' library
# for that, uncomment and run the following line:

!pip install transformers==4.31.0
!pip install torch
!pip install trl datasets evaluate nltk
!pip install bitsandbytes-cuda110
!pip install bitsandbytes
!pip install accelerate

Collecting transformers==4.31.0
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m62.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.31.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m107.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.15.2
    Uninstalling tokenizers-0.15.2:
      Successfully uninstalled tokenizers-0.15.2
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.2
    Uninstalling transformers-4.38.2:
      Successfully uninstalled transformers-4.38.2
Successfully installed tokenizers-0.13.3 transformers-4.31.0
Collecting trl
  Downloading 

In [None]:
from transformers.utils import is_accelerate_available, is_bitsandbytes_available

print("accelerate", is_accelerate_available())
print("is_bitsandbytes_available", is_bitsandbytes_available())

accelerate True
is_bitsandbytes_available True


In [None]:
# import libraries
from transformers import LlamaForCausalLM, LlamaTokenizer, AutoTokenizer
import torch

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM
)
import torch
from datasets import (
    load_dataset,
    Dataset
)
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
import evaluate

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.c

In [None]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = LlamaForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_4bit=True,
device_map="auto",
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
prompt_templates = ["""
You are at a {country} supermarket. You walk up to the cashier and greet them by saying:
A. Hello, how are you?
B. Buy
C. Hello
What would  you say A / B / C?""",
"""You are {country}. You find out that your friend is late to your meeting, you greet him by saying:
A. I am glad you came.
B. Late.
C. You are late, don't you respect my time.""",
"""
You are at a {country} train station, you realize your train is late aby 10 minutes. you
A. Make small talk with other passenters.
B. Stand on train tracks to protest.
C. Decide to wait patiently.
""",
]

In [None]:
def get_log_prob_of_completion(
        model,
        tokenizer,
        prompt,
        completion,
        device=torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
):
        """
        Convenience function for computing the log probability of a completion
        given a prompt.
        """
        # tokenize the prompt and the completion
        # truncate so as to fit into to maximal context window of gpt-2
        # which is 1024 tokens
        input_ids = tokenizer(
                prompt + completion,
                return_tensors='pt',
                truncation=True,
                max_length=1024,
        )['input_ids'].to(device)

        # separately tokenize prompt
        # so as to access the logits for the completion only
        # when scoring the completion
        input_ids_prompt = tokenizer(
                prompt,
                return_tensors='pt',
                truncation=True,
                max_length=1024
        )['input_ids'].to(device)

        # create attention mask and position ids
        attention_mask = (input_ids != tokenizer.eos_token_id).to(dtype=torch.int64)
        position_ids = attention_mask.cumsum(-1)-1
        # get the logits for the completion
        with torch.no_grad():
                out = model(
                        input_ids=input_ids,
                        attention_mask=attention_mask,
                        position_ids=position_ids
                )

        # get the logits of the completion
        # for that, make a tensor of the logits
        # for the completion only
        # in particular, we shift the indices by one to the left to access logits of the
        # actual sequence tokens
        logits_completion = out.logits[:, :-1]
        logits_completion = logits_completion.squeeze()
        # get the log probabilities for the completion
        log_probs = torch.nn.functional.log_softmax(
                logits_completion,
                dim=-1
        )
        # retrieve the logit corresponding to the actual completion tokens
        try:
                log_completion_tokens = log_probs.gather(
                        dim=-1,
                        index=input_ids[:, 1:].squeeze().unsqueeze(-1)
                )
        except:
                log_completion_tokens = log_probs.gather(
                        dim=-1,
                        index=input_ids[:, 1:].unsqueeze(-1)
                )

        continuationConditionalLogProbs = log_completion_tokens[
                (input_ids_prompt.shape[-1]-1):
        ]
        completion_log_prob = torch.mean(
                continuationConditionalLogProbs
        ).cpu()

        return completion_log_prob

In [None]:
model_results = []
results = []
for prompt_template in prompt_templates:
  context_options = [{"country": "German"}, {"country":"American"}]
  context_results = []
  for context in context_options:
    prompt = prompt_template.format(**context)
    selections = ["A", "B", "C"]
    selection_log_prob_opt_option = []
    for selection  in selections:
      log_prob_opt_option = get_log_prob_of_completion(
          model=model,
          tokenizer=tokenizer,
          prompt=prompt,
          completion=selection,
      )
      selection_log_prob_opt_option.append(log_prob_opt_option.detach().numpy())
    context_results.append(dict(zip(selections, selection_log_prob_opt_option)))
  results.append({"model": model.__class__.__name__, "prompt_template": prompt_template,   "context_results": list(zip(context_options, context_results)) })
model_results.append({"model_tag": "llama", "results": results})

for results in model_results:
  for result in results["results"]:
    print("Model: {model}".format(**result))
    print("Prompt:{prompt_template} ".format(**result))
    for context_result in result["context_results"]:
        print("Context Options", context_result)

Model: LlamaForCausalLM
Prompt:
You are at a {country} supermarket. You walk up to the cashier and greet them by saying:
A. Hello, how are you?
B. Buy
C. Hello
What would  you say A / B / C? 
Context Options ({'country': 'German'}, {'A': array(-9.659343, dtype=float32), 'B': array(-10.971843, dtype=float32), 'C': array(-11.073405, dtype=float32)})
Context Options ({'country': 'American'}, {'A': array(-9.85505, dtype=float32), 'B': array(-11.120675, dtype=float32), 'C': array(-11.112863, dtype=float32)})
Model: LlamaForCausalLM
Prompt:You are {country}. You find out that your friend is late to your meeting, you greet him by saying:
A. I am glad you came.
B. Late.
C. You are late, don't you respect my time. 
Context Options ({'country': 'German'}, {'A': array(-12.845149, dtype=float32), 'B': array(-13.337337, dtype=float32), 'C': array(-12.962337, dtype=float32)})
Context Options ({'country': 'American'}, {'A': array(-12.873283, dtype=float32), 'B': array(-13.068596, dtype=float32), 'C':

In [22]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Mounted at /content/drive


In [24]:
from datasets import load_dataset

dataset = load_dataset("cais/mmlu")

ValueError: Config name is missing.
Please pick one among the available configs: ['abstract_algebra', 'all', 'anatomy', 'astronomy', 'auxiliary_train', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']
Example of usage:
	`load_dataset('mmlu', 'abstract_algebra')`