In [None]:
# Minimal code to run a LLM

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig

model_name = '/home/xxh584/data/hf_models/Meta-Llama-3.1-8B-Instruct'
# model_name = '/home/xxh584/data/hf_models/Llama-3.2-1B'
config = AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, config=config, attn_implementation = "flash_attention_2", torch_dtype=torch.bfloat16).cuda()
model.eval()


input_text = "What is the capital of France? The answer is: "
input_ids = tokenizer.encode(input_text, return_tensors='pt')
input_ids = input_ids.cuda()

# Run the model
tokens = model.generate(input_ids, max_new_tokens=20, use_cache=True)
output_text = tokenizer.decode(tokens[0][len(input_ids[0]):], skip_special_tokens=True)
print(output_text)

## How do we quantify language quality?

N-Grams
- contiguous sequence of 'n' items from a given sample of text or speech
- Challenges: storage limitations, less likely to find repeated instances of the same sequence (sparsity), lack of semantic understanding

BLEU score 
- way to use n-grams to measure "precision"
- For each value of "n-gram" length, precision is calcualted as the ratio of the number of overlapping n-grams to the total number of n-grams in the candidate translation
- Brevity Penalty
  - To penalize the model for being too brief

ROUGE score
- measures overlap of n-grams between the generated text and the reference text
- ROUGE-N = (Number of overlapping n-grams) / (Total n-grams in the reference text)

BLEU vs. ROUGE
- precision vs. recall
- BLEU measures how many n-grams in the generated text match the reference text, but it doesn't consider if the reference text contains additional important information that's missing in the generated text
- ROUGE is more recall-oreiented. It measures how much of the reference text is captured by the generated text. It is more forgiving of shorter generated texts and useful for summarization tasks where it's crucial that the generated text includes key information

To address the limitations of traditional NLP eval:
- BERTScore and word vectors (semantics)

![BERT Image](/Users/efang/Desktop/coding/Intro-to-ML/CSDS600/info/bert.png)

BERT handles tokens, so it handles spelling mistakes and little errors. However, BERT still has issues because it takes the max similarity score out of the entire matrix between the reference and candidates. This means that semantics will still not be represented well.

ex: I hate playing sports and have never wanted to visit the Olympics.
I love playing sports and have always wanted to visit the Olympics.

Problems of BERTScore and its descendatns
- Maximum cosine similarity values are selected as representational values
- "differences" will be under-represented, unless they are catastrophically different

BLEURT
- built using multiple phases of transfer learning starting from pretrained BERT model
- regression model trained on ratings data, which means that everything that differs from training data could make the results different

T5
- Text-To-Text Transfer transformer, treats every task as a text generation problem

Datasets and Correlation studies
- Everyone has already done correlation before (BERTScore, BLEURT, etc.)
- Datasets are as close to Gold standard as you can get at scale, Rajpulakar et al 2018
- SQuAD dataset (Stanford Question Answering dataset)
  - performance is judged by "exact match" of n-grams between actual answer and predicted
- Translation: WMT dataset (Tran et al 2021)
  - contains several translation pairs, widely trained/tested for translation task
  - Translation largely suffers from same issues as does semantic similarity prediction





## LLM Safety and Responsible AI

How do we measure safety in LLMs?
- Langage models predict the next token in a sequence
- Can we see if there is a bias in the distribution of tokens?

![LLM Safety Image](/Users/efang/Desktop/coding/Intro-to-ML/CSDS600/info/llmsafety.png)

As seen in the equation, we are looking at the distribution of female given context, over males given context. This is a measurement of gender bias.
- In one experiment, this measurement returned -2.14 for the "competent" variant, and -1.14 for "incompentent" variant, and -1.1 for neutral variant.
- There was a lot of bias towards males, saying that 


## Reasoning

### What is Chain of Thought? (CoT)
- Series of intermediate reasoning steps, like walking yourself through logical reasoning through a question

### But what is reasoning?
- Reasoning is the cognitive proces of drawing conclusions, making decisions, and solving problems based on logical thinking, eivdence, and analysis. It involves critical thinking, as well as the use of deduction and induction, to arrive at sound judgements
- Contextual information, which is what CoT attempts to solve in language modeling
- CoT gives the context to LLMs, such that the probability of each generated token is conditioned on the chain of previous tokens

### Bounded Rationality
- Idea that when individuals make decisions, their rationality is limited by the information they have, the cognitive limitations of their minds, and the finite amount of time they have to make a decision.
- <a hlink="https://thedecisionlab.com/biases/bounded-rationality">link</a>

### LLMs and Cognitive Biases
- human survival
- survivorship bias
- bounded rationality
- mental shortcuts


# Interpretability, Robustness, and Trustworthiness

## What is explainable AI?
- Core of it is making AI understandable by people
- Challenging because not all models are directly explainable
- Explainability-performance tradeoff

### Jailbreaking and Mitigation Strategies
- Few shot and many shot jailbreaking
    - Giving examples of malicious use-case examples to a LLM, and "jailbreaking" it to provide malicious answers to a prompt.

### Standard Benchmarks
- There exists benchmarks for LLM safety, such as SafetyBench, R-Judge