<a href="https://colab.research.google.com/github/bhav380-2/Deep-Learning/blob/main/fewShots.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
import numpy as np
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from datasets import load_dataset
from google.colab import userdata
from huggingface_hub import notebook_login
import pandas as pd
from datasets import Dataset

In [None]:
# !pip install -U bitsandbytes>=0.46.1
# !pip install -U transformers accelerate


In [65]:
notebook_login()

# Load Model

In [None]:
model_id = "meta-llama/Meta-Llama-3-8B"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

In [None]:
save_path = "./llama3-8b-4bit"
def save_model():
  model.save_pretrained(save_path)
  tokenizer.save_pretrained(save_path)

def load_saved_model():
  model = AutoModelForCausalLM.from_pretrained(
    save_path,
    device_map="auto",
    quantization_config=bnb_config
  )


# GPT-3 Paper

**Language Models are few shot Learners**

## RACE Benchmark

### Load Dataset

In [None]:
test_ds = load_dataset("race", "high", split="test")
# train_ds = load_dataset("race", "high", split="train")

In [None]:
test_ds[0]

{'example_id': 'high19432.txt',
 'article': 'The rain had continued for a week and the flood had created a big river which were running by Nancy Brown\'s farm. As she tried to gather her cows to a higher ground, she slipped and hit her head on a fallen tree trunk. The fall made her unconscious for a moment or two. When she came to, Lizzie, one of her oldest and favorite cows, was licking her face. \nAt that time, the water level on the farm was still rising. Nancy gathered all her strength to get up and began walking slowly with Lizzie. The rain had become much heavier, and the water in the field was now waist high. Nancy\'s pace got slower and slower because she felt a great pain in her head. Finally, all she could do was to throw her arm around Lizzie\'s neck and try to hang on. About 20 minutes later, Lizzie managed to pull herself and Nancy out of the rising water and onto a bit of high land, which seemed like a small island in the middle of a lake of white water. \nEven though it 

In [None]:
train_ds[7]

{'example_id': 'high4558.txt',
 'article': "Understanding the process of making career choices and managing your career is a basic life skill that everyone should understand.\nYour career decisions have such a profound effect on all aspects of your life. It's important to have the knowledge and resources needed to make smart, informed decisions. Whether you are looking for a new job, aiming to take the next step at your current job or planning your retirement options, you are making career decisions. Using good resources and the guidance of a career counselor can help you to make those decisions well.\nMany people mistakenly believe that choosing a career is a one-time event that happens some time in early adulthood. However, career management is actually a life-long process, and we continue to make consequential   career choices over the years. When people want to take action in their career, career management and job search are about so much more than writing a good resume. If you le

In [None]:
# def preprocess_dataset(dataset):
#   pre_processed = {}
#   for row in dataset:
#     example_id = row['example_id']
#     if not pre_processed.get(example_id):
#       pre_processed[example_id] = {'article': row['article'], 'mcq':[]}
#     mcq = {
#         'question': row['question'],
#         'options': row['options'],
#         'answer': row['answer']}

#     pre_processed[example_id]['mcq'].append(mcq)

#   reformatted_data = list(pre_processed.values())
#   return Dataset.from_list(reformatted_data)

# test_ds_processed1 = preprocess_dataset(test_ds)


In [None]:
# optimised
def preprocess_dataset(dataset):
  df = dataset.to_pandas()
  # aggregate questions, options, and answers into a list of dicts
  grouped = df.groupby(['example_id', 'article']).apply(
      lambda x: x[['question', 'options', 'answer']].to_dict('records')
  ).reset_index(name='mcq')
  return Dataset.from_pandas(grouped)
test_ds_processed2 = preprocess_dataset(test_ds)

  grouped = df.groupby(['example_id', 'article']).apply(


In [None]:
test_ds_processed2[0]

{'example_id': 'high10001.txt',
 'article': 'Studies show that you may be lied to every day anywhere from 10 to 200 times. We say, "Nice song." "Honey, you don\'t look fat in that, no." But another study showed that strangers lied three times within the first 10 minutes of meeting each other. We lie more to strangers than we lie to coworkers. Men lie eight times more about themselves than they do other people. Women lie more to protect other people. If you\'re married, you\'re going to lie to your wife/ husband in one out of every 10 communications. If you\'re unmarried, that number drops to three. But look, if at some point you got lied to, it\'s because you agreed to get lied to. Truth about lying: lying\'s a cooperative act. Not all lies are harmful. Sometimes we\'re willing to lie for the sake of social dignity  , maybe to keep a private secret.\nLying is complex. It\'s woven into the fabric of our daily and business lives. We\'re deeply disturbed by the truth. We explain it, somet

### Build Prompts

In [None]:

def format_race(item):
  """
  Replicates the phrasing from GPT-3 Appendix G, Figure G.1.
  """
  article_text = item['article'].replace('\n', ' ')
  prompt = f"Article:\n{article_text}\n\n"

  mcqs = item['mcq']
  mcqs_count = len(mcqs)

  for i in range(mcqs_count-1):
    prompt += f"Q: {mcqs[i]['question']}\n"
    prompt += "A:"
    answer_index = ['A','B','C','D'].index(mcqs[i]['answer'])
    answer_text = mcqs[i]['options'][answer_index]
    prompt += f" {answer_text}\n\n"

  prompt += f"Q: {mcqs[mcqs_count-1]['question']}\n"
  prompt += "A:"
  return prompt

### Eval


In [None]:
# def get_loglikelihood(prompt, choice, model, tokenizer):
#     """Calculates the log-probability of choice given the prompt on CUDA."""
#     device = "cuda" if torch.cuda.is_available() else "cpu"
#     model.to(device)

#     prompt_ids = tokenizer.encode(prompt, add_special_tokens=True)
#     full_ids = tokenizer.encode(prompt + choice, add_special_tokens=True)
#     prompt_len = len(prompt_ids)
#     inputs_ids = torch.tensor([full_ids]).to(device)

#     with torch.no_grad():
#         outputs = model(inputs_ids)
#         logits = outputs.logits[:, :-1, :]
#         target_ids = inputs_ids[:, 1:]

#         log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
#         choice_log_probs = torch.gather(
#             log_probs[:, prompt_len-1:],
#             2,
#             target_ids[:, prompt_len-1:].unsqueeze(-1)
#         ).squeeze()

#         return torch.sum(choice_log_probs).item()

# def evaluate_race_pmi(test_ds,n_samples=50, k_shots=1):
#   correct = 0
#   test_subset = test_ds.select(range(n_samples))

#   for item in tqdm(test_subset):
#     mcq = item['mcq'][-1]
#     candidate_answers = mcq['options']
#     full_prompt = format_gpt3_race(item)
#     pmi_scores = []

#     for choice in candidate_answers:
#       ll_context = get_loglikelihood(full_prompt, " " + choice, model, tokenizer)
#       ll_null = get_loglikelihood("A:", " " + choice, model, tokenizer)
#       pmi_scores.append(ll_context - ll_null)

#     prediction_idx = np.argmax(pmi_scores)

#     if ['A', 'B', 'C', 'D'][prediction_idx] == mcq['answer']:
#       correct += 1

#   return (correct / n_samples) * 100

In [None]:
# optimised version using kv cache
def get_loglikelihood(prompt, choices, model, tokenizer):
  device = next(model.parameters()).device
  prompt_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

  # check above commented code to understand working
  # compute logits of prompt text only once
  with torch.no_grad():
    outputs = model(prompt_ids, use_cache=True)
    past_key_values = outputs.past_key_values
    last_prompt_logit = outputs.logits[:, -1, :]

  choice_ll = []

  # compute logits of each choice and combine with precomputes logits of prompt text
  for choice in choices:
    choice_ids = tokenizer.encode(choice, add_special_tokens=False, return_tensors="pt").to(device)

    with torch.no_grad():
      choice_outputs = model(choice_ids, past_key_values=past_key_values, use_cache=True)
      combined_logits = torch.cat([last_prompt_logit.unsqueeze(1), choice_outputs.logits[:, :-1, :]], dim=1)
      log_probs = torch.nn.functional.log_softmax(combined_logits, dim=-1)
      target_log_probs = torch.gather(log_probs, 2, choice_ids.unsqueeze(-1)).squeeze(-1)
      choice_ll.append(target_log_probs.sum().item())

  return choice_ll

def evaluate_race_pmi(test_ds, n_samples=50):
  correct = 0
  test_subset = test_ds.select(range(n_samples))

  for item in tqdm(test_subset):
    mcq = item['mcq'][-1]
    candidate_answers = [" " + opt for opt in mcq['options']]

    full_prompt = format_race(item)

    # check commented evaluate_race_pmi function to understand differnce
    # get loglikelihood of all candidate answers at once
    ll_contexts = get_loglikelihood(full_prompt, candidate_answers, model, tokenizer)

    ll_nulls = get_loglikelihood("A:", candidate_answers, model, tokenizer)

    pmi_scores = [ctx - null for ctx, null in zip(ll_contexts, ll_nulls)]
    prediction_idx = np.argmax(pmi_scores)

    if ['A', 'B', 'C', 'D'][prediction_idx] == mcq['answer']:
        correct += 1

  return (correct / n_samples) * 100

### Run Experiment

In [None]:

acc = evaluate_race_pmi(test_ds_processed2,n_samples=50)
print(f"\nFinal Few-Shot Accuracy: {acc}%")

100%|██████████| 50/50 [07:50<00:00,  9.42s/it]


Final Few-Shot Accuracy: 38.0%





# Conclusion

- Compared to gpt3-175b model used in Paper Language Models are Few-Shot Learners , LLama-3-8b with 4 bit quantization performs poor on Comprehension test - race-h benchmark

- Few Shot Accuracy RACE-h dataset

  - gpt3-175b few shot accuracy -> 46.8% (source - Paper Language Models are Few-Shot Learners)

  - LLama-3-8b 4 bit quanitzed accuracy -> 38%

  - 4 bit quanitzation may have caused drastic fall in performance of LLama-3-8b.