# Reviewing Model Generated Answers

In this notebook we'll load and review answers generated by both the base and fine-tuned models. We only have enough GPU memory to load one of these models at a time so we'll need to clear memory between different model runs!

# Set Base Directories

In [1]:
DATA_DIRECTORY = "/opt/enrichment/github/Tuning-Retrieval-Augmented-Question-Answering/data"
MODEL_DIRECTORY = "/opt/enrichment/github/Tuning-Retrieval-Augmented-Question-Answering/model"

# Load Train/Test Data Frame

In [2]:
import pandas as pd

path = f"{DATA_DIRECTORY}/train-test-df.csv"
df = pd.read_csv(path, na_filter=False)
print(f"Loaded {df.shape[0]:,d} Train/Test records.")
#print(df.fold.value_counts())
df.sample(n=1)

Loaded 1,050 Train/Test records.


Unnamed: 0,fold,excerpt,question,answer,hashID
49,6,"The Vatican described the visit as a ""further ...",Why does the Defence Secretary believe that fu...,The Defence Secretary believes funding defence...,5f9149a396b2a0d1eed4ed2d4cb4401f


# Load Evaluation Prompt Prefix

In [3]:
path = f"{DATA_DIRECTORY}/leval-1-prompt-prefix.txt"
with open(path) as ifp: prefix = ifp.read()

print(prefix)

Carefully read the excerpt below and then provide a clear concise answer to the follow-up question.


# Load Model, Tokenizer and Generator

In [4]:
import torch
from utilities import display_sample, load_base_model, load_lora

model_name = "meta-llama/Llama-2-7b-chat-hf"



# Clear memory between different model runs!

In [6]:
tokenizer, model, generator = None, None, None
torch.cuda.empty_cache()
!gpustat

[1m[37mip-172-25-5-124    [m  Wed Oct  4 20:22:59 2023  [1m[30m535.54.03[m
[36m[0][m [34mNVIDIA A10G     [m |[31m 36'C[m, [32m  0 %[m | [36m[1m[33m14746[m / [33m23028[m MB | [1m[30mubuntu[m([33m14434M[m)


## Base Model

In [None]:
tokenizer, model, generator = load_base_model(model_name)

## Fine-tuned Model

In [6]:
fold = 5
directory = f"{MODEL_DIRECTORY}/Llama-2-7b-qa-level-1-{fold:02d}"
tokenizer, model, generator = load_lora(model_name, directory)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 22.19 GiB total capacity; 7.74 GiB already allocated; 60.56 MiB free; 7.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

## Sample Predictions

In [None]:
x = df.sample(n=1).iloc[0]
display_sample(x, prefix, generator)

In [7]:
!gpustat

[1m[37mip-172-25-5-124    [m  Wed Oct  4 22:43:12 2023  [1m[30m535.54.03[m
[36m[0][m [34mNVIDIA A10G     [m |[31m 35'C[m, [32m  0 %[m | [36m[1m[33m14746[m / [33m23028[m MB | [1m[30mubuntu[m([33m14434M[m)


In [8]:
!nvidia-smi

Wed Oct  4 22:44:35 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   35C    P0              60W / 300W |  14442MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Add Base Model Predictions

## Load model, tokenizer and generator

In [None]:
!/opt/enrichment/miniconda3/envs/argilla/bin/huggingface-cli login --token ...

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_name = "meta-llama/Llama-2-7b-chat-hf"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
model.cuda()
# Reload the new tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "right"
tokenizer.eos_token = "</s>"
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)


In [None]:
import time

def generate(x, generator, verbose=False):
    start = time.time()        
    completion = generator(x.evalPrompt, max_new_tokens=256)[0]['generated_text']
    end = time.time()
    delta = end - start
    if verbose: print(f"\nProcessing time = {delta:0.3f} seconds.")
    completion = completion.split("%%ANSWER: ")[-1].strip()
    return completion
x = df.sample(n=1).iloc[0]
completion = generate(x, generator, verbose=True)
print(completion)

In [None]:
predictions = []
with tqdm(total=1000) as pbar:
    for index, x in df.iterrows():
        prediction = generate(x, generator, verbose=False) 
        predictions.append(prediction)
        pbar.update()
df["basePredictedAnswer"] = predictions
path = f"/opt/enrichment/github/Tuning-Retrieval-Augmented-Question-Answering/data/eval-qa-df.csv"
df.to_csv(path, index=False)
