<a href="https://colab.research.google.com/github/ashweta1/interp/blob/main/cs230_rectifying_facts_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rectifying Factual knowledge through RAG (Retrieval Augmented Generation)

This colab uses RAG on wikipedia knowledge dataset, to prepend context to prompts.

RAG is used from the library I wrote: git+https://github.com/ashweta1/rag_wiki.git

## Prepare environment

In [1]:
%%bash

# check that colab exists
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit

# recreate the local home for this colab run
cd /content && rm -rf /content/home && mkdir home && cd home

# install the known facts dataset.
pip install git+https://github.com/kmeng01/rome.git/tree/main/dsets >> install.log 2>&1

# install hugging face datasets library
pip install datasets >> install.log 2>&1

pip uninstall -y rag_wiki >> install.log 2>&1
pip install git+https://github.com/ashweta1/rag_wiki.git >> install.log 2>&1
pip list | grep rag_wiki

# install latest torch and faiss-gpu
pip uninstall -y torch faiss-cpu faiss-gpu >> install.log 2>&1
pip install torch faiss-gpu >> install.log 2>&1

# pip uninstall -y torch torchaudio torchvision torchtext torchdata faiss-gpu >> install.log 2>&1
# pip install torch torchaudio torchvision torchtext torchdata faiss-gpu >> install.log 2>&1

rag_wiki                           0.1.0


In [2]:
from ctypes import pythonapi
!python --version
!python -c "import torch; print(torch.__version__)"
!python -c "import faiss; print(faiss.__version__)"
!python -c "import numpy; print(numpy.__version__)"

Python 3.10.12
2.5.1+cu124
1.7.2
1.26.4


In [3]:
IS_COLAB = True
try:
    import google.colab, torch, os

    IS_COLAB = True
    device = "cpu"
    if torch.cuda.is_available():
      device = torch.device("cuda")
    elif torch.backends.mps.is_available():
      device = torch.device("mps")
    else:
      device = torch.device("cpu")
    print("Device = ", device)
        # raise Exception("Change runtime type to include a GPU.")

    os.chdir("/content/home")
    torch.set_grad_enabled(False)  # no model parameter updates

except ModuleNotFoundError as _:
    pass

Device =  cuda


In [4]:
# IPYTHON magic to automatically reload imported module if they change
%load_ext autoreload
%autoreload 2


In [5]:
!nvidia-smi

Sat Nov 23 08:03:47 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8              11W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [6]:
# Get wikiqa embeddings loaded
import datasets
import torch
from rag_wiki import rag

print("torch.cuda.is_available()", torch.cuda.is_available())
print(torch.__version__)

# Load dataset
print("Loading dataset...")
# pdframe = rag.load_wikiqa(debug=True)
dataset = datasets.load_dataset("wiki_qa", split="train")
print("Loading dataset...done")
print("")

# Preprocess the dataset
print("Preprocessing dataset...")
index, texts = rag.preprocess_wikiqa(dataset, batch_size=500, debug=True)
print("Preprocessing dataset...done")
print("")

torch.cuda.is_available() True
2.5.1+cu124
Loading dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading dataset...done

Preprocessing dataset...
device =  cuda
Getting the model and tokenizer for embeddings...
Embedding size =  384
Getting the model and tokenizer for embeddings... done
Creating FAISS index for storing and searching the embeddings...
Moving the index to GPU
Creating FAISS index... done
Batch 1
torch embeddings tensor shape:  torch.Size([17, 384])
numpy embeddings shape:  (17, 384)
Batch 2
torch embeddings tensor shape:  torch.Size([21, 384])
numpy embeddings shape:  (21, 384)
Batch 3
torch embeddings tensor shape:  torch.Size([38, 384])
numpy embeddings shape:  (38, 384)
Batch 4
torch embeddings tensor shape:  torch.Size([18, 384])
numpy embeddings shape:  (18, 384)
Batch 5
torch embeddings tensor shape:  torch.Size([24, 384])
numpy embeddings shape:  (24, 384)
Batch 6
torch embeddings tensor shape:  torch.Size([20, 384])
numpy embeddings shape:  (20, 384)
Batch 7
torch embeddings tensor shape:  torch.Size([20, 384])
numpy embeddings shape:  (20, 384)
Batch 8
torc

In [None]:
# Get wikipedia embeddings loaded
import torch
from rag_wiki import rag

print("torch.cuda.is_available()", torch.cuda.is_available())
print(torch.__version__)

# Load dataset
print("Loading dataset...")
dataset = rag.load_wiki_dataset(num_examples=1000, debug=True)
print("Loading dataset...done")
print("")

# Preprocess the dataset
print("Preprocessing dataset...")
index, texts = rag.preprocess(dataset, batch_size=200, debug=True)
print("Preprocessing dataset...done")
print("")

In [31]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

def get_gpt2_model():
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    model = GPT2LMHeadModel.from_pretrained('gpt2', pad_token_id=tokenizer.eos_token_id)
    #.to(device)
    return model, tokenizer
model, tokenizer = get_gpt2_model()

In [48]:
def generate_text(model, tokenizer, prompt, max_length=50, device=device):
    input_ids = tokenizer.encode(prompt, return_tensors='pt')

    outputs = model.generate(input_ids,
                             #.to(device),
                             max_length=10000,
                             do_sample=True,
                             num_beams=3,
                             temperature=0.7,
                             no_repeat_ngram_size=2,
                             early_stopping=True,
                             eos_token_id=tokenizer.encode(".")[0])

    # Decode the generated sequence back to text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return generated_text

In [54]:
# Query the index and retrieve relevant texts
TOP_K_TEXTS = 2
prompts = ["What is the capital of India?",
           "Who is the president of the United States?",
           "What is the population of China?",
           "The captial of France is ",
           "Where is the Eiffel Tower located?"]
prompts = ["how long was i love lucy on the air",
           "how did apollo creed die",
           "how much is 1 tablespoon of water",
           "how much are the harry potter movies worth"]

print("Retrieving relevant texts...")
print("Length of texts = ", len(texts))
retrieved_texts = rag.retrieve(prompts, index, texts, top_k=TOP_K_TEXTS, debug=False)
print("Retrieving relevant texts...done")
print("")

for p, ts in zip(prompts, retrieved_texts):
    print(f"Prompt: {p}")
    print(f"Retrieved texts: {len(ts)}")
    print("")

Retrieving relevant texts...
Length of texts =  1040
Retrieving relevant texts...done

Prompt: how long was i love lucy on the air
Retrieved texts: 2

Prompt: how did apollo creed die
Retrieved texts: 2

Prompt: how much is 1 tablespoon of water
Retrieved texts: 2

Prompt: how much are the harry potter movies worth
Retrieved texts: 2



In [49]:
for p in prompts:
    print("Prompt: ", p)
    print("----")
    print(generate_text(model, tokenizer, p))
    print("=====")
    print("")

Prompt:  how long was i love lucy on the air
----
how long was i love lucy on the air, t on a on t the e on n on d on on w on o on r on g on h on e the on b on l on i on s on y on , on air.
=====

Prompt:  how did apollo creed die
----
how did apollo creed die out?"

"No, it didn't.
=====

Prompt:  how much is 1 tablespoon of water
----
how much is 1 tablespoon of water and 1/2 teaspoon of sugar in a large bowl?

The answer is no.
=====

Prompt:  how much are the harry potter movies worth
----
how much are the harry potter movies worth?"

"No, I don't think so.
=====



In [55]:
prepend = "Based on the information provided, answer the question concisely. Do not repeat the prompt."
section1 = "[INFO]"
section2 = "[QUESTION]"
section3 = "[ANSWER]"
qa_prompt_with_context = [prepend + "\n" + section1 + "\n" + "\n".join(ts)[:1000] + "\n" + section2 + "\n" + p + "\n" + section3 + "\n" for p, ts in zip(prompts, retrieved_texts)]

for p in qa_prompt_with_context:
    print("Prompt: ", p)
    print("----")
    print(generate_text(model, tokenizer, p))
    print("=====")
    print("")

Prompt:  Based on the information provided, answer the question concisely. Do not repeat the prompt.
[INFO]
who is victoria jackson from saturday night live Victoria Jackson (born August 2, 1959) is an American comedian, actress, satirist, singer and internet blogger best known as a cast member of the NBC television sketch comedy series Saturday Night Live (SNL) from 1986 to 1992.
who played guitar on the kiss album, creatures of the night It is also the band's last album recorded with Ace Frehley credited as an official member (until 1998's Psycho Circus ), and its first album with Vinnie Vincent as the initially uncredited lead guitarist (Vincent would later be credited, but not featured pictorially on the cover, of 1985's reissue of the album ).
[QUESTION]
how long was i love lucy on the air
[ANSWER]

----
Based on the information provided, answer the question concisely. Do not repeat the prompt.
[INFO]
who is victoria jackson from saturday night live Victoria Jackson (born August 2

In [51]:
prompt_with_context = [" ".join(ts)[:200] + "\n" + p for p, ts in zip(prompts, retrieved_texts)]

for p in prompt_with_context:
    print("Prompt: ", p)
    print("----")
    print(generate_text(model, tokenizer, p))
    print("=====")
    print("")

Prompts with contexts:  ['who is victoria jackson from saturday night live Victoria Jackson (born August 2, 1959) is an American comedian, actress, satirist, singer and internet blogger best known as a cast member of the NBC t\nhow long was i love lucy on the air', 'how was the moon formed The Moon is thought to have formed nearly 4.5 billion years ago, not long after the Earth.\nhow did apollo creed die', 'how much is 1 tablespoon of water This tablespoon has a capacity of about 15 mL.\nhow much is 1 tablespoon of water', 'how much are the harry potter movies worth The series also originated much tie-in merchandise, making the Harry Potter brand worth in excess of $15 billion.\nhow much are the harry potter movies worth']
Prompt:  who is victoria jackson from saturday night live Victoria Jackson (born August 2, 1959) is an American comedian, actress, satirist, singer and internet blogger best known as a cast member of the NBC t
how long was i love lucy on the air
----
who is victoria 