<a href="https://colab.research.google.com/github/ayomight96/Trivia-RAG-LLM/blob/main/Trivia_Rag_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trivia-Rag-LLM

This code presents a RAG-LLM model, that should potentially take a set of questions in a csv file and return an output of the answer in another CSV file. The following packages were installed:
1. transformers by hugging face
2. torch
3. scikit-learn
4. accelerate
5. faiss-gpu
6. sentence-transformers

Also the following libraries were imported:
1. pandas
2. torch
3. SentenceTransformer
4. AutoTokenizer
5. AutoModel
6. tqdm
7. numpy
8. faiss
9. google drive
10. AutoModelForCausalLM
11. and pipeline

In [12]:
!pip uninstall -y transformers
!pip install git+https://github.com/huggingface/transformers
#!pip install openai==0.28
!pip install torch
!pip install scikit-learn
!pip install accelerate==0.31.0 #install for fix error "cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub'"
!pip install faiss-gpu
!pip install sentence-transformers

Found existing installation: transformers 4.48.0.dev0
Uninstalling transformers-4.48.0.dev0:
  Successfully uninstalled transformers-4.48.0.dev0
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-e4ptawbk
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-e4ptawbk
[31mERROR: Operation cancelled by user[0m[31m
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Downloading transformers-4.47.0-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.47.0-py3-none-any.whl (10.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m69.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
Successfully installed transformers-4.47.0


## The data

A total of 53800 trivia questions with their answers were pulled from multiple sources, web scraping, and some datasets from kaggle. The information was put together into one excel document which is then uploaded to google drive for ease of access within collab.

## Preprocessing of the data

Because of the nature of the trivia questions and answers, they have to be concatenated to form a more contextual sentence for easy processing.

In [None]:
from google.colab import drive

drive.mount('/content/drive')

# Load Excel file
df = pd.read_excel("/content/drive/My Drive/LLM/Trivia_new.xlsx")

# Convert rows into sentences
def row_to_sentence(row):
    return f"The question is: {row['Question']}. The answer is: {row['Answer']}."

sentences = df.apply(row_to_sentence, axis=1).tolist()


documents = [
  "On 14 April, ESA launched the Jupiter Icy Moons Explorer (JUICE) spacecraft to explore Jupiter and its large ice-covered moons following an eight-year transit.",
  "ISRO launched its third lunar mission Chandrayaan-3 on 14 July 2023 at 9:05 UTC; it consists of lander, rover and a propulsion module, and successfully landed in the south pole region of the Moon on 23 August 2023.",
  "Russian lunar lander Luna 25 was launched on 10 August 2023, 23:10 UTC, atop a Soyuz-2.1b rocket from the Vostochny Cosmodrome, it was the first Russian attempt to land a spacecraft on the Moon since the Soviet lander Luna 24 in 1974, it crashed on the Moon on 19 August after technical glitches.",
  "JAXA launched SLIM (Smart Lander for Investigating Moon) lunar lander (carrying a mini rover) and a space telescope (XRISM) on 6 September.",
  "The OSIRIS-REx mission returned to Earth on 24 September with samples collected from asteroid Bennu.",
  "NASA launched the Psyche spacecraft on 13 October 2023, an orbiter mission that will explore the origin of planetary cores by studying the metallic asteroid 16 Psyche, on a Falcon Heavy launch vehicle."
]
documents.extend(sentences)
# Load document_index from Google Drive
loaded_document_index = np.load("/content/drive/My Drive/LLM/document_index.npy").astype(np.float32)
print(f"Loaded embeddings shape: {loaded_document_index.shape}")
loaded_document_index.shape

## Embedding

To ensure smooth and easy retrieval of the large trivia data the following methods were written.

In [2]:
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
from tqdm import tqdm
import numpy as np


def embed_documents(docs, model_name, batch_size=16, use_gpu=True):
  """Embed the provided documents to create a document index"""
  # load the tokenizer and model
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  model = AutoModel.from_pretrained(model_name)

  model = SentenceTransformer(model_name, device="cuda" if torch.cuda.is_available() else "cpu")

  # Initialize list to store embeddings
  device = torch.device("cuda")
  all_embeddings = []

  # Use SentenceTransformer model for mean-pooled embeddings
  model = SentenceTransformer(model_name, device=device)

  # Process documents in batches
  for i in tqdm(range(0, len(docs), batch_size), desc="Embedding Documents"):
        batch_docs = docs[i:i + batch_size]
        batch_embeddings = model.encode(batch_docs, batch_size=batch_size, show_progress_bar=False)
        all_embeddings.append(batch_embeddings)

  # Concatenate all embeddings
  return np.vstack(all_embeddings)


  from tqdm.autonotebook import tqdm, trange


In [4]:
import faiss

def create_faiss_index(doc_index):
    """
    Create a FAISS index for efficient similarity search.

    Parameters:
    - doc_index (np.ndarray): Precomputed document embeddings.

    Returns:
    - faiss.IndexFlatL2: The FAISS index.
    """
    dimension = doc_index.shape[1]  # Embedding dimension
    index = faiss.IndexFlatL2(dimension)
    index.add(doc_index)  # Add embeddings to the FAISS index
    return index

def retrieve_documents(query_string, faiss_index, docs, model_name="BAAI/bge-base-en", k=5):
    """
    Retrieve the top-k most similar documents using FAISS.

    Parameters:
    - query_string (str): The query text.
    - faiss_index (faiss.IndexFlatL2): The FAISS index.
    - docs (list): List of original documents.
    - model_name (str): Hugging Face model name for embedding.
    - k (int): Number of documents to retrieve.

    Returns:
    - List of the top-k most similar documents.
    """
    # Embed the query string
    query_vector = embed_documents([query_string], model_name=model_name).reshape(1, -1)

    # Query the FAISS index
    distances, indices = faiss_index.search(query_vector, k)

    # Retrieve top-k documents
    return [docs[i] for i in indices[0]]

In [5]:
# Load the FAISS index
loaded_index = faiss.read_index("/content/drive/My Drive/LLM/faiss_index.index")
print("FAISS index loaded successfully.")

FAISS index loaded successfully.


In [6]:
def create_augmented_prompt(query_string, docs):
  # concatenate the retrieved docs as context for the LLM
  # you could do other pre-processing here too
  context = "\n".join(docs)
  # define your prompt template
  prompt_template = """Here is some relevant information:
  {context}

  Q: {query}
  Provide only the correct option letter (e.g., A, B, C, or D). Do not include any explanation.
  A:
  """
  # render the prompt template
  return prompt_template.format(context=context, query=query_string)

In [7]:
#import torch
from transformers import AutoModelForCausalLM, pipeline

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 10,
    "return_full_text": False,
    "temperature": 0.2,
    "do_sample": False,
}

def generate_response(query_string, chosen_model,generation_arguments):
  messages = [{"content": query_string, "role": "user"}]
  output = chosen_model(messages,**generation_arguments)
  return output[0]['generated_text'].strip()

def generate_rag_response(
    query_string,
    docs,
    faiss_index,
    chosen_model=pipe,
    generation_arguments=generation_args,
    k=3
):

  # R: retrieve documents
  retrieved_docs = retrieve_documents(
      query_string, faiss_index, documents
  )
  # A: create augmented prompt
  augmented_prompt = create_augmented_prompt(query_string, retrieved_docs)

  # G: generate response!
  #generated_response = generate_response(augmented_prompt, model_name)
  generated_response = generate_response(augmented_prompt, chosen_model=chosen_model,generation_arguments=generation_arguments)
  return generated_response

def process_questions_and_save(input_csv, output_csv):
    """
    Reads questions from a CSV file, generates answers, and saves them to another CSV file.
    Args:
        input_csv (str): Path to the input CSV containing 'number' and 'question'.
        output_csv (str): Path to save the answered questions.
    """
    # Load the input CSV
    df = pd.read_csv(input_csv)

    # List to store the results
    results = []

    # Loop through questions and generate responses
    for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing Questions"):
        question_number = row['Number']
        question_text = row['Question']

        # Generate response using RAG
        response = generate_rag_response(
            query_string=question_text,
            docs=documents,
            faiss_index=loaded_index,
            chosen_model=pipe,
            generation_arguments=generation_args
        )
        # Clean the response to ensure it's just the option letter
        clean_response = response.strip().split()[0]  # Extract the first word/letter

        print(f"Q{question_number}: {clean_response}")
        results.append({"Number": question_number, "Question": clean_response})

    # Save responses to CSV
    output_df = pd.DataFrame(results)
    output_df.to_csv(output_csv, index=False)
    print(f"Results saved to {output_csv}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


In [11]:
# Process CSV
input_csv = "/content/drive/My Drive/LLM/input.csv"
output_csv = "/content/drive/My Drive/LLM/output.csv"
process_questions_and_save(
    input_csv=input_csv,
    output_csv=output_csv
)

answers = pd.read_csv(output_csv)
answers

Processing Questions:   0%|          | 0/2 [00:00<?, ?it/s]
Embedding Documents:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding Documents: 100%|██████████| 1/1 [00:01<00:00,  1.63s/it]
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
Processing Questions:  50%|█████     | 1/2 [00:06<00:06,  6.64s/it]

Q1: B



Embedding Documents: 100%|██████████| 1/1 [00:00<00:00, 67.06it/s]
Processing Questions: 100%|██████████| 2/2 [00:10<00:00,  5.05s/it]

Q2: B
Results saved to /content/drive/My Drive/LLM/output.csv





Unnamed: 0,Number,Question
0,1,B
1,2,B
