<a href="https://colab.research.google.com/github/connectchayan/ViBe/blob/main/MICRO_LLM_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build a RAG system using a micro LLM.

## Install necessary libraries

### Subtask:
Install libraries such as `transformers`, `datasets`, `torch`, and `langchain` that will be used to build the RAG system.


**Reasoning**:
Install the necessary libraries using pip.



In [None]:
%pip install transformers datasets torch langchain



## Load a micro llm

### Subtask:
Load a pre-trained micro LLM model from the `transformers` library.


**Reasoning**:
Import the necessary classes from the `transformers` library and load a pre-trained micro LLM model and its tokenizer.



In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Specify the name of a pre-trained micro LLM model
model_name = "distilgpt2"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model
model = AutoModelForCausalLM.from_pretrained(model_name)

print(f"Tokenizer for {model_name} loaded successfully.")
print(f"Model {model_name} loaded successfully.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Tokenizer for distilgpt2 loaded successfully.
Model distilgpt2 loaded successfully.


## Load and process data

### Subtask:
Load the data that will be used as the knowledge base for the RAG system. Process the data into a suitable format for the LLM.


**Reasoning**:
Load a dataset from the `datasets` library, inspect its structure, select relevant columns, and process it into a suitable format for a RAG system.



In [None]:
from datasets import load_dataset

# Load a suitable dataset for RAG
# Using 'squad' as it's a question-answering dataset
dataset = load_dataset("squad")

# Inspect the dataset structure
print(dataset)

# Select relevant columns and process data
# For SQuAD, we can use 'context', 'question', and 'answers'
# We'll focus on 'context' as the knowledge base text
def process_squad_dataset(examples):
    # Concatenate context and question for potential use, or just keep context
    # For a simple RAG knowledge base, context is sufficient
    return {"text": examples["context"]}

# Apply the processing function to the dataset
processed_dataset = dataset.map(process_squad_dataset, remove_columns=['id', 'title', 'question', 'answers'])

# Inspect the processed dataset
print(processed_dataset)
print(processed_dataset['train'][0])

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['context', 'text'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['context', 'text'],
        num_rows: 10570
    })
})
{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'text': 'Architecturally, the school has a Catholic character. Atop the Main Buildin

## Build the rag system

### Subtask:
Build the RAG system using the loaded LLM and processed data.


**Reasoning**:
Import the necessary classes from langchain, create embeddings for the processed data, build a vector store from the embeddings, and set up the RetrievalQA system.



In [None]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

# 2. Create embeddings for the text data
# Using a smaller, faster sentence transformer model
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Extract text from the processed dataset
texts = [example['text'] for example in processed_dataset['train']]

# 3. Create a vector store from the embeddings and text data
# Using FAISS for efficient similarity search
vectorstore = FAISS.from_texts(texts, embeddings)

# 4. Set up a retrieval mechanism
retriever = vectorstore.as_retriever()

# 5. Combine the loaded micro LLM, the retrieval mechanism
# Create a HuggingFace pipeline for the loaded model
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=100
)
hf_llm = HuggingFacePipeline(pipeline=pipe)

# Create the RetrievalQA system
qa_chain = RetrievalQA.from_chain_type(
    llm=hf_llm,
    chain_type="stuff",
    retriever=retriever
)

print("RAG system (RetrievalQA chain) created successfully.")

ModuleNotFoundError: No module named 'langchain_community'

**Reasoning**:
The previous command failed because the `langchain_community` module was not found. This is likely due to a version mismatch or the module being in a different location in the installed langchain package. The fix is to install `langchain-community`.



In [None]:
%pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_community-0.3.27-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dataclasses_json-0.6.7-py3-none-any.whl (2

**Reasoning**:
Now that `langchain-community` is installed, retry the previous command to import necessary classes, create embeddings, build the vector store, and set up the RetrievalQA system.



In [None]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

# 2. Create embeddings for the text data
# Using a smaller, faster sentence transformer model
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Extract text from the processed dataset
# Taking a smaller subset for demonstration to avoid excessive memory usage
sample_size = 1000
texts = [example['text'] for example in processed_dataset['train'].select(range(sample_size))]

# 3. Create a vector store from the embeddings and text data
# Using FAISS for efficient similarity search
vectorstore = FAISS.from_texts(texts, embeddings)

# 4. Set up a retrieval mechanism
retriever = vectorstore.as_retriever()

# 5. Combine the loaded micro LLM, the retrieval mechanism
# Create a HuggingFace pipeline for the loaded model
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=100
)
hf_llm = HuggingFacePipeline(pipeline=pipe)

# Create the RetrievalQA system
qa_chain = RetrievalQA.from_chain_type(
    llm=hf_llm,
    chain_type="stuff",
    retriever=retriever
)

print("RAG system (RetrievalQA chain) created successfully.")

  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

ImportError: Could not import faiss python package. Please install it with `pip install faiss-gpu` (for CUDA supported GPU) or `pip install faiss-cpu` (depending on Python version).

**Reasoning**:
The previous command failed because the `faiss` package was not found. The traceback suggests installing either `faiss-gpu` or `faiss-cpu`. Since we are not guaranteed to have a GPU environment, `faiss-cpu` is the safer choice.



In [None]:
%pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


**Reasoning**:
Now that `faiss-cpu` is installed, retry the command to import necessary classes, create embeddings, build the vector store, and set up the RetrievalQA system.



In [None]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

# 2. Create embeddings for the text data
# Using a smaller, faster sentence transformer model
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Extract text from the processed dataset
# Taking a smaller subset for demonstration to avoid excessive memory usage
sample_size = 1000
texts = [example['text'] for example in processed_dataset['train'].select(range(sample_size))]

# 3. Create a vector store from the embeddings and text data
# Using FAISS for efficient similarity search
vectorstore = FAISS.from_texts(texts, embeddings)

# 4. Set up a retrieval mechanism
retriever = vectorstore.as_retriever()

# 5. Combine the loaded micro LLM, the retrieval mechanism
# Create a HuggingFace pipeline for the loaded model
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=100
)
hf_llm = HuggingFacePipeline(pipeline=pipe)

# Create the RetrievalQA system
qa_chain = RetrievalQA.from_chain_type(
    llm=hf_llm,
    chain_type="stuff",
    retriever=retriever
)

print("RAG system (RetrievalQA chain) created successfully.")

Device set to use cpu


RAG system (RetrievalQA chain) created successfully.


  hf_llm = HuggingFacePipeline(pipeline=pipe)


## Test the rag system

### Subtask:
Test the RAG system with some example queries to ensure it is working correctly.


**Reasoning**:
Define example queries and use the qa_chain to get responses for each query.



In [None]:
# Define example queries
example_queries = [
    "What is the capital of France?",
    "What is the purpose of the lymphatic system?",
    "Who was the first man on the moon?",
    "What is the function of photosynthesis?",
    "What is machine learning?"
]

# Get responses for each query using the qa_chain
for query in example_queries:
    response = qa_chain.invoke({"query": query})
    print(f"Query: {query}")
    print(f"Response: {response['result']}")
    print("-" * 20)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Token indices sequence length is longer than the specified maximum sequence length for this model (1026 > 1024). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What is the capital of France?
Response: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is

IndexError: index out of range in self

**Reasoning**:
The previous attempt resulted in an IndexError because the input sequence length exceeded the model's maximum length. To fix this, I need to adjust the `max_new_tokens` parameter in the HuggingFace pipeline to a smaller value to prevent generating sequences that are too long. I will also re-run the testing with the example queries.



In [None]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

# Re-create the HuggingFace pipeline with a smaller max_new_tokens
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=50 # Reduced from 100
)
hf_llm = HuggingFacePipeline(pipeline=pipe)

# Re-create the RetrievalQA system with the updated LLM
# The retriever remains the same
qa_chain = RetrievalQA.from_chain_type(
    llm=hf_llm,
    chain_type="stuff",
    retriever=retriever
)

print("RAG system (RetrievalQA chain) updated with reduced max_new_tokens.")

# Re-run the testing with example queries
example_queries = [
    "What is the capital of France?",
    "What is the purpose of the lymphatic system?",
    "Who was the first man on the moon?",
    "What is the function of photosynthesis?",
    "What is machine learning?"
]

# Get responses for each query using the qa_chain
for query in example_queries:
    response = qa_chain.invoke({"query": query})
    print(f"Query: {query}")
    print(f"Response: {response['result']}")
    print("-" * 20)

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


RAG system (RetrievalQA chain) updated with reduced max_new_tokens.


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What is the capital of France?
Response: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is

IndexError: index out of range in self

**Reasoning**:
The IndexError persists even after reducing `max_new_tokens`. This indicates the issue might be with the combined length of the prompt (context + query) exceeding the model's limit. To address this, I will limit the number of documents retrieved by the retriever to reduce the context size provided to the LLM. I will set `search_kwargs` to limit the retrieval to a smaller number of documents and then re-run the tests.



In [None]:
# Set up a retrieval mechanism with limited search results
retriever = vectorstore.as_retriever(search_kwargs={"k": 1}) # Limit to retrieving only 1 document

# Re-create the RetrievalQA system with the updated retriever
# The LLM pipeline remains the same as updated in the previous step
qa_chain = RetrievalQA.from_chain_type(
    llm=hf_llm,
    chain_type="stuff",
    retriever=retriever
)

print("RAG system (RetrievalQA chain) updated with limited document retrieval.")

# Re-run the testing with example queries
example_queries = [
    "What is the capital of France?",
    "What is the purpose of the lymphatic system?",
    "Who was the first man on the moon?",
    "What is the function of photosynthesis?",
    "What is machine learning?"
]

# Get responses for each query using the qa_chain
for query in example_queries:
    response = qa_chain.invoke({"query": query})
    print(f"Query: {query}")
    print(f"Response: {response['result']}")
    print("-" * 20)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


RAG system (RetrievalQA chain) updated with limited document retrieval.


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What is the capital of France?
Response: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

Question: What is the capital of France?
Helpful Answer: The capital of France is the capit

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What is the purpose of the lymphatic system?
Response: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

The Lobund Institute grew out of pioneering research in germ-free-life which began in 1928. This area of research originated in a question posed by Pasteur as to whether animal life was possible without bacteria. Though others had taken up this idea, their research was short lived and inconclusive. Lobund was the first research organization to answer definitively, that such life is possible and that it can be prolonged through generations. But the objective was not merely to answer Pasteur's question but also to produce the germ free animal as a new tool for biological and medical research. This objective was reached and for years Lobund was a unique center for the study and production of germ free animals and for their use in biological and medical investigations

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: Who was the first man on the moon?
Response: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

In 1882, Albert Zahm (John Zahm's brother) built an early wind tunnel used to compare lift to drag of aeronautical models. Around 1899, Professor Jerome Green became the first American to send a wireless message. In 1931, Father Julius Nieuwland performed early work on basic reactions that was used to create neoprene. Study of nuclear physics at the university began with the building of a nuclear accelerator in 1936, and continues now partly through a partnership in the Joint Institute for Nuclear Astrophysics.

Question: Who was the first man on the moon?
Helpful Answer:
Albert Zahm was a scientist at the University of California, Berkeley, and a professor of applied physics at the University of California, Berkeley. He was the first physicist who attempted to prove the ex

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What is the function of photosynthesis?
Response: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

This Main Building, and the library collection, was entirely destroyed by a fire in April 1879, and the school closed immediately and students were sent home. The university founder, Fr. Sorin and the president at the time, the Rev. William Corby, immediately planned for the rebuilding of the structure that had housed virtually the entire University. Construction was started on the 17th of May and by the incredible zeal of administrator and workers the building was completed before the fall semester of 1879. The library collection was also rebuilt and stayed housed in the new Main Building for years afterwards. Around the time of the fire, a music hall was opened. Eventually becoming known as Washington Hall, it hosted plays and musical acts put on by the school. By 18

## Summary:

### Data Analysis Key Findings

*   The necessary libraries (`transformers`, `datasets`, `torch`, `langchain`, `langchain-community`, `faiss-cpu`) were successfully installed.
*   A pre-trained micro LLM model, "distilgpt2", and its tokenizer were successfully loaded from the `transformers` library.
*   The SQuAD dataset was successfully loaded using the `datasets` library and processed to extract the 'context' information into a 'text' column.
*   A RAG system was successfully built using `RetrievalQA` from `langchain`, incorporating a `HuggingFaceEmbeddings` model, a `FAISS` vector store created from a subset of the processed data, and a `HuggingFacePipeline` for the loaded micro LLM.
*   Initial testing of the RAG system resulted in an `IndexError`, which was resolved by limiting the number of retrieved documents to one (`search_kwargs={"k": 1}`) to manage the input length for the micro LLM.
*   The RAG system was able to process queries after the input length issue was addressed, but the quality of the generated answers was poor, highlighting the limitations of using a micro LLM and a dataset like SQuAD for generative QA.

### Insights or Next Steps

*   Consider using a larger, more capable language model if improved answer quality is required, or fine-tune the current micro LLM on a task-specific dataset.
*   Explore alternative datasets or pre-processing techniques that are better suited for generative question answering with a RAG system.
