# Build a Retrieval Augmented Generation (RAG) Sytem

We will create a functional RAG system that can answer questions based on a dataset loaded from Hugging Face Datasets. This system will:

  * Load the databricks/databricks-dolly-15k dataset.
Index the dataset content into a vector store.
  * Utilize a pre-trained question-answering model from Hugging Face.
  * Answer user queries by retrieving relevant documents and using the LLM to generate answers.


1. Set up your environment:

  * Install all required libraries by running these commands:

In [3]:
!pip install -q langchain
!pip install -q torch
!pip install -q transformers
!pip install -q sentence-transformers
!pip install -q datasets
!pip install -q faiss-cpu
!pip install langchain_community
!pip install huggingface_hub

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m871.4 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [10]:
import torch
import langchain
import transformers
import faiss
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline

2. Load the dataset:

  * Import HuggingFaceDatasetLoader from langchain.document_loaders.
  * Specify the dataset name and content column:

In [17]:
# Load Dataset from Hugging Face
dataset_name = "databricks/databricks-dolly-15k"
page_content_column = "context"


loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)
data = loader.load()  # Load documents as a list

In [18]:
# Verify dataset contents before indexing
print("\n First 3 original documents before FAISS indexing:")
for i in range(min(3, len(data))):  # Ensure dataset is not empty
    print(data[i].page_content)
    print("-" * 50)


 First 3 original documents before FAISS indexing:
"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."
--------------------------------------------------
""
--------------------------------------------------
""
--------------------------------------------------


Before cleaning, let's count the number of empty rows in the "context" column to see how many documents need to be filtered.

In [16]:
# Count number of empty context fields
empty_context_count = sum(1 for doc in data if doc.page_content.strip() == "")

print(f"\n - Number of empty context entries: {empty_context_count}")
print(f" - Total documents before filtering: {len(data)}")



 - Number of empty context entries: 0
 - Total documents before filtering: 15011


In [19]:
print("\n Checking raw dataset structure:")
for i in range(5):  # Print first 5 documents
    print(f"Document {i+1}: {data[i]}")
    print("-" * 50)



 Checking raw dataset structure:
Document 1: page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."' metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}
--------------------------------------------------
Document 2: page_content='""' metadata={'instruction': 'Which is a species of fish? Tope or Rope', 'response': 'Tope', 'category': 'classification'}
--------

In [22]:
# Count documents where `page_content` is exactly '""' (including potential newlines)
double_quote_empty_count = sum(1 for doc in data if doc.page_content.strip() in ['""', '""\n', '""\n(', '""\n()'])

print(f"\n - Number of entries formatted as empty quotes ('\"\"'): {double_quote_empty_count}")
print(f" - Total documents before filtering: {len(data)}")



 - Number of entries formatted as empty quotes ('""'): 10544
 - Total documents before filtering: 15011


In [23]:
# Remove documents where `page_content` is empty
filtered_data = [doc for doc in data if doc.page_content.strip().replace('"', '') != ""]

print(f"\n Filtered dataset: {len(filtered_data)} non-empty documents available.")



 Filtered dataset: 4467 non-empty documents available.


In [26]:
# Check the average and maximum document length
doc_lengths = [len(doc.page_content) for doc in filtered_data]

print(f" - Number of documents: {len(filtered_data)}")
print(f" - Average document length: {sum(doc_lengths) // len(doc_lengths)} characters")
print(f" - Longest document: {max(doc_lengths)} characters")
print(f" - Shortest document: {min(doc_lengths)} characters")


 - Number of documents: 4467
 - Average document length: 1208 characters
 - Longest document: 23780 characters
 - Shortest document: 5 characters


3. Split the documents:

  * Import RecursiveCharacterTextSplitter from langchain.text_splitter.
  * Create a RecursiveCharacterTextSplitter instance with a chunk_size of 1000 and chunk_overlap of 150:

In [27]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

In [28]:
# Split only long documents
docs = []
for doc in filtered_data:
    if len(doc.page_content) > 1500:  # Only split long docs
        split_chunks = text_splitter.split_documents([doc])
        docs.extend(split_chunks)
    else:
        docs.append(doc)  # Keep short docs unchanged

print(f"\n - Total final documents after smart splitting: {len(docs)}")

# Print first few document chunks to verify
print("\n - Checking first 3 split documents:")
for i in range(min(3, len(docs))):
    print(f"Chunk {i+1}: {docs[i].page_content[:300]}...")  # Print only first 300 characters
    print("-" * 50)


 - Total final documents after smart splitting: 7146

 - Checking first 3 split documents:
Chunk 1: "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major ...
--------------------------------------------------
Chunk 2: "Komorida was born in Kumamoto Prefecture on July 10, 1981. After graduating from high school, he joined the J1 League club Avispa Fukuoka in 2000. Although he debuted as a midfielder in 2001, he did not play much and the club was relegated to the J2 League at the end of the 2001 season. In 2002, he...
--------------------------------------------------
Chunk 3: "Stalemate is a situation in chess where the player whose turn it is to move is not in check and has no legal move. Stalemate results in a draw. During the endgame, stalem

 4. Embed the text:

  * Import HuggingFaceEmbeddings from langchain.embeddings.
  * Define the model path, model configurations, and encoding options:

In [29]:
# Define the model path (pre-trained sentence-transformers model) and configurations
modelPath = "sentence-transformers/all-MiniLM-l6-v2"
model_kwargs = {'device': 'cpu'}  # Run on CPU (change to 'cuda' for GPU)
encode_kwargs = {'normalize_embeddings': False}  # Keep raw embeddings without normalization


# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

  embeddings = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [30]:
# (Optional) Test embedding on a sample text
text = "This is a test document."
query_result = embeddings.embed_query(text)
print("\n Embedding test result (first 3 values):", query_result[:3])


 Embedding test result (first 3 values): [-0.038338541984558105, 0.12346471846103668, -0.02864297851920128]


5. Create a vector store:

  * Import FAISS from langchain.vectorstores.
  * Create a FAISS vector store from the document chunks and embeddings:
db = FAISS.from_documents(docs, embeddings)
  * Note: This step might take some time depending on your dataset size.


In [33]:
# Create FAISS vector store from the processed document chunks and embeddings
db = FAISS.from_documents(docs, embeddings)

# Print confirmation
print("\n - FAISS database created successfully!")
print(f"  - Total documents stored in FAISS: {len(docs)}")


 - FAISS database created successfully!
  - Total documents stored in FAISS: 7146


6. Prepare the LLM model:

  * Load the tokenizer and question-answering model:

In [35]:
tokenizer = AutoTokenizer.from_pretrained("Intel/dynamic_tinybert")
model = AutoModelForQuestionAnswering.from_pretrained("Intel/dynamic_tinybert")

Invalid model-index. Not loading eval results into CardData.


In [38]:
from langchain import HuggingFacePipeline

# Define the model name
model_name = "Intel/dynamic_tinybert"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True, max_length=512)

Youtubeer = pipeline(
  "question-answering",
  model=model_name,
  tokenizer=tokenizer,
  return_tensors='pt'
)

Invalid model-index. Not loading eval results into CardData.
Device set to use cpu


  * Create a question-answering pipeline:

In [None]:
model_name = "Intel/dynamic_tinybert"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True, max_length=512)
Youtubeer = pipeline(
  "question-answering",
  model=model_name,
  tokenizer=tokenizer,
  return_tensors='pt'
)

Invalid model-index. Not loading eval results into CardData.
Device set to use cpu


In [40]:
# Check if tokenizer and model are loading correctly
print(f" - Model: {model_name} loaded successfully!")
print(f" - Tokenizer: {type(tokenizer)}")


 - Model: Intel/dynamic_tinybert loaded successfully!
 - Tokenizer: <class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>


  * Create a Langchain pipeline wrapper:

In [41]:
llm = HuggingFacePipeline(
  pipeline=Youtubeer,
  model_kwargs={"temperature": 0.7, "max_length": 512},
)

  llm = HuggingFacePipeline(


7. Build the Retrieval QA Chain:

  * Import RetrievalQA from langchain.chains.
  * Create a retriever from your FAISS database:

In [42]:
import langchain.chains
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain

  * Build the RetrievalQA chain:

In [43]:
# Convert FAISS database into a retriever
retriever = db.as_retriever(search_kwargs={"k": 4})  # "k" is the number of retrieved documents


In [44]:
qa = RetrievalQA.from_chain_type(
    llm=llm,  # Language model we defined earlier
    chain_type="refine",  # "stuff" is another option, but "refine" improves answers iteratively
    retriever=retriever,  # Connects FAISS retriever to the LLM
    return_source_documents=True  # Returns retrieved documents for verification
)


8. Test your RAG system:

  * Define your question:

In [45]:
question = "What is cheesemaking?"

  * Run the QA chain and print the result:  

In [52]:
# Retrieve relevant documents from FAISS
retrieved_docs = retriever.get_relevant_documents(question)

# Extract context (combine multiple retrieved documents)
context = "\n".join([doc.page_content for doc in retrieved_docs])

# Print the retrieved context for verification
print("\n - Retrieved Context:")
print(context[:500])  # Print only the first 500 characters for readability

# Ensure the pipeline receives the correct format
response = Youtubeer({
    "question": question,
    "context": context
})

# Print the generated answer
print("\n - Final Answer:")
print(response)



 - Retrieved Context:
"The goal of cheese making is to control the spoiling of milk into cheese. The milk is traditionally from a cow, goat, sheep or buffalo, although, in theory, cheese could be made from the milk of any mammal. Cow's milk is most commonly used worldwide. The cheesemaker's goal is a consistent product with specific characteristics (appearance, aroma, taste, texture). The process used to make a Camembert will be similar to, but not quite the same as, that used to make Cheddar.\n\nSome cheeses may be 





 - Final Answer:
{'score': 0.40952709317207336, 'start': 30, 'end': 73, 'answer': 'to control the spoiling of milk into cheese'}
