# üìò Retrieval-Augmented Generation (RAG) System using TinyLlama

## Introduction

The implementation of a **Retrieval-Augmented Generation ((RAG)** a hybrid approach that combines:

- **Information Retrieval** (from documents)
- **Text Generation** (using an LLM)
- Reduces hallucination
- Enables question answering over private documents
- Avoids retraining large language models)** system using:

- **TinyLlama (1.1B Chat Model)** from Hugging Face  
- **Vector Database** (Weaviate)  
- **LangChain** for orchestration  

The system enables question answering over a **custom PDF document** (e.g., a Transformer research paper) by retrieving relevant content and generating grounded answers.

## System Architecture

The complete RAG pipeline follows the architecture below:

PDF Document  
‚Üì  
Document Loader  
‚Üì  
Text Chunking  
‚Üì  
Embedding Model  
‚Üì  
Vector Database  
‚Üì  
Retriever  
‚Üì  
Prompt + Context  
‚Üì  
TinyLlama (LLM)  
‚Üì  
Final Answer


### Required Libraries and Environment Setup

The following Python libraries are required to run this project:
- transformers
- torch
- langchain
- langchain-community
- weaviate-client

These libraries enable model loading, document processing, vector search,
and RAG orchestration.


In [6]:
!pip install weaviate-client langchain tiktoken pypdf rapidocr-onnxruntime langchain-community langchain-weaviate bitsandbytes

Collecting weaviate-client
  Using cached weaviate_client-4.18.3-py3-none-any.whl.metadata (3.7 kB)
Using cached weaviate_client-4.18.3-py3-none-any.whl (599 kB)
Installing collected packages: weaviate-client
  Attempting uninstall: weaviate-client
    Found existing installation: weaviate-client 3.26.7
    Uninstalling weaviate-client-3.26.7:
      Successfully uninstalled weaviate-client-3.26.7
Successfully installed weaviate-client-4.18.3


In [10]:
!pip uninstall weaviate-client -y
!pip install "weaviate-client>=3.26.7,<4.0.0"


Found existing installation: weaviate-client 4.18.3
Uninstalling weaviate-client-4.18.3:
  Successfully uninstalled weaviate-client-4.18.3
Collecting weaviate-client<4.0.0,>=3.26.7
  Using cached weaviate_client-3.26.7-py3-none-any.whl.metadata (3.4 kB)
Using cached weaviate_client-3.26.7-py3-none-any.whl (120 kB)
Installing collected packages: weaviate-client
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-weaviate 0.0.6 requires weaviate-client<5.0.0,>=4.0.0, but you have weaviate-client 3.26.7 which is incompatible.[0m[31m
[0mSuccessfully installed weaviate-client-3.26.7


In [7]:
WEAVIATE_API_KEY = "use your own api key"
WEAVIATE_CLUSTER_URL = "use your qwn cluster url"


In [11]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"


In [12]:
# sentence tranformers embedding model
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

## Loading and Chunking the PDF Document

The PDF document is loaded and split into overlapping text chunks.
Chunking improves retrieval accuracy and context relevance.


In [13]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
loader = PyPDFLoader("/content/Transformer.pdf",extract_images = True)
pages = loader.load()

In [14]:
pages

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '/content/Transformer.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani‚àó\nGoogle Brain\navaswani@google.com\nNoam Shazeer‚àó\nGoogle Brain\nnoam@google.com\nNiki Parmar‚àó\nGoogle Research\nnikip@google.com\nJakob Uszkoreit‚àó\nGoogle Research\nusz@google.com\nLlion Jones‚àó\nGoogle Research\nllion@google.com\nAidan N. Gomez‚àó ‚Ä†\nUniversity of Toronto\naidan@cs.toronto.edu\n≈Åukasz Kais

In [15]:
# Chunking
text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)
docs = text_splitter.split_documents(pages)

In [16]:
# metadata cleaning like . and _
for doc in docs:
    clean_metadata = {}
    for k, v in doc.metadata.items():
        clean_key = k.replace(".", "_")
        clean_metadata[clean_key] = v
    doc.metadata = clean_metadata


In [17]:
docs

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex_fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '/content/Transformer.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani‚àó\nGoogle Brain\navaswani@google.com\nNoam Shazeer‚àó\nGoogle Brain\nnoam@google.com\nNiki Parmar‚àó\nGoogle Research\nnikip@google.com\nJakob Uszkoreit‚àó\nGoogle Research\nusz@google.com\nLlion Jones‚àó\nGoogle Research\nllion@google.com\nAidan N. Gomez‚àó ‚Ä†\nUniversity of Toronto\naidan@cs.toronto.edu\n≈Åukasz Kais

## Vector Database and Retriever

All document chunks are stored as vector embeddings.
The retriever fetches the most relevant chunks for a given query.


In [18]:
from langchain_community.vectorstores import Weaviate
import weaviate
WEAVIATE_URL = WEAVIATE_CLUSTER_URL
WEAVIATE_API_KEY = WEAVIATE_API_KEY
client = weaviate.Client(
    url=WEAVIATE_URL,
    auth_client_secret=weaviate.AuthApiKey(WEAVIATE_API_KEY)
)


In [19]:
vector_db = Weaviate.from_documents(
    documents=docs,
    embedding=embeddings,
    client=client,
    by_text=False
)

In [20]:
vector_db.similarity_search("what is transformer", k=3)

[Document(metadata={'author': '', 'creationdate': '2024-04-10T21:11:43Z', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2024-04-10T21:11:43Z', 'page': 7, 'page_label': '8', 'producer': 'pdfTeX-1.40.25', 'ptex_fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'source': '/content/Transformer.pdf', 'subject': '', 'title': '', 'total_pages': 15, 'trapped': '/False'}, page_content='Transformer (big) 28.4 41.8 2.3 ¬∑ 1019\nResidual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the\nsub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the\npositional encodings in both the encoder and decoder stacks. For the base model, we use a rate of\nPdrop = 0.1.\nLabel Smoothing During training, we employed label smoothing of value œµls = 0.1 [36]. This\nhurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.\n6 Result

In [21]:
doc, score = vector_db.similarity_search_with_score(
    "what is transformer", k=3
)[0]

print(doc.page_content)
print("Score:", score)


Transformer (big) 28.4 41.8 2.3 ¬∑ 1019
Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the
sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the
positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of
Pdrop = 0.1.
Label Smoothing During training, we employed label smoothing of value œµls = 0.1 [36]. This
hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
6 Results
6.1 Machine Translation
On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)
in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0
Score: 0.4052300956423408


In [22]:
doc, score = vector_db.similarity_search_with_score(
    "what is transformer", k=3
)[1]

print(doc.page_content)
print("Score:", score)

Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [11] around each of
the two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
Score: 0.39995650149855544


In [23]:
doc, score = vector_db.similarity_search_with_score(
    "what is transformer", k=3
)[2]

print(doc.page_content)
print("Score:", score)

is similar to that of single-head attention with full dimensionality.
3.2.3 Applications of Attention in our Model
The Transformer uses multi-head attention in three different ways:
‚Ä¢ In "encoder-decoder attention" layers, the queries come from the previous decoder layer,
and the memory keys and values come from the output of the encoder. This allows every
position in the decoder to attend over all positions in the input sequence. This mimics the
typical encoder-decoder attention mechanisms in sequence-to-sequence models such as
[38, 2, 9].
‚Ä¢ The encoder contains self-attention layers. In a self-attention layer all of the keys, values
and queries come from the same place, in this case, the output of the previous layer in the
Score: 0.3604415144593106


In [24]:
results = vector_db.similarity_search_with_score(
    "Why does the Transformer model not use recurrence or convolution?",
    k=3
)


In [25]:
for i, (doc, score) in enumerate(results, 1):
    print(f"\n--- Chunk {i} | Score: {score} ---")
    print(doc.page_content[:600])



--- Chunk 1 | Score: 0.4646717071047235 ---
tion models in various tasks, allowing modeling of dependencies without regard to their distance in
the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms
are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for significantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little

--- Chunk 2 | Score: 0.43008942023755814 ---
is similar to that of single-head attention with full dimensionality.
3.2.3 Applications of Attention in our Model
The Transformer uses multi-head attention in three different ways:
‚Ä¢ In "encoder-decoder attention" layers, the queries come from the previous decoder layer,
and the memory keys and values co

In [26]:
from google.colab import userdata
userdata.get('HUGGINGFACE_TOKEN')
import os
HF_TOKEN= userdata.get('HUGGINGFACE_TOKEN')

## Loading TinyLlama from Hugging Face

we load the **TinyLlama 1.1B Chat model** from Hugging Face.
The model automatically uses GPU if available.


In [38]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.float16
)

model.eval()
print("TinyLlama loaded ")


TinyLlama loaded 


## Wrapping TinyLlama for LangChain

LangChain requires a callable interface.
We wrap TinyLlama using `RunnableLambda` so it can be used inside a RAG chain.


In [39]:
from langchain_core.runnables import RunnableLambda

def tinyllama_llm(prompt: str) -> str:
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=1024
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=200,
            temperature=0.7,
            top_p=0.9,
            do_sample=True
        )

    return tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True
    )

llm = RunnableLambda(tinyllama_llm)
print("Runnable ready")

Runnable ready


In [40]:
retriever = vector_db.as_retriever(search_kwargs={"k": 3})
print("Retriever ready")


Retriever ready


In [41]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


## Prompt Template Design

A strict prompt is used to ensure:
- Answers come only from retrieved context
- Hallucinations are avoided
- Responses remain concise and accurate


In [31]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("""
You are a research assistant.

RULES:
- Answer ONLY using the context.
- If the answer is not found in the context, say:
  "Not found in the provided context."

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
""")


## Output Parsing

The model output is parsed into a clean string format
using `StrOutputParser`.


In [32]:
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()


## Building the RAG Chain

The RAG chain connects:
- Question
- Retriever
- Prompt
- TinyLlama model
- Output parser

This forms the complete Retrieval-Augmented Generation pipeline.


In [42]:
from operator import itemgetter

rag_chain = (
    {
        "context": itemgetter("question") | retriever | format_docs,
        "question": itemgetter("question"),
    }
    | prompt
    | (lambda prompt_value: prompt_value.to_string())
    | llm
    | output_parser
)

print(" RAG chain")

 RAG chain


## Running the RAG System

In this step, we ask a question related to the document.
The system retrieves relevant content and generates a grounded answer.


In [36]:
question = "What is the Transformer model and what are its key advantages?"

response = rag_chain.invoke({"question": question})
print(response)


The Transformer is a model for machine translation, which is a task where the goal is to translate
one language into another, usually using a machine. The Transformer is a large model that has
been designed to be very efficient and perform well on this task. The Transformer is a hierarchical
encoder-decoder model, where the encoder encodes the input sequence into a context vector, and the
decoder decodes this context vector to produce a translation. The Transformer is very effective in
this task because it can model long-range dependencies between words in the input sequence. This
model has several key advantages:
‚Ä¢ It is very efficient, as it can use a very large number of parameters.
‚Ä¢ It can model long-range dependencies, which can help to improve translation quality.
‚Ä¢ It is very flexible, as it can be trained on a wide range of data, including language-specific
data, and can be fine-


In [37]:
question = "What is the Transformer model and what are its key advantages?"

response = rag_chain.invoke({"question": question})
print(response)

The Transformer is a new architecture for neural machine translation (NMT) that uses attention to
handle long-range dependencies. It is a transformer network with a hierarchical encoding, where each
encoder layer encodes multiple sub-sequences of the input sequence. The encoder has self-attention,
where the query is a concatenation of all keys from the previous layer, and the value is the
concatenation of all keys from the previous layer. The decoder has a multi-head attention where the
query comes from the previous decoder layer, and the memory keys and values come from the output
of the encoder. The key advantage of the Transformer is that it can handle long-range dependencies
by attending over all possible positions in the input sequence.
