<a href="https://colab.research.google.com/github/charlottejin95/RAG/blob/main/Retrieval-Augmented%20Generation(RAG)%20Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval-Augmented Generation (RAG)

RAG is an advanced artifical intellifence tenique that combines information retrieval with text generation, allowing AI models to retrieve relevant information from a knowledge source and incorporate into generated text.

RAG enables Large Langurage Models(LLMs) to produce higher-quality & more context-aware output compared to traditional generation methods. RAG framework includes **Retriever Component** and **Generator Componet**

## Environment Set-up

Using **Llama Index** package, which is the leading data framework for building LLM applications.

In [None]:
! pip install llama_index openai

In [2]:
# Set up OPENAI_API environment by providing your own key
import os
os.environ['OPENAI_API_KEY'] = 'Your Key'


In [3]:
#Import packages
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Document

## Step 1: Loading

Loading external documents and data (Documents loader)

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# load documents
documents = SimpleDirectoryReader("/content/drive/MyDrive/Colab Notebooks/LO-Projects/ML Project/RAG/Reference_files").load_data()

print(documents)
print(type(documents),len(documents))
#Split the document into 8 chunks automatically, usally based on num of pages, for faster processing.

[Document(id_='17056afc-b0d2-4b97-9103-b4cc131d9f43', embedding=None, metadata={'page_label': '1', 'file_name': 'A Neuropsychiatric Analysis of the Cotard Delusion.pdf', 'file_path': '/content/drive/MyDrive/Colab Notebooks/LO-Projects/ML Project/RAG/Reference_files/A Neuropsychiatric Analysis of the Cotard Delusion.pdf', 'file_type': 'application/pdf', 'file_size': 613934, 'creation_date': '2025-06-15', 'last_modified_date': '2025-06-15'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='A Neuropsychiatric Analysis of the Cotard Delusion\nAradhana Sahoo, B.S., Keith A. Josephs, M.D.\nCotard’s syndrome, a condition in which the patient denie

##Step 2: Indexing

Create Index using Transformation: Splits your data into smaller, manageable chunks (nodes) and structures them so that only relevant chunks are retrieved during a query

In [6]:
from llama_index.llms.openai import OpenAI
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.node_parser import SimpleNodeParser
import tiktoken
from llama_index.embeddings.openai import OpenAIEmbedding

from llama_index.core import Settings

from llama_index.core import VectorStoreIndex

In [7]:
# Transformation

# Create TokenTextSplitter to use: Split to appropriate token size
text_splitter = TokenTextSplitter( separator=" ",
                                   chunk_size=1024, #chunk size is 1024 tokens
                                   chunk_overlap=20, # num of overlapping tokens between two chunks, minimize context loss
                                   backup_separators=["\n"],
                                   tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
                                   # use the same tokenizer as OpenAI's gpt-3.5-turbo, and return a list of token IDs
                                  )


# Create SentenceSplitter to use; make sure sentence is not separated
node_parser = SentenceSplitter( chunk_size=1024,
                                chunk_overlap=20)

In [8]:
#Define large langurage model to use:
llm = OpenAI( model='gpt-3.5-turbo',
              temperature=0, # higher, the more different answers will be each time; if=0, same answer every time
              max_tokens=256 # allowed max number of tokens in the generated answer
            )


Settings.llm=llm
Settings.text_splitter=text_splitter
Settings.node_parser=node_parser
#TokenTextSplitter: ensures that each text chunk does not exceed the model's token limit, crucial for handling long texts.
#SentenceSplitter: ensures that text chunks are split at sentence boundaries to avoid cutting sentences in half,
#                  thereby preserving contextual coherence.


index = VectorStoreIndex.from_documents(documents,
                                        transformation=[text_splitter,node_parser,llm] #pipeline of processing steps applied to the document
                                        )
print(index)

<llama_index.core.indices.vector_store.base.VectorStoreIndex object at 0x7a2e93f3ba50>


##Step 3: Storing

Saves all components of the index to disk. \
This is how you make your index persistent — so next time running the code, can load it without rebuilding.

In [9]:
from llama_index.core import StorageContext, load_index_from_storage

In [10]:
# save index to disk
index.set_index_id("vector_index")
# set the saving path: content/storage
index.storage_context.persist("./storage")

In [11]:
# rebuild storage context, from the given path
storage_context = StorageContext.from_defaults(persist_dir="storage")

# load index, by providing the previously set index_id
index = load_index_from_storage(storage_context, index_id="vector_index")

Loading llama_index.core.storage.kvstore.simple_kvstore from storage/docstore.json.
Loading llama_index.core.storage.kvstore.simple_kvstore from storage/index_store.json.


##Step 4: Querying

Ssing LlamaIndex to retrieve the most relevant piece of information from the indexed documents provided.

In [12]:
from llama_index.core.retrievers import VectorIndexRetriever

In [22]:
# Retriever, query from previously created index

# Configure retriever
retriever = VectorIndexRetriever( index=index,
                                  similarity_top_k=1, #Return the top k most similar result
                                )

# Provide query content
query = 'What is Cotard Delusion?'
# query = 'What are the potential diagnostic associations of Cotard Delusion'

#Finding out the nodes for the new query:
nodes=retriever.retrieve(query) # Node is the most relevant information

##### **Provide the most relevant chunk to the query** :

Note: size of the given node is based on the previous loading process settings

In [23]:
print(nodes[0].text)
#print(len(nodes[0].text))

A Neuropsychiatric Analysis of the Cotard Delusion
Aradhana Sahoo, B.S., Keith A. Josephs, M.D.
Cotard’s syndrome, a condition in which the patient denies
his or her own existence or the existence of body parts, is a
rare illness that has been reported in association with several
neuropsychiatric diagnoses. The majority of published lit-
erature on the topic is in the form of case reports, many of
which are several years old. The authors evaluated associated
diagnoses, neuroimaging, and treatments recorded in pa-
tients diagnosed with Cotard’s syndrome at their institution. A
search of the Mayo Clinic database for patients with mention
of signs and symptoms associated with Cotard’s in their re-
cords between 1996 and 2016 was conducted. The electronic
medical records of the identiﬁed patients were then reviewed
for evidence of a true diagnosis of Cotard’s. Clinical and
neuroimaging data were also recorded for these patients. The
search identiﬁed 18 patients, 14 of whom had Cotard delu-

##Step 5: Querying--Assemble Query Engine

Building a full question-answering pipeline over the documents provided using LlamaIndex.

In [24]:
from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

In [25]:
# Design query engine

# configure response synthesizer
# Purpose: Turn retrieved document chunks into a natural language answer using an LLM
response_synthesizer = get_response_synthesizer()

# assemble query engine
query_engine = RetrieverQueryEngine(retriever=retriever, #use the retriever designed before
                                    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
                                    #Using embedding-based cosine similarity to select document chunk; 1--perfect match
                                    response_synthesizer=response_synthesizer
                                    )

"""Note：
A Response Synthesizer is what generates a response from an LLM, using a user query and a given set of text chunks.
The output of a response synthesizer is a Response object.The method for doing this can take many forms, from as simple
as iterating over text chunks, to as complex as building a tree. The main idea here is to simplify the process of generating
a response using an LLM across your data.

When used in a query engine, the response synthesizer is used after nodes are retrieved from a retriever,
                                                          and after any node-postprocessors are ran.
"""

response = query_engine.query(query)

##### **Natural langurage answer using query engine** :

In [34]:
import textwrap

wrapped_text = textwrap.fill(response.response, width=100)  # Set max length of each row as 80
print(wrapped_text)

Cotard Delusion is a rare neuropsychiatric condition where the patient denies their own existence or
the existence of body parts. It is characterized by nihilistic delusions that can range from denial
of body parts to negation of self-existence.
