#RAG: Retrieval Augmented Generation

**Basic RAG pipeline:**




Ingestion: Document --> Chunking --> Embedding --> Indexing

---



Retrieval: Query(Embedded) + Knowledge database(Indexed) -> Semantic search --> Top-k


---



Response generation: Query + Top-k -> LLM --> response

##Ingestion

In [None]:
!pip install langchain -q
!pip install langchain-community -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h

### Loading Documents

---

🛠️ Langchain has list of document loaders for text,csv,pdf,html etc. Details can be found here - https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/

In [None]:
from langchain_community.document_loaders import TextLoader

In [None]:
#document which contains additional information which needs to added to the LLM
documents ="""The giraffe is a large African hoofed mammal belonging to the genus Giraffa. It is the tallest living terrestrial animal and the largest ruminant on Earth. Traditionally, giraffes have been thought of as one species, Giraffa camelopardalis, with nine subspecies. Most recently, researchers proposed dividing them into four extant species due to new research into their mitochondrial and nuclear DNA, and individual species can be distinguished by their fur coat patterns. Seven other extinct species of Giraffa are known from the fossil record.
The giraffe's chief distinguishing characteristics are its extremely long neck and legs, its horn-like ossicones, and its spotted coat patterns.
It is classified under the family Giraffidae, along with its closest extant relative, the okapi. Its scattered range extends from Chad in the north to South Africa in the south, and from Niger in the west to Somalia in the east.
Giraffes usually inhabit savannahs and woodlands. Their food source is leaves, fruits, and flowers of woody plants, primarily acacia species, which they browse at heights most other herbivores cannot reach.
Lions, leopards, spotted hyenas, and African wild dogs may prey upon giraffes. Giraffes live in herds of related females and their offspring or bachelor herds of unrelated adult males, but are gregarious and may gather in large aggregations. Males establish social hierarchies through "necking", combat bouts where the neck is used as a weapon. Dominant males gain mating access to females, which bear sole responsibility for rearing the young."""

In [None]:
#for completeness we save it as a text and then load it.
with open('sample_doc.txt', 'w') as file:
    file.write(documents)

We can code these steps of RAG pipelines in pure python but there are already many SOTA libraries availible which provides this abstraction, hence we will use them. **Langchain, Llamaindex, hayindex** are some of famous libraries.

In [None]:
loader = TextLoader('sample_doc.txt')

In [None]:
document = loader.load()

In [None]:
docs = document[0].page_content

### Chunking

---

🛠️ We can perfrom character level splitting, recursive splitting, document based and semantic chunking with langchain.
https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/

**Character Splitter**

It splits the text based on a single character

---

- Splitting based on: single character (default "\n\n")

- Chunk size: number of characters

In [None]:
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=600,
    length_function=len
)

In [None]:
docs

'The giraffe is a large African hoofed mammal belonging to the genus Giraffa. It is the tallest living terrestrial animal and the largest ruminant on Earth. Traditionally, giraffes have been thought of as one species, Giraffa camelopardalis, with nine subspecies. Most recently, researchers proposed dividing them into four extant species due to new research into their mitochondrial and nuclear DNA, and individual species can be distinguished by their fur coat patterns. Seven other extinct species of Giraffa are known from the fossil record.\nThe giraffe\'s chief distinguishing characteristics are its extremely long neck and legs, its horn-like ossicones, and its spotted coat patterns. \nIt is classified under the family Giraffidae, along with its closest extant relative, the okapi. Its scattered range extends from Chad in the north to South Africa in the south, and from Niger in the west to Somalia in the east. \nGiraffes usually inhabit savannahs and woodlands. Their food source is lea

In [None]:
text_splitter.split_text(docs)

[Document(page_content='The giraffe is a large African hoofed mammal belonging to the genus Giraffa. It is the tallest living terrestrial animal and the largest ruminant on Earth. Traditionally, giraffes have been thought of as one species, Giraffa camelopardalis, with nine subspecies. Most recently, researchers proposed dividing them into four extant species due to new research into their mitochondrial and nuclear DNA, and individual species can be distinguished by their fur coat patterns. Seven other extinct species of Giraffa are known from the fossil record.', metadata={'source': 'sample_doc.txt'}),
 Document(page_content="The giraffe's chief distinguishing characteristics are its extremely long neck and legs, its horn-like ossicones, and its spotted coat patterns. \nIt is classified under the family Giraffidae, along with its closest extant relative, the okapi. Its scattered range extends from Chad in the north to South Africa in the south, and from Niger in the west to Somalia in

In [None]:
#we can add an argument which is called chunk_overlap, it will make sure certain amount of characters overlap between two adjacent tokens
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=800,
    length_function=len,
    chunk_overlap=200
)

In [None]:
text_splitter.split_text(docs)[0]

"The giraffe is a large African hoofed mammal belonging to the genus Giraffa. It is the tallest living terrestrial animal and the largest ruminant on Earth. Traditionally, giraffes have been thought of as one species, Giraffa camelopardalis, with nine subspecies. Most recently, researchers proposed dividing them into four extant species due to new research into their mitochondrial and nuclear DNA, and individual species can be distinguished by their fur coat patterns. Seven other extinct species of Giraffa are known from the fossil record.\nThe giraffe's chief distinguishing characteristics are its extremely long neck and legs, its horn-like ossicones, and its spotted coat patterns."

In [None]:
text_splitter.split_text(docs)[1]

"The giraffe's chief distinguishing characteristics are its extremely long neck and legs, its horn-like ossicones, and its spotted coat patterns. \nIt is classified under the family Giraffidae, along with its closest extant relative, the okapi. Its scattered range extends from Chad in the north to South Africa in the south, and from Niger in the west to Somalia in the east. \nGiraffes usually inhabit savannahs and woodlands. Their food source is leaves, fruits, and flowers of woody plants, primarily acacia species, which they browse at heights most other herbivores cannot reach."

**Recursively split by character**

It recursively splits text using list of characters until the criteria of specified chunk size is met.

---
- Text splitted by: list of characters(default: ["\n\n", "\n", " ", ""])

- Chunk size: number of characters

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=250,
    chunk_overlap=20,
    length_function=len,
)

In [None]:
text_splitter.split_text(docs)

['The giraffe is a large African hoofed mammal belonging to the genus Giraffa. It is the tallest living terrestrial animal and the largest ruminant on Earth. Traditionally, giraffes have been thought of as one species, Giraffa camelopardalis, with nine',
 'with nine subspecies. Most recently, researchers proposed dividing them into four extant species due to new research into their mitochondrial and nuclear DNA, and individual species can be distinguished by their fur coat patterns. Seven other',
 'Seven other extinct species of Giraffa are known from the fossil record.',
 "The giraffe's chief distinguishing characteristics are its extremely long neck and legs, its horn-like ossicones, and its spotted coat patterns.",
 'It is classified under the family Giraffidae, along with its closest extant relative, the okapi. Its scattered range extends from Chad in the north to South Africa in the south, and from Niger in the west to Somalia in the east.',
 'Giraffes usually inhabit savannahs an

**if we have multiple documents and we want to save text and metadata, then it is advised to use `create_documents` instead of `split_text ` and pass documents as a list ot it.**

**Sentence Splitting**

---

- It tries to preserve the structure of sentence, once basic method is to use '.' as splitter.
- It can fail in complex sentence structures. In that cases we can use sentence segmentation using spacy etc.


In [None]:
docs.split('.') #using '.' as splitter

['The giraffe is a large African hoofed mammal belonging to the genus Giraffa',
 ' It is the tallest living terrestrial animal and the largest ruminant on Earth',
 ' Traditionally, giraffes have been thought of as one species, Giraffa camelopardalis, with nine subspecies',
 ' Most recently, researchers proposed dividing them into four extant species due to new research into their mitochondrial and nuclear DNA, and individual species can be distinguished by their fur coat patterns',
 ' Seven other extinct species of Giraffa are known from the fossil record',
 "\nThe giraffe's chief distinguishing characteristics are its extremely long neck and legs, its horn-like ossicones, and its spotted coat patterns",
 ' \nIt is classified under the family Giraffidae, along with its closest extant relative, the okapi',
 ' Its scattered range extends from Chad in the north to South Africa in the south, and from Niger in the west to Somalia in the east',
 ' \nGiraffes usually inhabit savannahs and woo

In [None]:
!pip install spacy -q

In [None]:
from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=250, chunk_overlap = 20)

In [None]:
text_splitter.split_text(docs)



['The giraffe is a large African hoofed mammal belonging to the genus Giraffa.\n\nIt is the tallest living terrestrial animal and the largest ruminant on Earth.',
 'Traditionally, giraffes have been thought of as one species, Giraffa camelopardalis, with nine subspecies.',
 'Most recently, researchers proposed dividing them into four extant species due to new research into their mitochondrial and nuclear DNA, and individual species can be distinguished by their fur coat patterns.',
 "Seven other extinct species of Giraffa are known from the fossil record.\n\n\nThe giraffe's chief distinguishing characteristics are its extremely long neck and legs, its horn-like ossicones, and its spotted coat patterns.",
 'It is classified under the family Giraffidae, along with its closest extant relative, the okapi.\n\nIts scattered range extends from Chad in the north to South Africa in the south, and from Niger in the west to Somalia in the east.',
 'Giraffes usually inhabit savannahs and woodlands

**Document based chunking**

---

- For formatted documents like Markdown, html etc. we can use specialized chunking.

- `MarkdownHeaderTextSplitter`,`HTMLHeaderTextSplitter` in Langchain provides the features to do this.

https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/HTML_header_metadata/







### Embedding

🛠️OpenAI embeddings can be accessed using langchain.
https://python.langchain.com/v0.1/docs/modules/data_connection/text_embedding/

In [None]:
!pip install sentence_transformers -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m45.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

In [None]:
#select a model from sentence bert, we can also use OpenAIEmbeddings or any other embedding models
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
doc_chunked = text_splitter.split_text(docs)

In [None]:
embeds = embeddings.embed_documents(doc_chunked)


In [None]:
len(doc_chunked)

8

In [None]:
len(embeds)

8

In [None]:
len(embeds[0])

384

### Indexing

**Vector DB**

---
- There are many vector database out there, some of the popular ones are chroma, pinecone, elastic search etc.

- Pinecone is one popular commercial vector database, which has capability to store vectors on cloud too.
- Here we will use a open source database- **chromadb**

🛠️ Right now langchain gives integration support to chroma,FAISS,pinecone and lance.
https://python.langchain.com/v0.1/docs/modules/data_connection/vectorstores/

In [None]:
!pip install langchain-chroma -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m559.5/559.5 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.9/59.9 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.0/107.0 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from langchain_chroma import Chroma

**Note**: langchain's chroma extension takes document in a specific format, it should be loaded through `TextLoader` and should contain metadata. Hence here we tokenize the document using `.split_documents` and give pass it to `Chroma.from_documents`

In [None]:
document_chunked = text_splitter.split_documents(document)

In [None]:
db = Chroma.from_documents(document_chunked, embeddings)

## Retrieval

Database is ready and we can search queries and it will return us top-k results.

In [None]:
#search a query
query = 'What is genus of giraffe?'
docs = db.similarity_search(query,k=2)
print(docs[0].page_content)

The giraffe is a large African hoofed mammal belonging to the genus Giraffa. It is the tallest living terrestrial animal and the largest ruminant on Earth. Traditionally, giraffes have been thought of as one species, Giraffa camelopardalis, with nine


In [None]:
retriever = db.as_retriever()

## Response

🛠️Langchain support API calls to a large number of LLMS. Comprehensive list can be found here -https://python.langchain.com/v0.2/docs/integrations/llms/

Right now we will be using google-gen-ai

In [None]:
!pip install --upgrade --quiet  langchain-google-genai pillow

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imageio 2.31.6 requires pillow<10.1.0,>=8.3.2, but you have pillow 10.3.0 which is incompatible.[0m[31m
[0m

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI

In [None]:
 #gemini API key, if not working generate one by searching makersuite in google

In [None]:
import getpass
import os

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Provide API KEY")

AIzaSyCoUfBd44d7QHGIbLXRaW_8gmfrjvJc1zk··········


In [None]:
llm = ChatGoogleGenerativeAI(model="gemini-pro") #calling gemini pro llm


In [None]:
%pip install langchainhub --quiet # langchain hub has many prompts saved, we will pull one of them

In [None]:
prompt = hub.pull("rlm/rag-prompt")

In [None]:
example_messages = prompt.invoke(
    {"context": "filler context", "question": "filler question"}
).to_messages()

example_messages

[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: filler question \nContext: filler context \nAnswer:")]

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
def format_docs(docs):
    # Format the documents for the prompt.
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
) # create a chain of prompt

In [None]:
question = "what is food source of girafee?  "
response = rag_chain.invoke(question )
response

'Giraffes primarily feed on leaves, fruits, and flowers of woody plants, particularly acacia species. They browse at heights that most other herbivores cannot reach, giving them a competitive advantage in their habitat.'

## References

----
1. https://www.langchain.com/
2. https://github.com/FullStackRetrieval-com
