# LangChain RAG
LangChain is a framework for building applications powered by language models (LLMs) like OpenAI's GPT, Anthropic Claude, or open-source models. It simplifies the creation of complex AI workflows by combining LLMs with:

* external data (files, databases, APIs),

* tools (calculators, search),

* memory (conversation history),

* and control flow (chains, agents, etc.).



## Environment Setup

In [1]:
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()



True

## Setup Open AI API

In [2]:
client = OpenAI(
    base_url="https://api.openai.com/v1",
    api_key=os.getenv("OPENAI_API_KEY")
)

## Document Loaders
Document loaders are tools that play a crucial role in data ingestion. They take in raw data from different sources and convert them into a structured format called “Documents”. These documents contain the document content as well as the associated metadata like source and timestamps. Let’s look into the different types of document loaders.

### CSV Data Loader

In [7]:
from langchain.document_loaders import CSVLoader

file_path = "/Users/danilofornari/Downloads/rag_poc/data/titanic.csv"

loader = CSVLoader(file_path=file_path)
data = loader.load()
print(type(data))

<class 'list'>


In [6]:
print(data[0])

page_content='PassengerId: 1
Survived: 0
Pclass: 3
Name: Braund, Mr. Owen Harris
Sex: male
Age: 22
SibSp: 1
Parch: 0
Ticket: A/5 21171
Fare: 7.25
Cabin: 
Embarked: S' metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/titanic.csv', 'row': 0}


In [8]:
print(data[0].page_content)

PassengerId: 1
Survived: 0
Pclass: 3
Name: Braund, Mr. Owen Harris
Sex: male
Age: 22
SibSp: 1
Parch: 0
Ticket: A/5 21171
Fare: 7.25
Cabin: 
Embarked: S


### HTML Data Loader

In [12]:
from langchain.document_loaders import UnstructuredHTMLLoader

loader = UnstructuredHTMLLoader('/Users/danilofornari/Downloads/rag_poc/data/harry_potter_example.html')
data = loader.load()
print(data[0])

page_content='Harry Potter and the Sorcerer's Stone

CHAPTER ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.

Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.

The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about th

### Wikipedia Data Loader

In [14]:
from langchain.document_loaders import WikipediaLoader
loader = WikipediaLoader(query='Harry Potter', load_max_docs=1)
data = loader.load()

print(data[0])

page_content='Harry Potter is a series of seven fantasy novels written by British author J. K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends, Ron Weasley and Hermione Granger, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's conflict with Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles (non-magical people).
The series was originally published in English by Bloomsbury in the United Kingdom and Scholastic Press in the United States. A series of many genres, including fantasy, drama, coming-of-age fiction, and the British school story (which includes elements of mystery, thriller, adventure, horror, and romance), the world of Harry Potter explores numerous themes and includes many cultural meanings and references. Major themes in the series include prejudice, corruption,

In [15]:
print(data[0].metadata)

{'title': 'Harry Potter', 'summary': 'Harry Potter is a series of seven fantasy novels written by British author J. K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends, Ron Weasley and Hermione Granger, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry\'s conflict with Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles (non-magical people).\nThe series was originally published in English by Bloomsbury in the United Kingdom and Scholastic Press in the United States. A series of many genres, including fantasy, drama, coming-of-age fiction, and the British school story (which includes elements of mystery, thriller, adventure, horror, and romance), the world of Harry Potter explores numerous themes and includes many cultural meanings and references. Major themes in the series incl

### PDF Data Loader

In [17]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader('/Users/danilofornari/Downloads/rag_poc/data/attention_is_all_you_need.pdf')

data = loader.load()

print(data[0])

page_content='Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions

### TXT Data Loader

In [28]:
from langchain.document_loaders import TextLoader

loader = TextLoader('/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt')
data = loader.load()

print(data[0].page_content)

Harry Potter and the Sorcerer's Stone


CHAPTER ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say
that they were perfectly normal, thank you very much. They were the last
people you'd expect to be involved in anything strange or mysterious,
because they just didn't hold with such nonsense.

Mr. Dursley was the director of a firm called Grunnings, which made
drills. He was a big, beefy man with hardly any neck, although he did
have a very large mustache. Mrs. Dursley was thin and blonde and had
nearly twice the usual amount of neck, which came in very useful as she
spent so much of her time craning over garden fences, spying on the
neighbors. The Dursleys had a small son called Dudley and in their
opinion there was no finer boy anywhere.

The Dursleys had everything they wanted, but they also had a secret, and
their greatest fear was that somebody would discover it. They didn't
think they could bear it if anyone found out about the Potters. Mr

### Creating a bot that can answer questions based on TXT articles


In [22]:
from langchain_openai import ChatOpenAI
from langchain.globals import set_llm_cache
from langchain.cache import InMemoryCache
from langchain.prompts import HumanMessagePromptTemplate, ChatPromptTemplate


chat = ChatOpenAI()

# Enable response caching
# If you send the same prompt twice, LangChain will reuse the cached response instead of calling the API again
set_llm_cache(InMemoryCache())

In [31]:
def bot(file_path, question):
    chat = ChatOpenAI()
    loader = TextLoader(file_path)
    data = loader.load()
    
    document = data[0]
    page_content = document.page_content[:1000]
    metadata = document.metadata
    
    question = question
    

    human_template = "Read about the article {metadata} having content {page_content} and answer the {question}"
    
    human_message_prompt= HumanMessagePromptTemplate.from_template(human_template)
    
    chat_prompt = ChatPromptTemplate.from_messages([human_message_prompt])
    
    prompt = chat_prompt.format_prompt(metadata = metadata, page_content = page_content, question = question)
    
    response = chat(messages = prompt.to_messages())
    
    return response.content

In [38]:
bot("/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt", "Who lives in number four, Privet Drive?")

'Mr. and Mrs. Dursley, along with their son Dudley, live in number four, Privet Drive.'

## Text Splitters 
Once we have loaded the documents, we may want to split the entire document into smaller chunks(or parts), that can fit into our model’s context window. This is where LangChain’s text splitters comes into picture!

But this is not as easy as it sounds, there may be a lot of complexity involved here. Ideally, we would want to keep the semantically related pieces of text together. By semantically, I mean texts have similar contextual meaning.

Let’s hop onto the different types of text splitters in LangChain.

#### Splitting by character

In [33]:
filepath = "/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt"
with open(filepath,'r') as f:
    hp_book = f.read()
    
print("Number of character letters in the document: ", len(hp_book))
print("Number of words in the document: ", len(hp_book.split()))
print("Number of lines in the document: ", len(hp_book.split("\n")))

Number of character letters in the document:  439742
Number of words in the document:  78451
Number of lines in the document:  10703


In [34]:
from collections import Counter
line_len_list = []

for line in hp_book.split("\n"):
    curr_line_len = len(line)
    line_len_list.append(curr_line_len)
    
print(len(line_len_list))
    
Counter(line_len_list) 

10703


Counter({0: 3057,
         71: 881,
         72: 864,
         70: 830,
         69: 710,
         68: 562,
         67: 385,
         66: 285,
         65: 170,
         64: 135,
         63: 89,
         32: 63,
         26: 63,
         9: 62,
         62: 61,
         22: 60,
         31: 59,
         37: 57,
         40: 56,
         35: 56,
         43: 56,
         7: 54,
         36: 53,
         6: 52,
         19: 52,
         34: 52,
         8: 52,
         53: 51,
         27: 50,
         42: 50,
         46: 50,
         13: 50,
         52: 50,
         23: 49,
         21: 48,
         24: 48,
         47: 47,
         18: 47,
         61: 47,
         33: 46,
         16: 46,
         45: 45,
         60: 45,
         41: 45,
         29: 45,
         48: 45,
         17: 44,
         30: 44,
         38: 44,
         28: 44,
         44: 43,
         5: 43,
         10: 42,
         20: 42,
         51: 41,
         54: 41,
         25: 41,
         14: 40,
         

Splitting the text at a specific character only if the chunk exceeds the given chunk size


In [39]:
from langchain.text_splitter import CharacterTextSplitter

def len_func(text):
    return len(text)

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size = 1200,
    chunk_overlap = 100,
    length_function = len_func,
    is_separator_regex= False
    
)

split_text = text_splitter.create_documents(texts = [hp_book])
print(len(split_text))
print(60*"-")
print(split_text[0].page_content)
print(60*"-")
print(split_text)

419
------------------------------------------------------------
Harry Potter and the Sorcerer's Stone


CHAPTER ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say
that they were perfectly normal, thank you very much. They were the last
people you'd expect to be involved in anything strange or mysterious,
because they just didn't hold with such nonsense.

Mr. Dursley was the director of a firm called Grunnings, which made
drills. He was a big, beefy man with hardly any neck, although he did
have a very large mustache. Mrs. Dursley was thin and blonde and had
nearly twice the usual amount of neck, which came in very useful as she
spent so much of her time craning over garden fences, spying on the
neighbors. The Dursleys had a small son called Dudley and in their
opinion there was no finer boy anywhere.
------------------------------------------------------------


In [40]:
first_chunk = split_text[0]

first_chunk.metadata = {"source":filepath}
first_chunk.metadata

{'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}

#### Recursive Character Splitter
It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""].( Order - First paragraphs, then sentences, then words, then characters)


In [41]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n","\n", " "],
    chunk_size = 200,
    chunk_overlap = 100,
    length_function = len_func,
    is_separator_regex=False
)

# Here, the split first happens at "\n\n", if the chunk size exceeds, it will move to the next separator, if it still exceeds, it will move to the next separator which is a " ".

chunk_list = text_splitter.create_documents(texts = [hp_book])

chunk_list

[Document(metadata={}, page_content="Harry Potter and the Sorcerer's Stone\n\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED"),
 Document(metadata={}, page_content='Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last'),
 Document(metadata={}, page_content="that they were perfectly normal, thank you very much. They were the last\npeople you'd expect to be involved in anything strange or mysterious,\nbecause they just didn't hold with such nonsense."),
 Document(metadata={}, page_content='Mr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did'),
 Document(metadata={}, page_content='drills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had'),
 Document(metadata={}, page_content='have a very large mustache. Mrs. Dursley was thin and blonde and had\

#### Split by tokens
tiktoken is a python library developed by openAI to count the number of tokens in a string without making an API call.

In [42]:
from langchain.text_splitter import CharacterTextSplitter

# The model name here refers to the model used for calculating the tokens.
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    separator = "\n\n",
    chunk_size = 1200,
    chunk_overlap = 100,
    is_separator_regex = False,
    model_name='text-embedding-3-small',
    encoding_name='text-embedding-3-small'
)


doc_list = text_splitter.create_documents([hp_book])
doc_list

[Document(metadata={}, page_content='Harry Potter and the Sorcerer\'s Stone\n\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you\'d expect to be involved in anything strange or mysterious,\nbecause they just didn\'t hold with such nonsense.\n\nMr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the\nneighbors. The Dursleys had a small son called Dudley and in their\nopinion there was no finer boy anywhere.\n\nThe Dursleys had everything they wanted, but they also had a secret, and\ntheir greatest fear was that somebody would discover it. They didn\'t\nthi

In [43]:
# split text returns a list of strings as opposed to create_documents which returns a list of documents

line_list = text_splitter.split_text(hp_book)

line_list

['Harry Potter and the Sorcerer\'s Stone\n\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you\'d expect to be involved in anything strange or mysterious,\nbecause they just didn\'t hold with such nonsense.\n\nMr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the\nneighbors. The Dursleys had a small son called Dudley and in their\nopinion there was no finer boy anywhere.\n\nThe Dursleys had everything they wanted, but they also had a secret, and\ntheir greatest fear was that somebody would discover it. They didn\'t\nthink they could bear it if anyone fou

#### Code Splitting

In [44]:
python_code = """def peer_review(article_id):
    chat = ChatOpenAI()
    loader = ArxivLoader(query=article_id, load_max_docs=2)
    data = loader.load()
    first_record = data[0]
    page_content = first_record.page_content
    title = first_record.metadata['Title']
    summary = first_record.metadata['Summary']

    summary_list = []
    for record in data:
        summary_list.append(record.metadata['Summary'])
    full_summary = "\n\n".join(summary_list)

    system_template = "You are a Peer Reviewer"
    human_template = "Read the paper with the title: '{title}'\n\nAnd Content: {content} and critically list down all the issues in the paper"

    systemp_message_prompt = SystemMessagePromptTemplate.from_template(system_template)
    human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

    chat_prompt = ChatPromptTemplate.from_messages([systemp_message_prompt, human_message_prompt])
    prompt = chat_prompt.format_prompt(title=title, content=page_content)

    response = chat(messages = prompt.to_messages())

    return response.content"""
    

In [45]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

text_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size = 50,
    chunk_overlap = 10
)

text_splitter.create_documents(texts = [python_code])

[Document(metadata={}, page_content='def peer_review(article_id):'),
 Document(metadata={}, page_content='chat = ChatOpenAI()'),
 Document(metadata={}, page_content='loader = ArxivLoader(query=article_id,'),
 Document(metadata={}, page_content='load_max_docs=2)'),
 Document(metadata={}, page_content='data = loader.load()'),
 Document(metadata={}, page_content='first_record = data[0]'),
 Document(metadata={}, page_content='page_content = first_record.page_content'),
 Document(metadata={}, page_content="title = first_record.metadata['Title']"),
 Document(metadata={}, page_content="summary = first_record.metadata['Summary']"),
 Document(metadata={}, page_content='summary_list = []\n    for record in data:'),
 Document(metadata={}, page_content="summary_list.append(record.metadata['Summary'])"),
 Document(metadata={}, page_content='full_summary = "'),
 Document(metadata={}, page_content='".join(summary_list)'),
 Document(metadata={}, page_content='system_template = "You are a Peer Reviewer

## Embeddings
Embeddings are used to create a vector representation of the text. These are stored along with their corresponding text in the vector database. We use an embedding function to create embeddings of the documents.



### OpenAI Embedding

In [46]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

  embeddings = OpenAIEmbeddings()


In [48]:
text = "The scar had not pained Harry for nineteen years. All was well."
embedded_text = embeddings.embed_query(text)
print(embedded_text[:5])
print(60*"-")
print(len(embedded_text))

[-0.005937635940751284, -0.0064883171609572115, 0.03387011718571223, -0.021146146934979498, -0.008623015451815258]
------------------------------------------------------------
1536


The embed_query() method makes an actual API call to OpenAI's embedding service.

**Cost Breakdown**

For OpenAI's text-embedding-3-small:
* Cost: $0.00002 per 1K tokens
* Your text: "The scar had not pained Harry for nineteen years. All was well." (≈ 15 tokens)
* Cost: ≈ $0.0000003 (basically free for testing, but it adds up!)

In [50]:

to_embed = line_list[:5]
embedded_docs = [embeddings.embed_query(text) for text in to_embed]

print(len(embedded_docs))
print(60*"-")
print(embedded_docs[0])
print(60*"-")
print(embedded_docs[1])
print(60*"-")
print(embedded_docs[2])
print(60*"-")
print(embedded_docs[3])

5
------------------------------------------------------------
[0.023872565033263963, -0.01776624251580446, -0.014910483454989394, -0.02460953547398626, -0.00376381135983063, 0.02734685265725898, 0.004540261727908458, -0.03203187823214568, -0.01867429504765066, -0.03284780619927264, 0.007369700681932744, 0.01334442173411579, -0.0018671004452225216, 0.008330393427360507, 0.013976110816352411, 0.013554984451087144, 0.024912219030386623, -0.0037045904210594566, 0.027952221632704822, -0.02800486184628638, -0.020069272814755245, 0.012929876094040137, -0.002204330131316, -0.01317991943685894, 0.012673252957354279, -0.005629266507340428, 0.019779748846088994, -0.027320531619145643, 0.0281101422734495, -0.008455415098769908, 0.002502079075146711, -0.010508402986224448, -0.0005601301568498204, -0.022714469158598958, -0.03884885015221414, -0.023372478347626357, 0.00021775984672664002, 2.0138673414667504e-05, 0.018463732330679307, -0.02259602728105661, 0.024056808574767094, 0.0014599574224932654,

In [52]:
import numpy as np 

np.array(embedded_docs).shape

(5, 1536)

1536 represents the number of dimentions of each vector

### BGE Embeddings( BAAI(Beijing Academy of Artificial Intelligence) General Embeddings)

 uses local models that run entirely on your machine therefore is cost free

**What Happens Under the Hood**

* First time: Downloads the BAAI/bge-base-en-v1.5 model from Hugging Face Hub to your local machine (one-time download)
* Every embedding: Runs the model locally on your CPU/GPU
* No network calls: After initial download, everything is offline
* No costs: Zero API calls = zero credits consumed


In [54]:
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "BAAI/bge-base-en-v1.5"
model_kwargs = {"device":'cpu'}
encode_kwargs = {'normalize_embeddings':True}

hf = HuggingFaceEmbeddings(
    model_name = model_name,
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)

  hf = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [55]:
baai_embedded_docs = [hf.embed_query(text) for text in to_embed]

np.array(baai_embedded_docs).shape

(5, 768)

## Vector Stores
Vector stores are essentially databases that are designed for a specific kind of data format — vector embeddings. They specialize in storing these embeddings very efficiently. Vector stores enable fast retrieval based on similarity searches.

### Document Loading

In [1]:
from langchain.document_loaders import TextLoader

loader = TextLoader('/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt')
hp_loader = loader.load()

hp_loader



### Text Split

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 400, chunk_overlap = 100)
docs = text_splitter.split_documents(documents=hp_loader)
print(len(docs))


1570


In [3]:
docs

[Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content="Harry Potter and the Sorcerer's Stone\n\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you'd expect to be involved in anything strange or mysterious,\nbecause they just didn't hold with such nonsense."),
 Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='Mr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the'),
 Document(metadata={'source': '/Users/

### Defining the embedding function


In [5]:
from langchain.embeddings import HuggingFaceBgeEmbeddings


model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings':True}

embedding_function = HuggingFaceBgeEmbeddings(
    model_name = model_name,
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)

  embedding_function = HuggingFaceBgeEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

### Define Query

In [6]:
query = "What is the role and firm of Mr. Dursley?"

### Vector DB - FAISS

Facebook AI Similarity Search (FAISS)

(FAISS - in memory database)

In [7]:
from langchain_community.vectorstores import FAISS

vector_db = FAISS.from_documents(
    docs,
    embedding_function
)

#### Querying the vector database using plain text


In [8]:
matched_docs = vector_db.similarity_search(query = query, k = 5)
matched_docs

[Document(id='8a4b17e8-f93c-4b39-aac5-cceea170f891', metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='Mr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the'),
 Document(id='07e2acfc-5338-435e-817e-8933bb49afd3', metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content="Harry Potter and the Sorcerer's Stone\n\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you'd expect to be involved in anything strange or mysterious,\nbecau

#### Querying the vector database using a vector


In [10]:
embedding_query = embedding_function.embed_query(query)
matched_docs = vector_db.similarity_search_by_vector(embedding_query, k = 5)
matched_docs

[Document(id='8a4b17e8-f93c-4b39-aac5-cceea170f891', metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='Mr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the'),
 Document(id='07e2acfc-5338-435e-817e-8933bb49afd3', metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content="Harry Potter and the Sorcerer's Stone\n\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you'd expect to be involved in anything strange or mysterious,\nbecau

### ChromaDB

Both in-memory and persisted to disk

In [None]:
from langchain.vectorstores import Chroma

chroma_db = Chroma.from_documents(docs, embedding_function, persist_directory="output/hp_vector_db")



#### Loading ChromaDB

In [13]:
loaded_chorma_db = Chroma(persist_directory = "output/hp_vector_db", embedding_function = embedding_function)

  loaded_chorma_db = Chroma(persist_directory = "output/hp_vector_db", embedding_function = embedding_function)


#### Querying the DB

In [14]:
matched_docs = loaded_chorma_db.similarity_search(query = query, k = 5)
matched_docs

[Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='Mr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the'),
 Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content="Harry Potter and the Sorcerer's Stone\n\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you'd expect to be involved in anything strange or mysterious,\nbecause they just didn't hold with such nonsense."),
 Document(metadata={'source': '/Users/

## Retrievers

A retriever is an interface that returns documents given an unstructured query. It does not have to store documents like Vector store. Retrievers accept a string query as an input and return a list of Documents as an output.

### Making a retriever from vector store


In [53]:
chroma_retriever = loaded_chorma_db.as_retriever()
chroma_retriever

VectorStoreRetriever(tags=['Chroma', 'HuggingFaceBgeEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x320ceb4d0>, search_kwargs={})

### Querying the retriever



In [16]:
matched_docs = chroma_retriever.get_relevant_documents(query = query)
matched_docs

  matched_docs = chroma_retriever.get_relevant_documents(query = query)


[Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='Mr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the'),
 Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content="Harry Potter and the Sorcerer's Stone\n\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you'd expect to be involved in anything strange or mysterious,\nbecause they just didn't hold with such nonsense."),
 Document(metadata={'source': '/Users/

While creating the retriever, we can also mention how the retriever should retrieve the document( like Maximum Marginal Relevance Search or Similarity Search)and how many documents to retrieve.

MMR - Maximum marginal relevance (relevancy and diversity)

In [17]:
chroma_retriever = loaded_chorma_db.as_retriever(search_type='mmr', search_kwargs={"k": 1})

matched_docs = chroma_retriever.get_relevant_documents(query=query)

matched_docs

[Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='Mr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the')]

Similarity Score threshold

In [44]:
chroma_retriever = loaded_chorma_db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5, "k":2})

matched_docs = chroma_retriever.get_relevant_documents(query=query)

matched_docs

No relevant docs were retrieved using the relevance score threshold 0.5


[]

BM25 Retriever ( similar to keyword search)

In [20]:
from langchain.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(docs)

In [21]:
matched_docs = bm25_retriever.get_relevant_documents('Mr. Dursley role')
matched_docs

[Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='"Shoo!" said Mr. Dursley loudly. The cat didn\'t move. It just gave him a\nstern look. Was this normal cat behavior? Mr. Dursley wondered. Trying\nto pull himself together, he let himself into the house. He was still\ndetermined not to mention anything to his wife.'),
 Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story\nstarts, there was nothing about the cloudy sky outside to suggest that\nstrange and mysterious things would soon be happening all over the\ncountry. Mr. Dursley hummed as he picked out his most boring tie for\nwork, and Mrs. Dursley gossiped away happily as she wrestled a screaming\nDudley into his high chair.'),
 Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Pott

### Multi Query Retriever

MultiQueryRetriever automates the process of prompt tuning. As the name suggests, it essentially uses an LLM to generate multiple queries for a given user input query. For each of the query, it retrieves a set of relevant documents and taken a union across all queries to get a larger set of relevant documents.

In [27]:
from langchain_openai import ChatOpenAI
from langchain.globals import set_llm_cache
from langchain.cache import InMemoryCache



chat = ChatOpenAI(temperature=0)

# Enable response caching
# If you send the same prompt twice, LangChain will reuse the cached response instead of calling the API again
set_llm_cache(InMemoryCache())

In [28]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

In [25]:
from langchain.retrievers.multi_query import MultiQueryRetriever

mq_retriever = MultiQueryRetriever.from_llm(retriever = loaded_chorma_db.as_retriever(), llm = chat)

In [29]:
query = "What is the role and firm of Mr. Dursley?"
retrieved_docs = mq_retriever.get_relevant_documents(query=query)
pretty_print_docs(retrieved_docs)

Document 1:

Mr. Dursley was the director of a firm called Grunnings, which made
drills. He was a big, beefy man with hardly any neck, although he did
have a very large mustache. Mrs. Dursley was thin and blonde and had
nearly twice the usual amount of neck, which came in very useful as she
spent so much of her time craning over garden fences, spying on the
----------------------------------------------------------------------------------------------------
Document 2:

Harry Potter and the Sorcerer's Stone


CHAPTER ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say
that they were perfectly normal, thank you very much. They were the last
people you'd expect to be involved in anything strange or mysterious,
because they just didn't hold with such nonsense.
----------------------------------------------------------------------------------------------------
Document 3:

Dursley. The Potters knew very well what he and Petunia thought about
them an

### Contextual Compression

One major challenge with retrieval is that the information that is most relevant to the query may be buried in a document with lot of irrelevant text. Also passing the entire document can lead to more expensive LLM calls and poor results. This is where Contextual Compression comes into picture.





#### Retriver -> Documents -> Document Compressor -> Result

#### LLMChainExtractor

It iterates over the initially returned documents and extract from each, only the content relevant to the query.

In [42]:
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(chat)
for step in compressor.llm_chain.steps:
    print(step)
    print(60*"-")




input_variables=['context', 'question'] input_types={} output_parser=NoOutputParser() partial_variables={} template='Given the following question and context, extract any part of the context *AS IS* that is relevant to answer the question. If none of the context is relevant return NO_OUTPUT. \n\nRemember, *DO NOT* edit the extracted parts of the context.\n\n> Question: {question}\n> Context:\n>>>\n{context}\n>>>\nExtracted relevant parts:'
------------------------------------------------------------
client=<openai.resources.chat.completions.completions.Completions object at 0x327520690> async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x327520b90> root_client=<openai.OpenAI object at 0x327520410> root_async_client=<openai.AsyncOpenAI object at 0x3275207d0> temperature=0.0 model_kwargs={} openai_api_key=SecretStr('**********')
------------------------------------------------------------

------------------------------------------------------------


In [46]:
from langchain.retrievers import ContextualCompressionRetriever

compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=mq_retriever)
compressed_docs = compression_retriever.get_relevant_documents(query = query)
pretty_print_docs(compressed_docs)

Document 1:

Mr. Dursley was the director of a firm called Grunnings
----------------------------------------------------------------------------------------------------
Document 2:

Mr. Dursley
----------------------------------------------------------------------------------------------------
Document 3:

Mr. Dursley always sat with his back to the window in his office on the ninth floor.
----------------------------------------------------------------------------------------------------
Document 4:

Mr. Dursley hummed as he picked out his most boring tie for work


#### LLMChainFilter

It uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents.



In [47]:
from langchain.retrievers.document_compressors import LLMChainFilter
compressor = LLMChainFilter.from_llm(chat)

for step in compressor.llm_chain.steps:
    print(step)
    print(60*"-")


input_variables=['context', 'question'] input_types={} output_parser=BooleanOutputParser() partial_variables={} template="Given the following question and context, return YES if the context is relevant to the question and NO if it isn't.\n\n> Question: {question}\n> Context:\n>>>\n{context}\n>>>\n> Relevant (YES / NO):"
------------------------------------------------------------
client=<openai.resources.chat.completions.completions.Completions object at 0x327520690> async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x327520b90> root_client=<openai.OpenAI object at 0x327520410> root_async_client=<openai.AsyncOpenAI object at 0x3275207d0> temperature=0.0 model_kwargs={} openai_api_key=SecretStr('**********')
------------------------------------------------------------

------------------------------------------------------------


In [48]:
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=mq_retriever)
compressed_docs = compression_retriever.get_relevant_documents(query = query)
pretty_print_docs(compressed_docs)

Document 1:

Mr. Dursley was the director of a firm called Grunnings, which made
drills. He was a big, beefy man with hardly any neck, although he did
have a very large mustache. Mrs. Dursley was thin and blonde and had
nearly twice the usual amount of neck, which came in very useful as she
spent so much of her time craning over garden fences, spying on the
----------------------------------------------------------------------------------------------------
Document 2:

Harry Potter and the Sorcerer's Stone


CHAPTER ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say
that they were perfectly normal, thank you very much. They were the last
people you'd expect to be involved in anything strange or mysterious,
because they just didn't hold with such nonsense.
----------------------------------------------------------------------------------------------------
Document 3:

Mr. Dursley always sat with his back to the window in his office on the
ninth

#### EmbeddingsFilter

Making an extra LLM call over the retrieved documents can be slow and expensive. The EmbeddingsFilter provides a cheaper and faster option. It embeds the documents and query and only returns documents which have sufficiently similar embeddings to query.

In [56]:
from langchain.retrievers.document_compressors import EmbeddingsFilter
embeddings_filter  = EmbeddingsFilter(embeddings=embedding_function, similarity_threshold=0.6)

In [57]:
compression_retriever = ContextualCompressionRetriever(base_compressor=embeddings_filter, base_retriever=chroma_retriever)

compressed_docs = compression_retriever.get_relevant_documents(query = query)
pretty_print_docs(compressed_docs)

Document 1:

Mr. Dursley was the director of a firm called Grunnings, which made
drills. He was a big, beefy man with hardly any neck, although he did
have a very large mustache. Mrs. Dursley was thin and blonde and had
nearly twice the usual amount of neck, which came in very useful as she
spent so much of her time craning over garden fences, spying on the


#### Parent Document Retriever
While splitting the documents for retrieval, there may be some conflicts.

You may want to have small documents, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.

You want to have long enough documents that the context of each chunk is retained.
The ParentDocumentRetriever splits and stores small chunks of data. It first fetches smaller chunks, looks upto the parent ids for those chunks and then returns those larger documents.

#### Splits documents for retrieval -> fetches small chunks -> looks up to parent ids for those chunks -> returns larger documents
#### Parent Document -> Document that a small chunk originated from

In [58]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

parent_splitter = CharacterTextSplitter(separator="\n\n", chunk_size=1000, chunk_overlap=100)
child_splitter = CharacterTextSplitter(separator="\n", chunk_size=200, chunk_overlap=50)

store = InMemoryStore() # parent documents

parent_retriever = ParentDocumentRetriever(vectorstore=chroma_db, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter)

parent_retriever.add_documents(docs)

parent_retriever.get_relevant_documents(query=query)

[Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='Mr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the'),
 Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content="Harry Potter and the Sorcerer's Stone\n\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you'd expect to be involved in anything strange or mysterious,\nbecause they just didn't hold with such nonsense.")]

## Hypothetical Document Embeddings(HyDE)

HyDE uses a LLM to generate a “fake” hypothetical document for a given user query. It then embeds the document which is then used to look up for real documents that are similar to the hypothetical document.

The underlying concept here is that the hypothetical document may be closer to the real documents in the embedding space than the query.

In [63]:
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import Chroma

In [64]:
loader = TextLoader('/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt')
hp_loader = loader.load()


text_splitter = RecursiveCharacterTextSplitter(chunk_size = 400, chunk_overlap = 100)
docs = text_splitter.split_documents(documents=hp_loader)

model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings':True}

embedding_function = HuggingFaceBgeEmbeddings(
    model_name = model_name,
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)

chroma_db = Chroma.from_documents(documents=docs, embedding=embedding_function, persist_directory="output/hp_vector_db")

chroma_retriever = chroma_db.as_retriever(search_kwargs = {"k":5})


In [65]:
from langchain.prompts.chat import SystemMessagePromptTemplate, ChatPromptTemplate

def get_hyde_doc(query):
    template = """Imagine you are an expert writing a detailed explanation on the topic: '{query}'
    Your response should be comprehensive and include all key points that would be found in the top search result."""

    system_message_prompt = SystemMessagePromptTemplate.from_template(template = template)
    chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt])
    messages = chat_prompt.format_prompt(query = query).to_messages()
    response = chat(messages = messages)
    hyde_doc = response.content
    return hyde_doc

In [66]:
query = "What is the role and firm of Mr. Dursley?"
print(get_hyde_doc(query=query))


  response = chat(messages = messages)


Mr. Dursley is a character in J.K. Rowling's Harry Potter series, specifically in the first book, "Harry Potter and the Sorcerer's Stone." He is portrayed as a narrow-minded, materialistic, and self-centered individual who works as a director of a drill-making firm called Grunnings. 

As the head of the firm, Mr. Dursley is depicted as being obsessed with his work and his reputation in the community. He is described as a man who values conformity and despises anything out of the ordinary. This is evident in his reaction to anything related to magic, which he views as strange and unacceptable.

Mr. Dursley's firm, Grunnings, is a symbol of his mundane and ordinary life. It represents his focus on material possessions and his lack of imagination or openness to the magical world that exists alongside the ordinary world of the Muggles (non-magical people).

Throughout the series, Mr. Dursley's firm and his role as its director serve as a contrast to the magical world of Hogwarts and the wi

In [69]:
matched_doc = chroma_retriever.get_relevant_documents(query = get_hyde_doc(query))
matched_doc

[Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='Mr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the'),
 Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='Mr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the'),
 Document(metadata={'source': '/

## HyDE from Chains



The HypotheticalDocumentEmbedder class takes care of creating hypothetical answers, embedding them and retrieving similar chunks.

In [71]:
from langchain.chains import HypotheticalDocumentEmbedder

hyde_embedding_function = HypotheticalDocumentEmbedder.from_llm(llm = chat, base_embeddings = embedding_function, prompt_key = 'web_search' )


The prompt_key="web_search" essentially tells the system:
> "When generating hypothetical documents, use the style and format that would be typical for web search results - clear, informative passages that directly answer questions."
This helps generate hypothetical documents that match the style of content you're likely to have in your vector database, improving retrieval accuracy.

In [72]:
chain_db = Chroma.from_documents(docs, hyde_embedding_function,persist_directory="output/hp_vector_db")


In [73]:
matched_docs_new = chain_db.similarity_search(query)

matched_docs_new

[Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='Mr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the'),
 Document(metadata={'source': '/Users/danilofornari/Downloads/rag_poc/data/Harry Potter 1 - Sorcerer_s Stone.txt'}, page_content='Mr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the'),
 Document(metadata={'source': '/