# Day 2: RAG

What is RAG? 

**RAG or Retrieval Augmented Generation** - is a technique that is widely used in industry to help reduce model hallucination.
For example, have you ever tryed to ask ChatGPT as questions about something that you know about and the answer is just made up?
This is what we call a **hallucination** and RAG is a technique that we use to reduce hallucination. 

RAG does this by finding relavant chuncks of text to add to your prompt before sending it off to a LLM.

**Why is this useful?** 
Models like ChatGPT are training on the internet. What if you ask it a questions, which it wasn't training on? Can you think of such questions? 
For example, let's say you have a business with your own data. You don't share any of that data on the internet or to anyone outside your business. 
How can you create a chatbot that let's you have a converation with your data? You use RAG!

Today we will learn about how you process document, put them into a database and how you retrieve those documents from a database. Let's get started! 🚀

In [1]:
# import libraries
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader

from langchain_community.document_loaders import DirectoryLoader

# Step 1: Simple loading documents

### PDF document loading

+ [Loading PDF files reference](https://python.langchain.com/docs/how_to/document_loader_pdf/)

In [2]:
# Step 1: Set the file path for the documents you want to load
file_path_pdf = "../data/principles-for-navigating-big-debt-crises-by-ray-dalio.pdf"

# Step 2: Create loader that is specific for your document type
loader_pdf = PyPDFLoader(file_path_pdf)

# Step 3: Load in the document
my_pdf_document = loader_pdf.load()

In [3]:
# Step 4: Examine what we just did
#         'my_document' - it contains metadata, source, page number and the content of the page we just "scrapped"

my_pdf_document[0]

Document(metadata={'source': '../data/principles-for-navigating-big-debt-crises-by-ray-dalio.pdf', 'page': 0}, page_content='Principles For Navigating\nBIG DEBT CRISES\nPart 1:\nThe Archetypal Big Debt Cycle')

In [4]:
print("Source:       ", my_pdf_document[0].metadata["source"])
print("Page number:  ",   my_pdf_document[0].metadata["page"])
print("Page content: ", my_pdf_document[0].page_content)

Source:        ../data/principles-for-navigating-big-debt-crises-by-ray-dalio.pdf
Page number:   0
Page content:  Principles For Navigating
BIG DEBT CRISES
Part 1:
The Archetypal Big Debt Cycle


### HTML document loading

In [5]:
# Step 1: Setting the file path
file_path_html = "../data/Is It Wrong to Remove a Card From Monopoly_ - The New York Times.html"

# Step 2: Create loader that is specific for your document type
loader_html = UnstructuredHTMLLoader(file_path_html)

# Step 3: Load in the document
my_html_document = loader_html.load()

In [6]:
print("Source:       ", my_html_document[0].metadata["source"])
print("Page content: ", my_html_document[0].page_content[:100])

Source:        ../data/Is It Wrong to Remove a Card From Monopoly_ - The New York Times.html
Page content:  Magazine|Is It Wrong to Remove a Card From Monopoly?

https://www.nytimes.com/2025/04/16/magazine/mo


### Markdown document loading

In [7]:
# Step 1: Setting the file path
file_path_markdown = "../data/README.md"

# Step 2: Create loader that is specific for your document type
loader_markdown = UnstructuredMarkdownLoader(file_path_markdown)

# Step 3: Load in the document
my_markdown_document = loader_markdown.load()

In [8]:
my_markdown_document

[Document(metadata={'source': '../data/README.md'}, page_content='Release Notes\n\nCI\n\nPyPI - License\n\nPyPI - Downloads\n\nGitHub star chart\n\nOpen Issues\n\nOpen in Dev Containers\n\n\n\nTwitter\n\nCodSpeed Badge\n\n[!NOTE] Looking for the JS/TS library? Check out LangChain.js.\n\nLangChain is a framework for building LLM-powered applications. It helps you chain together interoperable components and third-party integrations to simplify AI application development — all while future-proofing decisions as the underlying technology evolves.\n\nbash pip install -U langchain\n\nTo learn more about LangChain, check out the docs. If you’re looking for more advanced customization or agent orchestration, check out LangGraph, our framework for building controllable agent workflows.\n\nWhy use LangChain?\n\nLangChain helps developers build applications powered by LLMs through a standard interface for models, embeddings, vector stores, and more.\n\nUse LangChain for: - Real-time data augmenta

## 🦜 Langchain provide ways of loading in all types of documents 

#### Reference: [Document loaders](https://python.langchain.com/docs/how_to/#document-loaders)
+ [How to: load PDF files](https://python.langchain.com/docs/how_to/document_loader_pdf/)
+ [How to: load web pages](https://python.langchain.com/docs/how_to/document_loader_web/)
+ [How to: load CSV data](https://python.langchain.com/docs/how_to/document_loader_csv/)
+ [How to: load HTML data](https://python.langchain.com/docs/how_to/document_loader_html/)
+ [How to: load JSON data](https://python.langchain.com/docs/how_to/document_loader_json/)
+ [How to: load Markdown data](https://python.langchain.com/docs/how_to/document_loader_markdown/)

# Step 2: Loading documents from a folder

**Scenario:** What if you have a folder full of documents and you want to load in all of the PDF documents you have at the same time?

<div style="text-align:center;">
    <img src="../assets/image1.png" width="25%" height="25%"/>
</div>

In [15]:
# Step 1: Using Langchain's DirectoryLoader set the path to your folder and use
pdfs_file_path = "../data/"

# Step 2: use Directory loader to load in all of the documents from a folder

loader_pdfs = DirectoryLoader(pdfs_file_path,                # set your base file path
                              glob="*.pdf",                  # *.pdf means all files with the extension .pdf 
                              show_progress=True,            # progress bar
                              use_multithreading=True,       # load from disk using multiple threads
                              loader_cls=PyPDFLoader         # add the langchain PDF class
                             )

# Step 3: Load in the documents
my_pdf_documents = loader_pdfs.load()

100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:12<00:00,  4.28s/it]


In [17]:
for doc in my_pdf_documents:
    print(doc.metadata["source"])

../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/react.pdf
../data/trends_in_deep_learning_nlp.pdf
../data/trends_in_deep_learning_nlp.pdf
../data/trends_in_deep_learning_nlp.pdf
../data/trends_in_deep_learning_nlp.pdf
../data/trends_in_deep_learning_nlp.pdf
../data/trends_in_deep_learning_nlp.pdf
../data/trends_in_deep_learning_nlp.pdf
../data/trends_in_deep_learning_nlp.pdf
../data/trends_in_deep_learning_nlp.pdf
../data/trends_in_deep_learning_nlp.pdf
../dat