# Day 2: RAG

### What is RAG? 

**RAG or Retrieval Augmented Generation** - is a technique that is widely used in industry to help reduce model hallucination.
For example, have you ever tried to ask a LLM a question about something that you know about and the answer is just made up?
This is what we call a **hallucination** and RAG is a technique that we use to reduce hallucination. 

**RAG** does this by finding relavant chuncks of text to add to your prompt before sending it off to a LLM.

**Why is this useful?** 
Models like ChatGPT are training on a wide variety of data from the internet. What if you ask it a questions, which it wasn't training on? Can you think of such questions? 
For example, let's say you have a business with your own data. You don't share any of that data on the internet or to anyone outside your business. 
How can you create a chatbot that let's you have a converation with your business data? You use RAG!

Today we will learn about how you process documents, put them into a database and how you retrieve those documents from a database. Let's get started! 🚀


### Choosing an Embedding Model

+ Embedding models are tools that we use to turn our documents into numbers we can store in a database we can later retieve.

| Model | Dim | License | Notes |
|-------|-----|---------|-------|
| `text-embedding-3-small` (OpenAI) | 1536 | Commercial | Great multilingual coverage, pay‑per‑call |
| `BAAI/bge-small-en-v1.5` | 384 | Apache 2.0 | State‑of‑the‑art open model, can do *instruction‑tuned* search (`<emb>` prompt) |
| `sentence-transformers/all-MiniLM-L6-v2` | 384 | Apache 2.0 | Lightweight; good default |

**Rule of thumb**: Pick the smallest model that meets quality requirements; larger ≠ always better. Measure with *mean reciprocal rank* (MRR) or *precision@k* on a small evaluation set.


In [5]:
!pip3 install rich langchain-community pypdf

Collecting pypdf
  Downloading pypdf-5.7.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.7.0-py3-none-any.whl (305 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.7.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# utilities
from rich import inspect

# import core langchain libraries
from langchain_community.document_loaders import PyPDFLoader                    # special method to load PDF files
from langchain_community.document_loaders import UnstructuredHTMLLoader         # special method to load HTML
from langchain_community.document_loaders import UnstructuredMarkdownLoader     # special method to load Markdown

from langchain_community.document_loaders import DirectoryLoader                # helps use load multiple documents from a directory
from langchain_community.document_loaders.merge import MergedDataLoader         # helps merge multiple different types of documents

---
# Section 1: Loading Documents

## 📃 Step 1: Simple loading documents

### PDF document loading

+ [Loading PDF files reference](https://python.langchain.com/docs/how_to/document_loader_pdf/)

In [8]:
!pip3 install pypdf


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [9]:
# Step 1: Set the file path for the documents you want to load
file_path_pdf = "../data/principles-for-navigating-big-debt-crises-by-ray-dalio.pdf"

# Step 2: Create loader that is specific for your document type
loader_pdf = PyPDFLoader(file_path_pdf)

# Step 3: Load in the document
my_pdf_document = loader_pdf.load()

In [10]:
# Step 4: Examine what we just did
#         'my_document' - it contains metadata, source, page number and the content of the page we just "scrapped"

my_pdf_document[0]

Document(metadata={'producer': 'Adobe PDF Library 15.0', 'creator': 'Adobe InDesign CC 13.1 (Macintosh)', 'creationdate': '2018-09-26T11:32:24-05:00', 'author': 'Ray Dalio', 'keywords': 'Debt Crises; Bridgewater Associates; Ray Dalio', 'moddate': '2018-10-04T16:16:31-04:00', 'subject': 'Principles for Navigating Big Debt Crises provides a unique template for how debt crises work and the principles for dealing with them well.', 'title': 'Principles for Navigating Big Debt Crises', 'trapped': '/Unknown', 'source': '../data/principles-for-navigating-big-debt-crises-by-ray-dalio.pdf', 'total_pages': 480, 'page': 0, 'page_label': '1'}, page_content='Principles For Navigating\nBIG DEBT CRISES\nPart 1:\nThe Archetypal Big Debt Cycle')

In [11]:
print("Source:       ",   my_pdf_document[0].metadata["source"])
print("Page number:  ",   my_pdf_document[0].metadata["page"])
print("Page content: ",   my_pdf_document[0].page_content)

Source:        ../data/principles-for-navigating-big-debt-crises-by-ray-dalio.pdf
Page number:   0
Page content:  Principles For Navigating
BIG DEBT CRISES
Part 1:
The Archetypal Big Debt Cycle


### HTML document loading

In [13]:
!pip3 install unstructured

Collecting unstructured
  Downloading unstructured-0.18.1-py3-none-any.whl.metadata (24 kB)
Collecting chardet (from unstructured)
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting lxml (from unstructured)
  Downloading lxml-6.0.0-cp313-cp313-macosx_10_13_universal2.whl.metadata (6.6 kB)
Collecting nltk (from unstructured)
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting beautifulsoup4 (from unstructured)
  Downloading beautifulsoup4-4.13.4-py3-none-any.whl.metadata (3.8 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured)
  Downloading python_iso639-2025.2.18-py3-none-any.whl.metadata (14 kB)
Collecting langdetect (from unstr

In [15]:
# Step 1: Setting the file path
file_path_html = "../data/Is_It_Wrong_to_Remove_a_Card_From_Monopoly_The_New_York_Times.html"

# Step 2: Create loader that is specific for your document type
loader_html = UnstructuredHTMLLoader(file_path_html)

# Step 3: Load in the document
my_html_document = loader_html.load()

In [16]:
print("Source:       ", my_html_document[0].metadata["source"])
print("Page content: ", my_html_document[0].page_content[:100])

Source:        ../data/Is_It_Wrong_to_Remove_a_Card_From_Monopoly_The_New_York_Times.html
Page content:  Magazine|Is It Wrong to Remove a Card From Monopoly?

https://www.nytimes.com/2025/04/16/magazine/mo


### Markdown document loading

In [18]:
!pip install markdown -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [20]:
# Step 1: Setting the file path
file_path_markdown = "../data/README.md"

# Step 2: Create loader that is specific for your document type
loader_markdown = UnstructuredMarkdownLoader(file_path_markdown)

# Step 3: Load in the document
my_markdown_document = loader_markdown.load()

In [21]:
my_markdown_document

[Document(metadata={'source': '../data/README.md'}, page_content='Release Notes\n\nCI\n\nPyPI - License\n\nPyPI - Downloads\n\nGitHub star chart\n\nOpen Issues\n\nOpen in Dev Containers\n\n\n\nTwitter\n\nCodSpeed Badge\n\n[!NOTE] Looking for the JS/TS library? Check out LangChain.js.\n\nLangChain is a framework for building LLM-powered applications. It helps you chain together interoperable components and third-party integrations to simplify AI application development — all while future-proofing decisions as the underlying technology evolves.\n\nbash pip install -U langchain\n\nTo learn more about LangChain, check out the docs. If you’re looking for more advanced customization or agent orchestration, check out LangGraph, our framework for building controllable agent workflows.\n\nWhy use LangChain?\n\nLangChain helps developers build applications powered by LLMs through a standard interface for models, embeddings, vector stores, and more.\n\nUse LangChain for: - Real-time data augmenta

## 🦜 Langchain provide ways of loading in all types of documents 

Here is a reference for loading documents that I didn't cover in the examples. When we begin to work on our group project these links can be useful to help you load all types of document types.

#### Reference: [Document loaders](https://python.langchain.com/docs/how_to/#document-loaders)
+ [How to: load PDF files](https://python.langchain.com/docs/how_to/document_loader_pdf/)
+ [How to: load web pages](https://python.langchain.com/docs/how_to/document_loader_web/)
+ [How to: load CSV data](https://python.langchain.com/docs/how_to/document_loader_csv/)
+ [How to: load HTML data](https://python.langchain.com/docs/how_to/document_loader_html/)
+ [How to: load JSON data](https://python.langchain.com/docs/how_to/document_loader_json/)
+ [How to: load Markdown data](https://python.langchain.com/docs/how_to/document_loader_markdown/)

## 📂 Step 2: Loading documents from a folder

**Scenario:** What if you have a folder full of documents and you want to load in all of the PDF documents you have at the same time?

<div style="text-align:center;">
    <img src="../assets/image1.png" width="25%" height="25%"/>
</div>

### 2.1: Let's say we want to load in all of the PDF files we have in a folder into a document loader.

In [22]:
# Step 1: Using Langchain's DirectoryLoader set the path to your folder and use
pdfs_file_path = "../data/"

# Step 2: use Directory loader to load in all of the documents from a folder

loader_pdfs = DirectoryLoader(pdfs_file_path,                # set your base file path
                              glob="*.pdf",                  # *.pdf means all files with the extension .pdf 
                              show_progress=True,            # progress bar
                              use_multithreading=True,       # load from disk using multiple threads
                              loader_cls=PyPDFLoader         # add the langchain PDF class
                             )

# Step 3: Load in the documents
my_pdf_documents = loader_pdfs.load()

100%|██████████| 3/3 [00:10<00:00,  3.55s/it]


#### Let's talk about what just happend.

In this project - there is a folder called `data/`. That folder contains three different documents that end in a .pdf.
    
+ data/principles-for-navigating-big-debt-crises-by-ray-dalio.pdf
+ data/react.pdf
+ data/trends_in_deep_learning_nlp.pdf

`DirectoryLoader` took all of the documents ending in .pdf, saved them to a list of `Document` objects.

In [23]:
my_pdf_documents[0]

Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-03-13T00:09:11+00:00', 'author': '', 'keywords': '', 'moddate': '2023-03-13T00:09:11+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../data/react.pdf', 'total_pages': 33, 'page': 0, 'page_label': '1'}, page_content='Published as a conference paper at ICLR 2023\nREAC T: S YNERGIZING REASONING AND ACTING IN\nLANGUAGE MODELS\nShunyu Yao∗*,1, Jeffrey Zhao2, Dian Yu2, Nan Du2, Izhak Shafran2, Karthik Narasimhan1, Yuan Cao2\n1Department of Computer Science, Princeton University\n2Google Research, Brain team\n1{shunyuy,karthikn}@princeton.edu\n2{jeffreyzhao,dianyu,dunan,izhak,yuancao}@google.com\nABSTRACT\nWhile large language models (LLMs) have demonstrated impressive performance\nacross tasks in language understanding and interactive decision making, their\nabi

In [24]:
# We can also look at the details of the Document object
print("Source of the text:  " , my_pdf_documents[0].metadata["source"])
print("Page:                " , my_pdf_documents[0].metadata["page"])

Source of the text:   ../data/react.pdf
Page:                 0


#### What if you have a scenario where you want to add more data to the `Document` object? 

+ For example - your data is missing the authors name, but you know the author of the document and want to add it to use if for a later task - for example, as a reference.

In [25]:
my_pdf_documents[0].metadata["author"] = "Andronik"

In [26]:
my_pdf_documents[0].metadata["author"]

'Andronik'

In [27]:
my_pdf_documents[0]

Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-03-13T00:09:11+00:00', 'author': 'Andronik', 'keywords': '', 'moddate': '2023-03-13T00:09:11+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../data/react.pdf', 'total_pages': 33, 'page': 0, 'page_label': '1'}, page_content='Published as a conference paper at ICLR 2023\nREAC T: S YNERGIZING REASONING AND ACTING IN\nLANGUAGE MODELS\nShunyu Yao∗*,1, Jeffrey Zhao2, Dian Yu2, Nan Du2, Izhak Shafran2, Karthik Narasimhan1, Yuan Cao2\n1Department of Computer Science, Princeton University\n2Google Research, Brain team\n1{shunyuy,karthikn}@princeton.edu\n2{jeffreyzhao,dianyu,dunan,izhak,yuancao}@google.com\nABSTRACT\nWhile large language models (LLMs) have demonstrated impressive performance\nacross tasks in language understanding and interactive decision making, th

### 2.2: Let's do something similar but with HTML files.

In [29]:
# Step 1: Using Langchain's DirectoryLoader set the path to your folder and use
html_file_path = "../data/"

# Step 2: use Directory loader to load in all of the documents from a folder

loader_html = DirectoryLoader(html_file_path,                       
                              glob="*.html",
                              show_progress=True,
                              use_multithreading=True,
                              loader_cls=UnstructuredHTMLLoader              # the only thing that changes from the top example
                             )

# Step 3: Load in the documents
my_html_documents = loader_html.load()

100%|██████████| 2/2 [00:00<00:00,  3.60it/s]


In [30]:
# Same applies as above example
print("Source of the text:  " , my_html_documents[0].metadata["source"])

Source of the text:   ../data/Is_It_Wrong_to_Remove_a_Card_From_Monopoly_The_New_York_Times.html


In [31]:
my_html_documents[0].page_content[:1000]

'Magazine|Is It Wrong to Remove a Card From Monopoly?\n\nhttps://www.nytimes.com/2025/04/16/magazine/monopoly-games-children-ethics.html\n\nAdvertisement\n\nSKIP ADVERTISEMENT\n\nSubscriber-only Newsletter\n\nThe Ethicist\n\nIs It Wrong to Remove a Card From Monopoly?\n\nThe magazine’s Ethicist columnist on altering board games to teach children ethical behavior.\n\nKwame Anthony Appiah\n\nBy Kwame Anthony Appiah\n\nApril 16, 2025\n\nYou’re reading The Ethicist newsletter, for Times subscribers only. Advice on life’s trickiest situations and moral dilemmas from the philosopher Kwame Anthony Appiah.\n\nMy grandchildren love playing Monopoly. The board game has become a great way for me to interact with them, and also a great way for them to see capitalism in all its imperfect glory. The problem: One of the cards a player may draw when landing on Community Chest is “Bank Error in Your Favor. Collect $200.” Right when we first started playing the game together, I removed that card from th

## 🗂️ Step 3: Loading different documents from the same folder 

In [32]:
# Step 1: Loading in the PDF, HTML and Markdown files
file_path_data = "../data/"

loader_pdfs = DirectoryLoader(file_path_data,              
                              glob="*.pdf",              
                              show_progress=True,            
                              use_multithreading=True,     
                              loader_cls=PyPDFLoader)                   # on difference is the loader class


loader_html = DirectoryLoader(file_path_data,                       
                              glob="*.html",
                              show_progress=True,
                              use_multithreading=True,
                              loader_cls=UnstructuredHTMLLoader)


loader_markdown = DirectoryLoader(file_path_data,                       
                                  glob="*.md",
                                  show_progress=True,
                                  use_multithreading=True,
                                  loader_cls=UnstructuredMarkdownLoader)



# Step 2: User MergeDataLoader to combine all of the different loader types you have for your data
my_loaders = MergedDataLoader(loaders=[loader_pdfs, loader_html, loader_markdown])

# Step 3: Load all documents into a list of Document objects
my_docs = my_loaders.load()

100%|██████████| 3/3 [00:10<00:00,  3.56s/it]
100%|██████████| 2/2 [00:00<00:00,  3.75it/s]
100%|██████████| 1/1 [00:00<00:00, 90.96it/s]


In [33]:
print("The length of my list of Documents: ", len(my_docs))

The length of my list of Documents:  548


In [34]:
i = 0
while i < 10:
    
    for doc in my_docs:
        print(doc.metadata)
        break
    
    i += 1

{'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-03-13T00:09:11+00:00', 'author': '', 'keywords': '', 'moddate': '2023-03-13T00:09:11+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../data/react.pdf', 'total_pages': 33, 'page': 0, 'page_label': '1'}
{'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-03-13T00:09:11+00:00', 'author': '', 'keywords': '', 'moddate': '2023-03-13T00:09:11+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../data/react.pdf', 'total_pages': 33, 'page': 0, 'page_label': '1'}
{'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-03-13T00:09:11+00:00', 'author': '', 'keywords': '', 'moddate': '2023-03-13T00:

---
# Section 2: RAG

What is RAG and why is it important? Let's say that you are working at a company and your boss want you to create a chatbot that is able to answer questions about your business. For example, how much inventory do we have left in the parts department? What are the type of defects we had for Part A in the month of March? 

To answer your bosses questions using a chatbot your chatbot needs to be up to date on the information your business has. Plus, the data your business has is not on the internet!

RAG is a way of adding extra knowledge to the LLM without needing to retrain the entire model, which is expensive, time consuming and requires expertise. What RAG provides us is a database of additional information which we can "hookup" to an LLM and use to answer questions.

Here is a diagram and next we will talk about how it works.

![rag](../assets/rag2.png)


Three important parts to a RAG:
1. **Document Splitting** - how we chop of the documents into smaller pieces.
2. **Embeddings** - how we take those smaller pieces and put them into a database.
3. **Retrieval** - how we retrieve pieces of our documents from the RAG


If you want more details RAG. Here is a paper you can check out...
+ [What is a RAG?](https://www.youtube.com/watch?v=T-D1OfcDW1M)
+ [Searching for Best Practices in Retrieval-Augmented
Generation](https://arxiv.org/pdf/2407.01219)

## Step 1: Document Splitting and Embeddings

Once we know the documents that we want to include in our database, we need to processes the documents.
**Chunking** is a way of splitting your documents into smaller pieces so when you need to retrieve something from your database you don't collect the entire document. You just collect a piece of information that is most important for your users question.

<div style="text-align:center;">
    <img src="../assets/text_splitting.png" width="" height=""/>
</div>

## There a many different ways of splitting documents, but lets take a look at the few most common.


### Step 1.1: Recursive Chunking

In [35]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [36]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=10,
    chunk_overlap=5
)

In [37]:
test_sentence = "Hello. My name is Andronik and I have a cat named Jinxy"

In [38]:
texts = text_splitter.split_text(test_sentence)

In [39]:
texts

['Hello. My',
 'My name',
 'name is',
 'Andronik',
 'and I',
 'I have a',
 'a cat',
 'cat named',
 'Jinxy']

In [40]:
for n in [4, 8, 16]:
    
    text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=n,
    chunk_overlap=n//2)

    text = text_splitter.split_text(test_sentence)

    for t in text:
        print(f"Chunk length: {len(t)}, Text Split: {t}")

    print("------------------------------------------------------------")

Chunk length: 4, Text Split: Hell
Chunk length: 4, Text Split: llo.
Chunk length: 2, Text Split: My
Chunk length: 3, Text Split: nam
Chunk length: 3, Text Split: ame
Chunk length: 2, Text Split: is
Chunk length: 3, Text Split: And
Chunk length: 4, Text Split: ndro
Chunk length: 4, Text Split: roni
Chunk length: 3, Text Split: nik
Chunk length: 3, Text Split: and
Chunk length: 1, Text Split: I
Chunk length: 3, Text Split: hav
Chunk length: 3, Text Split: ave
Chunk length: 1, Text Split: a
Chunk length: 3, Text Split: cat
Chunk length: 3, Text Split: nam
Chunk length: 4, Text Split: amed
Chunk length: 3, Text Split: Jin
Chunk length: 4, Text Split: inxy
------------------------------------------------------------
Chunk length: 6, Text Split: Hello.
Chunk length: 7, Text Split: My name
Chunk length: 2, Text Split: is
Chunk length: 7, Text Split: Androni
Chunk length: 5, Text Split: ronik
Chunk length: 5, Text Split: and I
Chunk length: 6, Text Split: I have
Chunk length: 5, Text Split: a 

### Step 1.2: Semantic Chunking or Chunking by meaning

In [41]:
# loading in the api keys to connect to OpenAI
import os
from dotenv import load_dotenv
load_dotenv()

True

In [45]:
!pip3 install langchain-openai langchain-experimental -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [46]:
from langchain_openai import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

In [47]:
# connect to OpenAI 
openai_api_key = os.environ.get("OPENAI_API_KEY")
openai_organization = os.environ.get("OPENAI_ORGANIZATION")

In [48]:
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key,
                          model="text-embeddings-3-small")

In [49]:
text_splitter = SemanticChunker(OpenAIEmbeddings())

In [50]:
test_sentence = """
This raises the question: Can function inlining affect the output of a numerical program? 
I’m interested in both the programmer inlining a function manually in his editor and the compiler doing it for you during an optimization pass.
It turns out that yes, there are multiple ways in which inlining can change results, and specifics depend on the interplay of your language spec, 
compiler and hardware. One of the big reasons why compilers do inlining is that it increases the scope for optimizations. Most optimization passes 
run on individual functions, so inlining gives the compiler more code to work with, and more opportunities to apply potentially result-changing optimizations.
Here’s a concrete example that compiled with gcc -O3 -march=haswell produces different results depending on whether the function is inlined or not
"""

In [51]:
docs = text_splitter.split_text(test_sentence)

In [52]:
print("Number of chunks:", len(docs))
for doc in docs:
    print("--------------------------------------------------------------------")
    print(doc)
    print("--------------------------------------------------------------------")

Number of chunks: 2
--------------------------------------------------------------------

This raises the question: Can function inlining affect the output of a numerical program? I’m interested in both the programmer inlining a function manually in his editor and the compiler doing it for you during an optimization pass. It turns out that yes, there are multiple ways in which inlining can change results, and specifics depend on the interplay of your language spec, 
compiler and hardware. One of the big reasons why compilers do inlining is that it increases the scope for optimizations.
--------------------------------------------------------------------
--------------------------------------------------------------------
Most optimization passes 
run on individual functions, so inlining gives the compiler more code to work with, and more opportunities to apply potentially result-changing optimizations. Here’s a concrete example that compiled with gcc -O3 -march=haswell produces differe

### What did the `OpenAIEmbeddings()` do?

The basic idea is this...

The embeddings model take a look at the text that we give it and decides to divide it into 2 pieces. Each piece we can think of as an "idea". `Chunk 1` represents idea 2 and `Chunk 2` represents idea 2.

```mermaid
flowchart LR
node1[Document]
node2["Chunk 1"]
node3["Chunk 2"]
node1-->node2
node1-->node3
```

---

## Step 2: Creating a Database

+ Now that we have taken the documents and split them into smaller pieces, we need to store them in a database so we can use them later.

![vdb](../assets/vdb.png)

In [53]:
!pip3 install langchain-chroma -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [55]:
from rich import inspect
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

In [56]:
# first initilize an embeddings model to use on your documents
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key,
                              model="text-embedding-3-large",
                              chunk_size=1000)

In [57]:
# Step 1: First initilize the SemanticChunker
text_splitter = SemanticChunker(embeddings,
                                breakpoint_threshold_type="percentile",
                                breakpoint_threshold_amount=70.0)

In [58]:
# Step 2: Split the documents into section. 
#   We are hoping each section contains a seperate idea.
docs = text_splitter.split_documents(my_docs)

In [39]:
for doc in docs:
    print(doc)
    break

page_content='Published as a conference paper at ICLR 2023
REAC T: S YNERGIZING REASONING AND ACTING IN
LANGUAGE MODELS
Shunyu Yao∗*,1, Jeffrey Zhao2, Dian Yu2, Nan Du2, Izhak Shafran2, Karthik Narasimhan1, Yuan Cao2
1Department of Computer Science, Princeton University
2Google Research, Brain team
1{shunyuy,karthikn}@princeton.edu
2{jeffreyzhao,dianyu,dunan,izhak,yuancao}@google.com
ABSTRACT
While large language models (LLMs) have demonstrated impressive performance
across tasks in language understanding and interactive decision making, their
abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g.' metadata={'source': '../data/react.pdf', 'page': 0}


In [40]:
print("New Length after chunking:", len(docs))

New Length after chunking: 3883


In [41]:
# Step 3: Insert your documents into a database after turning your docs into numbers.
vectorstore = Chroma.from_documents(
    documents=docs,
    collection_name="tumo_2025",
    embedding=embeddings,
    persist_directory="../db/my_first_vdb",  # Where to save data locally, remove if not necessary
)

In [42]:
# Let's run a test and see what the most similar documents to the phrase "Machine Learning"
vectorstore.similarity_search("Machine Learning")

[Document(id='8cd4758a-5fde-4bb3-81c7-828374fca164', metadata={'page': 24, 'source': '../data/trends_in_deep_learning_nlp.pdf'}, page_content='25\nTABLE X: Comparison of ELMo + Baseline with the previous state of the art (SOTA) on various NLP tasks. The table has\nbeen adapted from [41]. SOTA results have been taken from [41]; SQUAD [166]: QA task; SNLI [178]: Stanford Natural\nLanguage Inference task; SRL [153]: Semantic Role Labelling; Coref [179]: Coreference Resolution; NER [180]: Named Entity\nRecognition; SST-5 [4]: Stanford Sentiment Treebank 5-class classiﬁcation;\nTask Previous SOTA Previous\nSOTA Results Baseline ELMo +\nBaseline\nIncrease\n(Absolute/Relative)\nSQuAD Liu et al. [181] 84.4 81.1 85.8 4.7 / 24.9%\nSNLI Qian et al. [182] 88.6 88.0 88.70 ±0.17 0.7 / 5.8%\nSRL Luheng et al. [183] 81.7 81.4 84.6 3.2 / 17.2%\nCoref Kenton et al. [184] 67.2 67.2 70.4 3.2 / 9.8%\nNER Matthew et al. [185] 91.93 ±0.19 90.15 92.22 ±0.10 2.06 / 21%\nSST-5 Bryan et al. [186] 53.7 51.4 54.7 

In [43]:
# Step 4: Langhchain as a retriever method that makes it easier to use the vectorbase 
#         by allowing it to be easier to integrate into our chain of functions. 

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

In [44]:
my_retrieved_docs = retriever.invoke("Machine Learning")

In [45]:
print("How many documents we retrieved: ", len(my_retrieved_docs))

How many documents we retrieved:  5


In [46]:
for doc in my_retrieved_docs:
    print(doc)
    print("----------")

page_content='25
TABLE X: Comparison of ELMo + Baseline with the previous state of the art (SOTA) on various NLP tasks. The table has
been adapted from [41]. SOTA results have been taken from [41]; SQUAD [166]: QA task; SNLI [178]: Stanford Natural
Language Inference task; SRL [153]: Semantic Role Labelling; Coref [179]: Coreference Resolution; NER [180]: Named Entity
Recognition; SST-5 [4]: Stanford Sentiment Treebank 5-class classiﬁcation;
Task Previous SOTA Previous
SOTA Results Baseline ELMo +
Baseline
Increase
(Absolute/Relative)
SQuAD Liu et al. [181] 84.4 81.1 85.8 4.7 / 24.9%
SNLI Qian et al. [182] 88.6 88.0 88.70 ±0.17 0.7 / 5.8%
SRL Luheng et al. [183] 81.7 81.4 84.6 3.2 / 17.2%
Coref Kenton et al. [184] 67.2 67.2 70.4 3.2 / 9.8%
NER Matthew et al. [185] 91.93 ±0.19 90.15 92.22 ±0.10 2.06 / 21%
SST-5 Bryan et al. [186] 53.7 51.4 54.7 0.5 3.3 / 6.8%
Task BiLSTM+
ELMo+Attn BERT
QNLI 79.9 91.1
SST-2 90.9 94.9
STS-B 73.3 86.5
RTE 56.8 70.1
SQuAD 85.8 91.1
NER 92.2 92.8
TABLE XI: 

#### How to load in your database after you have created it

In [47]:
loaded_vectorstore = Chroma(
    persist_directory="../db/my_first_vdb",
    embedding_function=embeddings,
    collection_name="tumo_2025",
)

In [48]:
loaded_vectorstore.similarity_search_with_score("Machine Learning")

[(Document(id='8cd4758a-5fde-4bb3-81c7-828374fca164', metadata={'page': 24, 'source': '../data/trends_in_deep_learning_nlp.pdf'}, page_content='25\nTABLE X: Comparison of ELMo + Baseline with the previous state of the art (SOTA) on various NLP tasks. The table has\nbeen adapted from [41]. SOTA results have been taken from [41]; SQUAD [166]: QA task; SNLI [178]: Stanford Natural\nLanguage Inference task; SRL [153]: Semantic Role Labelling; Coref [179]: Coreference Resolution; NER [180]: Named Entity\nRecognition; SST-5 [4]: Stanford Sentiment Treebank 5-class classiﬁcation;\nTask Previous SOTA Previous\nSOTA Results Baseline ELMo +\nBaseline\nIncrease\n(Absolute/Relative)\nSQuAD Liu et al. [181] 84.4 81.1 85.8 4.7 / 24.9%\nSNLI Qian et al. [182] 88.6 88.0 88.70 ±0.17 0.7 / 5.8%\nSRL Luheng et al. [183] 81.7 81.4 84.6 3.2 / 17.2%\nCoref Kenton et al. [184] 67.2 67.2 70.4 3.2 / 9.8%\nNER Matthew et al. [185] 91.93 ±0.19 90.15 92.22 ±0.10 2.06 / 21%\nSST-5 Bryan et al. [186] 53.7 51.4 54.7

## Hands‑On Activity — End‑to‑End RAG Pipeline


**Goal:**  
1. Embed & store a small doc collection  
2. Retrieve top‑k chunks for a user query  
3. Feed those chunks into an LLM prompt and compare the answer quality

We'll use:
* `langchain` (or `langchain-community` ≥ 0.2)  
* `sentence-transformers` or `openai` for embeddings  
* `chromadb` for the vector index  
* Your preferred LLM provider (OpenAI shown below)

> **Tip:** If you're offline or behind a firewall, install dependencies with `pip install` before running.


In [50]:
# !pip install -qU langchain-community sentence-transformers faiss-cpu openai python-dotenv
import os
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

# 1. Load sample docs (use your own folder path)
docs = "..."

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size="...", 
                                          chunk_overlap="...")

split_docs = splitter.split_documents(docs)

# 3. Embed 
embeddings = OpenAIEmbeddings()  # or HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")

# 4. To Database
vectorstore = Chroma.from_documents(split_docs
                                    documents=docs,
                                    collection_name="tumo_2025",
                                    embedding=embeddings,
                                    persist_directory="../db/my_second_vdb")

# 4. Retrieval
user_query = "Explain the mission statement of ACME Corp."
top_k = 4
retrieved_docs = vectordb.similarity_search(user_query, k=top_k)
context = "\n\n".join([d.page_content for d in retrieved_docs])

# 5. Build prompt & call LLM
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
prompt = f"Use the following context to answer the question.\n\nContext:\n{context}\n\nQuestion: {user_query}\nAnswer:"

answer = llm.invoke(prompt).content
print(answer)