---
# 1. RAG Documents Loading and Chunking

---
You will learn about:
- How to load docs using Pandas
- How to load docs from URLs
- How to load docs from Wikipedia
- How to load docs from PDFs
- How to load docs from Dictionaries

- How to split/chunk on Characters
- How to split/chunk using NLTK
- How to split/chunk using TikToken
- How to split/chunk using Transformers Tokenizers

### SetUp

In [3]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.document_loaders import DataFrameLoader
from langchain_community.document_loaders import UnstructuredURLLoader
from langchain_community.document_loaders import WikipediaLoader

from dotenv import load_dotenv
import os
import pandas as pd

_ = load_dotenv()
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
MODEL_NAME = "gemini-2.5-flash"

llm = ChatGoogleGenerativeAI(
    model=MODEL_NAME,
    api_key=GOOGLE_API_KEY,
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

### Loaders | Pandas

In [4]:
!wget -O youtube-sub.csv https://raw.githubusercontent.com/Petlja/JupyterBookSrCyr/master/podaci/Top%2025%20YouTubers.csv

--2025-10-17 15:14:27--  https://raw.githubusercontent.com/Petlja/JupyterBookSrCyr/master/podaci/Top%2025%20YouTubers.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1153 (1.1K) [text/plain]
Saving to: ‘youtube-sub.csv’


2025-10-17 15:14:27 (65.7 MB/s) - ‘youtube-sub.csv’ saved [1153/1153]



In [5]:
df = pd.read_csv("youtube-sub.csv", sep=",")
df.head()

Unnamed: 0,RANK,GRADE,NAME,VIDEOS,SUBSCRIBERS,VIEWES
0,1,A++,T-Series,13629,105783888,76945588449
1,2,A,PewDiePie,3898,97853589,22298927681
2,3,A+,5-Minute Crafts,3341,58629572,14860695079
3,4,A++,Cocomelon - Nursery Rhymes,441,53163816,33519273951
4,5,A++,SET India,31923,51784081,36464793233


In [6]:
datafarme = df[['NAME', 'VIDEOS', 'SUBSCRIBERS', 'VIEWES']]
datafarme.head()

Unnamed: 0,NAME,VIDEOS,SUBSCRIBERS,VIEWES
0,T-Series,13629,105783888,76945588449
1,PewDiePie,3898,97853589,22298927681
2,5-Minute Crafts,3341,58629572,14860695079
3,Cocomelon - Nursery Rhymes,441,53163816,33519273951
4,SET India,31923,51784081,36464793233


In [7]:
datafarme.shape

(25, 4)

In [8]:
llm_loader = DataFrameLoader(
    data_frame= datafarme, 
    page_content_column="NAME",
)

llm_data = llm_loader.load()

print(len(llm_data))
llm_data[0].model_dump()

25


{'id': None,
 'metadata': {'VIDEOS': 13629,
  'SUBSCRIBERS': 105783888,
  'VIEWES': 76945588449},
 'page_content': 'T-Series',
 'type': 'Document'}

### Loaders | URLs

In [9]:
urls = [
    "https://www.apple.com/in/support/products/faqs.html",
    "https://www.apple.com/legal/sales-support/",
]

llm_loader = UnstructuredURLLoader(
    urls=urls
)

llm_data = llm_loader.load()

print(len(llm_data))
llm_data[1].model_dump()

2


{'id': None,
 'metadata': {'source': 'https://www.apple.com/legal/sales-support/'},
 'page_content': 'Sales & Support\n\nGet familiar with the policies, terms, and conditions for the purchase, support, and servicing of your Apple products.\n\nAppleCare One\n\nAppleCare One covers up to three devices (with the option to add more) with all the benefits of AppleCare+ and includes up to three incidents of theft or loss for Apple Watch, iPhone, and iPad every 12 months.\n\nSee the AppleCare One Terms and Conditions\n\nAppleCare+ with Theft and Loss\n\nAppleCare+ with Theft and Loss provides everything included in AppleCare+ and up to two incidents of theft or loss coverage every 12 months.\n\nView the Theft and Loss Insurance Documents\n\nAppleCare+\n\nAppleCare+ provides additional hardware service and technical support from Apple, including coverage for unlimited incidents of accidental damage per device covered.\n\nExplore AppleCare+ Terms and Conditions\n\nAppleCare Plans\n\nSome servic

### Loaders | WikiPedia

In [10]:
query = "عبد الفتاح السيسي"

llm_loader = WikipediaLoader(
    query=query,
    load_max_docs=5,
    doc_content_chars_max=20000
)

llm_data = llm_loader.load()

print(len(llm_data))
llm_data[1].model_dump()

# llm_data[1].page_content
# llm_data[1].metadata

5


{'id': None,
 'metadata': {'title': 'List of presidents of Egypt',
  'summary': "The office of President of Egypt was established in 1953. The president is the head of state of Egypt and the Supreme Commander of the Egyptian Armed Forces. The current president is Abdel Fattah el-Sisi, who has effectively controlled the country since the 2013 coup d'état, and was officially elected president in 2014.",
  'source': 'https://en.wikipedia.org/wiki/List_of_presidents_of_Egypt'},
 'page_content': "The office of President of Egypt was established in 1953. The president is the head of state of Egypt and the Supreme Commander of the Egyptian Armed Forces. The current president is Abdel Fattah el-Sisi, who has effectively controlled the country since the 2013 coup d'état, and was officially elected president in 2014.\n\n\n== Background ==\nThe first president of Egypt was Mohamed Naguib, one of the leaders of the Free Officers Movement who led the Egyptian Revolution of 1952, and who took office

### Loaders | PDF

In [11]:
from langchain_community.document_loaders.pdf import PyPDFLoader

file_path = "/home/mango/Coding/Prompt Engineering using LangChain/conda-cheatsheet.pdf"

llm_loader = PyPDFLoader(file_path=file_path)

# pages = llm_loader.load_and_split()

pages = []

async for page in llm_loader.alazy_load():
    pages.append(page)

In [12]:
print(len(pages))


print(f"{pages[0].metadata}\n")
print(pages[0].page_content)

2
{'producer': 'macOS Version 15.3.2 (Build 24D81) Quartz PDFContext', 'creator': 'MKITFPdf v2.26 based on PDFKIT', 'creationdate': "D:20250625205157Z00'00'", 'moddate': "D:20250625205157Z00'00'", 'source': '/home/mango/Coding/Prompt Engineering using LangChain/conda-cheatsheet.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}

Quickstart
Tip: It is recommended to create a new environment for any new project or workflow.
verify conda install and check version conda info
conda update --name base conda
conda install anaconda
conda create --name ENVNAME
conda activate ENVNAME
update conda in base environment 
install latest anaconda distribution 
create a new environment
(tip: name environment descriptively) 
activate environment
(do this before installing packages)
Cheatsheet
conda list
conda list --show-channel-urls
conda update --all
conda install --channel CHANNELNAME PKGNAME
conda install CHANNELNAME::PKGNAME
conda install "PKGNAME>2.5,<3.2"
conda install "PKGNAME [version='2.5|3

### Loaders | Directories

In [13]:
from langchain_community.document_loaders import DirectoryLoader

llm_loader = DirectoryLoader(
    path="/home/mango/Coding/Prompt Engineering using LangChain/Data",
    glob="*.txt",
    show_progress=True,
)

llm_data = llm_loader.load()

llm_data

100%|██████████| 4/4 [00:00<00:00, 274.81it/s]


[Document(metadata={'source': '/home/mango/Coding/Prompt Engineering using LangChain/Data/1.txt'}, page_content='Youssef\n\nTaha\n\nBadawi'),
 Document(metadata={'source': '/home/mango/Coding/Prompt Engineering using LangChain/Data/2.txt'}, page_content='Major : AI'),
 Document(metadata={'source': '/home/mango/Coding/Prompt Engineering using LangChain/Data/4.txt'}, page_content='Age : 21'),
 Document(metadata={'source': '/home/mango/Coding/Prompt Engineering using LangChain/Data/3.txt'}, page_content='BD : 12 / 2 / 2004')]

In [14]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

llm_loader = PyPDFDirectoryLoader(
    path="/home/mango/Coding/Prompt Engineering using LangChain/Data",
)

llm_data = llm_loader.load()
llm_data

[Document(metadata={'producer': 'macOS Version 15.3.2 (Build 24D81) Quartz PDFContext', 'creator': 'MKITFPdf v2.26 based on PDFKIT', 'creationdate': "D:20250625205157Z00'00'", 'moddate': "D:20250625205157Z00'00'", 'source': '/home/mango/Coding/Prompt Engineering using LangChain/Data/conda-cheatsheet.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='Quickstart\nTip: It is recommended to create a new environment for any new project or workflow.\nverify conda install and check version conda info\nconda update --name base conda\nconda install anaconda\nconda create --name ENVNAME\nconda activate ENVNAME\nupdate conda in base environment \ninstall latest anaconda distribution \ncreate a new environment\n(tip: name environment descriptively) \nactivate environment\n(do this before installing packages)\nCheatsheet\nconda list\nconda list --show-channel-urls\nconda update --all\nconda install --channel CHANNELNAME PKGNAME\nconda install CHANNELNAME::PKGNAME\nconda install "P

### Text Splitter


In [15]:
from langchain_community.document_loaders import WikipediaLoader

query_1 = "Ahmed Zewail"
query_2 = "Mohamed Salah"

###########################################

docs_1 = WikipediaLoader(
    query=query_1, load_max_docs=1, doc_content_chars_max=20_000
).load()

text_documents_1 = docs_1[0].page_content

############################################

docs_2 = WikipediaLoader(
    query=query_2, load_max_docs=1, doc_content_chars_max=20_000
).load()

text_documents_2 = docs_2[0].page_content

############################################

print(text_documents_1)

Ahmed Hassan Zewail (February 26, 1946 – August 2, 2016) was an Egyptian-American chemist, known as the "father of femtochemistry". He was awarded the 1999 Nobel Prize in Chemistry for his work on femtochemistry and became the first Egyptian and Arab to win a Nobel Prize in a scientific field, and also the first African to win a Nobel Prize in Chemistry. He was a professor of chemistry and physics at the California Institute of Technology (Caltech), where he was the first Caltech faculty member to be named the Linus Pauling Chair of Chemical Physics and served as the director of the Physical Biology Center for Ultrafast Science and Technology.


== Early life and education ==
Ahmed Hassan Zewail was born on February 26, 1946, in Damanhur, Egypt, and was raised in Desouk. He received Bachelor of Science and Master of Science degrees in chemistry from Alexandria University before moving to the United States to complete his PhD at the University of Pennsylvania under the supervision of Ro

In [16]:
documents = [text_documents_1, text_documents_2]

metadata = [{"document_title": query_1}, {"document_title": query_2}]

In [17]:
documents[0] , metadata[0]

('Ahmed Hassan Zewail (February 26, 1946 – August 2, 2016) was an Egyptian-American chemist, known as the "father of femtochemistry". He was awarded the 1999 Nobel Prize in Chemistry for his work on femtochemistry and became the first Egyptian and Arab to win a Nobel Prize in a scientific field, and also the first African to win a Nobel Prize in Chemistry. He was a professor of chemistry and physics at the California Institute of Technology (Caltech), where he was the first Caltech faculty member to be named the Linus Pauling Chair of Chemical Physics and served as the director of the Physical Biology Center for Ultrafast Science and Technology.\n\n\n== Early life and education ==\nAhmed Hassan Zewail was born on February 26, 1946, in Damanhur, Egypt, and was raised in Desouk. He received Bachelor of Science and Master of Science degrees in chemistry from Alexandria University before moving to the United States to complete his PhD at the University of Pennsylvania under the supervision

In [18]:
len(documents)

2

#### Text Splitters | Characters

In [19]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="",
    chunk_size=100,
    chunk_overlap=10,
    length_function=len,
)

normal_chunks = text_splitter.create_documents(texts=documents, metadatas=metadata)

In [20]:
len(normal_chunks)

327

In [21]:
normal_chunks[0].page_content

'Ahmed Hassan Zewail (February 26, 1946 – August 2, 2016) was an Egyptian-American chemist, known as'

In [22]:
normal_chunks[1].page_content

'known as the "father of femtochemistry". He was awarded the 1999 Nobel Prize in Chemistry for his w'

#### NLTK | Splitter


In [23]:
import nltk
from langchain_text_splitters import NLTKTextSplitter

nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/mango/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [24]:
text_splitter = NLTKTextSplitter(
    chunk_size=500,
    language="english",
)

nltk_chunks = text_splitter.create_documents(texts=documents, metadatas=metadata)

Created a chunk of size 677, which is longer than the specified 500


In [25]:
print(len(nltk_chunks))
print('-' * 50)

print(nltk_chunks[0].page_content) 
print(nltk_chunks[0].metadata)
print('-' * 50)

print(nltk_chunks[1].page_content)
print(nltk_chunks[1].metadata)

91
--------------------------------------------------
Ahmed Hassan Zewail (February 26, 1946 – August 2, 2016) was an Egyptian-American chemist, known as the "father of femtochemistry".

He was awarded the 1999 Nobel Prize in Chemistry for his work on femtochemistry and became the first Egyptian and Arab to win a Nobel Prize in a scientific field, and also the first African to win a Nobel Prize in Chemistry.
{'document_title': 'Ahmed Zewail'}
--------------------------------------------------
He was a professor of chemistry and physics at the California Institute of Technology (Caltech), where he was the first Caltech faculty member to be named the Linus Pauling Chair of Chemical Physics and served as the director of the Physical Biology Center for Ultrafast Science and Technology.

== Early life and education ==
Ahmed Hassan Zewail was born on February 26, 1946, in Damanhur, Egypt, and was raised in Desouk.
{'document_title': 'Ahmed Zewail'}


#### Tokens | TikToken

In [26]:
from langchain_text_splitters import TokenTextSplitter


text_splitter = TokenTextSplitter(
    encoding_name="cl100k_base",
    chunk_size=100,
    chunk_overlap=20,
)

In [27]:
tokens_chunks = text_splitter.create_documents(
    texts = documents, metadatas=metadata
)

In [28]:
print(len(tokens_chunks))
print('-' * 50)

print(tokens_chunks[0].page_content) 
print(tokens_chunks[0].metadata)
print('-' * 50)

print(tokens_chunks[1].page_content)
print(tokens_chunks[1].metadata)

85
--------------------------------------------------
Ahmed Hassan Zewail (February 26, 1946 – August 2, 2016) was an Egyptian-American chemist, known as the "father of femtochemistry". He was awarded the 1999 Nobel Prize in Chemistry for his work on femtochemistry and became the first Egyptian and Arab to win a Nobel Prize in a scientific field, and also the first African to win a Nobel Prize in Chemistry. He was a professor of chemistry and physics at the California Institute
{'document_title': 'Ahmed Zewail'}
--------------------------------------------------
 to win a Nobel Prize in Chemistry. He was a professor of chemistry and physics at the California Institute of Technology (Caltech), where he was the first Caltech faculty member to be named the Linus Pauling Chair of Chemical Physics and served as the director of the Physical Biology Center for Ultrafast Science and Technology.


== Early life and education ==
Ahmed Hassan Zewail was born on February 26, 1946, in Damanhur, Egy

#### Tokens | Transformers Tokens


In [29]:
from transformers import AutoTokenizer
from langchain.text_splitter import CharacterTextSplitter

model_id = "stabilityai/stablelm-tuned-alpha-3b"

tokenizer = AutoTokenizer.from_pretrained(model_id)

In [30]:
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer,
    chunk_size=400,
    chunk_overlap=0
)

tokens_chunks = text_splitter.create_documents(
    texts = documents, metadatas=metadata
)

Created a chunk of size 518, which is longer than the specified 400
Created a chunk of size 786, which is longer than the specified 400
Created a chunk of size 701, which is longer than the specified 400


In [31]:
print(len(tokens_chunks))
print("-" * 50)

print(tokens_chunks[0].page_content)
print(tokens_chunks[0].metadata)
print("-" * 50)

print(tokens_chunks[1].page_content)
print(tokens_chunks[1].metadata)

18
--------------------------------------------------
Ahmed Hassan Zewail (February 26, 1946 – August 2, 2016) was an Egyptian-American chemist, known as the "father of femtochemistry". He was awarded the 1999 Nobel Prize in Chemistry for his work on femtochemistry and became the first Egyptian and Arab to win a Nobel Prize in a scientific field, and also the first African to win a Nobel Prize in Chemistry. He was a professor of chemistry and physics at the California Institute of Technology (Caltech), where he was the first Caltech faculty member to be named the Linus Pauling Chair of Chemical Physics and served as the director of the Physical Biology Center for Ultrafast Science and Technology.


== Early life and education ==
Ahmed Hassan Zewail was born on February 26, 1946, in Damanhur, Egypt, and was raised in Desouk. He received Bachelor of Science and Master of Science degrees in chemistry from Alexandria University before moving to the United States to complete his PhD at the 

---
# 2. RAG Documents Embedding and Indexing

---
You will learn about:
- OpenAI Embedding
- Cohere Embedding
- HuggingFace Embedding
- Indexing using FAISS
- Indexing using ChromaDB

## Embedding

In [None]:
from langchain_community.document_loaders import WikipediaLoader

query_1 = "Albert Einstein"
query_2 = "Isaac Newton"

###########################################
docs_1 = WikipediaLoader(
    query=query_1, load_max_docs=1, doc_content_chars_max=20_000
).load()

text_documents_1 = docs_1[0].page_content

#############################################

docs_2 = WikipediaLoader(
    query=query_2, load_max_docs=1, doc_content_chars_max=20_000
).load()

text_documents_2 = docs_2[0].page_content

#############################################

documents = [text_documents_1, text_documents_2]
metadata = [{"document_title": query_1}, {"document_title": query_2}]

In [34]:
import nltk
from langchain_text_splitters import NLTKTextSplitter

nltk.download("punkt")

text_splitter = NLTKTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
)
tokens_chunks = text_splitter.create_documents(texts=documents, metadatas=metadata)

[nltk_data] Downloading package punkt to /home/mango/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Created a chunk of size 337, which is longer than the specified 300
Created a chunk of size 336, which is longer than the specified 300
Created a chunk of size 332, which is longer than the specified 300
Created a chunk of size 322, which is longer than the specified 300
Created a chunk of size 353, which is longer than the specified 300
Created a chunk of size 304, which is longer than the specified 300
Created a chunk of size 314, which is longer than the specified 300
Created a chunk of size 383, which is longer than the specified 300
Created a chunk of size 406, which is longer than the specified 300
Created a chunk of size 439, which is longer than the specified 300
Created a chunk of size 651, which is longer than the specified 300
Created a chunk of size 332, which is longer than the specified 300
Created a chunk of size 330, which is longer than the specified 300

#### Embeddings | HuggingFace

In [73]:
from langchain_huggingface import HuggingFaceEmbeddings
from sentence_transformers import SentenceTransformer

model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {"device": "cuda"}
encode_kwargs = {"normalize_embeddings": False}


hf_embeddings_model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    show_progress=True
)

# Source Data
docs_text = [chunk.page_content for chunk in tokens_chunks]
docs_embeddings = hf_embeddings_model.embed_documents(docs_text)

len(docs_embeddings[0])      # 384

# query
query_text = "Can you list a number of Nikola Tesla's inventions?"
query_embedding = hf_embeddings_model.embed_query(query_text)

Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

#### Embeddings | Cohere

In [None]:
from langchain_cohere import CohereEmbeddings

from dotenv import load_dotenv
import os

_ = load_dotenv()

COHERE_API_KEY = os.getenv("COHERE_API_KEY")

co_embedding_llm = CohereEmbeddings(
    model="embed-english-v3.0", cohere_api_key=COHERE_API_KEY
)

# source data
docs_text = [ chunk.page_content  for chunk in tokens_chunks ]
docs_embeddings = co_embedding_llm.embed_documents(docs_text)

len(docs_embeddings[0])          # 1024

# query
query_text = "What is the name of Nikola Tesla's mother?"
query_embedding = co_embedding_llm.embed_query(query_text)

## Vectors Stores

#### Vector Stores | FAISS

In [None]:
from langchain_community.vectorstores import FAISS

vector_db = FAISS.from_documents(documents=tokens_chunks, embedding=hf_embeddings_model)

query_text = "Albert Einstein adn his contributions to physics"

similar_docs = vector_db.similarity_search(query_text)

Batches:   0%|          | 0/6 [00:00<?, ?it/s]

In [103]:
similar_docs[0].page_content

'Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist best known for developing the theory of relativity.\n\nEinstein also made important contributions to quantum theory.'

#### Vector Stores | Chroma

In [None]:
from langchain_chroma import Chroma

# store text documents as vectors
save_to_dir = "./wiki_chroma_db"

docs_ids = list(range(len(tokens_chunks)))
docs_ids = [str(d) for d in docs_ids]

vector_db = Chroma.from_documents(
    documents = tokens_chunks,
    embedding = hf_embeddings_model,
    persist_directory=save_to_dir,
    ids=docs_ids
)

Batches:   0%|          | 0/6 [00:00<?, ?it/s]

In [None]:
# search for most similar document to a query
query_text = "Albert Einstein adn his contributions to physics"

similar_docs = vector_db.similarity_search(
    query_text,
    k=5,
    filter={"document_title": "Albert Einstein"}
)

similar_docs[0].page_content

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist best known for developing the theory of relativity.\n\nEinstein also made important contributions to quantum theory.'

In [112]:
## Load from disk
load_from_dir = "./wiki_chroma_db"

loaded_vector_db = Chroma(
    persist_directory=load_from_dir,
    embedding_function=hf_embeddings_model
)

In [119]:
# search for most similar document to a query
query_text = "Albert Einstein adn his contributions to physics"

similar_docs = loaded_vector_db.similarity_search(
    query_text, k=5, filter={"document_title": "Albert Einstein"}
)

print(similar_docs[0].page_content)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist best known for developing the theory of relativity.

Einstein also made important contributions to quantum theory.


---
# 3. RAG QnA Chains

---
You will learn about:
- QnA Stuff Chain
- QnA MapReduce Chain
- QnA Refine Chain
