# 1. Document Loading

In [1]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

## PDFs

In [2]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/Yetiskin2020.pdf")
pages = loader.load()

In [3]:
len(pages)

18

In [4]:
page = pages[0]

In [5]:
print(page.page_content[0:500])

Full Terms & Conditions of access and use can be found at
https://www.tandfonline.com/action/journalInformation?journalCode=hppc20
Popular Communication
The International Journal of Media and Culture
ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/hppc20
Paratactic commoning: collective knowledge
production networking as political struggle
Ebru Yetiskin
To cite this article: Ebru Yetiskin (2020): Paratactic commoning: collective knowledge production
networking as politic


In [10]:
page.metadata

{'producer': 'iText 4.2.0 by 1T3XT',
 'creator': 'Arbortext Advanced Print Publisher 11.0.3433/W Unicode',
 'creationdate': '2020-06-25T12:52:06+05:30',
 'keywords': 'Commons; commoning; data; control; tactic; paratactic; obfuscation',
 'moddate': '2020-06-30T04:01:46-07:00',
 'source': 'docs/Yetiskin2020.pdf',
 'total_pages': 18,
 'page': 0,
 'page_label': '1'}

## Youtube

In [6]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.blob_loaders import FileSystemBlobLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [None]:
url="https://www.youtube.com/watch?v=vtLfCO4IGXY&ab_channel=DiEM25"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),  # fetch from youtube
    #FileSystemBlobLoader(save_dir, glob="*.m4a"),   #fetch locally
    OpenAIWhisperParser()
)
docs = loader.load()

In [25]:
docs[0].page_content[0:500]

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nhandbook/titles-for-programmers.md at master · basecamp/handbook · GitHub\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\n\n\n\n\n\n\nNavigation Menu\n\nToggle navigation\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n            Sign in\n          \n\n\n\n\n\n\n\n\n        Product\n        \n\n\n\n\n\n\n\n\n\n\n\n\n\nGitHub Copilot\n        Write better code with AI\n      \n\n\n\n\n\n\n\nSecurity\n        Find and fix vulnerabilities\n      \n\n\n\n\n\n\n\nActions\n        Automa'

## URLs

In [22]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")

In [23]:
docs = loader.load()

In [None]:
print(docs[0].page_content[:500])

# 2. Document Splitting

![Alt text](images/splitters.png)

### Key Differences

| Feature                           | `RecursiveCharacterTextSplitter`              | `CharacterTextSplitter`         |
|-----------------------------------|--------------------------------|-------------------------|
| **Splitting Strategy**           | Tries to break at meaningful places | Splits at a fixed character (e.g., space) |
| **Handles Word Boundaries?**      | ✅ Yes                          | ❌ No (may cut words)  |
| **Performance**                   | Slightly slower but better structured chunks | Fast but less optimal chunks |
| **Best for**                      | Long documents with structured text (e.g., articles, books) | Simple text that doesn’t require structure |


In [26]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [27]:
chunk_size =26
chunk_overlap = 4

In [35]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator=' '
)

In [None]:
text1 = 'abcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [31]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

In [32]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [36]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text. 

In [37]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [38]:
len(some_text)

496

In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""]
)

In [40]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [41]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [42]:
#Let's reduce the chunk size a bit and add a period to our separators:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

In [43]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/NG2023.pdf")
pages = loader.load()

In [46]:
len(pages)

41

In [44]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [47]:
docs = text_splitter.split_documents(pages)
len(docs)

80

## Token splitting

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.

In [48]:
from langchain.text_splitter import TokenTextSplitter

In [50]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [51]:
text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [52]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)

In [53]:
docs[0]

Document(metadata={'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 18.0 (Macintosh)', 'creationdate': '2022-12-13T16:08:00-05:00', 'moddate': '2022-12-13T16:08:04-05:00', 'trapped': '/False', 'source': 'docs/NG2023.pdf', 'total_pages': 41, 'page': 0, 'page_label': '1'}, page_content='PAGE 1\nFounder, DeepLearning.')

In [54]:
docs[5]

Document(metadata={'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 18.0 (Macintosh)', 'creationdate': '2022-12-13T16:08:00-05:00', 'moddate': '2022-12-13T16:08:04-05:00', 'trapped': '/False', 'source': 'docs/NG2023.pdf', 'total_pages': 41, 'page': 1, 'page_label': '2'}, page_content='\nelectricity. It will \ntransform and')

In [55]:
pages[0].metadata

{'producer': 'Adobe PDF Library 17.0',
 'creator': 'Adobe InDesign 18.0 (Macintosh)',
 'creationdate': '2022-12-13T16:08:00-05:00',
 'moddate': '2022-12-13T16:08:04-05:00',
 'trapped': '/False',
 'source': 'docs/NG2023.pdf',
 'total_pages': 41,
 'page': 0,
 'page_label': '1'}

In [56]:
pages[10].metadata

{'producer': 'Adobe PDF Library 17.0',
 'creator': 'Adobe InDesign 18.0 (Macintosh)',
 'creationdate': '2022-12-13T16:08:00-05:00',
 'moddate': '2022-12-13T16:08:04-05:00',
 'trapped': '/False',
 'source': 'docs/NG2023.pdf',
 'total_pages': 41,
 'page': 10,
 'page_label': '11'}

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [None]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [None]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n
## Chapter 2\n\n \
Hi this is Molly"""

In [59]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [60]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [61]:
md_header_splits[0]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}, page_content='Hi this is Jim  \nHi this is Joe')

In [62]:
md_header_splits[1]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}, page_content='Hi this is Lance')

# Vectorstores and Embeddings

![overview.png](images/overview.png)

![](images/vectorstore.png)

In [2]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("docs/Geron2017.pdf"),
    PyPDFLoader("docs/Zheng2018.pdf"),
    PyPDFLoader("docs/NG2018.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [5]:
print(f"Total documents loaded: {len(docs)}")
print(f"First document: {docs[0] if docs else 'No documents loaded'}")

Total documents loaded: 944
First document: page_content='Aurélien Géron
Hands-On  
Machine Learning  
with Scikit-Learn  
& TensorFlow  
CONCEPTS, TOOLS, AND TECHNIQUES  
TO BUILD INTELLIGENT SYSTEMS
 D o w n l o a d   f r o m   f i n e l y b o o k   w w w . f i n e l y b o o k . c o m' metadata={'producer': 'Antenna House PDF Output Library 6.2.609 (Linux64)', 'creator': 'AH CSS Formatter V6.2 MR4 for Linux64 : 6.2.6.18551 (2014/09/24 15:00JST)', 'creationdate': '2017-03-10T21:55:34+00:00', 'author': 'Aurélien Géron', 'moddate': '2017-05-16T09:54:54+08:00', 'title': 'Hands-On Machine Learning with Scikit-Learn and TensorFlow', 'trapped': '/False', 'source': 'docs/Geron2017.pdf', 'total_pages': 564, 'page': 0, 'page_label': 'Cover'}


In [6]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [7]:
splits = text_splitter.split_documents(docs)

In [10]:
print(f"Total Chunks: {len(splits)}")
print(f"First Chunk: {splits[3].page_content}")  # Show first chunk

Total Chunks: 1605
First Chunk: 978-1-491-96229-9
[LSI]
Hands-On Machine Learning with Scikit-Learn and TensorFlow
by Aurélien Géron
Copyright © 2017 A
urélien Géron. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐
tutional sales department: 800-998-9938 or corporate@oreilly.com.
Editor: Nicole Tache
Production Editor: N
icholas Adams
Copyeditor: Rachel Monaghan
Proofreader: Charles Roumeliotis
Indexer: Wendy Catalano
Interior Designer: Da
vid Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest
March 2017:  First Edition
Revision History for the First Edition
2017-03-10: First Release
See h
ttp://oreilly.com/catalog/errata.csp?isbn=9781491962299 for release de

In [17]:
print(f"Type of first element in splits: {type(splits[0])}")

Type of first element in splits: <class 'langchain_core.documents.base.Document'>


## Embeddings

In [11]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(openai_api_key=openai.api_key)

  embedding = OpenAIEmbeddings(openai_api_key=openai.api_key)


In [6]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [7]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [9]:
import numpy as np

In [10]:
np.dot(embedding1, embedding2)

0.9631510802407727

In [11]:
np.dot(embedding1, embedding3)

0.7702031204123162

In [12]:
np.dot(embedding2, embedding3)

0.759053971445477

## Vectorstores

In [34]:
from langchain.vectorstores import Chroma

In [35]:
persist_directory = 'docs/chroma/'

In [36]:
!rm -rf ./docs/chroma  # remove old database files if any

In [37]:

# Ensure `splits` is converted into a list of Document objects
documents = [doc for doc in splits if doc.page_content.strip()]


In [38]:
print(f"Total valid documents: {len(documents)}")
print(f"First document content: {documents[0].page_content[:200]}")  # Print first 200 characters
print(f"First document metadata: {documents[0].metadata}")


Total valid documents: 1605
First document content: Aurélien Géron
Hands-On  
Machine Learning  
with Scikit-Learn  
& TensorFlow  
CONCEPTS, TOOLS, AND TECHNIQUES  
TO BUILD INTELLIGENT SYSTEMS
 D o w n l o a d   f r o m   f i n e l y b o o k   w w w 
First document metadata: {'producer': 'Antenna House PDF Output Library 6.2.609 (Linux64)', 'creator': 'AH CSS Formatter V6.2 MR4 for Linux64 : 6.2.6.18551 (2014/09/24 15:00JST)', 'creationdate': '2017-03-10T21:55:34+00:00', 'author': 'Aurélien Géron', 'moddate': '2017-05-16T09:54:54+08:00', 'title': 'Hands-On Machine Learning with Scikit-Learn and TensorFlow', 'trapped': '/False', 'source': 'docs/Geron2017.pdf', 'total_pages': 564, 'page': 0, 'page_label': 'Cover'}


In [39]:
# Remove Surrogate Unicode Characters
import unicodedata
from langchain.schema import Document

def clean_text(text):
    return unicodedata.normalize("NFKD", text).encode("utf-8", "ignore").decode("utf-8")

documents = [
    Document(page_content=clean_text(doc.page_content), metadata=doc.metadata)
    for doc in documents
]

In [None]:

vectordb = Chroma.from_documents(
    documents=documents,
    embedding=embedding,
    persist_directory=persist_directory
)

# try persist_directory=None if you get tenant error. It can also work in-memory mode.

In [42]:
print(vectordb._collection.count())  # Number of stored embeddings
print(vectordb.similarity_search("What is AI?", k=3))  # Test retrieval

1605
[Document(metadata={'creationdate': '', 'creator': 'PyPDF', 'page': 1, 'page_label': '2', 'producer': 'Skia/PDF m71', 'source': 'docs/NG2018.pdf', 'total_pages': 118}, page_content='Machine Learning Yearning is a\n \n \ndeeplearning.ai project.\n \n \n \n \n \n \n \n \n \n \n \n© 2018 Andrew Ng. All Rights Reserved.\n \n \n \n \nPage 2\nMachine Learning Yearning-Draft\nAndrew Ng'), Document(metadata={'author': 'Aurélien Géron', 'creationdate': '2017-03-10T21:55:34+00:00', 'creator': 'AH CSS Formatter V6.2 MR4 for Linux64 : 6.2.6.18551 (2014/09/24 15:00JST)', 'moddate': '2017-05-16T09:54:54+08:00', 'page': 25, 'page_label': '4', 'producer': 'Antenna House PDF Output Library 6.2.609 (Linux64)', 'source': 'docs/Geron2017.pdf', 'title': 'Hands-On Machine Learning with Scikit-Learn and TensorFlow', 'total_pages': 564, 'trapped': '/False'}, page_content='If you already know all the Machine Learning basics, you may want\nto skip directly to Chapter 2 . If you are not sure, tr\ny to answe

In [44]:
result= vectordb.similarity_search("What are neural networks?", k=3)
print(result)


[Document(metadata={'creationdate': '', 'creator': 'PyPDF', 'page': 91, 'page_label': '92', 'producer': 'Skia/PDF m71', 'source': 'docs/NG2018.pdf', 'total_pages': 118}, page_content='Neural networks are commonly used in end-to-end learning systems. The term “end-to-end”\n \nrefers to the fact that we are asking the learning algorithm to go directly from the input to\n \nthe desired output. I.e., the learning algorithm directly connects the “input end” of the\n \nsystem to the “output end.”\n \nIn problems where data is abundant, end-to-end systems have been remarkably successful.\n \nBut they are not always a good choice. The next few chapters will give more examples of\n \nend-to-end systems as well as give advice on when you should and should not use them.\n \n \n \n \n \n \n \n \n \n \n \nPage 92\nMachine Learning Yearning-Draft\nAndrew Ng'), Document(metadata={'author': 'Aurélien Géron', 'creationdate': '2017-03-10T21:55:34+00:00', 'creator': 'AH CSS Formatter V6.2 MR4 for Linux64

In [45]:
result[0].page_content

'Neural networks are commonly used in end-to-end learning systems. The term “end-to-end”\n \nrefers to the fact that we are asking the learning algorithm to go directly from the input to\n \nthe desired output. I.e., the learning algorithm directly connects the “input end” of the\n \nsystem to the “output end.”\n \nIn problems where data is abundant, end-to-end systems have been remarkably successful.\n \nBut they are not always a good choice. The next few chapters will give more examples of\n \nend-to-end systems as well as give advice on when you should and should not use them.\n \n \n \n \n \n \n \n \n \n \n \nPage 92\nMachine Learning Yearning-Draft\nAndrew Ng'

In [48]:
import textwrap


# Print formatted results
for i, doc in enumerate(result):
    print(f"\n🔹 **Result {i+1}:**")
    print(f"📄 Source: {doc.metadata.get('source', 'Unknown')}, Page: {doc.metadata.get('page', 'N/A')}")
    print("📝 Content:")
    print(textwrap.fill(doc.page_content, width=80))  # Wrap text to 80 characters
    print("-" * 100)  # Divider


🔹 **Result 1:**
📄 Source: docs/NG2018.pdf, Page: 91
📝 Content:
Neural networks are commonly used in end-to-end learning systems. The term “end-
to-end”   refers to the fact that we are asking the learning algorithm to go
directly from the input to   the desired output. I.e., the learning algorithm
directly connects the “input end” of the   system to the “output end.”   In
problems where data is abundant, end-to-end systems have been remarkably
successful.   But they are not always a good choice. The next few chapters will
give more examples of   end-to-end systems as well as give advice on when you
should and should not use them.                       Page 92 Machine Learning
Yearning-Draft Andrew Ng
----------------------------------------------------------------------------------------------------

🔹 **Result 2:**
📄 Source: docs/Geron2017.pdf, Page: 277
📝 Content:
4 In the context of Machine Learning, the phrase “neural networks” generally
refers to ANNs, not BNNs. 5 Drawing of a co