# LangChain: Chat with Your Data

Main topics:<br>
1. Retrieval Augmented Generation (RAG): common LLM application that retrieves contextual documents from external dataset - useful to ask question about specific documents (e.g. PDFs, videos, etc.)

![Alt text](pics\rag.png)

2. Build chatbot that responds to queries based on content of your documents, rather than info learned from training


## Document Loading

LangChain has >80 unique loaders to provide access to diverse data sources, incl. audio and video

Accessing: 
+ Websites
+ Databases
+ Youtube
+ arXiv...

Data Types:
+ PDF
+ HTML
+ JSON
+ Word, PowerPoint...

-> Returns a list of 'Documents' objects

![Alt text](pics\loaders.png)

In [1]:
import os
import sys
import openai

sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

### Load PDFs

Load PDF transcript from Andrew Ng's CS229 course

In [3]:
#pip install pypdf
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [4]:
len(pages)

# 22 pages in the pdf
# Each page is a "document"

22

A document contains page_content and metadata

In [5]:
page = pages[0]
print(page.page_content[0:500])

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i


In [6]:
page.metadata

{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

### Youtube

In [7]:
# pip install yt_dlp
# pip install pydub
# conda install -c conda-forge ffmpeg
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser # speech-to-text model 
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader # load audio file from YT vid

In [87]:
# Save YT vid in specified directory
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"

loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading android player API JSON
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] docs\youtube\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of   69.76MiB


ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location


DownloadError: ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location

In [None]:
docs[0].page_content[0:500]

### URLs

In [9]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [11]:
docs = loader.load()

In [12]:
print(docs[0].page_content[:500])













































































handbook/37signals-is-you.md at master · basecamp/handbook · GitHub
















































Skip to content







Toggle navigation










            Sign up
          


 













        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







Code


### Notion

Follow steps here: https://python.langchain.com/docs/integrations/document_loaders/notion
 for an example Notion site such as this one: https://www.notion.so/Blendle-s-Employee-Handbook-41828bca9bb043e9ab39e12adc9799c8

+ Duplicate the page into your own Notion space and export as Markdown / CSV.
+ Unzip it and save it as a folder that contains the markdown file for the Notion page.

In [13]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

In [14]:
print(docs[0].page_content[0:200])

# Blendle's Employee Handbook

This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that


In [15]:
docs[0].metadata

{'source': "docs\\Notion_DB\\Blendle's Employee Handbook 41828bca9bb043e9ab39e12adc9799c8.md"}

## Document Splitting

Split the documents into smaller chunks to enable retrieval of most relevant data

- Split into semantically relevant and meaningful chunks

- Text splitters in LangChain involves splitting on chunks with specified chunk size and some overlap
  + Chunk size: Passing a length function to measure the size of the chunk - units are characters or tokens
  + Chunk overlap: Allow same piece of context at the end of one chunk and at the start of the other - consistency

![Alt text](pics\splitters.png)

![Alt text](pics\splitter_type.png)


### Character Split

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [17]:
chunk_size =26
chunk_overlap = 4

In [18]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

# Split on a single character - by default: split on newline character
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [23]:
text1 = 'abcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text1)
# doesn't split text above since the text length = chunk size = 26

['abcdefghijklmnopqrstuvwxyz']

In [24]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

In [25]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)
# space count as a character

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [26]:
c_splitter.split_text(text3)
# Doesn't even split
# character splitter splits on a single character - default: newline character
# no newline in the text, thus, no split

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [27]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' ' # set separator to be empty space
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

#### Recursive Split

`RecursiveCharacterTextSplitter` is recommended for generic text. 

In [28]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [29]:
len(some_text)

496

In [30]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""] 
)

In [31]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [32]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [33]:
# Reduce chunk size even more and add a period to the separators i.e. split between sentences

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [34]:
# Add more punctuations

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

##### Split on PDF

In [35]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [36]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [37]:
docs = text_splitter.split_documents(pages)

In [38]:
len(docs)

77

In [39]:
len(pages)

# More documents than pages after the split

22

##### Split on Notion

In [44]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
notion_db = loader.load()

In [45]:
docs = text_splitter.split_documents(notion_db)

In [46]:
len(notion_db)

1

In [47]:
len(docs)

# More documents after the split

7

### Token Split

LLMs often have context windows designated in tokens. Tokens are often ~4 characters

In [48]:
from langchain.text_splitter import TokenTextSplitter

In [49]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [52]:
text1 = "foo bar bazzyfoo"

text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [53]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [57]:
docs = text_splitter.split_documents(pages)

docs[4]

Document(page_content=' is ju st spend a little time going over the', metadata={'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0})

In [59]:
pages[0].metadata

{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

### Context Aware Split
Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [60]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [61]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [62]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [64]:
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [65]:
md_header_splits[0]

Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})

In [66]:
md_header_splits[1]

Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})

#### Split Notion 

In [67]:
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])

In [68]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

In [69]:
md_header_splits = markdown_splitter.split_text(txt)

In [70]:
md_header_splits[0]

Document(page_content="This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change.  \n**Everything related to working at Blendle and the people of Blendle, made public.**  \nThese are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more.  \nWe've made this document public because we want to learn fro

## Embeddings & Vector Stores

Check notebook 2

## Retrieval

Check notebook 3

Retrieval Augmented Generation (RAG): common LLM application that retrieves contextual documents from external dataset - useful to ask question about specific documents (e.g. PDFs, videos, etc.)

![Alt text](pics\rag.png)


1. Methods for accessing/indexing data in vector store
- Basic semantic similarity

- Maximum marginal relevance (MMR)
  + Not always want to choose the most similar responses - get add'l information

  + MMR algorithm: 
    Query the vector store -> choose the `fetch_k` most similar responses -> within the responses, choose the `k` most diverse

- Including Metadata<br>
<br>

2. LLM Aided Retrieval

- SelfQuery: Use LLM to split user question into 2 things: a filter & a search term - useful to filter results based on metadata

![Alt text](pics\selfquery.png)

- Compression: Run all documents through LLM & extract most relevant segments 
  -> pass only the most relevant segments into final language model call

![Alt text](pics\compress.png)



In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

## Vectorstore retrieval

### Similarity Search



### Maximum marginal relevance: Addressing Diversity

`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

### Working with metadata: Addressing Specificity

Many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

### LLM-aided Retrieval

#### Self-query retriever: Using metadata

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

#### Compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

### Combining various techniques

## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents. 

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.