# [Introduction To LangChain](https://docs.langchain.com/docs/)

<br>

- LangChain is a framework for developing applications powered by language models.

## Topics Covered

1. Document Loading
2. Document Splitting
3. Using LangChain to create output parsers.
4. Using LangChain to create memory.
5. Using LangChain to create chains.

## Installation

```sh
pip install langchain

# OR
pip install 'langchain[all]'

# Other dependencies
pip install python-dotenv
pip install openai
```

In [1]:
# Built-in library
import itertools
import re
import json
from typing import Any, Dict, List, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
from pprint import pprint
import pandas as pd

# Visualization
import matplotlib.pyplot as plt


# pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

warnings.filterwarnings("ignore")

# Black code formatter (Optional)
%load_ext lab_black
# auto reload imports
%load_ext autoreload
%autoreload 2

## 1. Document Loading

### Retrieval augmented generation

```text
- In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.
- This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).
```

<br>

[![image.png](https://i.postimg.cc/bJpQfq5y/image.png)](https://postimg.cc/nsSsvfZg)

```sh
# Install dependency
pip install pypdf
```


In [3]:
from langchain.document_loaders import PyPDFLoader


fp = "../../data/cs229-data/MachineLearning-Lecture01.pdf"
loader = PyPDFLoader(file_path=fp)
pages = loader.load()

# Each page is a Document.
# A Document contains text (page_content) and metadata.
len(pages)

22

In [4]:
first_page = pages[0]

# 1st 500 charcaters
pprint(first_page.page_content[:500])

('MachineLearning-Lecture01  \n'
 'Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \n'
 'learning class. So what I wanna do today is ju st spend a little time going '
 'over the logistics \n'
 "of the class, and then we'll start to  talk a bit about machine learning.  \n"
 "By way of introduction, my name's  Andrew Ng and I'll be instru ctor for "
 'this class. And so \n'
 "I personally work in machine learning, and I' ve worked on it for about 15 "
 'years now, and \n'
 'I actually think that machine learning i')


In [5]:
len(first_page.page_content[:500])

500

#### YoutTube

```sh
pip install yt_dlp
pip install pydub
```

In [6]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [7]:
# Andrew NG Standford lecture
URL = "https://www.youtube.com/watch?v=jGwO_UgTS7I"
SAVE_DIR = "../../data/docs/youtube/"
loader = GenericLoader(YoutubeAudioLoader([URL], SAVE_DIR), OpenAIWhisperParser())

In [8]:
# This takes a while to complete!
docs = loader.load()
docs[0].page_content[0:500]

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage




[youtube] jGwO_UgTS7I: Downloading ios player API JSON




[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading android player API JSON
[youtube] jGwO_UgTS7I: Downloading iframe API JS
[youtube] jGwO_UgTS7I: Downloading player 0e6aaa83
[youtube] jGwO_UgTS7I: Downloading web player API JSON
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[youtube] jGwO_UgTS7I: Downloading initial data API JSON




[youtube] jGwO_UgTS7I: Downloading initial data API JSON
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] ../../data/docs/youtube//Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of   69.76MiB
[ExtractAudio] Not converting audio ../../data/docs/youtube//Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a
Transcribing part 1!
Transcribing part 2!
Transcribing part 3!
Transcribing part 4!


"Welcome to CS229 Machine Learning. Uh, some of you know that this is a class that's taught at Stanford for a long time. And this is often the class that, um, I most look forward to teaching each year because this is where we've helped, I think, several generations of Stanford students become experts in machine learning, got- built many of their products and services and startups that I'm sure, many of you or probably all of you are using, uh, uh, today. Um, so what I want to do today was spend s"

#### URLs


In [9]:
from langchain.document_loaders import WebBaseLoader


URL = "https://github.com/basecamp/handbook/blob/master/37signals-is-you.md"
loader = WebBaseLoader(URL)
docs = loader.load()

# Print the contents of the doc
print(docs[0].page_content[:500])











































































handbook/37signals-is-you.md at master · basecamp/handbook · GitHub

















































Skip to content







Toggle navigation










            Sign up
          


 













        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







Codes


## 2. Document Splitting

In [10]:
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
)


chunk_size = 26
chunk_overlap = 4

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

In [11]:
text1 = "abcdefghijklmnopqrstuvwxyz"  # 26 chars
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [12]:
text2 = "abcdefghijklmnopqrstuvwxyzabcdefg"  # 34 chars
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

In [13]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [14]:
# Seperator:"\n\n"
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [15]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap, separator=" "
)

c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [17]:
# Recursive splitting details
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""
print(len(some_text))

c_splitter = CharacterTextSplitter(chunk_size=450, chunk_overlap=0, separator=" ")
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""],
)

496


In [22]:
result = c_splitter.split_text(some_text)
pprint(result)
len(result)

['When writing documents, writers will use document structure to group '
 "content. This can convey to the reader, which idea's are related. For "
 'example, closely related ideas are in sentances. Similar ideas are in '
 'paragraphs. Paragraphs form a document. \n'
 '\n'
 ' Paragraphs are often delimited with a carriage return or two carriage '
 'returns. Carriage returns are the "backslash n" you see embedded in this '
 'string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']


2

In [24]:
print(f"Split_1: {result[0]}\n\nSplit_2: {result[1]}")

Split_1: When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. 

 Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,

Split_2: have a space.and words are separated by space.


In [25]:
result = r_splitter.split_text(some_text)
pprint(result)
len(result)

['When writing documents, writers will use document structure to group '
 "content. This can convey to the reader, which idea's are related. For "
 'example, closely related ideas are in sentances. Similar ideas are in '
 'paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage '
 'returns. Carriage returns are the "backslash n" you see embedded in this '
 'string. Sentences have a period at the end, but also, have a space.and words '
 'are separated by space.']


2

In [26]:
print(f"Split_1: {result[0]}\n\nSplit_2: {result[1]}")

Split_1: When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.

Split_2: Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.


In [33]:
# Let's reduce the chunk size a bit and add a period to our separators:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""],
)

result = r_splitter.split_text(some_text)
pprint(result)
len(result)

['When writing documents, writers will use document structure to group '
 "content. This can convey to the reader, which idea's are related",
 '. For example, closely related ideas are in sentances. Similar ideas are in '
 'paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage '
 'returns',
 '. Carriage returns are the "backslash n" you see embedded in this string',
 '. Sentences have a period at the end, but also, have a space.and words are '
 'separated by space.']


5

In [34]:
print(
    f"\nSplit_1: {result[0]}\n\nSplit_2: {result[1]}\n\nSplit_3: {result[2]}"
    f"\n\nSplit_4: {result[3]}\n\nSplit_5: {result[4]}"
)


Split_1: When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related

Split_2: . For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.

Split_3: Paragraphs are often delimited with a carriage return or two carriage returns

Split_4: . Carriage returns are the "backslash n" you see embedded in this string

Split_5: . Sentences have a period at the end, but also, have a space.and words are separated by space.


In [35]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150, chunk_overlap=0, separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)

result = r_splitter.split_text(some_text)
print(len(result))

print(
    f"\nSplit_1: {result[0]}\n\nSplit_2: {result[1]}\n\nSplit_3: {result[2]}"
    f"\n\nSplit_4: {result[3]}\n\nSplit_5: {result[4]}"
)

5

Split_1: When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related.

Split_2: For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.

Split_3: Paragraphs are often delimited with a carriage return or two carriage returns.

Split_4: Carriage returns are the "backslash n" you see embedded in this string.

Split_5: Sentences have a period at the end, but also, have a space.and words are separated by space.


In [4]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter

# Load the PDF
fp = "../../data/cs229-data/MachineLearning-Lecture01.pdf"
loader = PyPDFLoader(file_path=fp)
pages = loader.load()

# Split the document
text_splitter = CharacterTextSplitter(
    separator="\n", chunk_size=1000, chunk_overlap=150, length_function=len
)
docs = text_splitter.split_documents(pages)
print(f"Length of split docs: {len(docs)}\nLength of original docs: {len(pages)}")

Length of split docs: 77
Length of original docs: 22


<br>

### Notion Database

In [19]:
from langchain.document_loaders import NotionDirectoryLoader

fp = "../../data/Notion_DB"
loader = NotionDirectoryLoader(path=fp)
notion_db = loader.load()
docs = text_splitter.split_documents(notion_db)
len(notion_db)
len(docs)

308

#### Note

```text
Token splitting
- We can also split on token count explicity, if we want.
- This can be useful because LLMs often have context windows designated in tokens.
- Tokens are often ~4 characters.
```

In [20]:
from langchain.text_splitter import TokenTextSplitter


text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)

pprint(docs[0])
print()
# Metadata of the docs
pprint(pages[0].metadata)

Document(page_content='MachineLearning-Lecture01  \n', metadata={'source': '../../data/cs229-data/MachineLearning-Lecture01.pdf', 'page': 0})

{'page': 0, 'source': '../../data/cs229-data/MachineLearning-Lecture01.pdf'}


#### Note

```text
Context aware splitting
- Chunking aims to keep text with common context together.
- A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.
- `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.
```

In [21]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
pprint(md_header_splits[0])

Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})


In [22]:
md_header_splits[1]

Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})

In [23]:
# Try on a real Markdown file, like a Notion database.
fp = "../../data/Notion_DB"
loader = NotionDirectoryLoader(path=fp)
docs = loader.load()
txt = " ".join([d.page_content for d in docs])
md_header_splits = markdown_splitter.split_text(txt)

In [25]:
md_header_splits

[Document(page_content="This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change.  \n**Everything related to working at Blendle and the people of Blendle, made public.**  \nThese are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more.  \nWe've made this document public because we want to learn fr

In [26]:
md_header_splits[0]

Document(page_content="This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change.  \n**Everything related to working at Blendle and the people of Blendle, made public.**  \nThese are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more.  \nWe've made this document public because we want to learn fro