In [29]:
from package import embedding

# Chunking Startegy  
Chunking or Splitting is really important in the RAG-LMM Chatbot.  
The reasons are:
- if you have the wrong chunk, some useful information will be left out, and you may not retrieve it in the retrival process
- if you do the chunk too big, you may have a lot of noise
- if you do the chunk too small, you may not see the big picture

Hence, choose your chinking strategy wisely and try to test as much as you can.  
Also prepare the debugging method for your application in advance, so you don't have to suffer in the production.

## Character split  
This is just an example, please do not use it in any case.

In [8]:
from langchain_text_splitters import CharacterTextSplitter

chunk_size=100

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=chunk_size,
    chunk_overlap=int(chunk_size*0.1),
    length_function=len,
    is_separator_regex=False,
)

In [9]:
%%time

from langchain_community.document_loaders import SeleniumURLLoader

urls = [
    "https://python.langchain.com/v0.1/docs/integrations/document_loaders/url/",
]

loader = SeleniumURLLoader(urls=urls)
docs = loader.load_and_split(text_splitter)
docs

Created a chunk of size 119, which is longer than the specified 100
Created a chunk of size 101, which is longer than the specified 100
Created a chunk of size 101, which is longer than the specified 100
Created a chunk of size 147, which is longer than the specified 100


[Document(page_content='Components\n\nDocument loaders\n\nURL\n\nURL', metadata={'source': 'https://python.langchain.com/v0.1/docs/integrations/document_loaders/url/', 'title': 'URL | 🦜️🔗 LangChain', 'description': 'This example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream.', 'language': 'en'}),
 Document(page_content='This example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream.', metadata={'source': 'https://python.langchain.com/v0.1/docs/integrations/document_loaders/url/', 'title': 'URL | 🦜️🔗 LangChain', 'description': 'This example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream.', 'language': 'en'}),
 Document(page_content='Unstructured URL Loader\u200b\n\nYou have to install the unstructured library:\n\n!pip install', metadata={'source': 'https://python.langchain.com/v0.1/docs/integrations/document_loader

In [10]:
len(docs)

25

In [15]:
print(docs[0].page_content)

Components

Document loaders

URL

URL


In [19]:
doc_len = []
for doc in docs:
    doc_len.append(len(doc.page_content))

", ".join([str(doc) for doc in doc_len])

'38, 119, 85, 90, 43, 101, 101, 89, 83, 87, 73, 90, 95, 86, 98, 28, 89, 67, 86, 147, 87, 88, 100, 97, 94'

## Recursive split  
It's a to-go method when you need to a quick test.

In [21]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [22]:
chunk_size = 100

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=chunk_size,
    chunk_overlap=int(chunk_size*0.1),
    length_function=len,
    is_separator_regex=False,
)

In [23]:
%%time

loader = SeleniumURLLoader(urls=urls)
docs = loader.load_and_split(text_splitter)
docs

[Document(page_content='Components\n\nDocument loaders\n\nURL\n\nURL', metadata={'source': 'https://python.langchain.com/v0.1/docs/integrations/document_loaders/url/', 'title': 'URL | 🦜️🔗 LangChain', 'description': 'This example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream.', 'language': 'en'}),
 Document(page_content='This example covers how to load HTML documents from a list of URLs into the Document format that we', metadata={'source': 'https://python.langchain.com/v0.1/docs/integrations/document_loaders/url/', 'title': 'URL | 🦜️🔗 LangChain', 'description': 'This example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream.', 'language': 'en'}),
 Document(page_content='that we can use downstream.', metadata={'source': 'https://python.langchain.com/v0.1/docs/integrations/document_loaders/url/', 'title': 'URL | 🦜️🔗 LangChain', 'description': 'This example covers how to loa

In [24]:
len(docs)

29

In [25]:
print(docs[0].page_content)

Components

Document loaders

URL

URL


In [26]:
doc_len = []
for doc in docs:
    doc_len.append(len(doc.page_content))

", ".join([str(doc) for doc in doc_len])

'38, 99, 27, 85, 90, 43, 99, 12, 99, 12, 89, 83, 87, 73, 90, 95, 86, 98, 28, 89, 67, 86, 98, 52, 87, 88, 92, 97, 94'

## Semantic split  
The longest time to split, but the best strategy for now.

In [27]:
from langchain_experimental.text_splitter import SemanticChunker

In [30]:
text_splitter = SemanticChunker(embedding)

In [35]:
%%time
loader = SeleniumURLLoader(urls=urls)
docs = loader.load_and_split(text_splitter)
docs

CPU times: total: 62.5 ms
Wall time: 32.4 s


[Document(page_content='\n\nComponents\n\nDocument loaders\n\nURL\n\nURL\n\nThis example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream. Unstructured URL Loader\u200b\n\nYou have to install the unstructured library:\n\n!pip install\n\nU unstructured\n\nfrom\n\nlangchain_community\n\ndocument_loaders\n\nimport\n\nUnstructuredURLLoader\n\nAPI Reference:\n\nUnstructuredURLLoader\n\nurls\n\n"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023"\n\n"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"\n\nPass in ssl_verify=False with headers=headers to get past ssl_verification error. loader\n\nUnstructuredURLLoader\n\nurls\n\nurls\n\ndata\n\nloader\n\nload\n\nSelenium URL Loader\u200b\n\nThis covers how to load HTML documents from a list of URLs using the SeleniumURLLoader. Using Selenium allows us to load pages that require JavaScript to

In [32]:
len(docs)

2

In [33]:
print(docs[0].page_content)



Components

Document loaders

URL

URL

This example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream. Unstructured URL Loader​

You have to install the unstructured library:

!pip install

U unstructured

from

langchain_community

document_loaders

import

UnstructuredURLLoader

API Reference:

UnstructuredURLLoader

urls

"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023"

"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"

Pass in ssl_verify=False with headers=headers to get past ssl_verification error. loader

UnstructuredURLLoader

urls

urls

data

loader

load

Selenium URL Loader​

This covers how to load HTML documents from a list of URLs using the SeleniumURLLoader. Using Selenium allows us to load pages that require JavaScript to render. To use the SeleniumURLLoader, you have to install selenium and unstructured. !p

In [34]:
doc_len = []
for doc in docs:
    doc_len.append(len(doc.page_content))

", ".join([str(doc) for doc in doc_len])

'1565, 603'

# Play Ground  
[Look at this reference on LangChain](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/)