###### Source: 'Building LLMs for Production' by Louis-Francois Bouchard, Louie Peters

#### 0. Install related packages 

In [None]:
!pip install langchain==0.0.208 deeplake openai==0.27.8 tiktoken 

#### 1. Create a LangChain Object

In [1]:
from langchain.document_loaders import TextLoader 

text =""" Google opens up its AI language model PaLM to challenge OpenAI and GPT-3 Google offers developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses "generate text, images, code, videos, audio, and more from simple natural language prompts."

PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or Meta's LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs, PaLM is a flexible system that can potentially carry out all sorts of text generation and editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for example, or you could use it for tasks like summarizing text or even writing code. (It's similar to features Google also announced today for its Workspace apps like Google Docs and Gmail.)
"""

# Write text to local file 
with open('my_file.txt','w') as file: 
    file.write(text)

In [2]:
# Use TextLoader to load text from local file 

loader = TextLoader('my_file.txt')      # Create an instance of TextLoader class 
docs_from_file = loader.load()          # Call the load method to load the text from the file

print(len(docs_from_file))              # Print the number of documents loaded from the file

1


#### 2. Split the documents into Chunks with Text Splitter

`Chunk_overlap` is the number of characters that overlap between two chunks. 

>It preserves context and improves coherence by ensuring that important information is not cut off at the boundaries of chunks.

In [3]:
from langchain.text_splitter import CharacterTextSplitter 

# Create a text splitter instance 
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)
# Split the text into chunks
docs = text_splitter.split_documents(docs_from_file)
print(len(docs))

Created a chunk of size 369, which is longer than the specified 200


2


#### 3. Setup a vector store & Create an embedding for each chunk

> A vector store is a system to store embeddings, allowing us to query them.

> Chunking is done because LLMs typically have a limited context window, and storing smaller chunks improves retrieval accuracy and ensures relevant sections are retrieved.

We'll utilize the Deep Lake vector store, offered by Activeloop. They provide a cloud-based vector store solution, but other options like Chroma DB would also be suitable.

In [4]:
from langchain.embeddings import OpenAIEmbeddings

# Before executing the following code, make sure to have
# your OpenAI key saved in the "OPENAI_API_KEY" environment variable.
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

In [5]:
from langchain.vectorstores import DeepLake 

# Signup and Get your DeepLake API key  (https://app.activeloop.ai/), then save API Key / Secret before executing the following code 

# create a DeepLake dataset 

my_activeloop_org_id = "bichpham102" # TODO: use your organization id here. (by default, org id is your username)
my_activeloop_dataset_name = 'langchain_course_indexers_retrievers'

dataset_path = f'hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}'
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
db.add_documents(docs)
# the documents (docs) and their corresponding embeddings will be generated and stored in the new DeepLake dataset

Deep Lake Dataset in hub://bichpham102/langchain_course_indexers_retrievers already exists, loading from the storage


Creating 2 embeddings in 1 batches of size 2:: 100%|██████████| 1/1 [00:42<00:00, 42.41s/it]

Dataset(path='hub://bichpham102/langchain_course_indexers_retrievers', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (8, 1536)  float32   None   
    id        text      (8, 1)      str     None   
 metadata     json      (8, 1)      str     None   
   text       text      (8, 1)      str     None   





['000c4df6-78c3-11ef-89ac-6045bd1d6a3b',
 '000c4f36-78c3-11ef-89ac-6045bd1d6a3b']

#### 4. Create & work with a LangChain retriever

> Once created the retriever, we can use the `RetrievalQA` class to define a question answering chain using external data source and start with question-answering. 

###### 1. **`chain_type='stuff'`**
   - **How it works**: This method takes **all retrieved documents** and "stuffs" them into a single prompt to the language model (LLM). It concatenates the documents and the query into one large input and then sends that to the LLM to generate a response.
   - **Use case**: Suitable when the **total size of documents is small** enough to fit within the LLM’s context window (the maximum input size the model can process).
   - **Advantages**: It's **simple and fast** because it only makes one call to the LLM.
   - **Limitations**: This approach **fails** when the combined document size exceeds the context length of the model, leading to truncation or incomplete processing of the documents.
   - **Best for**: Short documents or queries that need to be processed all at once【42†source】【45†source】.

###### 2. **`chain_type='map-reduce'`**
   - **How it works**: Once the relevant chunks are retrieved from the vector store, the map-reduce process treats each retrieved chunk **independently**. It runs the LLM on each chunk to generate partial answers (the map step), and then **aggregates** (reduces) these partial results to form a final answer (the reduce step).
   - **Use case**: Ideal for scenarios where the documents are **too large** to fit into the context window at once.
   - **Advantages**: It allows for **scaling** to larger datasets because each chunk can be processed separately. It reduces the risk of important information being missed due to context limitations.
   - **Limitations**: This approach may lead to **inconsistencies** across chunks or loss of context between parts of the document.
   - **Best for**: Handling larger documents that exceed the model’s context window【43†source】【44†source】.

###### 3. **`chain_type='refine'`**
   - **How it works**: In this method, the LLM processes the first document, generates an answer, and then **iteratively refines** that answer by incorporating information from each subsequent document. Each document is used to improve or adjust the previous answer.
   - **Use case**: Best for situations where you want the model to **build upon** or **refine** an answer by looking at documents one-by-one.
   - **Advantages**: This method is beneficial when **context continuity** is essential because the model refines its response with each new document. It ensures that **new insights** from each document are incorporated into the final answer.
   - **Limitations**: It can be **slow** as the model processes each document individually and performs multiple iterations.
   - **Best for**: Complex queries where iterative improvements in the answer based on additional information are crucial【42†source】【45†source】.

###### Summary of Use Cases:
- **`stuff`**: Use when documents are small enough to fit into the context window. Simple and fast but limited by input size.
- **`map-reduce`**: Ideal for large documents that need to be broken into chunks. Efficient for handling large datasets but may lead to slight inconsistencies.
- **`refine`**: Best for complex answers that require refinement over time by integrating information from multiple documents. Provides thoroughness but is slower due to multiple iterations.

These methods allow you to tailor the retrieval and answering process based on the size and complexity of the documents you're working with.

##### 4a. RetrievalQA with base retriever

In [6]:
retriever = db.as_retriever() 
# calling the method as_retriever() on the vector store instance 

In [7]:
from langchain.chains import RetrievalQA 
from langchain.chat_models import ChatOpenAI


llm = ChatOpenAI(model_name='gpt-3.5-turbo') 

# create a retrieval chain 
qa_chain = RetrievalQA.from_chain_type(
    llm=llm
    ,chain_type='stuff'
    ,retriever=retriever 
)

In [8]:
query = 'How Google plans to challenge OpenAI?'
response = qa_chain.run(query)
print(response)

Google plans to challenge OpenAI by opening up its advanced AI language model, PaLM, to developers. By launching an API for PaLM and providing a range of AI enterprise tools, Google aims to help businesses generate various forms of content from simple natural language prompts. This move positions Google to compete with OpenAI and its renowned GPT-3 model in the field of AI language processing.


##### 4b. RetrievalQA with base retriever and additional LLMChainExtractor as compressor 

In [9]:
from langchain.retrievers import ContextualCompressionRetriever 
from langchain.retrievers.document_compressors import LLMChainExtractor 
from langchain.chat_models import ChatOpenAI

# create a GPT-3 wrapper instance 
llm = ChatOpenAI(model_name='gpt-3.5-turbo') 

# create compressor for the retriever 
compressor = LLMChainExtractor.from_llm(llm)

# Wrap the original retriever with ContextualCompressionRetriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor
    ,base_retriever=retriever
)

In [10]:
#  retrieves and compresses relevant documents -> a list of document objects
query = 'How Google plans to challenge OpenAI?'
# You are working directly with the retrieved and compressed documents, 
# no Q&A or further language model processing happens at this stage.
retrieved_docs = compression_retriever.get_relevant_documents(query)
print(retrieved_docs[0].page_content)



Google opens up its AI language model PaLM to challenge OpenAI and GPT-3. Google offers developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses "generate text, images, code, videos, audio, and more from simple natural language prompts."


> Output: It returns a list of document objects. Each document in this list represents a relevant text retrieved from the underlying data store (like a vector store), compressed for efficiency.

In [15]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm 
    ,chain_type='stuff'
    ,retriever=compression_retriever 
)
response = qa_chain.run(query)
print(response)



Google plans to challenge OpenAI by opening up its AI language model PaLM. This move allows researchers and developers to access and utilize Google's powerful language model, potentially competing with OpenAI's models in the field of artificial intelligence.


---

#### 5. Data Ingestion 

##### 5a. TextLoader

In [18]:
from langchain.document_loaders import TextLoader 

loader = TextLoader('my_file.txt', encoding='utf-8')    # Create an instance of TextLoader class
documents = loader.load()   

> An example of what the documents list would contain after running the TextLoader `[Document(page_content='<FILE_CONTENT>', metadata={'source': 'file_path.txt'})]`



##### 5b. PyPDFLoader

In [1]:
!pip install -q pypdf 

In [16]:
from langchain.document_loaders import PyPDFLoader 

loader = PyPDFLoader('Storage Options GCP.pdf')
pages = loader.load_and_split()

print('number of pages: ' ,len(pages))

number of pages:  3


In [None]:
print(pages[0])

##### 5c. SeleniumURLLoader (URL)

When the `load()` method is used with the `SeleniumURLLoader`, it returns a collection of Document instances, each containing the content fetched from the web pages. These Document instances have a `page_content` attribute, which includes the text extracted from the HTML, and a metadata attribute that stores the source URL.

The SeleniumURLLoader class in LangChain has the following attributes :

• `URLs (List[str])`: A list of URLs that the loader will access.

• `continue_on_failure (bool, default=True)`: Determines whether the loader should continue processing other URLs in case of a failure.

• `browser (str, default=“chrome”)`: Choice of browser for loading the URLs. Options typically include ‘Chrome’ or ‘Firefox’.

• `executable_path (Optional[str], default=None)`: The path to the browser’s executable file.

• `headless (bool, default=True)`: Specifies whether the browser should operate in headless mode, meaning it runs without a visible user interface.



In [20]:
!pip install -q unstructured selenium 

In [21]:
from langchain.document_loaders import SeleniumURLLoader 

urls = [
    "https://www.youtube.com/watch?v=TFa539R09EQ&t=139s",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s"
]

loader = SeleniumURLLoader(urls = urls)
data = loader.load()

print(data[0])


page_content='' metadata={'source': 'https://www.youtube.com/watch?v=TFa539R09EQ&t=139s'}


In [22]:
print(data[0].page_content)





##### 5d. Google Drive Loader

To use the `GoogleDriveLoader`, you need to set up the necessary credentials and tokens:

• The loader typically looks for the credentials.json file in the `~/.credentials/credentials.json` directory. You can specify a different path using the `credentials_file` keyword argument.

• For the token, the token.json file is created automatically on the loader’s first use and follows a similar path convention.

`recursive=False `: This means the loader will only access files directly within the specified folder_id and not look into any of its subfolders. It does not recurse into deeper folder structures within the specified folder.

In [None]:
from langchain.document_loaders import GoogleDriveLoader 

# instantiate the GoogleDriveLoader 
loader = GoogleDriveLoader(
    folder_id = 'folder_id'
    ,recursive=False # will not look into subfolders 
)

In [None]:
docs = loader.load() 

#### 6. Text Splitter 

##### 6a. `CharacterTextSplitter` 

In [1]:
# Load/Ingest documents 

from langchain.document_loaders import PyPDFLoader 
loader = PyPDFLoader('The One Page Linux Manual.pdf') 
pages = loader.load_and_split() 

In [2]:
# Chunking with `CharacterTextSplitter``

from langchain.text_splitter import CharacterTextSplitter 
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(pages)

In [3]:
print(texts[0]) 

print(f'You have {len(texts)} documents')
print('Preview:')
print(texts[0].page_content)

page_content='THE ONE     PAGE LINUX MANUALA summary of useful Linux commands\nVersion 3.0 May 1999 squadron@powerup.com.au\nStarting & Stopping\nshutdown -h now Shutdown the system now and do not\nreboot\nhalt Stop all processes - same as above\nshutdown -r 5 Shutdown the system in 5 minutes and\nreboot\nshutdown -r now Shutdown the system now and reboot\nreboot Stop all processes and then reboot - same\nas above\nstartx Start the X system\nAccessing & mounting file systems\nmount -t iso9660 /dev/cdrom\n/mnt/cdromMount the device cdrom\nand call it cdrom under the\n/mnt directory\nmount -t msdos /dev/hdd\n/mnt/ddriveMount hard disk “d” as a\nmsdos file system and call\nit ddrive under the /mnt\ndirectory\nmount -t vfat /dev/hda1\n/mnt/cdriveMount hard disk “a” as a\nVFAT file system and call it\ncdrive under the /mnt\ndirectory\numount /mnt/cdrom Unmount the cdrom\nFinding files and text within files\nfind / -name  fname Starting with the root directory, look\nfor the file called fnam

##### 6b. `RecursiveCharacterTextSplitter`

In [1]:
from langchain.document_loaders import PyPDFLoader 
from langchain.text_splitter import RecursiveCharacterTextSplitter 

loader = PyPDFLoader('The One Page Linux Manual.pdf') 
pages = loader.load_and_split() 

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=50 
    ,chunk_overlap=10 
    ,length_function=len # calculates the length of chunks
    # The default is len, which counts the number of characters
)

docs = text_splitter.split_documents(pages)

In [2]:
for doc in docs:
    print(doc)

page_content='THE ONE     PAGE LINUX MANUALA summary of useful' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
page_content='of useful Linux commands' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
page_content='Version 3.0 May 1999 squadron@powerup.com.au' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
page_content='Starting & Stopping' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
page_content='shutdown -h now Shutdown the system now and do' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
page_content='and do not' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
page_content='reboot\nhalt Stop all processes - same as above' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
page_content='shutdown -r 5 Shutdown the system in 5 minutes' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
page_content='5 minutes and' metadata={'source': 'The One Page Linux Manual.

> `RecursiveCharacterTextSplitter` can also split by tokens if used with a Token Counter.


```python

# install `transformers` 

!pip install transformers

# modify the `length_function` to use a token-based counter

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

# Load a pre-trained tokenizer (e.g., BERT tokenizer)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Function to count tokens
def token_counter(text):
    return len(tokenizer.encode(text))

# Load and split the PDF
loader = PyPDFLoader('The One Page Linux Manual.pdf')
pages = loader.load_and_split()

# Use the token counter for the length function
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,  # This is now 50 tokens
    chunk_overlap=10,  # 10 tokens of overlap
    length_function=token_counter  # Counts tokens instead of characters
)

docs = text_splitter.split_documents(pages)


##### 6c. `NLTK TextSplitter`

In [1]:
!pip install nltk 



In [2]:
# SeleniumURLLoader: Loads the text from the specified URL
from langchain.document_loaders import SeleniumURLLoader 

urls = [
    "https://python.langchain.com/v0.1/docs/use_cases/web_scraping/"
    ,"https://www.restack.io/docs/langchain-knowledge-langchain-web-scraper"
]

loader = SeleniumURLLoader(urls = urls)
data = loader.load()

print(data[0])

page_content='This is documentation for LangChain v0.1, which is no longer actively maintained.\n\nFor the current stable version, see this version (Latest).\n\nWeb scraping\n\nUse case\u200b\n\nWeb research is one of the killer LLM applications:\n\nUsers have highlighted it as one of his top desired AI tools.\n\nOSS repos like gpt-researcher are growing in popularity.\n\nOverview\u200b\n\nGathering content from the web has a few components:\n\nSearch: Query to url (e.g., using GoogleSearchAPIWrapper).\n\nLoading: Url to HTML (e.g., using AsyncHtmlLoader, AsyncChromiumLoader, etc).\n\nTransforming: HTML to formatted text (e.g., using HTML2Text or Beautiful Soup).\n\nQuickstart\u200b\n\npip install -q langchain-openai langchain playwright beautifulsoup4\nplaywright install\n\n# Set env var OPENAI_API_KEY or load from a .env file:\n# import dotenv\n# dotenv.load_dotenv()\n\nScraping HTML content using a headless instance of Chromium.\n\nThe async nature of the scraping process is handled

In [8]:
# Intializae the NLTKTextSplitter 
from langchain.text_splitter import NLTKTextSplitter
text_splitter = NLTKTextSplitter(chunk_size=500)

# Iterate over the documents and split them into chunks
all_texts = []
for i, doc in enumerate(data):
    doc_content = doc.page_content
    texts = text_splitter.split_text(doc_content) 
    all_texts.extend(texts) 
    print(f'Document {i+1} has {len(texts)} chunks')

Created a chunk of size 789, which is longer than the specified 500
Created a chunk of size 1300, which is longer than the specified 500
Created a chunk of size 512, which is longer than the specified 500
Created a chunk of size 1240, which is longer than the specified 500
Created a chunk of size 1291, which is longer than the specified 500
Created a chunk of size 745, which is longer than the specified 500
Created a chunk of size 757, which is longer than the specified 500
Created a chunk of size 517, which is longer than the specified 500


Document 1 has 46 chunks
Document 2 has 70 chunks


In [9]:
print(len(all_texts))

116


In [10]:
# print the first chunk of the first document
print(all_texts[2])

Loading: Url to HTML (e.g., using AsyncHtmlLoader, AsyncChromiumLoader, etc).

Transforming: HTML to formatted text (e.g., using HTML2Text or Beautiful Soup).

Quickstart​

pip install -q langchain-openai langchain playwright beautifulsoup4
playwright install

# Set env var OPENAI_API_KEY or load from a .env file:
# import dotenv
# dotenv.load_dotenv()

Scraping HTML content using a headless instance of Chromium.


In [13]:
print(all_texts[1])

OSS repos like gpt-researcher are growing in popularity.

Overview​

Gathering content from the web has a few components:

Search: Query to url (e.g., using GoogleSearchAPIWrapper).

Loading: Url to HTML (e.g., using AsyncHtmlLoader, AsyncChromiumLoader, etc).

Transforming: HTML to formatted text (e.g., using HTML2Text or Beautiful Soup).


##### 6d. `SpacyTextSplitter`

In [4]:
!pip install spacy



> The SpaCy model (like en_core_web_sm) is needed because the SpacyTextSplitter relies on SpaCy's natural language processing (NLP) capabilities to analyze and understand the structure of the text.

In [8]:
# download the language model 
!python -m spacy download en_core_web_sm 

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m55.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [9]:
# SeleniumURLLoader: Loads the text from the specified URL
from langchain.document_loaders import SeleniumURLLoader 

urls = [
    "https://python.langchain.com/docs/introduction/"
    ,"https://python.langchain.com/docs/versions/v0_3/"
]

loader = SeleniumURLLoader(urls = urls)
data = loader.load()

print(data[0])

page_content="Introduction\n\nLangChain is a framework for developing applications powered by large language models (LLMs).\n\nLangChain simplifies every stage of the LLM application lifecycle:\n\nDevelopment: Build your applications using LangChain's open-source building blocks, components, and third-party integrations. Use LangGraph to build stateful agents with first-class streaming and human-in-the-loop support.\n\nProductionization: Use LangSmith to inspect, monitor and evaluate your chains, so that you can continuously optimize and deploy with confidence.\n\nDeployment: Turn your LangGraph applications into production-ready APIs and Assistants with LangGraph Cloud.\n\nConcretely, the framework consists of the following open-source libraries:\n\nlangchain-core: Base abstractions and LangChain Expression Language.\n\nlangchain-community: Third party integrations.\n\nPartner packages (e.g. langchain-openai, langchain-anthropic, etc.): Some integrations have been further split into t

In [10]:
from langchain.text_splitter import SpacyTextSplitter 

text_splitter = SpacyTextSplitter(chunk_size=500, chunk_overlap=25) 

all_texts = []  
for i, doc in enumerate(data): 
    doc_content = doc.page_content 
    texts = text_splitter.split_text(doc_content) 
    all_texts.extend(texts) 
    print(f'Document {i+1} has {len(texts)} chunks') 

Document 1 has 9 chunks


Created a chunk of size 1186, which is longer than the specified 500
Created a chunk of size 509, which is longer than the specified 500
Created a chunk of size 645, which is longer than the specified 500


Document 2 has 20 chunks


In [11]:
print(len(all_texts))

29


##### 6e. `MarkdownTextSplitter`

In [12]:
markdown_text = """
#

# Welcome to My Blog!

## Introduction
Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python, Java, and JavaScript.

Here's a list of my favorite programming languages:

1. Python
2. JavaScript
3. Java

You can check out some of my projects on [GitHub](https://github.com).

## About this Blog
In this blog, I will share my journey as a software developer. I'll post tutorials, my thoughts on the latest technology trends, and occasional book reviews.

Here's a small piece of Python code to say hello:

\``` python
def say_hello(name):
    print(f"Hello, {name}!")

say_hello("John")
\```

Stay tuned for more updates!

## Contact Me
Feel free to reach out to me on [Twitter](https://twitter.com) or send me an email at johndoe@email.com.

"""


  markdown_text = """


Summary:

`split_text()`: Returns plain text chunks without any additional information.

`create_documents()`: Returns a list of documents that may contain both text chunks and associated metadata, making it more useful for structured use cases like document processing, search, or indexing.


> If you need just the text split into chunks, use split_text(). If you need structured documents with metadata (e.g., for further NLP processing or indexing), use create_documents().

In [18]:
# CORRECT: With [] (list of documents): The method processes the list as one complete document, resulting in fewer splits.
from langchain.text_splitter import MarkdownTextSplitter 

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])
print(docs)
print(len(docs))

# INCORRECT: Without [] (single string): The method treats the entire string as a large block of text and splits it into a maximum number of chunks by single characters.

[Document(page_content='#\n\n# Welcome to My Blog!', metadata={}), Document(page_content='## Introduction', metadata={}), Document(page_content='Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python,', metadata={}), Document(page_content='Java, and JavaScript.', metadata={}), Document(page_content="Here's a list of my favorite programming languages:\n\n1. Python\n2. JavaScript\n3. Java", metadata={}), Document(page_content='You can check out some of my projects on [GitHub](https://github.com).', metadata={}), Document(page_content='## About this Blog', metadata={}), Document(page_content="In this blog, I will share my journey as a software developer. I'll post tutorials, my thoughts on", metadata={}), Document(page_content='the latest technology trends, and occasional book reviews.', metadata={}), Document(page_content="Here's a small piece of Python code to say hello:", metadata={}), Document(page_content='\\``` python\ndef say_hello(name):\n 

##### 6f. `TokenTextSplitter`

In [20]:
# SeleniumURLLoader: Loads the text from the specified URL
from langchain.document_loaders import SeleniumURLLoader 

urls = [
    "https://python.langchain.com/docs/introduction/"
    ,"https://python.langchain.com/docs/versions/v0_3/"
]

loader = SeleniumURLLoader(urls = urls)
data = loader.load()

print(data[0])

page_content="Introduction\n\nLangChain is a framework for developing applications powered by large language models (LLMs).\n\nLangChain simplifies every stage of the LLM application lifecycle:\n\nDevelopment: Build your applications using LangChain's open-source building blocks, components, and third-party integrations. Use LangGraph to build stateful agents with first-class streaming and human-in-the-loop support.\n\nProductionization: Use LangSmith to inspect, monitor and evaluate your chains, so that you can continuously optimize and deploy with confidence.\n\nDeployment: Turn your LangGraph applications into production-ready APIs and Assistants with LangGraph Cloud.\n\nConcretely, the framework consists of the following open-source libraries:\n\nlangchain-core: Base abstractions and LangChain Expression Language.\n\nlangchain-community: Third party integrations.\n\nPartner packages (e.g. langchain-openai, langchain-anthropic, etc.): Some integrations have been further split into t

In [21]:
from langchain.text_splitter import TokenTextSplitter 

text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=50)

all_texts = []
for i, doc in enumerate(data): 
    doc_content = doc.page_content 
    texts = text_splitter.split_text(doc_content)
    all_texts.extend(texts) 
    print(f'Document {i+1} has {len(texts)} chunks')

Document 1 has 18 chunks
Document 2 has 61 chunks


#### 7. Embeddings - Similarity Search and Vector Embeddings

##### 7a. OpenAI 

In [1]:
import openai 
import numpy as np 
from sklearn.metrics.pairwise import cosine_similarity 
from langchain.embeddings import OpenAIEmbeddings 

In [2]:
# Define the documents
documents = [
    "The cat is on the mat.",
    "There is a cat on the mat.",
    "The dog is in the yard.",
    "There is a dog in the yard.",
]

In [3]:
# Initializae the OpenAIEmbeddings instance 
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
# Generate embeddings for the documents 
document_embeddings = embeddings.embed_documents(documents)

In [5]:
# get embeddings for the query 
query = "A cat is sitting on a mat."
query_embedding = embeddings.embed_query(query) 

# perform a similarity search between the query's embedding and the documents' embeddings

## similarity scores between the query and each document 
similarity_scores = cosine_similarity([query_embedding], document_embeddings)[0]
# the [0] is to get the first row of the cosine similarity matrix -> flatten the matrix
# the matrix is of shape (1, len(documents)) 
print(cosine_similarity([query_embedding], document_embeddings)) 
print(similarity_scores) 

[[0.97331318 0.96955366 0.82846392 0.82959776]]
[0.97331318 0.96955366 0.82846392 0.82959776]


In [6]:
# find the most similar document 
most_similar_index = np.argmax(similarity_scores)  # argmax returns the index of the maximum value in a given array or sequence
most_similar_document = documents[most_similar_index] 

print(f"Most similar document to the query: '{query}':")
print(most_similar_document)

Most similar document to the query 'A cat is sitting on a mat.':
The cat is on the mat.


##### 7b. Open-source Embedding Models - HuggingFace

> We chose `sentence-transformers/all-mpnet-base-v2`, a pre-trained model for converting sentences into semantically meaningful vectors.

In [2]:
!pip install sentence_transformers 

Collecting sentence_transformers
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Collecting transformers<5.0.0,>=4.38.0 (from sentence_transformers)
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
Collecting huggingface-hub>=0.19.3 (from sentence_transformers)
  Downloading huggingface_hub-0.25.1-py3-none-any.whl.metadata (13 kB)
Collecting safetensors>=0.4.1 (from transformers<5.0.0,>=4.38.0->sentence_transformers)
  Downloading safetensors-0.4.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers<5.0.0,>=4.38.0->sentence_transformers)
  Downloading tokenizers-0.19.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading sentence_transformers-3.1.1-py3-none-any.whl (245 kB)
Downloading huggingface_hub-0.25.1-py3-none-any.whl (436 kB)
Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
from langchain.llms import HuggingFacePipeline 
from langchain.embeddings import HuggingFaceEmbeddings 

model_name = 'sentence-transformers/all-mpnet-base-v2'
model_kwargs = {'device':'cpu'}
hf = HuggingFaceEmbeddings(model_name=model_name
                           ,model_kwargs=model_kwargs)

documents = ["Document 1", "Document 2", "Document 3"]
doc_embeddings = hf.embed_documents(documents) 

  from tqdm.autonotebook import tqdm, trange


In [5]:
len(doc_embeddings)

3

##### 7c. Cohere Embeddings

> Cohere multilingual model maps text to semantic vector space and enhances text similarity comprehension in multilingual applications.

> This model, distinct from their English language model, employ dot product computations for improved performance. 

In [3]:
%pip install -U langchain-cohere

Collecting langchain-cohere
  Downloading langchain_cohere-0.3.0-py3-none-any.whl.metadata (6.7 kB)
Collecting langchain-experimental>=0.3.0 (from langchain-cohere)
  Downloading langchain_experimental-0.3.0-py3-none-any.whl.metadata (1.7 kB)
Collecting tabulate<0.10.0,>=0.9.0 (from langchain-cohere)
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Downloading langchain_cohere-0.3.0-py3-none-any.whl (43 kB)
Downloading langchain_experimental-0.3.0-py3-none-any.whl (206 kB)
Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Installing collected packages: tabulate, langchain-experimental, langchain-cohere
Successfully installed langchain-cohere-0.3.0 langchain-experimental-0.3.0 tabulate-0.9.0
Note: you may need to restart the kernel to use updated packages.


In [5]:
from langchain_cohere import CohereEmbeddings


# Initialize the CohereEmbeddings object 
cohere = CohereEmbeddings(model = 'embed-multilingual-v2.0')


In [7]:
# Define a list of texts
texts = [
    "Hello from Cohere!",
    "مرحبًا من كوهير!",
    "Hallo von Cohere!",  
    "Bonjour de Cohere!",
    "¡Hola desde Cohere!",
    "Olá do Cohere!",  
    "Ciao da Cohere!",
    "您好，来自 Cohere！",
    "कोहेरे से नमस्ते!",
    "Xin chào Cohere!"
]

# generate embeddings for the texts 
document_embeddings = cohere.embed_documents(texts)

# print the embeddings 
for text, embedding in zip(texts, document_embeddings):
    print(f"Text: {text}")
    print(f"Embedding: {embedding[:5]}") # print the first 5 elements of the embedding 

Text: Hello from Cohere!
Embedding: [0.23461914, 0.50146484, -0.048828125, 0.13989258, -0.18029785]
Text: مرحبًا من كوهير!
Embedding: [0.25317383, 0.30004883, 0.0104904175, 0.12573242, -0.18273926]
Text: Hallo von Cohere!
Embedding: [0.10266113, 0.28320312, -0.050201416, 0.23706055, -0.07159424]
Text: Bonjour de Cohere!
Embedding: [0.15185547, 0.28173828, -0.057281494, 0.11743164, -0.04385376]
Text: ¡Hola desde Cohere!
Embedding: [0.25146484, 0.43139648, -0.0859375, 0.24682617, -0.11706543]
Text: Olá do Cohere!
Embedding: [0.18664551, 0.39038086, -0.045898438, 0.14562988, -0.11254883]
Text: Ciao da Cohere!
Embedding: [0.115722656, 0.43310547, -0.026168823, 0.14575195, 0.07080078]
Text: 您好，来自 Cohere！
Embedding: [0.24609375, 0.30859375, -0.111694336, 0.26635742, -0.051086426]
Text: कोहेरे से नमस्ते!
Embedding: [0.1932373, 0.6352539, 0.03213501, 0.117370605, -0.26098633]
Text: Xin chào Cohere!
Embedding: [0.29492188, 0.38793945, -0.013412476, 0.19311523, -0.0993042]


##### 7d. Deep Lake Vector Store

In [8]:
%pip install deeplake 

Collecting deeplake
  Downloading deeplake-3.9.23.tar.gz (617 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m617.0/617.0 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting pillow~=10.2.0 (from deeplake)
  Downloading pillow-10.2.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (9.7 kB)
Collecting click (from deeplake)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting pathos (from deeplake)
  Downloading pathos-0.3.2-py3-none-any.whl.metadata (11 kB)
Collecting humbug>=0.3.1 (from deeplake)
  Downloading humbug-0.3.2-py3-none-any.whl.metadata (6.8 kB)
Collecting lz4 (from deeplake)
  Downloading lz4-4.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Collecting pyjwt (from deeplake)
  Downloading PyJWT-2.9.0-py3-none-any.whl.metadata (3.0 kB

In [None]:
%pip install tiktoken

In [9]:
from langchain.embeddings import OpenAIEmbeddings 
from langchain.vectorstores import DeepLake
from langchain.text_splitter import RecursiveCharacterTextSplitter 
from langchain.chat_models import ChatOpenAI 
from langchain.chains import RetrievalQA 

In [10]:
# 1. split documents into chunks

texts = [
    "Napoleon Bonaparte was born in 15 August 1769",
    "Louis XIV was born in 5 September 1638",
    "Lady Gaga was born in 28 March 1986",
    "Michael Jeffrey Jordan was born in 17 February 1963"
] 


text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0) 
docs = text_splitter.create_documents(texts)

In [13]:
# 2. Get and Store embeddings in DeepLake

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# create a DeepLake dataset 
my_activeloop_org_id = "bichpham102" # TODO: use your organization id here. (by default, org id is your username)
my_activeloop_dataset_name = 'langchain_course_embeddings'
dataset_path = f'hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}'
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to the DeepLake dataset 
db.add_documents(docs)

Using embedding function is deprecated and will be removed in the future. Please use embedding instead.


Deep Lake Dataset in hub://bichpham102/langchain_course_embeddings already exists, loading from the storage


Creating 4 embeddings in 1 batches of size 4:: 100%|██████████| 1/1 [00:26<00:00, 26.69s/it]

Dataset(path='hub://bichpham102/langchain_course_embeddings', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (4, 1536)  float32   None   
    id        text      (4, 1)      str     None   
 metadata     json      (4, 1)      str     None   
   text       text      (4, 1)      str     None   





['ed38329e-7a89-11ef-a33a-6045bd2238a6',
 'ed38342e-7a89-11ef-a33a-6045bd2238a6',
 'ed383500-7a89-11ef-a33a-6045bd2238a6',
 'ed3835b4-7a89-11ef-a33a-6045bd2238a6']

In [14]:
# 3. Create a Retriever, and incorporate it into a RetrievalQA chain
retriever = db.as_retriever()

# instantiate the llm wrapper 
model = ChatOpenAI(model_name='gpt-3.5-turbo')
# create the chain 
qa_chain = RetrievalQA.from_llm(model, retriever=retriever)

  model = ChatOpenAI(model_name='gpt-3.5-turbo')


In [15]:
# 4. Query the chain 
qa_chain.run('When was Lady Gaga born?') 

  qa_chain.run('When was Lady Gaga born?')


'Lady Gaga was born on 28 March 1986.'

#### 8. Langchain Chain

##### 8a. LLMChain

> create a bot to suggest contextually appropriate replacement words

LLMChain - 1 input per prompt 

In [31]:
from langchain import PromptTemplate, LLMChain 
from langchain.chat_models import ChatOpenAI

prompt_template=PromptTemplate.from_template("What is a word to replace the following: {word}?")
# prompt_template=PromptTemplate.from_template("Correct the spelling of this Vietnamese word/phrase: {word}?")


llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0)

# Use the new RunnableSequence approach
llm_chain = prompt_template | llm 

In [39]:
# Use `invoke` for single input 
word = 'rare'
response = llm_chain.invoke(word)
print(response)
print(response.content)

content='scarce' additional_kwargs={} response_metadata={'token_usage': {'completion_tokens': 2, 'prompt_tokens': 18, 'total_tokens': 20, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run-0807cedc-2ee0-4d8f-8831-d24072a2691f-0'
scarce


In [37]:
# Use `batch` for multiple inputs
input_list = [
    {"word": "analogous"},
    {"word": "intelligence"},
    {"word": "robot"}
]

response = llm_chain.batch(input_list)
for r in response:
    print(r.content)

Similar
Cleverness
android


LLMChain - 2 input per prompt 

In [44]:
from langchain import PromptTemplate, LLMChain 
from langchain.chat_models import ChatOpenAI

template = """Looking at the context of '{context}'. \
What is an appropriate word to replace the following: {word}?""" 
prompt_template=PromptTemplate.from_template(template)


llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0)
llm_chain = prompt_template | llm 

# Example input data
input_data = {
    "context": "object",
    "word": "fan"
}

# Invoke the chain for the input data
response = llm_chain.invoke(input_data)
print(response.content)

air conditioner


> Another Option: We can directly pass a prompt as a string to a Chain and initialize it using the `.from_string()` function as follows: `LLMChain.from_string(llm=llm, template=template)`

In [45]:
from langchain import LLMChain
from langchain.chat_models import ChatOpenAI

# Define your template directly as a string
template = """Looking at the context of '{context}', what is an appropriate word to replace the following: {word}?"""

# Initialize the language model
llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0)

# Initialize the chain using .from_string()
llm_chain = LLMChain.from_string(llm=llm, template=template)

# Example input data
input_data = {
    "context": "The car was driving fast on the highway.",
    "word": "fast"
}

# Run the chain with the input data
response = llm_chain.run(input_data)

# Print the response
print(response)


quickly


##### 8b. Conversational Chain (Memory)

##### 8c. Sequential Chain

##### 8d. Debug

##### 8e. Custom Chain