<a href="https://www.kaggle.com/code/lorentzyeung/a-langchain-openai-complete-tutorial-02?scriptVersionId=161982580" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **A LangChain + OpenAI Complete Tutorial for Beginner - Lesson 2 Advanced Chatbot with RAG and Vector Databases**

For the full Lesson 2, please visit here:

https://pub.towardsai.net/a-langchain-openai-complete-tutorial-for-beginner-lesson-2-advanced-chatbot-with-rag-and-vector-b6fe524909cb

## Content:

1. **Introduction to Advanced Concepts**
   - Brief overview of Retrieval-Augmented Generation (RAG)
   - The role of vector databases in document management

2. **Setting Up the Environment for Advanced Features**
   - Installing additional libraries for RAG and vector databases

3. **Loading and Preparing Documents**
   - Utilizing loaders for different document types
      - PDF,
      - CSV,
      - Wikipedia
   - Splitting and processing documents for RAG
      - Splitting text
      - Loading and Splitting HTML

4. **Implementing Vector Databases**
   - Selecting the right vector database for your needs
   - Selecting the Right Embeddings Model
   - Load the PDF and Store it in RAG
   - Utilizing a Vector Database in Retrieval-Augmented Generation

5. **Integrating RAG with Vector Databases**
   - Crafting Queries for RAG
   - Querying and Retrieving Information

6. **Conclusion and Further Exploration**
   - Recap of the advanced features implemented
   - Coming Up Next

In the lesson 1, you have learned the basics of building chatbot applications using LangChain, OpenAI, and Hugging Face. We started by setting up the environment and choosing the right language model. Then, we progressed to creating a simple chatbot, enhancing it with prompt templates for structured interactions. We also delved into the crucial aspects of managing chat model memory and introduced advanced features like Conversation Chains and Summary Memory.

In this lesson 2, we learn advanced system with RAG, and Loader. With RAG and Loader, your chatbot can tap into the external information or konwledge and supercharge the answers to your questions.

## 1. Introduction to Advanced Concepts
Retrieval-Augmented Generation is a cutting-edge approach in AI, combining the power of language models with external knowledge sources. RAG enhances the capability of chatbots by allowing them to pull in information from a variety of documents, making responses more informative and contextually rich. This is particularily useful in commercial companies, e.g. creating a chatbot for client bases data retrieving.

## 2. Ensure the Environments

Like our previous tutorial, we will stick to our langchain-openai 0.0.5, langchain 0.1.4, and openai 1.10.0.

In [1]:
!pip install --quiet langchain-openai==0.0.5
!pip install --quiet langchain==0.1.4
!pip install --quiet langchain-community==0.0.16

!pip install --quiet openai==1.10.0

!pip install --upgrade --quiet  lark==1.1.9
!pip install --upgrade --quiet  chromadb==0.4.22

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-cloud-bigquery 2.34.4 requires packaging<22.0dev,>=14.3, but you have packaging 23.2 which is incompatible.
jupyterlab 4.0.11 requires jupyter-lsp>=2.0.0, but you have jupyter-lsp 1.5.1 which is incompatible.
jupyterlab-lsp 5.0.2 requires jupyter-lsp>=2.0.0, but you have jupyter-lsp 1.5.1 which is incompatible.
libpysal 4.9.2 requires shapely>=2.0.1, but you have shapely 1.8.5.post1 which is incompatible.
momepy 0.7.0 requires shapely>=2, but you have shapely 1.8.5.post1 which is incompatible.
osmnx 1.8.1 requires shapely>=2.0, but you have shapely 1.8.5.post1 which is incompatible.
spopt 0.6.0 requires shapely>=2.0.1, but you have shapely 1.8.5.post1 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This 

In [2]:
import sys
import subprocess
import json

# Function to get the version of a package using pip
def get_package_version(package_name):
    result = subprocess.run([sys.executable, '-m', 'pip', 'list', '--format', 'json'], capture_output=True, text=True)
    packages = json.loads(result.stdout)
    for package in packages:
        if package["name"].lower() == package_name.lower():
            return package["version"]
    return "Package not found"

# Get versions
langchain_version = get_package_version("langchain")
langchain_openai_version = get_package_version("langchain-openai")
langchain_community_version = get_package_version("langchain-community")

openai_version = get_package_version("openai")
python_version = sys.version

lark_version = get_package_version("lark")
chromadb_version = get_package_version("chromadb")


# Display versions
print(f"Langchain version: {langchain_version}")
print(f"langchain openai version: {langchain_openai_version}")
print(f"langchain-community version: {langchain_community_version}")
print(f"OpenAI version: {openai_version}")
print(f"Python version: {python_version}")
print(f"Lark version: {lark_version}")
print(f"Chromadb version: {chromadb_version}")

Langchain version: 0.1.4
langchain openai version: 0.0.5
langchain-community version: 0.0.16
OpenAI version: 1.10.0
Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
Lark version: 1.1.9
Chromadb version: 0.4.22


## 3. Loading and Preparing Documents


First let's load the necessary library from LangChain, and setup our chat model (ChatOpenAI) before loading our documents.

In [5]:
import sys
sys.path.append("/kaggle/input/api-py/")
#import api
#openai_api_key = api.openai_api_key

In [6]:
from langchain.llms import OpenAI
from langchain_openai import ChatOpenAI

# Set your API Key from OpenAI
# your api key should be something like this:
# openai_api_key = 'sk-5Y9BbKFBbte5aghtOXRvT3BlbkFJLwwUvc4hjAf4YUw9KOabc'
openai_api_key = ''

chat = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, openai_api_key=openai_api_key)

If you haven't yet installed the pypdf library, install it now. Pypdf is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files.

For more detailed information, you are welcome to visit their official page. https://pypi.org/project/pypdf/. If you are only interested in RAG and creating chatbot, this is ignorable.

In [7]:
!pip install pypdf



In [8]:
from pypdf import PdfReader

reader = PdfReader("/kaggle/input/langchain-openai-a-complete-tutorial/t-entz.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()

# Load the document
print(text)
print(number_of_pages)

T-Entz  (pronounce: t-ants) TYRANT LIZARD T-Entz is a close rela=ve of T-Rex, it lives up to its reputa=on as one of the most admired no-eaters of all =me. Its powerful jaw had 2 teeth, each one about 0.5 inches long, and its smile was about three =mes more powerful than the Minions. Bite marks found on Triceratops fossils show that T-Entz could be playful, and its laughter remained in the fossil of trees. It could use its good sense of humour to melt the invisible walls between animals. It would have been able to scare oﬀ any other toxic animals, so the world was much beRer with it. We do not know whether T-Entz laughed alone or in packs, as no groups of skeletons have been found together.  LENGTH: 40 W DIET: water WHEN IT LIVED: Late Jurassic period FOUND IN: Hong Kong and the back of the Moon  Excerpt From Dic=onary of Dinosaurs By Lorentz Yeung This material is not protected by copyright. 
1


Actually LangChain has its own official pdf loader, and text loader. Showing you the pypdf reader is to verify the official pdf loader by LangChain can do the job exactly well.

In [9]:
from langchain.document_loaders import PyPDFLoader # this is for pdf
from langchain.document_loaders import TextLoader # this is for txt file.

reader = PyPDFLoader('/kaggle/input/langchain-openai-a-complete-tutorial/t-entz.pdf')
# reader = TextLoader('t-entz.txt', encoding='utf-8')

# Load the document
data = reader.load()
print(data[0])

page_content='T-Entz  (pronounce: t-ants) TYRANT LIZARD T-Entz is a close rela=ve of T-Rex, it lives up to its reputa=on as one of the most admired no-eaters of all =me. Its powerful jaw had 2 teeth, each one about 0.5 inches long, and its smile was about three =mes more powerful than the Minions. Bite marks found on Triceratops fossils show that T-Entz could be playful, and its laughter remained in the fossil of trees. It could use its good sense of humour to melt the invisible walls between animals. It would have been able to scare oﬀ any other toxic animals, so the world was much beRer with it. We do not know whether T-Entz laughed alone or in packs, as no groups of skeletons have been found together.  LENGTH: 40 W DIET: water WHEN IT LIVED: Late Jurassic period FOUND IN: Hong Kong and the back of the Moon  Excerpt From Dic=onary of Dinosaurs By Lorentz Yeung This material is not protected by copyright. ' metadata={'source': '/kaggle/input/langchain-openai-a-complete-tutorial/t-entz

See the results? They are basically the same. 

How about CSV?

In [10]:
# Import library
from langchain_community.document_loaders.csv_loader import CSVLoader

# Create a document loader for fifa_countries_audience.csv
reader = CSVLoader(file_path='/kaggle/input/langchain-openai-a-complete-tutorial/sample.csv')

# Load the document
data = reader.load()
print(data[0])

page_content='Component: Christmas Day\nCoefficient: -6.52339e-13' metadata={'source': '/kaggle/input/langchain-openai-a-complete-tutorial/sample.csv', 'row': 0}


How about webpages? There are tons of loader in LangChain, including pictures, BigQuery, Reddit, subtitle... etc.
Feel free to check out the official page: https://python.langchain.com/docs/integrations/document_loaders/

Our last example is to load the content from Wikipedia.


In [11]:
%pip install --upgrade --quiet  wikipedia

Note: you may need to restart the kernel to use updated packages.


In [12]:
from langchain_community.document_loaders import WikipediaLoader

docs = WikipediaLoader(query="dragon ball z", load_max_docs=2).load()
len(docs)

2

In [13]:
docs[0].metadata  # meta-information of the Document

{'title': 'Dragon Ball Z',
 'summary': "Dragon Ball Z is a Japanese anime television series produced by Toei Animation. Part of the Dragon Ball media franchise, it is the sequel to the 1986 Dragon Ball television series and adapts the latter 325 chapters of the original Dragon Ball manga series created by Akira Toriyama. The series aired in Japan on Fuji TV from April 1989 to January 1996, and was later dubbed for broadcast in at least 81 countries worldwide.Dragon Ball Z continues the adventures of Son Goku in his adult life as he and his companions defend the Earth against villains including aliens (Vegeta, Freeza), androids (Cell), and magical creatures (Majin Boo). At the same time, the story parallels the life of Goku's son, Gohan, as well as the development of his rivals, Piccolo and Vegeta.\nDue to the success of the series in the United States, the manga chapters making up its story were initially released by Viz Media under the Dragon Ball Z title. The anime's popularity has a

In [14]:
docs[0].page_content[:400]  # a content of the Document

'Dragon Ball Z is a Japanese anime television series produced by Toei Animation. Part of the Dragon Ball media franchise, it is the sequel to the 1986 Dragon Ball television series and adapts the latter 325 chapters of the original Dragon Ball manga series created by Akira Toriyama. The series aired in Japan on Fuji TV from April 1989 to January 1996, and was later dubbed for broadcast in at least '

### Splitting and processing documents for RAG

#### Splitting text

In [15]:
# Import libary
from langchain.text_splitter import CharacterTextSplitter

doc = 'Lorentz, is the English name of Pui Yeung, who is a data scientist. \n Here is some statistics of it. \n\nIn 1939, unpaid Domestic Duties was the top reported job for people in the United Kingdom named Lorentz.'
chunk_size = 30
chunk_overlap = 3
separator = "\n" # adding this as an additional to default

# Create an instance of the splitter class
splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator=separator
    )

# Split the document and print the chunks
docs = splitter.split_text(doc)
docs

['Lorentz, is the English name of Pui Yeung, who is a data scientist.',
 'Here is some statistics of it.',
 'In 1939, unpaid Domestic Duties was the top reported job for people in the United Kingdom named Lorentz.']

### Loading and Splitting HTML

Let's split a html file. We are doing this exercise because very often we will face this need in real world.

In [16]:
!pip install --quiet unstructured

In [17]:
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load the HTML document into memory

reader = UnstructuredHTMLLoader("/kaggle/input/langchain-openai-a-complete-tutorial/t-entz.html")
doc = reader.load()

# Define variables
chunk_size = 200
chunk_overlap = 50

# Split the HTML
splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=['.'])

docs = splitter.split_documents(doc) 
docs

[Document(page_content='T-Entz\n\n(tie-ants) TYRANT LIZARD\n\nT-Entz is a close relative of T-Rex, it lives up to its reputation as one of the most admired no-eaters of all time. Its powerful jaw had 2 teeth, each one about 0', metadata={'source': '/kaggle/input/langchain-openai-a-complete-tutorial/t-entz.html'}),
 Document(page_content='. Its powerful jaw had 2 teeth, each one about 0.5 inches long, and its smile was about three times more powerful than the Minions', metadata={'source': '/kaggle/input/langchain-openai-a-complete-tutorial/t-entz.html'}),
 Document(page_content='. Bite marks found on Triceratops fossils show that T-Entz could be playful, and its laughter remained in the fossil of trees', metadata={'source': '/kaggle/input/langchain-openai-a-complete-tutorial/t-entz.html'}),
 Document(page_content='. It could use its good sense of humour to melt the invisible walls between animals. It would have been able to scare off any other toxic animals, so the world was much better

The function works with similar logic, it split according to the chunk size and marks specified, while prioritizing the natural breaks and pauses.

## 4. Implementing RAG with Vector Databases for Document Management

To make it easier to find and use parts of documents in a system called RAG (Retrieval-Augmented Generation), we turn these document parts into special coded forms. Each part gets a number-based code that sums up what it's about and how important it is. 

It's really important to use this special kind of database, a vector database, to keep and organize these document parts in a smart way. This database sorts them by what they're about and their deeper meaning. It helps to quickly find and evaluate these parts, checking how relevant they are based on how similar their codes are, and then using them in the system's responses.


### Load the PDF and Store it in RAG

Let's get our hands dirty now. First we import the required library as usual.

In [18]:
# Prepare the documents and vector database
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

Then we load the pdf, and split it and then stor it in dataset called "doc".

In [19]:
openai_api_key= openai_api_key
loader = PyPDFLoader('/kaggle/input/langchain-openai-a-complete-tutorial/t-entz.pdf')
text = loader.load()
chunk_size = 20
chunk_overlap = 5

# Split the pdf
splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap,
    separators=['.']
    )
doc = splitter.split_documents(text)

Then store it into our vector database, which will then observable in your local working directory named "chroma".

In [20]:
# Use the OpenAI embeddings method to embed "meaning" into the text
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)
# embedding = OpenAIEmbeddings(openai_api_key=openai_api_key, model_name='text-embedding-3-small')

persist_directory = "embedding/chroma"

# Create a Chroma vector database for the current states after embedding
vectordb = Chroma(
    persist_directory = persist_directory,
    embedding_function = embedding)
vectordb.persist() #  save its current states and any data it holds to the specified directory (embedding/chroma/).
# now you will see a folder named embedding in your working directory.

By persisting the state of the vector database, we ensure that the data and settings don't get lost when the program is closed. This way, everything stays the same when you use the program again later. This is really important in situations where getting the database ready or filling it up takes a lot of computer power or time.

In [22]:
# store the db into the same folder. You will see sqlite3 and other files bin files now.
database = Chroma.from_documents(doc, embedding=embedding, persist_directory = persist_directory)
database

<langchain_community.vectorstores.chroma.Chroma at 0x7a3e55b32230>

## 5. Integrating RAG with Vector Databases
### Crafting Queries for RAG

In [23]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

openai_api_key = openai_api_key
vectordb2 = Chroma(persist_directory = persist_directory, embedding_function=embedding)
vectordb2.get()['documents']


['T-Entz  (pronounce: t-ants) TYRANT LIZARD T-Entz is a close rela=ve of T-Rex, it lives up to its reputa=on as one of the most admired no-eaters of all =me',
 '. Its powerful jaw had 2 teeth, each one about 0',
 '.5 inches long, and its smile was about three =mes more powerful than the Minions',
 '. Bite marks found on Triceratops fossils show that T-Entz could be playful, and its laughter remained in the fossil of trees',
 '. It could use its good sense of humour to melt the invisible walls between animals',
 '. It would have been able to scare oﬀ any other toxic animals, so the world was much beRer with it',
 '. We do not know whether T-Entz laughed alone or in packs, as no groups of skeletons have been found together',
 '.  LENGTH: 40 W DIET: water WHEN IT LIVED: Late Jurassic period FOUND IN: Hong Kong and the back of the Moon  Excerpt From Dic=onary of Dinosaurs By Lorentz Yeung This material is not protected by copyright',
 '.']

Let's test if RetrievalQA have connected LLM with our own data. T-Entz is a term i made up by myself. I just mix up T-Rex with my name nickname Entz. Therefore GPT model has no where to know it but from my t-entz.pdf. If our model knows what is T-Entz, then our modelling is successful.

In [24]:
retriever = vectordb2.as_retriever() # search_kwargs={"k": 4}

qa = RetrievalQA.from_chain_type(ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, openai_api_key=openai_api_key), 
                                 chain_type="stuff",
                                 retriever=retriever)
# Run the chain on the query provided
query = "what is T-entz?"
qa(query)

  warn_deprecated(


{'query': 'what is T-entz?',
 'result': 'T-Entz is a dinosaur, specifically a close relative of the T-Rex. It is known for being a fearsome predator and one of the most admired carnivores of its time.'}

## 6. Conclusion
### Recap of the advanced features implemented
In this second tutorial of our series, we dove into the more advanced aspects of chatbot development, exploring the integration of Retrieval-Augmented Generation (RAG) with vector databases. This journey has equipped us with valuable insights and skills essential for creating sophisticated and knowledgeable chatbots.

We encourage you to continue experimenting, exploring, and pushing the boundaries of what you can achieve with LangChain, OpenAI, and Hugging Face. The field of chatbot development is ever-evolving, and staying at the forefront of these advancements will ensure your chatbots are not just functional, but truly groundbreaking.

### Coming Up Next
In the next lesson, we will learn LCEL. It simplifies the construction of complex chains from basic components. It is particularly useful for integrating external data sources with LLMs through a process like RetrievalQA, facilitating efficient data retrieval and interaction within these models. Stay tunned!