# Personalize AI with LangChain Agents, OpenAI and FAISS

Source code for [blog post](https://romantech.io/personalized-ai-hands-on-with-langchain-openai-pinecone/)

### Install All Dependencies

Install the Unstructured library and all its dependencies for parsing PDF files. These commands are specific to the Google Colab environment. To run in a Windows local environment see their [installation instructions](https://unstructured-io.github.io/unstructured/installing.html). Note that this installation will take a few minutes.

In [1]:
!apt-get -qq install poppler-utils tesseract-ocr
%pip install -q unstructured[local-inference]
%pip install -q "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"
# %pip install -q --user --upgrade pillow


Selecting previously unselected package poppler-utils.
(Reading database ... 122518 files and directories currently installed.)
Preparing to unpack .../poppler-utils_0.86.1-0ubuntu1.1_amd64.deb ...
Unpacking poppler-utils (0.86.1-0ubuntu1.1) ...
Selecting previously unselected package tesseract-ocr-eng.
Preparing to unpack .../tesseract-ocr-eng_1%3a4.00~git30-7274cfa-1_all.deb ...
Unpacking tesseract-ocr-eng (1:4.00~git30-7274cfa-1) ...
Selecting previously unselected package tesseract-ocr-osd.
Preparing to unpack .../tesseract-ocr-osd_1%3a4.00~git30-7274cfa-1_all.deb ...
Unpacking tesseract-ocr-osd (1:4.00~git30-7274cfa-1) ...
Selecting previously unselected package tesseract-ocr.
Preparing to unpack .../tesseract-ocr_4.1.1-2build2_amd64.deb ...
Unpacking tesseract-ocr (4.1.1-2build2) ...
Setting up tesseract-ocr-eng (1:4.00~git30-7274cfa-1) ...
Setting up tesseract-ocr-osd (1:4.00~git30-7274cfa-1) ...
Setting up poppler-utils (0.86.1-0ubuntu1.1) ...
Setting up tesseract-ocr (4.1.1-2b

In [2]:
# NLTK is also required by Unstructured
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

Install LangChain, the FAISS vector database and OpenAI.

In [3]:
!pip install langchain faiss-cpu openai tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.154-py3-none-any.whl (709 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m709.9/709.9 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.27.5-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.6/71.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m61.8 MB/s[0m eta [36m0:00:00[0m
Collecting aioh

If you run this notebook in Google Colab, you might run into an `'PIL.Image' has no attribute 'Resampling'` error. If that happens, uninstalling the Pillow library and re-installing should fix the issue. Make sure to restart the  runtime after running the code below.

In [None]:
!pip uninstall Pillow
!pip install Pillow

### Tokenize the PDF File into Chunks

We are going to parse a PDF file using the Unstructured open-source library. This library provides components for pre-processing text-based documents, such as PDFs, HTML and Word Documents, into semantically structured elements. I have created a PDF version of the same LangChain readme file that was used in the [last blog post](https://romantech.io/personalized-ai-hands-on-with-langchain-openai-pinecone/), and placed in GitHub. We are going to use the wget utility to read the file from GitHub and load it into the local environment for parsing.

Note that the wget utility is available in the Google Colab environment, but not in a Windows environment. If you run this notebook locally in Windows, you can download a wget executable for Windows from the GNUWin32 project at http://gnuwin32.sourceforge.net/packages/wget.htm.

In [1]:
!wget  https://raw.githubusercontent.com/enrtrav/blog/main/personalize_ai/langchain_readme.pdf

--2023-05-01 02:09:14--  https://raw.githubusercontent.com/enrtrav/blog/main/personalize_ai/langchain_readme.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22526 (22K) [application/octet-stream]
Saving to: ‘langchain_readme.pdf’


2023-05-01 02:09:14 (14.4 MB/s) - ‘langchain_readme.pdf’ saved [22526/22526]



Once the document is available in our local environment, we can split it into chunks. This is called partitioning in Unstructured, and it consists of reading in the source document, splitting it into sections, categorizing those sections, and extracting the text associated with those sections. The code below uses the `partition_pdf` function from Unstructured to do this.

In [12]:
from unstructured.partition.pdf import partition_pdf

#parse PDF into chunks and get the count
chunks = partition_pdf("langchain_readme.pdf")
len(chunks)

14

But because we want to use LangChain agents to interact with our PDF and the OpenAI LLM, we are going to use LangChain's `UnstructuredPDFLoader`, which wraps the Unstructured partition functionality. The `mode='elements'` argument tells Unstructured to split the file into semantic elements.

In [2]:
from langchain.document_loaders import UnstructuredPDFLoader
pdf_loader = UnstructuredPDFLoader('langchain_readme.pdf',mode='elements')
chunks = pdf_loader.load()
len(chunks)    

Downloading model_final.pth:   0%|          | 0.00/330M [00:00<?, ?B/s]

Downloading (…)50_FPN_3x/config.yml:   0%|          | 0.00/5.37k [00:00<?, ?B/s]

14

Our PDF file got split into 14 chunks of semantically structured data. The output below is a list of the chunks that includes only the semantic category and content of each chunk.

In [3]:
[(chunk.metadata['category'], chunk.page_content) for chunk in chunks]

[('Title', 'LangChain'),
 ('NarrativeText', 'What is this?'),
 ('NarrativeText',
  'Large language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. But using these LLMs in isolation is often not enough to create a truly powerful app - the real power comes when you can combine them with other sources of computation or knowledge.'),
 ('NarrativeText',
  'This library is aimed at assisting in the development of those types of applications. There are six main areas that LangChain is designed to help with.'),
 ('Title', 'LLMs and Prompts:'),
 ('NarrativeText',
  'This includes prompt management, prompt optimization, generic interface for all and common utilities for working with LLMs.'),
 ('Title', 'Chains:'),
 ('NarrativeText',
  'Chains go beyond just a single LLM call, and are sequences of calls (whether to an or a different utility). LangChain provides a standard interface for chains, lots of integration

### Generate Embeddings for the Chunks

We use OpenAI to generate embeddings and LangChain's `OpenAIEmbeddings` class provides a wrapper around the OpenAI embedding model. To use this class, you must first get an API Key from OpenAI by going to [platform.openapi.com](https://platform.openai.com/account/api-keys) and paste it below. Note that API keys should not be hardcoded and instead loaded from the environment, but in this example we are just passing the API key as a parameter to the constructor.

In [4]:
from langchain.embeddings.openai import OpenAIEmbeddings

OPENAI_API_KEY = ""
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

### Store the Embeddings in a FAISS Index
For this example, we use FAISS (Facebook AI Similarity Search) to store the embeddings. FAISS is an open source similarity search library developed by Facebook AI and written in C++ with bindings for Python. FAISS is not a vector database that can permanently store embeddings, but rather an in-memory index of embeddings. However, we can save the index to disk and reload it to make it persistent across sessions.

In [5]:
import os
from langchain.vectorstores import FAISS

idx_name= 'faiss_idx'
if os.path.exists(idx_name):
    faiss_idx = FAISS.load_local(folder_path=idx_name, embeddings=embeddings)
    print("Index loaded from disk")    
else:
    faiss_idx = FAISS.from_documents(documents=chunks, embedding=embeddings)
    faiss_idx.save_local(folder_path=idx_name)
    print("New index created and saved to disk")

New index created and saved to disk


We can confirm that our chunks got stored in the index by checking the size of the index internal document store, which is the same as the number of chunks generated.

In [6]:
dict = faiss_idx.docstore._dict
len(dict.values())

14

Finally, we expose the index as a `retriever`, which is a generic LangChain interface that makes it easy to combine a vector store with language models. The k argument specifies the maximum number of results to return, which in our example is 3, meaning we only want the top two most relevant text chunks.

In [66]:
retriever = faiss_idx.as_retriever(search_kwargs={"k": 3})

### Create a LangChain Agent

Our agent is provided with a tool that it can use to answer questions about LangChain. The tool uses a LangChain `RetrievalQA` chain for question-answering. The QA chain automatically generates the embeddings for the user query, uses the retriever interface of the FAISS index to find the most relevant chunks of text from the index, and injects the chunks into the prompt to provide context to the LLM for answering the question.

The agent will also need to answer general questions, so we provide it with a second tool backed by an LLM. The agent will decide which tool to use based on the question.

We start by importing the necessary modules.

In [47]:
from langchain import OpenAI
from langchain.chains import RetrievalQA, LLMChain
from langchain.agents import Tool, AgentType, initialize_agent
from langchain.prompts import PromptTemplate

Then we create an OpenAI LLM, to be used as the underlying language model for the agent and all its tools, and a LangChain `RetrievalQA` chain to answer questions based on the PDF file. To create the QA chain, we specify `openai_llm` as the LLM to answer questions and the FAISS retriever to perform the similarity search on the user query. We also supply a chain type, which defines how context (the relevant chunks) is injected into the prompt. Here we use a stuff chain type, which means all the relevant chunks are fed to the model without regard for size. We use stuff in this example because we are limiting the number of chunks being injected to 3 and our chunks are small, so we know we will not exceed the model's token limit. But a real application will require a lot more thought to ensure it does not exceed the max prompt length.

In [67]:
openai_llm = OpenAI(
    openai_api_key=OPENAI_API_KEY,
    temperature=0
)
qa_llm_chain = RetrievalQA.from_chain_type(llm=openai_llm, chain_type="stuff", retriever=retriever)

Next, we create the tool to be used by the agent that uses the QA retrieval chain to answer questions about LangChain. We provide the tool with a description that is used by the agent's LLM to determine when to use the tool.

In [82]:
langchain_tool = Tool(
    name='LangChain',
    func=qa_llm_chain.run,
    description='You must use this tool to answer questions about LangChain'
)

And then we create the second tool to answer general questions using `LLMChain`. The LLM chain is provided a basic prompt template and an underlying model.

In [None]:
prompt = PromptTemplate(
    input_variables=["query"],
    template="{query}"
)

llm_chain = LLMChain(llm=openai_llm, prompt=prompt)

llm_tool = Tool(
    name='Language Model',
    func=llm_chain.run,
    description='Use this tool for general purpose queries not about LangChain'
)

Now that we have all the components in place, we create our agent, giving it a type, a list of tools to use, and an LLM that will be used to decide what tool to use. Our agent type is zero-shot, meaning it will only have a single interaction with the LLM and will have no memory of previous interactions.

**It is important to note that in this example we are using the same LLM (`openai_llm`) for the agent and for both tools. The LangChain tool's RetrievalQA chain uses its LLM to answer questions about LangChain, providing it context from the index. The Language Model tool uses its LLM to answer general purpose questions. And the agent uses its LLM to dedice which tool to use to answer a question. However, in practice agents will often use tools backed by domain specific LLMs or no LLMs at all.** 

In [83]:
agent = initialize_agent(
	agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
	tools=[langchain_tool, llm_tool],
	llm=openai_llm,
	verbose=True,
)

### Evaluate the Agent's Performance

Let's take our agent for a spin and ask it a couple of questions. First, we ask the agent a general question and make sure it uses the Language Model tool to provide the answer. Because we used `verbose=true` when creating the agent, you see the detailed cyclical process the agent used to arrive at the answer.

In [52]:
agent.run("What is Artifical Intelligence?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m Artificial Intelligence is a broad field of study.
Thought: I should use a tool to help me answer this question.
Action: Language Model
Action Input: What is Artificial Intelligence?[0m
Observation: [33;1m[1;3m

Artificial Intelligence (AI) is a branch of computer science that focuses on creating intelligent machines that can think and act like humans. AI systems are designed to learn from their environment and experiences, and to use that knowledge to solve problems and make decisions. AI can be used to automate tasks, improve decision-making, and create new products and services.[0m
Thought:[32;1m[1;3m I now know the final answer.
Final Answer: Artificial Intelligence is a branch of computer science that focuses on creating intelligent machines that can think and act like humans. AI systems are designed to learn from their environment and experiences, and to use that knowledge to solve problems and make decisions. AI

'Artificial Intelligence is a branch of computer science that focuses on creating intelligent machines that can think and act like humans. AI systems are designed to learn from their environment and experiences, and to use that knowledge to solve problems and make decisions. AI can be used to automate tasks, improve decision-making, and create new products and services.'

And now let's ask it a second question about LangChain. As you can see below, at first the agent used the Language Model tool to determine what an LLM is, and although the answer it got is correct, it is completely unrelated to our topic of language models. Fortunately, it followed up with the LangChain tool, which queried the PDF index for context, and was able to provide a fairly decent answer based on the minimal LangChain PDF documentation that it had available.

In [85]:
agent.run("How can LangChain harness the real power of LLMs?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to understand what LLMs are and how LangChain can use them.
Action: Language Model
Action Input: What is an LLM?[0m
Observation: [33;1m[1;3m

LLM stands for "Master of Laws," and it is a postgraduate degree in law. It is typically a one-year program that is designed for students who already have a law degree and want to specialize in a particular area of law. LLM programs are offered by many law schools around the world.[0m
Thought:[32;1m[1;3m I now understand what LLMs are and how LangChain can use them.
Action: LangChain
Action Input: How can LangChain harness the real power of LLMs?[0m
Observation: [36;1m[1;3m LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications, allowing developers to combine LLMs with other sources of computation or knowledge to create powerful applications.[0m
Thought:[32;1m[1;3m I now know the final 

'LangChain can harness the real power of LLMs by providing a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications, allowing developers to combine LLMs with other sources of computation or knowledge to create powerful applications.'