# **Demo: MultiPDF QA Retriever with FAISS and LangChain**

In this demo, you will learn how to use LangChain to create a MultiPDF retriever with FAISS. This demo is performed on new generative AI research paper PDFs. You will understand how to load and process documents, create a database, make a retriever, create a chain, and use the retriever to ask questions and get answers.

## **Steps to Perform:**

*   Step 1: Importing the Necessary Libraries
*   Step 2: Loading and Splitting
*   Step 3: Loading the OpenAI Embeddings
*   Step 4: Creating and Loading the Database
*   Step 5: Creating and Using the Retriever
*   Step 6: Passing the Query



### **Step 1: Importing the Necessary Libraries**

In [1]:
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader, PyPDFLoader, DirectoryLoader
from langchain.embeddings import OpenAIEmbeddings
import os
import openai


### **Step 2: Loading and Splitting**


*   Create a directory named `GenAI_Papers`.
*   Load the PDF documents in the directory.
*   Split the documents into smaller chunks using the **RecursiveCharacterTextSplitter**.

In [3]:
# Loading the documents
doc_loader = DirectoryLoader('Gen_AI_Papers', glob="./*.pdf", loader_cls=PyPDFLoader)
documents = doc_loader.load()

# Splitting the documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)


  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


In [5]:
texts

[Document(metadata={'source': 'Gen_AI_Papers/Lab_Guide.pdf', 'page': 0}, page_content='Advanced Generative AI : Building LLM Application   \nLab Guide'),
 Document(metadata={'source': 'Gen_AI_Papers/Lab_Guide.pdf', 'page': 1}, page_content='Note: The screenshots are only for your reference. Your LMS may look \ndifferent depending on  the course content.   \n  \nThis section will guide you to:   \n● Use labs for executing all the demos included in this course  \n  \nStep 1:  Log into the Simplilearn LMS . Click on Practice Labs , and then click on \nLaunch Lab  \n \n \n \n \n \n \n \n \n \nClick on Practice Labs  Click  on Launch Lab  to \nlaunch  it'),
 Document(metadata={'source': 'Gen_AI_Papers/Lab_Guide.pdf', 'page': 2}, page_content='Step 2:  A small screen will pop up in the middle of your screen with essential \ninformation  about the lab ; click on the Launch Lab  button . \n \nStep 3:  End your lab by clicking the End Lab  button'),
 Document(metadata={'source': 'Gen_AI_Papers/

In [6]:
doc_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
split_texts = doc_splitter.split_documents(texts)
print(len(split_texts))  # Prints the number of chunks the PDF has been split into


20


In [7]:
split_texts

[Document(metadata={'source': 'Gen_AI_Papers/Lab_Guide.pdf', 'page': 0}, page_content='Advanced Generative AI : Building LLM Application   \nLab Guide'),
 Document(metadata={'source': 'Gen_AI_Papers/Lab_Guide.pdf', 'page': 1}, page_content='Note: The screenshots are only for your reference. Your LMS may look \ndifferent depending on  the course content.   \n  \nThis section will guide you to:   \n● Use labs for executing all the demos included in this course  \n  \nStep 1:  Log into the Simplilearn LMS . Click on Practice Labs , and then click on \nLaunch Lab  \n \n \n \n \n \n \n \n \n \nClick on Practice Labs  Click  on Launch Lab  to \nlaunch  it'),
 Document(metadata={'source': 'Gen_AI_Papers/Lab_Guide.pdf', 'page': 2}, page_content='Step 2:  A small screen will pop up in the middle of your screen with essential \ninformation  about the lab ; click on the Launch Lab  button . \n \nStep 3:  End your lab by clicking the End Lab  button'),
 Document(metadata={'source': 'Gen_AI_Papers/

### **Step 3: Loading the OpenAI Embeddings**

In [4]:
openai_embeddings = OpenAIEmbeddings()

  warn_deprecated(


ValidationError: 1 validation error for OpenAIEmbeddings
__root__
  Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter. (type=value_error)

### **Step 4: Creating and Loading the Database**

*   Create a database to store the embedded text.
*   Load the database to bring it back into memory from the disk.



In [7]:
from langchain.vectorstores import FAISS

# Create embeddings for texts
text_embeddings = openai_embeddings.embed_documents([text.page_content for text in texts])

# Creating the FAISS database
faiss_index = FAISS.from_texts([text.page_content for text in texts], openai_embeddings)

# Save the FAISS index
faiss_index.save_local('faiss_index')

# Loading the FAISS index
faiss_index = FAISS.load_local('faiss_index', openai_embeddings)


### **Step 5: Creating and Using the Retriever**

*   Create a retriever using the vector database.
*   Use the retriever to get relevant documents for a specific query.



In [8]:
# Creating retriever
retriever = faiss_index.as_retriever()

# Using retriever
docs = retriever.get_relevant_documents("What is toolformer?")


### **Step 6: Passing the Query**

*   Pass the query to the vector database.
*   Print the content of the most relevant document.



In [9]:
query = "A fundamental limitation of HMMs"
docs = faiss_index.similarity_search(query)
print(docs[0].page_content)


with size, measured by the number of trainable parameters: f or example, Wei et al. (2022b ) demonstrate
that LLMs become able to perform some BIG-bench tasks3via few-shot prompting once a certain scale is
attained. Although a recent line of work yielded smaller LMs that retain some capabilities from their largest
counterpart ( Hoﬀmann et al. ,2022), the size and need for data of LLMs can be impractical for tra ining
but also maintenance: continual learning for large models r emains an open research question ( Scialom et al. ,
2022). Other limitations of LLMs are discussed by Goldberg (2023) in the context of ChatGPT , a chatbot
built upon GPT3 .
We argue these issues stem from a fundamental defect of LLMs: they are generally trained to perform
statistical language modeling given (i) a single parametri c model and (ii) a limited context, typically the n
previous or surrounding tokens. While nhas been growing in recent years thanks to software and hardw are


### **Conclusion**

By the end of this demo, you have a clear understanding of how to use LangChain’s MultiPDF retriever with FAISS. You’ve learned how to load and process documents, create a database, make a retriever, and use the retriever to ask questions. This knowledge will help you effectively utilize LangChain’s capabilities in your projects.