In [None]:
### Installs ###
# %pip install langchain openai pypdf faiss-cpu python-dotenv tiktoken

# Documentation

__NEED AN OPENAI API KEY TO RUN__

## Environment Setup
- Load environment variables using `dotenv`.
- Import necessary libraries like `os`, `glob`, and various modules from `langchain`.

## Finding PDF Files
- Function `find_pdf_files(directory)` lists all PDF files in a specified directory and subdirectories.
- Uses `glob.glob` for searching and sorts the absolute paths of the PDF files.

## Loading and Splitting Text
- `load_and_split(path)` loads a PDF and splits its text into chunks.
- Utilizes `PyPDFLoader` for reading PDF content.
- `CharacterTextSplitter` breaks text into 1000 character chunks, separated by newline (`\n`).

## Setup PDF Files and Embeddings
- Defines a list of file paths to various PDF documents.
- Creates an embedding model using `OpenAIEmbeddings`.

## Creating and Merging Vector Stores
- For each PDF, the text is loaded, split, and converted into vector representations using `FAISS`.
- Merges these vector stores into one primary store.

## Saving and Loading Vector Store
- Saves the merged vector store locally.
- Reloads it using `FAISS.load_local`.

## Demonstration with QA System
- Sets up a QA system using `RetrievalQA` with an `OpenAI` language model and the loaded vector store.
- Runs example queries to demonstrate information retrieval from the processed PDF documents.

This setup establishes a document retrieval system using embeddings and a QA model, capable of answering questions based on content from the loaded PDF documents.


In [1]:
from dotenv import load_dotenv
load_dotenv()

import os
import glob

In [2]:
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Setup

In [3]:
def find_pdf_files(directory):
    os.chdir(directory)

    pdf_files = []

    for file in glob.glob('**/*.pdf', recursive=True):
        absolute_path = os.path.abspath(file)
        pdf_files.append(absolute_path)
        
    pdf_files.sort()

    return pdf_files

In [4]:
def load_and_split(path):
    loader = PyPDFLoader(file_path=path)
    documents = loader.load()

    # A chunk size of 1000 characters offers a balance between granularity and manageability
    # It's large enough to contain meaningful units of text (like sentences or paragraphs),
    # but small enough to be easily processed by various algorithms
    # The absence of overlap means each character in the document is only processed once
    text_splitter = CharacterTextSplitter(
            chunk_size=1000, chunk_overlap=0, separator="\n"
        )
    return text_splitter.split_documents(documents=documents)

In [5]:
pdf_files = [
    "/Users/aditkapoor/Local Documents/Work/Cognizant/langchain-proj/assets/data/gray-city.pdf",
    "/Users/aditkapoor/Local Documents/Work/Cognizant/langchain-proj/assets/data/irl.pdf",
    "/Users/aditkapoor/Local Documents/Work/Cognizant/langchain-proj/assets/data/singing-peddler.pdf",
    "/Users/aditkapoor/Local Documents/Work/Cognizant/langchain-proj/assets/data/memoirs.pdf",
    "/Users/aditkapoor/Local Documents/Work/Cognizant/langchain-proj/assets/data/small-little-circle.pdf",
    "/Users/aditkapoor/Local Documents/Work/Cognizant/langchain-proj/assets/data/veracious.pdf",
]

In [6]:
embeddings = OpenAIEmbeddings()
vectorstores = []

for file_path in pdf_files:
    docs = load_and_split(file_path)
    
    # I chose to use the FAISS vectorstore because it's the fastest to work with locally as I often
    # needed to rebuild the vectorstore when I made changes to the code. With Pinecone, it was harder
    # to make changes as I had to reset the vectorstore on the website instead of just deleting the locally
    # saved stone (as FAISS manages it)
    vectorstore = FAISS.from_documents(docs, embeddings)
    vectorstores.append(vectorstore)

In [None]:
### DEBUG CELL ###
# vectorstores[0].docstore._dict

In [7]:
for i in range(1, len(vectorstores)):
    vectorstores[0].merge_from(vectorstores[i])
    
vectorstores[0].save_local("faiss_index_project")

# Demonstration

In [8]:
embeddings = OpenAIEmbeddings()
new_vectorstore = FAISS.load_local("faiss_index_project", embeddings)
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type='stuff', retriever=new_vectorstore.as_retriever())

In [9]:
result = qa.run("Who did Clyde yet at about giving third person narrators too much information? If you don't know, say you don't know.")
print(result)

 Clyde yelled at his mom.


In [10]:
result = qa.run("In the memoir, why was the father sad after receiving the letter?")
print(result)

 The father was sad after receiving the letter because it was from his son, who had been shot two weeks earlier. He was writing to tell his parents he was coming home, but he wouldn't make it back before the letter arrived.


In [11]:
result = qa.run("In the crowded streets of Delhi, what stories do the patches on people's clothing tell, as described in the poem? Reflect on the diversity and history embedded in these patches.")
print(result)

 The patches on people's clothing in the crowded streets of Delhi tell stories of the hardships and struggles of life in India. They symbolize the diversity of the city and represent different times in history, from the air-conditioned car safe from the scalding ground to the hand-me-downs that are too small. They also emphasize the contrast between the huge mansions and the slums, highlighting the disparity between the wealthy and the poor.


In [12]:
result = qa.run("How does \"IRL\" portray the world post-internet and modern relationships?")
print(result)



IRL portrays the world post-internet and modern relationships by showing how people can build relationships and connections through the internet and then continue those relationships in real life. The story follows Sahil and Magnus who build a friendship online and eventually meet IRL (in real life) to continue their connection.


In [13]:
result = qa.run("In \"Veracious\", why is the author so mad at his Uncle despite his untimely death? Give me around 5 sentences.")
print(result)

 In "Veracious", the author is mad at his Uncle despite his untimely death because his death has caused so much pain for his family. His death has left his grandmother alone in a large house and has reduced his mother to a vulnerable state. In addition, his death came only two years after the passing of his older brother, making the pain even greater. Furthermore, his mother had recently told the author that Uncle Scott was doing better, sobering up and planning to work for Uber. The author resents his Uncle for leaving his family in such a difficult and painful situation, feeling that he was being selfish in death.


In [14]:
result = qa.run("Describe the Author's relationship with Maggi. Why was he so attached to her?")
print(result)

 The author was very attached to Maggi because she provided him comfort during a difficult time. He was grieving the death of one of his students and felt responsible for her death. Maggi was understanding and comforting, and the author felt like he could talk to her in a way he couldn't talk to anyone else.


# Testing

In [None]:
prompt = ""  # Fill in with your own prompt
result = qa.run(prompt)
print(result)