# Chat with PDFs using ChatGPT & OpenAI GPT API

This is a supplementary python notebook for the blog - https://nanonets.com/blog/chat-with-pdfs-using-chatgpt-and-openai-gpt-api/. We dive into a detailed code tutorial on how to chat with all kinds of PDF files using OpenAI GPT API and use it for PDF automations / chatbots.

* We will chat with PDFs using just a few lines of Python code.
* We will chat with large PDF files using ChatGPT API and LangChain.
* We will build an automation to sort PDF files based on their contents.
* We will go through examples of building more automations for tasks involving PDFs.

## Chat with PDF using ChatGPT API

Let us now chat with our first PDF using OpenAI's GPT models.

We are going to converse with a resume PDF to demonstrate this.

#### Step 1 - Read the PDF File

We follow different approaches based on whether the PDF is scanned or digital.

##### Approach 1 : Read Digital PDF

In [1]:
!pipenv install PyPDF2
!pipenv install pdf2image 
!pipenv install PIL 
!pipenv install pytesseract
!pipenv install getpass
!pipenv install openai

[32m[1mInstalling PyPDF2...[0m
[2K[32m⠹[0m ✔ Installation Succeeded
[1A[2K[1mInstalling dependencies from Pipfile.lock (cdb9eb)...[0m
To activate this project's virtualenv, run [33mpipenv shell[0m.
Alternatively, run a command inside the virtualenv with [33mpipenv run[0m.
[32m[1mInstalling pdf2image...[0m
[2K[32m⠹[0m ✔ Installation Succeeded.
[1A[2K[1mInstalling dependencies from Pipfile.lock (cdb9eb)...[0m
To activate this project's virtualenv, run [33mpipenv shell[0m.
Alternatively, run a command inside the virtualenv with [33mpipenv run[0m.
[32m[1mInstalling PIL...[0m
[2K[32m⠸[0m Installing PIL.....[1m[[0m31m[1m[[0m1mError: [1m[[0m0m An error occurred while installing [1m[[0m32mPIL[1m[[0m0m!
Error text: 
[1m[[0m36mERROR: Could not find a version that satisfies the requirement pil [1m([0mfrom 
versions: none[1m)[0m
ERROR: No matching distribution found for pil
[1m[[0m0m
✘ Installation Failed
[2K[32m⠼[0m Installing PIL...


In [2]:
import PyPDF2

pdf_file_obj = open('resume-sample.pdf', 'rb')
pdf_reader = PyPDF2.PdfReader(pdf_file_obj)
num_pages = len(pdf_reader.pages)
detected_text = ''

for page_num in range(num_pages):
    page_obj = pdf_reader.pages[page_num]
    detected_text += page_obj.extract_text() + '\n\n'

pdf_file_obj.close()

print(detected_text)

FUNCTIONAL  (EXPERIENCED)   
IM A . SAMPLE I  
1234 North 55 Street  
Bellevue, Nebraska 68005  
(402) 292 -2345  
imasample1@xxx.com  
 
SUMMARY OF QUALIFICATIONS  
Exceptionally well organized and resourceful Professional  with more than six years experience and a 
solid academic background in accounting and financial management; excellent analytical and problem 
solving skills; able to handle multiple projects while pr oducing high quality work in a fast -paced, 
deadline -oriented environment.  
 
EDUCATION  
Bachelor of Science , Bellevue University, Bellevue, NE (In Progress)  
 Major:  Accounting  Minor:  Computer Information Systems  
 Expected Graduation Date:  January, 20xx  GPA  to date:  3.95/4.00  
 
PROFESSIONAL ACCOMPLISHMENTS  
Accounting and Financial Management  
 Developed and maintained accounting records for up to fifty bank accounts.  
 Formulated monthly and year -end financial statements and generated various payroll records, 
including federal and state payro

##### Approach 2 : Read Scanned PDF

In [None]:
import pdf2image
from PIL import Image
import pytesseract

image = pdf2image.convert_from_path('resume-sample.pdf')
for pagenumber, page in enumerate(image):
    detected_text = pytesseract.image_to_string(page)
    print(detected_text)

#### Step 2 - First Chat with PDF

Let us ask the LLM to suggest jobs that this person will be suitable for based on his resume.

Firstly, We import the os and openai library and define our OpenAI API key.

In [4]:
!pipenv install os
!pipenv install openai
!pipenv install getpass

[32m[1mInstalling os...[0m
[2K[32m⠙[0m Installing os...[1m[[0m31m[1m[[0m1mError: [1m[[0m0m An error occurred while installing [1m[[0m32mos[1m[[0m0m!
Error text: 
[1m[[0m36mERROR: Could not find a version that satisfies the requirement os [1m([0mfrom 
versions: none[1m)[0m
ERROR: No matching distribution found for os
[1m[[0m0m
✘ Installation Failed
[2K[32m⠙[0m Installing os...
[32m[1mInstalling openai...[0m
[2K[32m⠼[0m ✔ Installation Succeeded
[1A[2K[1mInstalling dependencies from Pipfile.lock (f9464d)...[0m
To activate this project's virtualenv, run [33mpipenv shell[0m.
Alternatively, run a command inside the virtualenv with [33mpipenv run[0m.
[32m[1mInstalling getpass...[0m
[2K[32m⠋[0m Installing getpass...[1m[[0m31m[1m[[0m1mError: [1m[[0m0m An error occurred while installing [1m[[0m32mgetpass[1m[[0m0m!
Error text: 
[1m[[0m36mERROR: Could not find a version that satisfies the requirement getpass [1m([0mfrom
vers

In [3]:
import os
import openai
from getpass import getpass 

openai.api_key = getpass()

Choosing the ideal model while using OpenAI's python library depends on your use case and specific requirements. We recommend going through the list of available models and learning the pros and cons of each of the available models. You can access the list of available models as follows - 

In [None]:
import pandas as pd
models = openai.Model.list()
modelsdf = pd.DataFrame(models["data"])
modelsdf.head(10)

Next, we append our query - "give a list of jobs suitable for the above resume" to the extracted PDF text and send this as the user_msg. The detected_text variable already contains the data extracted from the PDF. We will simply append our query here.

In [None]:
query = 'give a list of jobs suitable for the above resume.'

user_msg = detected_text + '\n\n' + query

We also add a relevant system_msg to refine the behavior of the AI assistant. In our case, a useful system message can be "You are a helpful career advisor."

In [None]:
system_msg = 'You are a helpful career advisor.'

We send the request to get our first response.

In [None]:
response = openai.ChatCompletion.create(model="gpt-3.5-turbo",
        messages=[{"role": "system", "content": system_msg},
                    {"role": "user", "content": user_msg}]) 

Once the request is complete, the response object will contain the response from the LLM. We can view it by accessing the 'choices' attribute in the response object as follows -

In [None]:
print(response.choices[0].message.content)    

#### Step 3 : Continuing the Conversation

Often, we would want to have conversations with the LLM which are more than just a pair of a single prompt and a single response. Let us now learn how to use our past conversation history to continue the conversation.

To simplify the implementation, we define the following function for calling the OpenAI GPT API from now on -

In [None]:
def continue_chat(system_message, user_assistant_messages):
  
  system_msg = [{"role": "system", "content": system_message}]
  
  user_assistant_msgs = [{"role": "assistant", "content": user_assistant_messages[i]} if i % 2 else {"role": "user", "content": user_assistant_messages[i]} for i in range(len(user_assistant_messages))]

  allmsgs = system_msg + user_assistant_msgs
  response = openai.ChatCompletion.create(model="gpt-3.5-turbo",
                                          messages=allmsgs)
  
  return response["choices"][0]["message"]["content"]

The function accepts -

* system_message (string) : This acts as the system_msg
* user_assistant_messages (list) : This list contains user prompts and model responses in alternating order. This is also the order in which they occur in the conversation.

The function internally makes the API call to generate and return a new response based on the conversation history.

Let us now use this function to continue our previous conversation, and find out the highest paying jobs out of the ones recommended in the first response.

We will use the same system message (system_msg) used in previous call.

We create user_assistant_messages list as follows - 

In [None]:
user_msg1 = user_msg
model_response1 = response["choices"][0]["message"]["content"]
user_msg2 = 'based on the suggestions, choose the 3 jobs with highest average salary'
user_assistant_msgs = [user_msg1, model_response1, user_msg2]

Note that we used the original prompt as the first user message (user_msg1), the response to that prompt as the first model response message (model_response1), and our new prompt as the second user message (user_msg2).

Finally, we add them to the user_assistant_messages list in order of their occurrence in the conversation.

We now call the continue_chat() function to get the next response in the conversation.

In [None]:
response = continue_chat(system_msg, user_assistant_msgs)

In [None]:
print(response)

## Chat with Large PDFs using ChatGPT API and LangChain

The code tutorial shown above fails for very large PDFs. Let us illustrate this with an example. We will try to chat with BCG's "2022 Annual Sustainability Report", a large PDF published by the Boston Consulting Group (BCG) on their general impact in the industry. We execute the code shown below -

In [None]:
import PyPDF2

pdf_file_obj = open('bcg-2022-annual-sustainability-report-apr-2023.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)
num_pages = pdf_reader.numPages
detected_text = ''

for page_num in range(num_pages):
    page_obj = pdf_reader.getPage(page_num)
    detected_text += page_obj.extractText() + '\n\n'

pdf_file_obj.close()
print(len(detected_text))

We can see that the PDF is super large, and the length of the detected_text string variable is roughly 250k.

Let us now try chatting with the PDF -

In [None]:
system_msg = ''

query = '''
summarize this PDF in 500 words.
'''

user_msg = detected_text + '\n\n' + query

response = openai.ChatCompletion.create(model="gpt-3.5-turbo",
                                        messages=[{"role": "system", "content": system_msg},
                                         {"role": "user", "content": user_msg}])

We get an error message saying that we have hit the prompt length threshold.

This happens because for large PDFs with lots of text, the request payload we send to OpenAI becomes too large, and OpenAI returns an error saying that we have hit the prompt length threshold.

Let us now learn how to remove this bottleneck.

Enter LangChain. LangChain is an innovative technology that functions as a bridge -  linking large language models (LLMs) with practical applications like Python programming, PDFs, CSV files, or databases.

Let us import the required dependencies and get started.

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
import os

We load the PDF using PyPDF loader for LangChain.

In [None]:
loader = PyPDFLoader("bcg-2022-annual-sustainability-report-apr-2023.pdf")

We will perform chunking and split the text using LangChain text splitters.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.create_documents([detected_text])

We create a vector database using the chunks. We will save it the database for future use as well.

In [None]:
directory = 'index_store'
vector_index = FAISS.from_documents(texts, OpenAIEmbeddings())
vector_index.save_local(directory)

We now load the database. Using the database, we configure a retriever and then create a chat object. This chat object (qa_interface) will be used to chat with the PDF.

In [None]:
vector_index = FAISS.load_local('index_store', OpenAIEmbeddings())
retriever = vector_index.as_retriever(search_type="similarity", search_kwargs={"k":6})
qa_interface = RetrievalQA.from_chain_type(llm=ChatOpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=True)

We can now start chatting with the PDF. Let us ask the PDF to list measures taken to address diseases occurring in developing industries.

In [None]:
response = qa_interface("List measures taken to address diseases occuring in developing industries")

In [None]:
print(response['result'])

So far, we've used the RetrievalQA chain, a LangChain type for pulling document pieces from a vector store and asking one question about them. But, sometimes we need to have a full conversation about a document, including referring to topics we've already talked about.

Thankfully, LangChain has us covered. To make this possible, our system needs a memory or conversation history.  Instead of the RetrievalQA chain, we'll use the ConversationalRetrievalChain.

In [None]:
conv_interface = ConversationalRetrievalChain.from_llm(ChatOpenAI(temperature=0), retriever=retriever)

Let's ask the PDF to reveal the context in which Morocco is mentioned in the report.

'chat_history' parameter is a list contains past conversation history. For the first message, this list will be empty.

'question' parameter is used to send our message.

In [None]:
chat_history = []
query = "in what context is Morocco mentioned in the report?"
result = conv_interface({"question": query, "chat_history": chat_history})
print(result["answer"])

Let us now continue the conversation by updating the chat_history variable and ask the PDF to give some statistics around this. We append the messages in order of appearance in the conversation. We first append our initial message followed by the first response.

In [None]:
chat_history.append((query, result["answer"]))

We now add our new question along with the updated chat_history to continue the conversation.

In [None]:
query = "give some statistics around this."
result = conv_interface({"question": query, "chat_history": chat_history})
print(result["answer"])

The result uses the context gained by knowing the conversation history, and provides another great response! We can keep updating the chat_history variable and further continue our conversation using this method.

## Build PDF Automations using OpenAI GPT API

Let us now explore automations involving PDF tasks that can be implemented using GPT API. 

#### Automation 1 - Document Data Extraction

GPT-3.5 is excellent at extracting data from documents. Let us try to extract data from an invoice using it. We are going to extract the following fields in JSON format - invoice_date, invoice_number, seller_name, seller_address, total_amount, and each line item present in the invoice.

In [None]:
import os
import openai
openai.api_key = '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'

import pdf2image
from PIL import Image
import pytesseract

image = pdf2image.convert_from_path('invoice.pdf')
for pagenumber, page in enumerate(image):
    detected_text = pytesseract.image_to_string(page)
    
system_msg = 'You are an invoice processing solution.'

query = '''
extract data from above invoice and return only the json containing the following -
invoice_date, invoice_number, seller_name, seller_address, total_amount, and each line item present in the invoice.
json=
'''

user_msg = detected_text + '\n\n' + query

response = openai.ChatCompletion.create(model="gpt-3.5-turbo",
                                        messages=[{"role": "system", "content": system_msg},
                                         {"role": "user", "content": user_msg}])

print(response.choices[0].message.content)

The json here is essentially a json dump - it is a text string which is in the correct json format, but is not a json variable yet.

Let us convert this response to a json variable, which happens by adding just one line of code.

In [None]:
import json
invoice_json = json.loads(response["choices"][0]["message"]["content"])

In [None]:
pretty_json = json.dumps(invoice_json, indent=2)
print(pretty_json)

#### Automation 2 - Document Classification

Let us consider an example. Say we have a lot of files which are either invoices or receipts. We want to classify and sort these documents based on their type.

Doing this is easy using GPT API.

We create simple python functions to do this.

In [None]:
import shutil
import os
import openai
openai.api_key = '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'


def list_files_only(directory_path):
    if os.path.isdir(directory_path):
        file_list = [f for f in os.listdir(directory_path) if os.path.isfile(os.path.join(directory_path, f))]
        file_list = [file for file in file_list if ".pdf" in file]
        return file_list
    else:
        return f"{directory_path} is not a directory"

def classify(file_name):
    
    image = pdf2image.convert_from_path(file_name)
    for pagenumber, page in enumerate(image):
        detected_text = pytesseract.image_to_string(page)
    
    system_msg = 'You are an accounts payable expert.'

    query = '''
    Classify this document and return one of these two document types as response - [Invoices, Receipts]
    Return only the document type in the response.

    Document Type = 
    '''

    user_msg = detected_text + '\n\n' + query

    response = openai.ChatCompletion.create(model="gpt-3.5-turbo",
                                            messages=[{"role": "system", "content": system_msg},
                                             {"role": "user", "content": user_msg}])
    
    return response["choices"][0]["message"]["content"]

def move_file(current_path, new_folder):
    if os.path.isfile(current_path) and os.path.isdir(new_folder):
        file_name = os.path.basename(current_path)
        new_path = os.path.join(new_folder, file_name)
        shutil.move(current_path, new_path)
        print(f'File moved to {new_path}')

We create two folders labelled 'Invoices' & 'Receipts', in the folder where the unclassified invoices & receipts are present.

Let us execute the code now to classify these files and sort them into separate folders based on the document type.

In [None]:
list_of_files = list_files_only('invoices and receipts/')
for doc in list_of_files:
    current_path = 'invoices and receipts/' + doc
    doc_type = classify(current_path)
    new_path = 'invoices and receipts/' + doc_type
    move_file(current_path, new_path)

Upon execution, the code sorts these files perfectly!

#### Automation 3 - Recipe Recommendations

We can even feed our favorite cookbooks to GPT API, and ask it to give recipe recommendations based on our inputs. Let us look at an example. We use the Brakes' Meals n More recipe cookbook, and talk to it using LangChain. Let us ask it to give recommendations based on the ingredients we have at home.

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
import os

os.environ["OPENAI_API_KEY"] = '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'
directory = 'index_store'

loader = PyPDFLoader("meals-more-recipes.pdf")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.create_documents([detected_text])

directory = 'index_store'
vector_index = FAISS.from_documents(texts, OpenAIEmbeddings())
vector_index.save_local(directory)

vector_index = FAISS.load_local('index_store', OpenAIEmbeddings())
retriever = vector_index.as_retriever(search_type="similarity", search_kwargs={"k":6})
qa_interface = RetrievalQA.from_chain_type(llm=ChatOpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=True)

response = qa_interface("""
I have a lot of broccoli and tomatoes at home. 
Recommend recipe for some meal I can make at home using these.
""")

print(response['result'])

Upon execution, the PDF recommends a recipe for a meal that can be prepared using the mentioned ingredients!

#### Automation 4 - Automated Test Assistant

You can feed textbooks and automate creation of complete question papers and tests using GPT API. The LLM can even generate the marking scheme for you!
We use the textbook Advanced High-School Mathematics by David B. Surowski and ask the LLM to create a question paper with a marking scheme for a particular chapter in the textbook.

We execute the below code - 

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
import os

os.environ["OPENAI_API_KEY"] = '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'
directory = 'index_store'

loader = PyPDFLoader("further.pdf")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.create_documents([detected_text])

directory = 'index_store'
vector_index = FAISS.from_documents(texts, OpenAIEmbeddings())
vector_index.save_local(directory)

vector_index = FAISS.load_local('index_store', OpenAIEmbeddings())
retriever = vector_index.as_retriever(search_type="similarity", search_kwargs={"k":6})
qa_interface = RetrievalQA.from_chain_type(llm=ChatOpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=True)

response = qa_interface("""
list 5 questions of 20 marks total of varying difficuly and weightage based on the topic "Euclidian Geometry"
""")

print(response['result'])

The LLM reads the PDF textbook and create the question paper for us!