## PDF Query Using Langchain

In [3]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from keys import openai_api_key, serp_api_key


In [5]:
import os, warnings
os.environ["SERPAPI_API_KEY"] = serp_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key
warnings.filterwarnings("ignore")

In [6]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('data\going_beyond_abm_whitepaper.pdf')

In [7]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [9]:
raw_text[:1000]

'Introduction\nAccount -based marketing (ABM) has been growing steadily as B2B marketers embrace this valuable technique to \nproactively reach high-value accounts and buyers. A survey from ITSMA indicated that 73% of companies expect to \nincrease their ABM budgets in the next year by an average of 21%.1 In a year that has brought on a disruption in the \neconomic landscape and a likelihood for budgetary cuts, this is a major statement about the value of ABM and its \nincreasing role in the B2B marketer’s toolset. \nToday, work and home life are converging due to the pandemic and a greater acceptance of remote work as a \nbusiness practice. In addition, the same individual attention and personalized approach that people demand from the \nB2C companies they interact with as consumers is now being demanded from the B2B companies they engage with \nas business buyers. As we hunker down at home, connecting with business prospects now means reaching them the \nsame way B2C companies reach 

In [10]:
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [11]:
len(texts)

55

In [12]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [13]:
document_search = FAISS.from_texts(texts, embeddings)

In [14]:
document_search


<langchain.vectorstores.faiss.FAISS at 0x21d3786c850>

In [15]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [16]:
chain = load_qa_chain(OpenAI(), chain_type="stuff")

In [17]:
query = "What is ABM?"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' ABM is an account-based marketing approach that focuses marketing and sales efforts onto specific key accounts. It is a strategic approach that seeks to find best-fit accounts and turn them into new customers, or grow existing relationships.'

In [18]:
query = "Why ABM is the wave of the future?"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' ABM is the wave of the future because it provides a solid foundation for B2B companies and it amplifies the benefits of ABM, such as higher ROI than other marketing approaches, shorter sales cycles, better conversion rates, and more. Additionally, the demand for individual attention and personalized approach from B2B companies is increasing.'

In [19]:
query = "Summarize the document"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' This document discusses how to employ an account-based strategy to gain insights and build personalized campaigns at scale. It provides tips on how to use data and design to craft campaigns tailored to the needs of different contacts within an account, as well as advice on how to use technology and data science to gain further insights. It gives an example of how Microsoft used the Strategic Executive Board to gain insights from its customers.'