Load PDF files Using PyPDF docuemnt loader from LangChain

In [14]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("../pdfs/Olympics.pdf")
pages = loader.load()

In [15]:
len(pages)

17

In [None]:
pages[0]

In [16]:
loader = PyPDFLoader("../pdfs/Global Warming.pdf")
pages += loader.load()

In [17]:
len(pages)

21

In [None]:
pages[18]

In [18]:
len(pages[18].page_content)

2391

In [None]:
pages[13].page_content

In [21]:
len(pages)

21

Split pages into smaller chunks

In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000, 
    chunk_overlap=200, 
    length_function=len,
    add_start_index=True)

chunked_documents = text_splitter.split_documents(pages)

In [22]:
len(chunked_documents)

54

## Create the DB + Azure OpenAI Embeddings

In [6]:
import os

# replace with your own values 

os.environ["OPENAI_API_KEY"] = "..."
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_BASE"] = "..."
os.environ["OPENAI_API_VERSION"] = "2023-03-15-preview"

In [28]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import AzureOpenAI
from langchain.chains import RetrievalQA

In [29]:
import chromadb

In [30]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = '../vector_db'

## here we are using OpenAI embeddings
embedding = OpenAIEmbeddings(deployment="text-embedding-ada-002", chunk_size=1)

vectordb = Chroma.from_documents(documents=chunked_documents,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

In [31]:
# persiste the db to disk
vectordb.persist()
vectordb = None

In [32]:
# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding)

## Make a retriever

In [33]:
retriever = vectordb.as_retriever()

In [34]:
retriever = vectordb.as_retriever(search_kwargs={"k": 4})

In [35]:
relevant_docs = retriever.get_relevant_documents(
    search_type = 'similarity_search_with_score', 
    query="Which country organized summer olympics more than once?")



In [36]:
len(relevant_docs)

4

In [None]:
relevant_docs

## Make a chain

In [38]:
# create the chain to answer questions
llm = AzureOpenAI(model_kwargs={ 'engine':"text-davinci-003"})
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="refine",
                                  retriever=retriever,
                                  return_source_documents=True)

In [49]:
## Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'], ", page: ", source.metadata['page'])

In [43]:
# example
query = "Which country organized summer olympics more than once?"
llm_response = qa_chain(query)



In [50]:
process_llm_response(llm_response)


Australia, France, Germany, Greece, Japan, Great Britain, Sweden, Belgium, the Netherlands, Finland, Italy, Mexico, Canada, the Soviet Union, South Korea, Spain, China, Brazil, and the upcoming hosts in 2024 and 2032 (taking both countries to three each) have all organized the Summer Olympic Games more than once. Tokyo, Japan, hosted the 2020 Games and became the first city outside the predominantly English-speaking and European nations to have hosted the Summer Olympics twice, having already hosted the Games in 1964. It is the largest city ever to have hosted the Olympics, having grown considerably since 1964. The other countries to have hosted the Summer Olympics are Belgium, Brazil, Canada, China, Finland, Italy, Mexico, Netherlands, South Korea, Soviet Union, Spain, and Sweden, with each of these countries having hosted the Summer Games on one occasion. Asia has hosted the Summer Olympics four times: in Tokyo (1964 and 2020), Seoul (1988), and Beijing (2008). The 2016 Games in Rio

In [52]:
# another example
query = "What are the causes of global warming?"
llm_response = qa_chain(query)

process_llm_response(llm_response)



Global warming occurs when carbon dioxide (CO2) and other air pollutants, known as greenhouse gases, collect in the atmosphere and absorb sunlight and solar radiation that have bounced off the Earth's surface. Normally this radiation would escape into space, but these pollutants, which can last for years to centuries in the atmosphere, trap the heat and cause the planet to get hotter. These heat-trapping pollutants —specifically carbon dioxide, methane, nitrous oxide, water vapor, and synthetic fluorinated compounds— absorb the sun’s energy, preventing it from escaping the atmosphere and trapping heat on the earth’s surface. This phenomenon is known as the greenhouse effect and is directly attributable to human activity, such as burning fossil fuels such as coal, oil, gasoline, and natural gas. In the United States, transportation, electricity production, and industrial activity are the largest sources of greenhouse gases.

The effects of global warming are being felt everywhere. Ext