![image](puffin.jpg)

[image source](https://images.discerningassets.com/image/upload/q_auto:best/c_limit,w_1000/v1705258979/xz2yui2oqtqhv0yhrgto.jpg)

# Ask Questions from PDF Files, Using Langchain Library

- We will use some PDF files to ask AI the question: **"Where are the main locations of puffin colonies?"**

- PDF files about puffins were downloaded using the Langchain library. These files were loaded as documents with `PyPDFLoader` and then split using `RecursiveCharacterTextSplitter`.

- On the other hand, the Pinecone and OpenAI API keys were turned into environment variables.

- Using `OpenAIEmbeddings` and `PineconeVectorStore`, all document texts were converted into embedding vectors. These texts, along with their vectors, metadata, and generated UUIDs, were stored in the Pinecone cloud database.

- Using `similarity_search` method, a question was converted to embedding values, and using cosine similarity, Pinecone found the top 10 (k=10) similar documents to that question.

- After assigning the top 10 similar documents as context for the `ChatOpenAI` class, questions were asked of the AI, and it provided answer.

Useful Resources:
* [Pinecone in Langchain](https://python.langchain.com/docs/integrations/vectorstores/pinecone/)
* [Free Articles in PDF](https://www.freefullpdf.com/)

## Importing Libraries

In [1]:
import os
from dotenv import load_dotenv

from langchain_community.document_loaders import PyPDFLoader
from langchain_pinecone import PineconeVectorStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone, ServerlessSpec

from langchain_openai import ChatOpenAI
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage

  from tqdm.autonotebook import tqdm


## Loading Envoironment Variables

In [2]:
# Environment variables will be loaded from a .env file located in the same root directory as this notebook file.
load_dotenv()

# test one of the env keys
os.environ.get("PINECONE_API_ENV")

'us-east-1'

## Create Documents

In [3]:
add=False # change it to True, if you want your data be added to the pincone database, after adding change it back to False

In [4]:
!ls data

25_BSSC_Tufted Puffin.pdf
4 - Trophic interactions under climate fluctuations.pdf
4458-Article Text-33734-1-10-20200626.pdf
52_2_235-245.pdf
Tufted-Puffin-Coast-Wide-Colony-Survey-2021-508-compliant.pdf
tupu_recovery__planfinalsept2019.pdf


In [5]:
files_dir = "data"

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

In [7]:
documents = []
for file in os.listdir(files_dir):
    if file.endswith('.pdf'):
        loader = PyPDFLoader(os.path.join(files_dir , file))
        data = loader.load()
        documents.extend(text_splitter.split_documents(data))

In [8]:
len(documents)

538

In [9]:
documents[10].metadata

{'source': 'data/52_2_235-245.pdf', 'page': 1}

In [10]:
print(documents[11].page_content)

2022), and marine mammals (Ford et al. 2016, Thomas et al. 2017, 
Michaux et al. 2021). Results from eDNA studies can often be 
correlated with results from studies using conventional sampling 
methods (Port et al. 2016, Thomsen et al. 2016, Kelly et al. 2017, 
Sigsgaard et al. 2017, Pont et al. 2018), most of which involve 
greater time and energy investment. One limitation of eDNA 
methods is that secondary prey are difficult to distinguish from 
primary prey.
For seabirds, diets have been assessed using DNA analysis of 
feces from several species, including penguins (Deagle et al. 2007, 
Jarman et al. 2013, Cavallo et al. 2018), albatrosses (McInnes et 
al. 2016a, 2017a, 2017b), shearwaters (Komura et al. 2018, Nimz 
et al. 2022), terns (Bogantes et al. 2024), and cormorants (Oehm 
et al. 2017). The eDNA methods perform as well as or better than 
conventional dietary characterization methods for many species 
(Deagle et al. 2007; Bowser et al. 2013; Jarman et al. 2013;


## Initializing Pincone database and Indexes

In [11]:
index_name = "puffin"
pc = Pinecone()
existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]

print("Indexes: ", existing_indexes)

if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    while not pc.describe_index(index_name).status["ready"]: # if server is not ready it will wait for 1 second, then again tries
        # to create an index using the index_name
        time.sleep(1)

print(f"Using {index_name} index ")

index = pc.Index(index_name) # instantiate an index object


Indexes:  ['langzam', 'puffin']
Using puffin index 


## Adding Documents to the Vector Store

In [12]:
from uuid import uuid4
uuids = [str(uuid4()) for _ in range(len(documents))]

In [14]:
embeddings=OpenAIEmbeddings(model="text-embedding-3-small")

In [15]:
vector_store = PineconeVectorStore(index=index, embedding=embeddings)

In [16]:
if add:
    vector_store.add_documents(documents=documents, ids=uuids)
    print("Documents (text+vector+metadata+uuid) were added to the pinecone database")

print("Nothing will be added to the vector database")

Nothing will be added to the vector database


## Query Vector Store

In [17]:
question = "Where are the main locations of puffin colonies?"
results = vector_store.similarity_search(
    question,
    k=10,
    # filter={"source": "xxx.pdf"}, #in case a filter on a project is needed
)

In [18]:
resutls_content = "\n\n".join([result.page_content for result in results])
print(resutls_content)

at the South Farallon Islands. On the north coast, 
the principal breeding sites were Prince Island 
(27 birds) and Castle Rock (82 birds), Del Norte 
County, and Green Rock, Humboldt County 
(29 birds). Between Cape Mendocino and the 
Farallon Islands, puffins were found only at Goat 
Island Area (8 birds) and Fish Rocks (15 birds), 
Mendocino County, and Point Reyes (4 birds). In 
1991, small numbers of puffins (about 10 birds) 
were rediscovered at Prince Island, Santa Barbara 
County, after an absence of up to several decades 
from the Channel Islands (Carter et al. 1992, 
McChesney et al. 1995). This was the only breed-
ing location south of the Farallon Islands where 
puffins were found in 1989–1991. No overall

September 2019  3                               Washington Department of Fish and Wildlife 
   
 
1995), and along the Asian coast as far 
south as Hokkaido, Japan (Brazil 1991, 
Osa and Watanuki 2002).  Of the 1,031 
nesting colonies known worldwide, 802 
(78%) occur in 

## Ask OpenAI

In [19]:
model = ChatOpenAI(model="gpt-3.5-turbo")

In [20]:
messages = [ # order is important
    SystemMessage(content=resutls_content),
    HumanMessage(content=question)
]

In [21]:
response=model.invoke(messages)

In [22]:
print(question , response.content)

Where are the main locations of puffin colonies? The main locations of Tufted Puffin colonies in Washington are along the outer coast from Point Grenville north to Cape Flattery. They are also found in colonies on Protection and Smith Islands in the eastern Strait of Juan de Fuca. In Oregon, puffin colonies are located at sites such as Haystack Rock, Three Arch Rocks, and Cape Meares. In California, puffin colonies are mainly found on the northern coast, including locations like Prince Island, Castle Rock, and Green Rock. Additionally, puffins breed in significant numbers in the Aleutian Islands and along the Alaskan Peninsula.
