### PDF Analyzer

This analyzer uses Azure OpenAI's embedding model `text-embedding-ada-002` and vector database by Pinecone.

First off, we import necessary packages and initialize three models that going to be used later: `chat_model`, `embedding_model` and `pinecone_client`.

In [19]:
# pip install pinecone-client
from dotenv import find_dotenv, load_dotenv
import os
from langchain_openai import AzureChatOpenAI
from pinecone.grpc import PineconeGRPC
from pinecone import ServerlessSpec
from langchain_openai import AzureOpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_pinecone import PineconeVectorStore
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import PyPDFLoader

load_dotenv(find_dotenv())

chat_model = AzureChatOpenAI(
    openai_api_type="azure",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    openai_api_key=os.getenv("AZURE_OPENAI_API_KEY")
)

embedding_model = AzureOpenAIEmbeddings(
    openai_api_type="azure",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    azure_deployment=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME"),
    api_version=os.getenv("AZURE_OPENAI_EMBEDDING_API_VERSION"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY")
)

pinecone_client = PineconeGRPC(
    api_key = os.getenv("PINECONE_API_KEY")
)

Next, start loading the PDF document, using `PyPDFLoader`.

In [6]:
loader = PyPDFLoader(file_path="../docs/TOM_AI 2.0_ForRedaction.pdf")
data = loader.load()
pages = len(data)
pdf_content = ''

for i in range(pages):
    pdf_content += data[i].page_content

print(pdf_content)

GEN AI: TOO MUCH SPEND, 
TOO LITTLE BENEFIT? 
ISSUE 129 | June 25, 2024 | 5:10 PM EDT
”P
   
Global Macro  
Research
Investors should consider this report as only a single factor in making their investment decision. For 
Reg AC certiﬁcation and other important disclosures, see the Disclosure Appendix, or go to 
www.gs.com/research/hedge.html.
The Goldman Sachs Group, Inc.
Tech giants and beyond are set to spend over $1tn on AI capex in coming years, 
with so far little to show for it. So, will this large spend ever pay off? MIT’s Daron 
Acemoglu and GS’ Jim Covello are skeptical, with Acemoglu seeing only limited US 
economic upside from AI over the next decade and Covello arguing that the 
technology isn’t designed to solve the complex problems that would justify the costs, 
which may not decline as many expect. But GS’ Joseph Briggs, Kash Rangan, and 
Eric Sheridan remain more optimistic about AI’s economic potential and its ability to 
ultimately generate returns b

Now we can start splitting the text, or the term `chunking`. We use `RecursiveCharacterTextSplitter` to do the work.
* `chuck_size`: maximum size of each chunk.
* `chunk_overlap`: the number of characters that should overlap between the consecutive chunks.
* `length_function`: function that takes in a string as input and returns its length.

In [14]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2048,
    chunk_overlap = 0,
    length_function = len
)

pdf_content_splitted = [c.page_content for c in text_splitter.create_documents([pdf_content])]
print(len(pdf_content_splitted))
print(pdf_content_splitted[3])

74
one quarter given ample near-term liquidity, and now expect
a 25bp RRR cut in Q3 and a 10bp policy rate cut in Q4.
Datapoints/trends we’re focused on 
• China’s economy, which remains bifurcated between
strength in exports and manufacturing activity and weakness
in housing and credit, coupled with very low inflation.
• EM easing cycle; we think the fundamental case for further
EM rate cuts remains strong, though the recent unwind in
EM FX carry trades following electoral surprises in Mexico,
India, and South Africa could impede policy normalization.
French election: upside risk to debt trajectory 
French government debt, % of GDP 
China: a bifurcated economy 
China activity indicator, % change, yoy 
Source: Goldman Sachs GIR. Source: Haver Analytics, Goldman Sachs GIR. 
40
50
60
70
80
90
100
110
2010 2012 2014 2016 2018 2020 2022 2024
Presidential 
election years 
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0
2
4
6
8
10
12
2024 2026 2028 2030 2032 2034
Additional interest receipt (lhs)
Addition

Now create an index with name `pdf-index` in Pinecone.

In [21]:
index_name = os.getenv("PINECONE_INDEX_NAME")

# print(pc.list_indexes().names())
if index_name not in pinecone_client.list_indexes().names():
    pinecone_client.create_index(
        name = index_name,
        dimension = 1536,
        metric = "cosine",
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )
else:
    print("Index {} created.".format(index_name)) 

Index pdf-index created.


Now, let's initialize a Pinecone object with name `pdf_docsearch`, passing in:
1. Chunked content;
2. Embedding model to use for generate embeddings for each chunk;
3. Invoke `pinecone_client` at the back session and insert into vector database in Pinecone.

To verify after execution, login [Pinecone](https://app.pinecone.io/) and check the index created from above, click the index and there should be records already inserted into the database, and record count should match the length of `pdf_content_splitted`.

In [22]:
pdf_docsearch = PineconeVectorStore.from_texts(
    texts=pdf_content_splitted,
    embedding=embedding_model,
    index_name=index_name
)

Now we can perform searches, but bear in mind that in the real world usecase, there are steps of evaluation and tunning before formal usage, this is just a demo.

In [26]:
question = "Who is Brian Janous?"

docs = pdf_docsearch.similarity_search(question)

chain = load_qa_chain(llm=chat_model, chain_type="stuff")
chain.run(
    input_documents=docs,
    question = question
)

'Brian Janous is the Co-founder of Cloverleaf Infrastructure, a company that develops strategies to help utilities unlock new grid capacity. Before this, he served as the Vice President of Energy at Microsoft.'

In [27]:
question = "Who is Agent Smith?" # From Matrix LOL

docs = pdf_docsearch.similarity_search(question)

chain = load_qa_chain(llm=chat_model, chain_type="stuff")
chain.run(
    input_documents=docs,
    question = question
)

"I'm sorry, but the provided context does not contain any information on an individual named Agent Smith."