# AI PDF Reader Assistant

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-OpenAI-API-KEY" data-toc-modified-id="1.-OpenAI-API-KEY-1">1. OpenAI API KEY</a></span></li><li><span><a href="#2.-Testing-GPT4-from-LangChain" data-toc-modified-id="2.-Testing-GPT4-from-LangChain-2">2. Testing GPT4 from LangChain</a></span></li><li><span><a href="#3.-Loading-PDF-file" data-toc-modified-id="3.-Loading-PDF-file-3">3. Loading PDF file</a></span></li><li><span><a href="#4.-Chunks" data-toc-modified-id="4.-Chunks-4">4. Chunks</a></span></li><li><span><a href="#5.-Embedding-Model" data-toc-modified-id="5.-Embedding-Model-5">5. Embedding Model</a></span></li><li><span><a href="#6.-Store-in-ChromaDB" data-toc-modified-id="6.-Store-in-ChromaDB-6">6. Store in ChromaDB</a></span></li><li><span><a href="#7.-Load-from-storage" data-toc-modified-id="7.-Load-from-storage-7">7. Load from storage</a></span></li><li><span><a href="#8.-Prompt-template" data-toc-modified-id="8.-Prompt-template-8">8. Prompt template</a></span></li><li><span><a href="#9.-Chain" data-toc-modified-id="9.-Chain-9">9. Chain</a></span></li><li><span><a href="#10.-Code-Summary" data-toc-modified-id="10.-Code-Summary-10">10. Code Summary</a></span></li></ul></div>

## 1. OpenAI API KEY

To carry out this project, we will need an API KEY from OpenAI to use the GPT-4 Turbo model. This API KEY can be obtained at https://platform.openai.com/api-keys. It is only displayed once, so it must be saved at the moment it is obtained. Of course, we will need to create an account to get it.

We store the API KEY in a `.env` file to load it with the dotenv library and use it as an environment variable. This file is added to the `.gitignore` to ensure that it cannot be seen if we upload the code to GitHub, for example.

In [1]:
# import API KEY

import os                           # operating system library
from dotenv import load_dotenv      # load environment variables  


load_dotenv()


OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

## 2. Testing GPT4 from LangChain

We are going to test the connection from LangChain to the GPT-4 model. We´ll just ask who the Apple's CEO is.

In [2]:
from langchain_openai.chat_models import ChatOpenAI   # LangChain connection to OpenAI

model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-4-turbo")

response = model.invoke("Who is Apple's CEO?")

response.content

"As of my last update in 2023, Apple's CEO is Tim Cook. He has held the position since August 2011, succeeding Steve Jobs."

## 3. Loading PDF file

Now, we load the [2023 Apple's Form 10-K](https://s2.q4cdn.com/470004039/files/doc_earnings/2023/q4/filing/_10-K-Q4-2023-As-Filed.pdf) PDF file previously downloaded. A 10-K is a comprehensive report filed annually by a publicly-traded company about its financial performance and is required by the U.S. Securities and Exchange Commission (SEC). The report contains much more detail than a company's annual report, which is sent to its shareholders before an annual meeting to elect company directors.

In [3]:
os.listdir("../pdfs")

['_10-K-Q4-2023-As-Filed.pdf']

In [4]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

In [5]:
# loads PDF file page by page

loader = PyPDFDirectoryLoader("../pdfs/")

pages = loader.load()

In [6]:
len(pages)

80

In [7]:
pages[0]  # first pdf page

Document(page_content='UNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\nFORM 10-K\n(Mark One)\n☒    ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the fiscal year ended September\xa030, 2023\nor\n☐    TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the transition period from \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0  to \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 .\nCommission File Number: 001-36743\nApple Inc.\n(Exact name of Registrant as specified in its charter)\nCalifornia 94-2404110\n(State or other jurisdiction\nof incorporation or organization)(I.R.S. Employer Identification No.)\nOne Apple Park Way\nCupertino , California 95014\n(Address of principal executive offices) (Zip Code)\n(408) 996-1010\n(Registrant’s telephone number, including area code)\nSecurities registered pursuant to Section 12(b) of the Act:\nTitle of each classTrading \nsymbol(s) Name

## 4. Chunks

The `PyPDFDirectoryLoader` uses a TextSplitter instance, specifically the `RecursiveCharacterTextSplitter` by default, to handle the document splitting. This approach helps in breaking down large PDF files or collections of files into manageable chunks for further processing. The loader ensures that each chunk is manageable and retains necessary metadata, like page numbers, which are important for referencing and maintaining the integrity of the source documents during processing.

In [8]:
chunks = loader.load_and_split()

In [9]:
len(chunks)

106

In [10]:
chunks[55]

Document(page_content='The gross fair values of the Company’s derivative assets and liabilities as of September\xa024, 2022  were as follows (in millions):\n2022\nFair Value of\nDerivatives Designated\nas Accounting HedgesFair Value of\nDerivatives Not Designated\nas Accounting HedgesTotal\nFair Value\nDerivative assets (1):\nForeign exchange contracts $ 4,317 $ 2,819 $ 7,136 \nDerivative liabilities (2):\nForeign exchange contracts $ 2,205 $ 2,547 $ 4,752 \nInterest rate contracts $ 1,367 $ — $ 1,367 \n(1) Derivative assets are measured using Level 2 fair value inputs and are included in other current assets and other non-\ncurrent assets in the Consolidated Balance Sheet.\n(2) Derivative liabilities are measured using Level 2 fair value inputs and are included in other current liabilities and other non-\ncurrent liabilities in the Consolidated Balance Sheet.\nThe derivative assets above represent the Company’s gross credit exposure if all counterparties failed to perform. To mitigate

## 5. Embedding Model

Embeddings transform data, especially textual data, into a format, usually a vector of numbers, that ML algorithms can process effectively. These embeddings capture the contextual relationships and semantic meanings of words, phrases, or documents, enabling various applications in AI.

In [11]:
from langchain_openai.embeddings import OpenAIEmbeddings


vectorizer = OpenAIEmbeddings()

## 6. Store in ChromaDB

Chroma DB is an open-source vector database designed to store and retrieve vector embeddings efficiently. It is particularly useful for enhancing LLMs by providing relevant context to user inquiries. Chroma DB allows for the storage of embeddings along with metadata, which can later be utilized by LLMs or for semantic search engines over text data.

Now, we store the chunks in the vector database.

In [12]:
from langchain_community.vectorstores import Chroma

chroma_db = Chroma.from_documents(chunks, vectorizer, persist_directory="../chroma_db")

## 7. Load from storage

Once the data is saved, we can perform a search for the most relevant documents based on our query. We can search directly with similarity search, based in cosine similarity, or we can instantiate a retriever object for later use.

By default, the type of search conducted by the retriever, `search_type`, is by similarity and it returns the most relevant results according to that similarity. We can also use similarity with a threshold to retrieve documents that exceed a certain level of similarity to our query. The retriever also features an algorithm called MMR (maximal marginal relevance). The maximal marginal relevance algorithm selects documents based on a combination of which documents are most similar to the queries, while also optimizing for diversity. It does this by finding examples with embeddings that have the highest cosine similarity to the inputs, and then iteratively adds them while applying a penalty for closeness to already selected examples.

We will use the MMR algorithm to return 15 documents. The `lambda_mult` parameter refers to the diversity of the results returned by MMR, with 1 being for minimal diversity and 0 for maximum. The default is 0.5. We will ask for a bit more diversity in its response.

In [13]:
query = "What can you tell me about foreign exchange contracts?"

chroma_db = Chroma(persist_directory="../chroma_db", embedding_function=vectorizer)

docs = chroma_db.similarity_search(query, k=10)

len(docs)

10

In [14]:
docs[5]

Document(page_content='The gross fair values of the Company’s derivative assets and liabilities as of September\xa024, 2022  were as follows (in millions):\n2022\nFair Value of\nDerivatives Designated\nas Accounting HedgesFair Value of\nDerivatives Not Designated\nas Accounting HedgesTotal\nFair Value\nDerivative assets (1):\nForeign exchange contracts $ 4,317 $ 2,819 $ 7,136 \nDerivative liabilities (2):\nForeign exchange contracts $ 2,205 $ 2,547 $ 4,752 \nInterest rate contracts $ 1,367 $ — $ 1,367 \n(1) Derivative assets are measured using Level 2 fair value inputs and are included in other current assets and other non-\ncurrent assets in the Consolidated Balance Sheet.\n(2) Derivative liabilities are measured using Level 2 fair value inputs and are included in other current liabilities and other non-\ncurrent liabilities in the Consolidated Balance Sheet.\nThe derivative assets above represent the Company’s gross credit exposure if all counterparties failed to perform. To mitigate

In [15]:
retriever = chroma_db.as_retriever(search_type="mmr", search_kwargs={'k': 15, 'lambda_mult': 0.25})

In [16]:
retriever

VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x10b19c280>, search_type='mmr', search_kwargs={'k': 15, 'lambda_mult': 0.25})

## 8. Prompt template

Prompt templates are predefined recipes for generating instructions for language models.

A template can include instructions, context, and specific questions suitable for a given task. LangChain provides tools for creating and working with instruction templates and also strives to create model-agnostic templates to facilitate the reuse of existing templates across different language models.

In [17]:
from langchain.prompts import ChatPromptTemplate

In [18]:
template = """
            Answer the question based on the context below. If you can't 
            answer the question, reply "I don't know".

            Context: {context}

            Question: {question}
            """


prompt = ChatPromptTemplate.from_template(template)

## 9. Chain

A "chain" refers to a sequence of components or steps that are linked together to perform a specific task or set of tasks related to AI or LLMs operations. LangChain is a library designed to facilitate the building and deploying of language applications by chaining together different components such as models, databases, and custom logic. Each component in the chain handles a specific part of the task, and the output of one component serves as the input for the next, creating a seamless workflow that leverages both AI and traditional software methodologies. A chain effectively acts as a pipeline, where data flows through each component in the chain, being transformed, enhanced, or utilized at each step.

In LangChain, the StrOutputParser parses the model's output directly into a string format. We will use this parser when creating the LangChain sequence; it will be an additional link in the chain, allowing us to directly obtain the LLM's response in string format.

RunnablePassthrough allows inputs to pass through unchanged.

In [19]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

In [20]:
from langchain_core.runnables import RunnablePassthrough

In [21]:
query

'What can you tell me about foreign exchange contracts?'

In [22]:
chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | model | parser


response = chain.invoke(query)


response

"Foreign exchange contracts are used by the Company as a form of derivative instrument to manage its exposure to fluctuations in foreign exchange rates. These contracts help to hedge against potential adverse movements in currency exchange rates that could impact the Company's financial performance. The types of foreign exchange derivatives mentioned include forward and option contracts. These instruments are part of the Company's broader strategy to mitigate the financial risks associated with changes in exchange rates. For instance, the fair values of these derivatives as of September 24, 2022, were reported with derivative assets in foreign exchange contracts amounting to $4,317 million designated as accounting hedges and $2,819 million not designated as accounting hedges, and derivative liabilities in foreign exchange contracts were $2,205 million designated as accounting hedges and $2,547 million not designated as accounting hedges."

## 10. Code Summary

Step by step.

In [23]:
import os                          
from dotenv import load_dotenv      
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_openai.chat_models import ChatOpenAI   
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


load_dotenv()


OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")



vectorizer = OpenAIEmbeddings()

model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-4-turbo")

chroma_db = Chroma(persist_directory="../chroma_db", embedding_function=vectorizer)

retriever = chroma_db.as_retriever(search_type="mmr", search_kwargs={'k': 15, 'lambda_mult': 0.25})

template = """
            Answer the question based on the context below. If you can't 
            answer the question, reply "I don't know".

            Context: {context}

            Question: {question}
            """


prompt = ChatPromptTemplate.from_template(template)


parser = StrOutputParser()

chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | model | parser


In [24]:
query = "What can you tell me about foreign exchange contracts?"

chain.invoke(query)

'Foreign exchange contracts are utilized by the Company as derivative instruments to manage exposure to fluctuations in foreign exchange rates. These contracts include both foreign exchange forward and option contracts. The Company uses these derivatives to hedge against adverse movements in exchange rates that could negatively impact its financial results, specifically its gross margins and net sales, when expressed in U.S. dollars. However, the effectiveness of these hedging activities may not fully offset the financial effects of unfavorable movements in foreign exchange rates during the periods the hedges are in place. The Company also faces risks related to the fair values of these contracts, which can be significantly affected by changes in exchange rates.'

In [25]:
query = "What are the main products?"

chain.invoke(query)

'The main products of the Company are smartphones, personal computers, tablets, wearables, and accessories.'

In [26]:
query = "What can you tell me about legal proceedings?"

chain.invoke(query)

"Epic Games filed a lawsuit against Apple Inc. in the U.S. District Court for the Northern District of California, alleging violations of federal and state antitrust laws and California’s unfair competition law due to Apple's operation of its App Store. The District Court ruled mostly in favor of Apple on September 10, 2021, except for finding certain App Store Review Guidelines in violation of California’s law, leading to an injunction against Apple's prohibition of external purchasing links in apps. The U.S. Court of Appeals for the Ninth Circuit affirmed this ruling, and after further appeals were denied, Apple's motion to stay enforcement of the injunction was granted pending a potential U.S. Supreme Court appeal.\n\nAdditionally, Masimo Corporation and Cercacor Laboratories filed a complaint with the U.S. International Trade Commission (ITC) alleging that Apple infringed on patents related to the blood oxygen feature in Apple Watch Series 6 and 7. The ITC issued a limited exclusio