# 📄 AI Financial Analyst: Document Q\&A with RAG

Natural Language Processing | 🧠 Retrieval-Augmented Generation | 🤖 LLMs

---

## 📌 Notebook Summary

This notebook implements a complete **Retrieval-Augmented Generation (RAG)** pipeline from scratch. The goal is to take a large, unstructured PDF document — such as a company's 10-K annual report — and turn it into a **searchable knowledge base**.

We can then ask **natural language questions** and receive **accurate, source-backed answers** from a locally-run Large Language Model (LLM).

The entire process, from **data extraction** to **final answer generation**, is self-contained within this notebook.

---

## 🚀 Pipeline Stages

This notebook is divided into four main parts that execute the full RAG workflow:

### 1. 🧾 Text Extraction

* **Goal**: Process the source PDF document
* **Method**: Uses the `PyMuPDF` library to parse the PDF page by page and extract all its text content into a single string

### 2. ✂️ Text Chunking

* **Goal**: Break down the massive text into smaller, meaningful pieces for analysis
* **Method**: Employs LangChain's `RecursiveCharacterTextSplitter` to divide the text into overlapping chunks of a fixed size, ensuring context is preserved across splits

### 3. 📦 Vector Store Creation

* **Goal**: Create a searchable "knowledge base" from the text chunks
* **Method**:

  * Uses the `sentence-transformers/all-MiniLM-L6-v2` model to convert each chunk into a numerical embedding
  * Stores these embeddings in a **FAISS** index, a highly efficient similarity search library

### 4. ❓ Q\&A Inference

* **Goal**: Ask a question and generate an answer based on the document
* **Method**:

  * The user's question is embedded using the same model
  * FAISS retrieves the most relevant chunks from the vector store
  * The question and retrieved chunks are passed as context to `google/flan-t5-base`, which generates the final answer

---


In [12]:
import fitz  
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pickle
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
import transformers
import torch

In [2]:
pdf_path = "NASDAQ_AAPL_2024.pdf" 
output_text_file = "extracted_report_text.txt"

In [3]:
try:
    doc = fitz.open(pdf_path)
    full_text = ""
    
    num_pages = len(doc)
    
    for page_num in range(num_pages):
        page = doc.load_page(page_num)
        page_text = page.get_text()

        full_text += f"\n--- Page {page_num + 1} ---\n"
        full_text += page_text
    doc.close()

    with open(output_text_file, "w", encoding="utf-8") as f:
        f.write(full_text)
        
    print(f" Success! Extracted {num_pages} pages.")
    print(f"Full text saved to '{output_text_file}'")
    print("\n--- PREVIEW OF EXTRACTED TEXT ---")
    print(full_text[:1000])

except FileNotFoundError:
    print(f" Error: The file '{pdf_path}' was not found.")
    print("Please make sure the PDF file is in the same directory as this script and the filename is correct.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

 Success! Extracted 121 pages.
Full text saved to 'extracted_report_text.txt'

--- PREVIEW OF EXTRACTED TEXT ---

--- Page 1 ---
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended September 28, 2024
or
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from              to             .
Commission File Number: 001-36743
Apple Inc.
(Exact name of Registrant as specified in its charter)
California
94-2404110
(State or other jurisdiction
of incorporation or organization)
(I.R.S. Employer Identification No.)
One Apple Park Way
Cupertino, California
95014
(Address of principal executive offices)
(Zip Code)
(408) 996-1010
(Registrant’s telephone number, including area code)
Securities registered pursuant to Section 12(b) of the Act:
Title of each class
Trading symbol(s)
Name

In [4]:
input_text_file = "extracted_report_text.txt"
output_chunks_file = "report_chunks.pkl"

chunk_size = 1000
chunk_overlap = 200

In [5]:
try:
    with open(input_text_file, "r", encoding="utf-8") as f:
        full_text = f.read()
    print("Successfully loaded the extracted text file.")

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    chunks = text_splitter.split_text(full_text)
    with open(output_chunks_file, 'wb') as f:
        pickle.dump(chunks, f)

    print(f"\n Success! The document was split into {len(chunks)} chunks.")
    print(f"Chunks saved to '{output_chunks_file}'")
    print("\n--- PREVIEW OF FIRST CHUNK ---")
    print(chunks[0])
    
except FileNotFoundError:
    print(f" Error: The file '{input_text_file}' was not found.")
    print("Please make sure you have successfully run the previous extraction step.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Successfully loaded the extracted text file.

 Success! The document was split into 558 chunks.
Chunks saved to 'report_chunks.pkl'

--- PREVIEW OF FIRST CHUNK ---
--- Page 1 ---
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended September 28, 2024
or
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from              to             .
Commission File Number: 001-36743
Apple Inc.
(Exact name of Registrant as specified in its charter)
California
94-2404110
(State or other jurisdiction
of incorporation or organization)
(I.R.S. Employer Identification No.)
One Apple Park Way
Cupertino, California
95014
(Address of principal executive offices)
(Zip Code)
(408) 996-1010
(Registrant’s telephone number, including area code)
Securities registered pursuant to Section 12(b) of t

In [6]:
input_chunks_file = "report_chunks.pkl"
faiss_index_path = "faiss_index"
model_name = "sentence-transformers/all-MiniLM-L6-v2"

In [7]:
try:
    with open(input_chunks_file, 'rb') as f:
        chunks = pickle.load(f)
    print(f"Successfully loaded {len(chunks)} chunks from '{input_chunks_file}'.")
    print(f"Initializing embedding model: {model_name}")

    embeddings = HuggingFaceEmbeddings(model_name=model_name)
    print("Creating vector store from chunks. This may take a moment...")

    vectorstore = FAISS.from_texts(texts=chunks, embedding=embeddings)
    vectorstore.save_local(faiss_index_path)
    
    print(f"\n Success! Vector store created and saved to '{faiss_index_path}'.")
    
except FileNotFoundError:
    print(f" Error: The file '{input_chunks_file}' was not found.")
    print("Please make sure you have successfully run the previous chunking step.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")


Successfully loaded 558 chunks from 'report_chunks.pkl'.
Initializing embedding model: sentence-transformers/all-MiniLM-L6-v2


  embeddings = HuggingFaceEmbeddings(model_name=model_name)





modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Creating vector store from chunks. This may take a moment...

 Success! Vector store created and saved to 'faiss_index'.


In [10]:
faiss_index_path = "faiss_index"
model_name = "sentence-transformers/all-MiniLM-L6-v2"
llm_model_name = "google/flan-t5-base"

In [14]:
try:
    print("Loading embedding model...")
    embeddings = HuggingFaceEmbeddings(model_name=model_name)
    print("Loading FAISS vector store...")
    vectorstore = FAISS.load_local(faiss_index_path, embeddings, allow_dangerous_deserialization=True)
    print("Vector store loaded successfully.")

    print(f"Loading LLM: {llm_model_name}. This might take a while...")

    tokenizer = transformers.AutoTokenizer.from_pretrained(llm_model_name)
    model = transformers.AutoModelForSeq2SeqLM.from_pretrained(llm_model_name)

    pipe = transformers.pipeline(
        "text2text-generation",
        model=model,
        tokenizer=tokenizer,
        max_length=512,
        temperature=0.7,
        top_p=0.95,
        repetition_penalty=1.15
    )
    llm = HuggingFacePipeline(pipeline=pipe)
    print("LLM pipeline created.")

    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="refine",  
        retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
        return_source_documents=True
    )
    
    print("Q&A chain is ready.")
    question = "What were the main risks identified by the company?"
    print(f"\nAsking question: {question}")
    result = qa_chain.invoke({"query": question})
    
    print("\n--- ANSWER ---")
    print(result['result'])
    
    print("\n--- RELEVANT SOURCES ---")
    for i, source in enumerate(result['source_documents']):
        try:
            page_info = f"(from Page {source.page_content.splitlines()[1].split()[-1]})"
        except IndexError:
            page_info = "(Page number not found)"
            
        print(f"Source {i+1} {page_info}:")
        print(source.page_content[:300] + "...")
        print("-" * 20)

except Exception as e:
    print(f"An unexpected error occurred: {e}")

Loading embedding model...
Loading FAISS vector store...
Vector store loaded successfully.
Loading LLM: google/flan-t5-base. This might take a while...


Device set to use cpu
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


LLM pipeline created.
Q&A chain is ready.

Asking question: What were the main risks identified by the company?


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



--- ANSWER ---
design and manufacturing defects that could materially adversely affect the Company’s business and result in harm to the Company’s reputation ------------

--- RELEVANT SOURCES ---
Source 1 (from Page to):
The Company’s operations are also subject to the risks of industrial accidents at its suppliers and contract manufacturers. While the Company’s suppliers are
required to maintain safe working environments and operations, an industrial accident could occur and could result in serious injuries or loss...
--------------------
Source 2 (from Page affect):
--- Page 12 ---
The Company’s products and services may be affected from time to time by design and manufacturing defects that could materially adversely affect
the Company’s business and result in harm to the Company’s reputation.
The Company offers complex hardware and software products and servic...
--------------------
Source 3 (from Page periods.):
Because of the following factors, as well as other factors affecting