# PDF Investor Summarizer 

This notebook demonstrates how to use the [pdf-investor-summarizer](https://github.com/ashishki/pdf-investor-summarizer) project to automatically extract investment-relevant insights from company PDF reports using an LLM (OpenAI GPT).

**Features:**
- Loads and parses PDF files (local or public URL)
- Cleans and splits text for efficient LLM extraction
- Asks the LLM for a structured summary (growth, changes, triggers, material factors)
- Supports cost/token accounting and error handling


In [None]:
#  1. Install dependencies (for Colab)

# Option 1: pip install from GitHub
!pip install git+https://github.com/ashishki/pdf-investor-summarizer.git

# Option 2: clone + pip install local 
!git clone https://github.com/ashishki/pdf-investor-summarizer.git
%cd pdf-investor-summarizer
!pip install .

# (Optional) For OCR support in Colab:
# !apt-get install tesseract-ocr

In [None]:
#  2. Import project modules

# All necessary modules are now available from the installed package.
import os
import asyncio
from src.pdf_investor_summarizer.report_analyzer import ReportAnalyzer

# (If you want to edit or debug, you can clone the repo instead of pip install)
# !git clone https://github.com/ashishki/pdf-investor-summarizer.git
# %cd pdf-investor-summarizer


In [None]:
#  3. Configure your API key and input PDF

# Set your OpenAI API key (required for the LLM)
os.environ["OPENAI_API_KEY"] = "sk-..."  # <-- Replace with your key!

# Path to the default demo PDF (bundled with the repo)
PDF_SOURCE = "assets/company_report.pdf"
print(f"Default demo PDF: {PDF_SOURCE}")

# Optional: If user uploads another file, use it instead
from google.colab import files
uploaded = files.upload()
if uploaded:
    PDF_SOURCE = list(uploaded.keys())[0]
    print(f"Using uploaded PDF: {PDF_SOURCE}")
else:
    print("No upload detected, using demo PDF.")



In [None]:
#  4. Run the summarization pipeline

# Instantiate the main analyzer. Chunk size and overlap can be tuned for your document size.
analyzer = ReportAnalyzer(chunk_size=2000, overlap=200)

# Run the analysis pipeline asynchronously (required for LLM API calls)
result = asyncio.run(analyzer.analyze_async(PDF_SOURCE))

# Show the structured summary output
import json
print(json.dumps(result, indent=2, ensure_ascii=False))


In [None]:
#  5. (Optional) How much did this cost?

# Token/cost stats are already logged to console by the pipeline.
# If you want to analyze further, you can add custom print statements or save stats from result.


## Notes & Tips

- The pipeline supports both local files and public PDF URLs (except direct links from Google Drive/Dropbox/OneDrive — see README).
- If you get a PDF parsing error, make sure your input is a real PDF file, not a web page.
- For large files, costs and runtime may increase — try with a short report first.
- Adjust chunk size and overlap for your use-case: smaller values = cheaper but possibly less context.
- All token/cost stats and errors are printed in the output cells for transparency and easy debugging.
- For Google Drive, download your PDF to local disk and then upload via Colab.
- Poetry is used for development, but for Colab you only need pip install.

