In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import base64
import pymupdf
from openai import OpenAI

In [3]:
from pricing import calculate_cost_response
from models import Page, PageResponse

In [5]:
openai_client = OpenAI()
doc = pymupdf.open(r"C:\Users\alexe\Downloads\books\math-engineers.pdf")

In [6]:
def to_base64_url(page):
    matrix = pymupdf.Matrix(1.5, 1.5)

    pixmap = page.get_pixmap(matrix=matrix)
    png_bytes = pixmap.tobytes(output='png')

    b64 = base64.b64encode(png_bytes).decode("utf-8")
    image_url = f"data:image/png;base64,{b64}"

    return image_url

In [7]:
instructions = """
You are extracting a textbook page into structured page blocks.

text and formulas should be extracted verbatim.

use latex for all math.
Use "$" for inline equation and EquationBlock type for block equations.

some inline equations should be treated as block equations
if there's little text around them.

important: don't skip any text. if something is not possible to 
recognize, include a placeholder

Extraction rules:
1) Preserve reading order. The blocks list must match the order a human reads the page.
2) Do NOT include OCR or layout details (no coordinates, fonts, line breaks, or scan artifacts).
3) Prefer fewer, larger TextBlocks over many tiny ones. Group adjacent paragraphs when they belong together.
4) Use LaTeX for all math in EquationBlock.latex.
5) Section headings must be SectionHeadingBlock only; do not include body text in them.
6) FigureBlock.description should explain what the figure conveys conceptually (graphs, curves, relationships),
   not how it looks on the page.
7) TableBlock should capture semantic columns and rows. Include units in column names if shown.
8) Store the running page header (if any) in Page.header.
9) If uncertain, make a best-faith concise extraction; do not invent content.
""".strip()

def extract_page_information(page, detail="low", model_name="gpt-4o-mini") ->  PageResponse:
    image_url = to_base64_url(page) 
    
    image_content = {
        "type": "input_image",
        "image_url": image_url,
        "detail": detail
    }

    messages = [
        {"role": "system", "content": instructions}, 
        {"role": "user", "content": [image_content]}
    ]

    response = openai_client.responses.parse(
        model=model_name,
        input=messages,
        text_format=Page
    )

    cost = calculate_cost_response(response)

    return PageResponse(
        page=response.output_parsed,
        cost=cost
    )

In [8]:
page_response = extract_page_information(doc[100])

In [9]:
page_response.cost

0.00081765

In [11]:
page_response.page.print()

83
ADDITIONAL RULES OF DIFFERENTIATION
If $z$ is a function of $x$ and $y$, i.e., $z = f(x, y)$, the total differential $dz$ is obtained from the partial differentials $dx$ and $dy$ by the use of the following relationship:

$$dz = \frac{dz}{dx}dx + \frac{dz}{dy}dy$$

The reason for this is more clearly seen if we work from the fundamental idea of change, and introduce the actually measurable quantities like $x$, $y$, and $z$.

Figure 21
The figure illustrates the relationship between the change in $z$ and the changes in $x$ and $y$ as $P$ moves to $Q$ on a surface. It shows how the differential change in one variable affects the corresponding change in another variable.
Fig. 21

Thus:

$$\frac{dz}{dx} = \frac{\Delta z}{\Delta x}$$

The change in $z$ due to a change in $x$ can be measured by the product of the change in $x$ multiplied by the rate of change of $z$ with respect to $x$; and this fact can be better illustrated by referring to a diagram (Fig. 21).

Let $P$ be a point $(x, y