In [10]:
import torch

from transformers import AutoProcessor, AutoModelForImageTextToText, AutoModelForVision2Seq

In [11]:
model_name = "ibm-granite/granite-docling-258M"
device = "mps" if torch.mps.is_available() else "cpu"
print(f"Device: {device}")

granite_docling_processor = AutoProcessor.from_pretrained(model_name)

granite_docling_model = AutoModelForVision2Seq.from_pretrained(
    pretrained_model_name_or_path=model_name, dtype=torch.bfloat16,
).to(device)

Device: mps


In [12]:
from transformers.image_utils import load_image

image_path = "data/images/image.png"
image = load_image(image=image_path)


In [13]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."},
        ],
    },
]

prompt = granite_docling_processor.apply_chat_template(
    conversation=messages, add_generation_prompt=True
)
inputs = granite_docling_processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(device)

In [14]:
generated_ids = granite_docling_model.generate(
    **inputs, max_new_tokens=8192, use_cache=True
)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doc_tags = granite_docling_processor.batch_decode(
    trimmed_generated_ids,
    skip_special_tokens=False,
)[0].lstrip()

In [15]:
doc_tags

"<doctag><page_header><loc_115><loc_27><loc_385><loc_34>Energy Budget of WASP-121 b from JWST/NIRISS Phase Curve</page_header>\n<page_header><loc_454><loc_28><loc_459><loc_34>9</page_header>\n<text><loc_41><loc_42><loc_239><loc_88>while the kernel weights are structured as ( N$_{slice}$ , N$_{time}$ ). This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.</text>\n<text><loc_41><loc_89><loc_239><loc_206>To address this, we follow a similar approach to our sinusoidal fits using emcee , but we increase the total number of steps to 100,000 and use 100 walkers. Na¨ıvely, the fit would include 2 N$_{slice}$ + 1 parameters: N$_{slice}$ for the albedo values, N$_{slice}$ for the emission parameters, and one additional scatter parameter, σ . However, since night-side slices do not contribute to the refl