<a href="https://colab.research.google.com/github/ajiayi-debug/munichreassignment/blob/main/extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PDF Extraction with Qwen3 VL

This notebook extracts text and layout information from PDF documents using the Qwen3 8B vision-language model.

## Setup (Google Colab)

Install required dependencies for Qwen3-VL model and mount drive for PDF. Make sure to save the PDF in your drive root (MyDrive).

If you would like to use a different PDF< make sure to change the name of the PDF in pdf_path!!!

In [1]:
from google.colab import drive
drive.mount('/content/drive')

pdf_path="/content/drive/MyDrive/Principles of Public Health.pdf"
output_dir="/content/drive/MyDrive/pdf_extraction"

Mounted at /content/drive


In [2]:
import os

output_dir="/content/drive/MyDrive/pdf_extraction"
os.makedirs(output_dir, exist_ok=True)
print(f"Output directory ensured at: {output_dir}")

Output directory ensured at: /content/drive/MyDrive/pdf_extraction


In [3]:
# Install system dependencies for PDF processing
!apt-get install -y poppler-utils
!pip install git+https://github.com/huggingface/transformers
!pip install pdf2image

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 41 not upgraded.
Need to get 186 kB of archives.
After this operation, 697 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.12 [186 kB]
Fetched 186 kB in 1s (178 kB/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 121689 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.12_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.12) ...
Setting up poppler-utils (22.02.0-2ubuntu0.12) ...
Processing triggers for man-db (2.10.2-1) ...
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-alteg6m1
  Running command git clone --filter=blob

In [4]:
import os
import torch
import json
from pathlib import Path
from pdf2image import convert_from_path
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

print("All libraries imported successfully!")

All libraries imported successfully!


## Load Qwen3-VL Model

Load the Qwen3-VL-8B-Instruct model and processor.

Make sure to insert your huggingface token into secrets and name it HF_TOKEN. You can get your huggingface token from here:
https://huggingface.co/settings/tokens

In [5]:
from google.colab import userdata
token=userdata.get("HF_TOKEN")

In [6]:
from transformers import AutoProcessor, AutoModelForImageTextToText

model_name = "Qwen/Qwen3-VL-8B-Instruct"

processor = AutoProcessor.from_pretrained(model_name, token=token)
model = AutoModelForImageTextToText.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    token=token
)

print("Qwen3-VL model loaded successfully!")

preprocessor_config.json:   0%|          | 0.00/390 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

video_preprocessor_config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/750 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/269 [00:00<?, ?B/s]

Qwen3-VL model loaded successfully!


## Extraction

Given it is a VLM, I was able to request for extraction of images in terms of description and how it relates to the text, which aids in adding context to the chatbot

In [7]:
prompt = """Please extract all the text content from this PDF page image.

Extract the following information:
1. All text content in reading order
2. Tables (format as markdown tables)
3. Headings and their hierarchy
4. Lists and bullet points
5. Any formulas or equations
6. Any Images, describe the images and what you can extract from them in terms of relevancy to the text

Organize the output in a clear, structured format that maintains the original document's layout and hierarchy. Return in markdown format"""

## Convert PDF to Images

Convert the PDF file into individual page images for processing.

In [8]:
print(f"Converting PDF to images: {pdf_path}")
images = convert_from_path(pdf_path)
print(f"Total pages: {len(images)}")

Converting PDF to images: /content/drive/MyDrive/Principles of Public Health.pdf
Total pages: 123


## Process Each Page

Loop through each page and extract layout and text information using Qwen3 8B VLM.

In [9]:
# Process Each Page with Checkpointing
page_outputs = []
output_dir_path = Path(output_dir)

# Ensure the directory exists
output_dir_path.mkdir(parents=True, exist_ok=True)

print(f"Processing {len(images)} pages with checkpointing...")

for page_num, image in enumerate(images, start=1):
    # Define the output path for this specific page
    page_output_path = output_dir_path / f"page_{page_num}_extraction.txt"

    #CHECKPOINT: Check if this page is already done
    if page_output_path.exists():
        print(f"✓ Page {page_num} already exists. Skipping processing.")
        with open(page_output_path, "r", encoding="utf-8") as f:
             # Skip the first line ("=== Page X ===") to avoid duplication if re-printing
             content = f.read()
             page_outputs.append({"page": page_num, "output": content})
        continue

    print(f"\nProcessing page {page_num}/{len(images)}...")

    # Save temp image for the model to read
    temp_image_path = output_dir_path / f"temp_page_{page_num}.png"
    image.save(temp_image_path)
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": str(temp_image_path)
                },
                {"type": "text", "text": prompt}
            ]
        }
    ]
    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    inputs = processor(
        text=[text],
        images=[image],
        padding=True,
        return_tensors="pt",
    )

    # Move to GPU
    device = "cuda" if torch.cuda.is_available() else "cpu"
    inputs = inputs.to(device)

    # Generate output
    generated_ids = model.generate(**inputs, max_new_tokens=4096)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    with open(page_output_path, "w", encoding="utf-8") as f:
        f.write(f"=== Page {page_num} ===\n\n")
        f.write(output_text)
    page_outputs.append({
        "page": page_num,
        "output": output_text
    })

    temp_image_path.unlink()

    print(f"✓ Page {page_num} saved/completed")
    print(f"Preview: {output_text[:100]}...")


Processing 123 pages with checkpointing...
✓ Page 1 already exists. Skipping processing.
✓ Page 2 already exists. Skipping processing.
✓ Page 3 already exists. Skipping processing.
✓ Page 4 already exists. Skipping processing.
✓ Page 5 already exists. Skipping processing.
✓ Page 6 already exists. Skipping processing.
✓ Page 7 already exists. Skipping processing.
✓ Page 8 already exists. Skipping processing.
✓ Page 9 already exists. Skipping processing.
✓ Page 10 already exists. Skipping processing.
✓ Page 11 already exists. Skipping processing.
✓ Page 12 already exists. Skipping processing.
✓ Page 13 already exists. Skipping processing.
✓ Page 14 already exists. Skipping processing.
✓ Page 15 already exists. Skipping processing.
✓ Page 16 already exists. Skipping processing.
✓ Page 17 already exists. Skipping processing.
✓ Page 18 already exists. Skipping processing.
✓ Page 19 already exists. Skipping processing.
✓ Page 20 already exists. Skipping processing.
✓ Page 21 already exists. 

## Save Results

Save extracted text from all pages into individual files and a combined file.

In [10]:
# Combine Extraction Results
# This reads the actual files from disk.

print("Generating full extraction file from checkpoints...")

output_dir_path = Path(output_dir)
combined_output_path = output_dir_path / "full_extraction.txt"

with open(combined_output_path, "w", encoding="utf-8") as outfile:
    # Iterate through the total count of images to maintain order
    for i in range(len(images)):
        page_num = i + 1
        page_file_path = output_dir_path / f"page_{page_num}_extraction.txt"

        outfile.write(f"\n{'='*60}\n")
        outfile.write(f"PAGE {page_num}\n")
        outfile.write(f"{'='*60}\n\n")

        if page_file_path.exists():
            with open(page_file_path, "r", encoding="utf-8") as infile:
                # Read the file content
                content = infile.read()
                lines = content.split('\n')
                if lines and lines[0].startswith("=== Page"):
                    content = "\n".join(lines[2:]) # Skip header and first newline

                outfile.write(content)
        else:
            outfile.write(f"[ERROR: Page {page_num} extraction file not found]\n")

        outfile.write("\n\n")

print(f"✓ Combined output saved to: {combined_output_path}")
print("=== Extraction Complete ===")

Generating full extraction file from checkpoints...
✓ Combined output saved to: /content/drive/MyDrive/pdf_extraction/full_extraction.txt
=== Extraction Complete ===
