# DeepSeek OCR Pipeline - Google Cloud Run GPU

This notebook runs a three-stage OCR pipeline on Google Cloud Run with GPU support:

1. **Extract** ‚Äì Run DeepSeek OCR over a dataset, save Markdown and crop detected figures
2. **Describe** ‚Äì Generate captions for extracted figures  
3. **Assemble** ‚Äì Enrich Markdown with figure captions

All three stages write a **Hugging Face dataset** to GCS. Each stage pulls the current dataset, adds new columns with its outputs, and saves the updated dataset back ‚Äî so by the end you have one dataset with all intermediate and final results.

**Prerequisites:**
- Google Cloud credentials configured
- Cloud Run GPU enabled in your project
- Required packages: `google-cloud-run`, `google-cloud-storage`

**References:**
- [Cloud Run GPU Documentation](https://cloud.google.com/run/docs/configuring/services/gpu)
- [Cloud Run Pricing](https://cloud.google.com/run/pricing)


## üìë Table of Contents

1. [Setup](#Setup)
   - [Google Cloud Run](#‚òÅÔ∏è-Google-Cloud-Run)
   - [The dataset](#üìä-The-dataset)
   - [Inference Backend: vLLM](#‚ö°-Inference-Backend:-vLLM)
2. [Authentication](#üîê-Authentication)
3. [Configuration](#Configuration)
4. [Create GCS Bucket](#Create-GCS-Bucket)
5. [Build Container Image](#Build-Container-Image)
6. [Helper Functions](#Helper-Functions-(Python-SDK))
7. [Stage 1: Extract](#Stage-1:-Extract)
8. [Stage 2: Describe](#Stage-2:-Describe)
9. [Stage 3: Assemble](#Stage-3:-Assemble)
10. [Load Final Dataset](#Load-Final-Dataset)
11. [Cleanup](#Cleanup)


## Setup

### ‚òÅÔ∏è Google Cloud Run

[Cloud Run](https://cloud.google.com/run) is a fully managed serverless platform that automatically scales containers. With [GPU support](https://cloud.google.com/run/docs/configuring/services/gpu), you can run ML inference workloads without managing infrastructure.

**Key features for this pipeline:**
- **L4 GPUs** ‚Äî 24GB VRAM, good for 7B parameter models
- **Pay per use** ‚Äî billed per second while the job runs
- **Auto-scaling** ‚Äî scales to zero when idle
- **Container-based** ‚Äî package your code + dependencies in a Docker image


### üì¶ How the pipeline code is shipped

For every Cloud Run Job we launch, the logic is similar:

**From this notebook**, we build a Docker image containing:
1. The entrypoint script (`gcr_job_runner.py`)
2. The pipeline code in `llm_ocr/`

The image is pushed to Google Artifact Registry.

**Then we launch a Cloud Run Job** that pulls the Docker image and runs it. The job:
1. Starts a vLLM server with DeepSeek-OCR model
2. Imports `llm_ocr.cli` and calls `main()` to run the requested pipeline stage
3. Saves results to GCS (Google Cloud Storage)


### üìö The dataset

This pipeline uses **FineVision** (`HuggingFaceM4/FineVision`) as a large, mixed **image+text** corpus. FineVision aggregates many public sub-datasets into one unified interface.

- **Dataset**: [`HuggingFaceM4/FineVision`](https://huggingface.co/datasets/HuggingFaceM4/FineVision)

#### The `olmOCR` subsets

The [`olmOCR-mix-0225`](https://arxiv.org/pdf/2502.18443) dataset from Allen AI contains **260,000 crawled PDF pages** from over 100,000 diverse PDFs ‚Äî academic papers, legal documents, public domain books, brochures, and more. It includes challenging content: graphics, handwritten text, multi-column layouts, tables, equations, and poor quality scans.

Available configs:
- `olmOCR-mix-0225-documents` ‚Äî general documents
- `olmOCR-mix-0225-books` ‚Äî book pages

> üìÑ **Note**: In this pipeline, **one document = one page** of a PDF.

These mirror real-world enterprise use cases: contracts, invoices, reports, forms, and scanned documents that organizations need to digitize and extract structured information from.


### ‚ö° Inference Backend: vLLM

This pipeline uses [**vLLM**](https://github.com/vllm-project/vllm) as the inference backend for DeepSeek-OCR. vLLM provides:

- **High throughput** via continuous batching and PagedAttention
- **OpenAI-compatible API** ‚Äî easy to integrate with existing code
- **Efficient memory management** ‚Äî run large models on limited GPU memory

The Cloud Run job uses the [`vllm/vllm-openai:latest`](https://hub.docker.com/r/vllm/vllm-openai) Docker image as the base. The pipeline sends batched requests (32 concurrent) to maximize throughput on L4 GPUs (~50 docs/min).


### üìù DeepSeek-OCR Prompts

DeepSeek-OCR supports different prompts for various OCR tasks. See the [official config.py](https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py) for examples:

| Use Case | Prompt |
|----------|--------|
| **Document ‚Üí Markdown** | `<image>\n<\|grounding\|>Convert the document to markdown.` |
| **General OCR** | `<image>\n<\|grounding\|>OCR this image.` |
| **Free OCR (no layout)** | `<image>\nFree OCR.` |
| **Parse figures** | `<image>\nParse the figure.` |
| **Describe image** | `<image>\nDescribe this image in detail.` |

We configure these prompts via environment variables `DOC_PROMPT` and `FIGURE_PROMPT` in our job configuration, re-using the special tokens from the [official DeepSeek-OCR config](https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py).


In [None]:
# Install required packages
!pip install -q google-cloud-run google-cloud-storage google-cloud-build google-auth

In [3]:
import os
import json
import time
from pathlib import Path

from google.cloud import run_v2
from google.cloud import storage
from google.protobuf import duration_pb2

### üîê Authentication

Authenticate to Google Cloud before running jobs. You can use:
- **Application Default Credentials (ADC)** ‚Äî recommended for local development
- **Service account key** ‚Äî for production or CI/CD

See:
- [Authentication methods](https://docs.cloud.google.com/docs/authentication)
- [Install the Google Cloud SDK](https://cloud.google.com/sdk/docs/install-sdk)


In [4]:
# Option 1: Set credentials from a service account key file
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/your/service-account-key.json"

# Option 2: If running locally, run this in terminal first:
# gcloud auth application-default login

# Verify credentials
from google.auth import default
credentials, project = default()
print(f"Authenticated with project: {project}")

Authenticated with project: huggingface-ml


## Configuration

In [5]:
#  Set your HuggingFace token 
# Get your token at: https://huggingface.co/settings/tokens
import os
if not os.environ.get("HF_TOKEN"):
    os.environ["HF_TOKEN"] = input("Enter your HF token: ")

In [7]:
# GCP settings (update these for your project)
PROJECT_ID = project  # From authentication cell above
REGION = "us-east4"  # Cloud Run GPU available regions: us-central1, us-east4, europe-west4
BUCKET_NAME = f"{PROJECT_ID}-ocr"  # GCS bucket name

# Container image settings
IMAGE_URI = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/deepseek-ocr/deepseek-ocr:latest"

# Project settings
PROJECT_NAME = "deepseek-ocr"

# Model and dataset settings
MODEL_NAME = "deepseek-ai/DeepSeek-OCR"
DATASET_NAME = "HuggingFaceM4/FineVision"
DATASET_CONFIG = "olmOCR-mix-0225-documents"
MAX_SAMPLES = 128

# GPU configuration
GPU_TYPE = "nvidia-l4"  # Cloud Run supports L4 GPUs
GPU_COUNT = 1
MEMORY = "32Gi"
CPU = "8"

# GCS output path (single location for all stages - dataset gets updated in place)
GCS_OUTPUT_URI = f"gs://{BUCKET_NAME}/{PROJECT_NAME}/pipeline"

# Base environment variables (passed to all stages)
BASE_ENV = {
    "MODEL_ID": MODEL_NAME,
    "DATASET_NAME": DATASET_NAME,
    "DATASET_CONFIG": DATASET_CONFIG,
    "MAX_SAMPLES": str(MAX_SAMPLES),
    "HF_HUB_ENABLE_HF_TRANSFER": "1",
    # Performance tuning
    "EXTRACT_BATCH_SIZE": "32",
    "GPU_MEMORY_UTILIZATION": "0.90",
    "VLLM_STARTUP_TIMEOUT": "900",

}

# Add HF token to job environment
if os.environ.get("HF_TOKEN"):
    BASE_ENV["HF_TOKEN"] = os.environ.get("HF_TOKEN")

print(f"Project: {PROJECT_NAME}")
print(f"GCS Output URI: {GCS_OUTPUT_URI}")

Project: deepseek-ocr
GCS Output URI: gs://huggingface-ml-ocr/deepseek-ocr/pipeline


## Create GCS Bucket

In [8]:
# Create GCS bucket if it doesn't exist
storage_client = storage.Client(project=PROJECT_ID)

try:
    bucket = storage_client.get_bucket(BUCKET_NAME)
    print(f"Bucket already exists: gs://{BUCKET_NAME}")
except Exception:
    bucket = storage_client.create_bucket(BUCKET_NAME, location=REGION)
    print(f"Created bucket: gs://{BUCKET_NAME}")

Bucket already exists: gs://huggingface-ml-ocr


## Build Container Image

**Note:** Container building requires either:
1. Run from a machine with Docker installed and push to Artifact Registry
2. Use Cloud Build (requires `gcloud` CLI or Cloud Build API)
3. Use a pre-built image

For simplicity, we'll create the Dockerfile here and you can build it separately.

In [9]:
%%writefile Dockerfile.cloudrun
FROM vllm/vllm-openai:latest

# Install uv for fast dependency management
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:$PATH"

# Copy pipeline code (from parent directory context)
WORKDIR /app
COPY llm_ocr/ /app/llm_ocr/
COPY google-cloud-run/gcr_job_runner.py /app/

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV HF_HUB_ENABLE_HF_TRANSFER=1
ENV PYTHONPATH=/app:$PYTHONPATH

# Entry point - uv run reads deps from inline script metadata (PEP 723)
ENTRYPOINT ["uv", "run", "/app/gcr_job_runner.py"]


Overwriting Dockerfile.cloudrun


### üî® Build and Push Instructions

Run these commands from the `batch-ocr-inference/` directory:

```bash
# 1. Authenticate Docker with Artifact Registry
gcloud auth configure-docker us-east4-docker.pkg.dev

# 2. Create Artifact Registry repository (if needed)
gcloud artifacts repositories create deepseek-ocr \
    --repository-format=docker \
    --location=us-east4 \
    --project=$PROJECT_ID

# 3. Build the Docker image
docker build -f google-cloud-run/Dockerfile.cloudrun \
    -t us-east4-docker.pkg.dev/$PROJECT_ID/deepseek-ocr/deepseek-ocr:latest .

# 4. Push to Artifact Registry
docker push us-east4-docker.pkg.dev/$PROJECT_ID/deepseek-ocr/deepseek-ocr:latest
```

> üí° Replace `$PROJECT_ID` with your GCP project ID (e.g., `huggingface-ml`)


## Helper Functions ([Python SDK](https://cloud.google.com/python/docs/reference))

These functions use the [Cloud Run Python client library](https://cloud.google.com/python/docs/reference/run/latest) to create and manage jobs.

In [14]:
def launch_stage(stage: str, env: dict = None, use_gpu: bool = True):
    """Launch a pipeline stage as a Cloud Run job.
    
    Args:
        stage: Pipeline stage (extract, describe, assemble)
        env: Stage-specific environment variables (optional)
        use_gpu: Whether to use GPU (default True). Set False for CPU-only jobs.
        
    Returns:
        job_name: Name of the created/running job
    """
    from google.cloud import run_v2
    from google.protobuf import duration_pb2
    
    job_name = f"{PROJECT_NAME}-{stage}"
    
    # Create client with regional endpoint
    client_options = {"api_endpoint": f"{REGION}-run.googleapis.com"}
    client = run_v2.JobsClient(client_options=client_options)
    
    parent = f"projects/{PROJECT_ID}/locations/{REGION}"
    job_path = f"{parent}/jobs/{job_name}"
    
    # Merge base env with stage-specific env
    full_env = {**BASE_ENV, "PIPELINE_STAGE": stage}
    if env:
        full_env.update(env)
    
    # Build environment variables list
    env_vars = [run_v2.EnvVar(name=k, value=str(v)) for k, v in full_env.items()]
    
    # Build resource limits (GPU or CPU-only)
    if use_gpu:
        resource_limits = {
            "cpu": CPU,
            "memory": MEMORY,
            "nvidia.com/gpu": str(GPU_COUNT),
        }
    else:
        # CPU-only: use less resources
        resource_limits = {
            "cpu": "4",
            "memory": "8Gi",
        }
    
    # Build task template
    task_template_kwargs = {
        "containers": [
            run_v2.Container(
                image=IMAGE_URI,
                env=env_vars,
                resources=run_v2.ResourceRequirements(limits=resource_limits),
            )
        ],
        "timeout": duration_pb2.Duration(seconds=3600),
        "max_retries": 0,
    }
    
    # Add GPU-specific settings only if using GPU
    if use_gpu:
        task_template_kwargs["gpu_zonal_redundancy_disabled"] = True
        task_template_kwargs["node_selector"] = run_v2.NodeSelector(accelerator=GPU_TYPE)
    
    # Create job configuration
    job = run_v2.Job(
        template=run_v2.ExecutionTemplate(
            template=run_v2.TaskTemplate(**task_template_kwargs),
        ),
        labels={"stage": stage},
    )
    
    # Delete existing job if it exists
    try:
        client.get_job(name=job_path)
        print(f"Deleting existing job: {job_name}")
        delete_op = client.delete_job(name=job_path)
        delete_op.result()
        import time
        time.sleep(5)
    except Exception as e:
        if "not found" not in str(e).lower() and "404" not in str(e):
            print(f"Warning: {e}")
    
    # Create job
    gpu_str = "GPU" if use_gpu else "CPU-only"
    print(f"Creating job: {job_name} ({gpu_str})")
    request = run_v2.CreateJobRequest(parent=parent, job=job, job_id=job_name)
    operation = client.create_job(request=request)
    operation.result()
    
    # Run job
    print(f"Launching {stage} stage...")
    run_operation = client.run_job(name=job_path)
    
    print(f"Started job: {job_name}")
    print(f"Console: https://console.cloud.google.com/run/jobs/details/{REGION}/{job_name}/executions?project={PROJECT_ID}")
    
    return job_name


def wait_for_job(job_name: str, poll_interval: int = 30, timeout: int = 3600):
    """Wait for a Cloud Run job to complete."""
    from google.cloud import run_v2
    import time
    
    client_options = {"api_endpoint": f"{REGION}-run.googleapis.com"}
    exec_client = run_v2.ExecutionsClient(client_options=client_options)
    
    parent = f"projects/{PROJECT_ID}/locations/{REGION}/jobs/{job_name}"
    start_time = time.time()
    
    print(f"Waiting for job {job_name}...")
    
    while time.time() - start_time < timeout:
        executions = list(exec_client.list_executions(parent=parent))
        if not executions:
            time.sleep(poll_interval)
            continue
        
        latest = executions[0]
        
        if latest.succeeded_count > 0:
            print(f"  {job_name}: Completed ‚úì")
            return {"status": "Completed", "execution": latest}
        elif latest.failed_count > 0:
            print(f"  {job_name}: Failed ‚úó")
            for cond in latest.conditions:
                if cond.type_ == "Completed" and cond.state.name == "CONDITION_FAILED":
                    print(f"  Reason: {cond.message}")
            return {"status": "Failed", "execution": latest}
        else:
            print(f"  {job_name}: Running... (running={latest.running_count})")
        
        time.sleep(poll_interval)
    
    raise TimeoutError(f"Job {job_name} did not complete within {timeout}s")


def check_job_status(job_name: str):
    """Check the status of recent executions for a job."""
    from google.cloud import run_v2
    
    client_options = {"api_endpoint": f"{REGION}-run.googleapis.com"}
    exec_client = run_v2.ExecutionsClient(client_options=client_options)
    
    parent = f"projects/{PROJECT_ID}/locations/{REGION}/jobs/{job_name}"
    
    print(f"Recent executions for {job_name}:")
    for execution in exec_client.list_executions(parent=parent):
        status = "UNKNOWN"
        if execution.succeeded_count > 0:
            status = "SUCCEEDED ‚úì"
        elif execution.failed_count > 0:
            status = "FAILED ‚úó"
        elif execution.running_count > 0:
            status = "RUNNING..."
        
        print(f"  {execution.name.split('/')[-1]}: {status}")


# Import IO and rendering utilities from llm_ocr
import sys; sys.path.insert(0, "..")
from llm_ocr.gcr_io import load_dataset_from_gcs
from llm_ocr.document import render_sample_markdown, display_markdown, display_samples



## Stage 1: Extract

Run DeepSeek OCR on each document image to produce Markdown and extract figure crops. This stage runs on GPU (L4).

![Extract Stage](../assets/extract-gcr.png)


In [11]:
# Stage 1: Extract
# Output dataset will be saved to GCS
stage1_env = {
    "GCS_OUTPUT_URI": GCS_OUTPUT_URI,
}

stage1_job = launch_stage("extract", stage1_env)

Deleting existing job: deepseek-ocr-extract
Creating job: deepseek-ocr-extract (GPU)
Launching extract stage...
Started job: deepseek-ocr-extract
Console: https://console.cloud.google.com/run/jobs/details/us-east4/deepseek-ocr-extract/executions?project=huggingface-ml


In [None]:
# Wait for extract to complete
stage1_result = wait_for_job(stage1_job)
print(f"Extract stage completed: {stage1_result['status']}")

Waiting for job deepseek-ocr-extract...
  deepseek-ocr-extract: Running... (running=1)
  deepseek-ocr-extract: Running... (running=1)
  deepseek-ocr-extract: Running... (running=1)


In [None]:
# Load and display samples after Extract
ds_extract = load_dataset_from_gcs(f"{GCS_OUTPUT_URI}/dataset")
display_samples(ds_extract, num_samples=2)

## Stage 2: Describe

Generate captions for each extracted figure using vision-language inference. This stage also runs on GPU.

![Describe Stage](../assets/describe-gcr.png)


In [None]:
# Stage 2: Describe
# Updates dataset in place (same location as extract)
stage2_env = {
    "GCS_OUTPUT_URI": GCS_OUTPUT_URI,
    "GCS_INPUT_URI": f"{GCS_OUTPUT_URI}/dataset",
}

stage2_job = launch_stage("describe", stage2_env)

In [None]:
# Wait for Stage 2 to complete
# stage2_result = wait_for_job(stage2_job)
# print(f"Describe stage completed: {stage2_result['status']}")

check_job_status(stage2_job)

In [None]:
# Wait for describe to complete
describe_result = wait_for_job("deepseek-ocr-describe")
print(f"Describe stage completed: {describe_result['status']}")

In [None]:
# Load and display samples after Describe
ds_describe = load_dataset_from_gcs(f"{GCS_OUTPUT_URI}/dataset")
#display_samples(ds_describe, num_samples=2)

## Stage 3: Assemble

Combine the original Markdown with figure captions to produce the final enriched document. This stage is CPU-only.

![Assemble Stage](../assets/assemble-gcr.png)


In [None]:
# Stage 3: Assemble
# Updates dataset in place + saves final markdown files
stage3_env = {
    "GCS_OUTPUT_URI": GCS_OUTPUT_URI,
    "GCS_INPUT_URI": f"{GCS_OUTPUT_URI}/dataset",
}

stage3_job = launch_stage("assemble", stage3_env, use_gpu=False)  # CPU-only

In [None]:
# Wait for Stage 3 to complete (optional)
# stage3_result = wait_for_job(stage3_job)
# print(f"Assemble stage completed: {stage3_result['status']}")

check_job_status(stage3_job)

In [None]:
# Load and display final samples after Assemble
ds_final = load_dataset_from_gcs(f"{GCS_OUTPUT_URI}/dataset")
display_samples(ds_final, num_samples=2)

In [None]:
# Display rendered markdown with images for sample 1
# This properly renders figure: URIs using images from extracted_figures column
display_markdown(ds_final[1])


## Load Final Dataset

In [None]:
from datasets import load_from_disk
import tempfile

# Download and load final dataset
storage_client = storage.Client(project=PROJECT_ID)
bucket = storage_client.bucket(BUCKET_NAME)

# Download dataset files
dataset_prefix = "pipeline/assemble/dataset"
local_dir = tempfile.mkdtemp()

blobs = bucket.list_blobs(prefix=dataset_prefix)
for blob in blobs:
    rel_path = blob.name[len(dataset_prefix):].lstrip("/")
    if rel_path:
        local_path = Path(local_dir) / rel_path
        local_path.parent.mkdir(parents=True, exist_ok=True)
        blob.download_to_filename(str(local_path))

dataset = load_from_disk(local_dir)
print(f"Loaded dataset: {dataset}")
dataset

In [None]:
# View a sample
sample = dataset[0]
print("Sample keys:", list(sample.keys()))
print("\nFinal markdown preview:")
print(sample.get("document_final_markdown_text", "")[:1000])

## üí∞ Cost Analysis

| Metric | Value |
|--------|-------|
| üñ•Ô∏è **Hardware** | Nvidia L4 (24GB), 8 vCPU, 32 GiB RAM |
| ‚ö° **Throughput** | ~26 docs/min |
| üîÑ **Concurrency** | 32 parallel requests |

**Pricing** ([Cloud Run GPU Pricing](https://cloud.google.com/run/pricing)):

| Resource | Rate | Per 40 min |
|----------|------|------------|
| L4 GPU (no zonal redundancy) | $0.0001867/sec | $0.45 |
| CPU (8 vCPU) | $0.000018/vCPU-sec | $0.35 |
| Memory (32 GiB) | $0.000002/GiB-sec | $0.15 |
| **Total** | | **~$1** |

| Scale | ‚è±Ô∏è Time | üí≤ Cost |
|-------|------|------|
| 1,024 docs | ~40 min | ~$1 |
| 10,000 docs | ~6.5 hours | ~$10 |
| 100,000 docs | ~65 hours | ~$100 |

> üí° Costs include GPU + CPU + Memory. Using [Committed Use Discounts](https://cloud.google.com/run/pricing) can reduce CPU/Memory costs by up to 46%.

> üìÑ **Note**: 1 doc = 1 PDF page in these benchmarks.


## Cleanup

In [None]:
# Delete jobs (optional)
def delete_job(job_name: str):
    from google.cloud.run_v2 import JobsClient
    client = JobsClient()
    job_path = f"projects/{PROJECT_ID}/locations/{REGION}/jobs/{job_name}"
    try:
        client.delete_job(name=job_path)
        print(f"Deleted job: {job_name}")
    except Exception as e:
        print(f"Could not delete {job_name}: {e}")

# Uncomment to delete:
# delete_job("deepseek-ocr-extract")
# delete_job("deepseek-ocr-describe")
# delete_job("deepseek-ocr-assemble")