<a href="https://colab.research.google.com/github/davidhuiky/isom5240/blob/main/OCR_byDeepSeek.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load Packages & Prepare Environment

In [1]:
# Import libraries
import os
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
print(f"New Transformers version: {transformers.__version__}")

import warnings# Check the version after install
warnings.filterwarnings('ignore')

# Set device
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Using device: cuda
PyTorch version: 2.9.0+cu126
CUDA version: 12.6
GPU: Tesla T4


In [8]:
# Load the model and tokenizer
model_name = "deepseek-ai/DeepSeek-OCR"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Patch LlamaFlashAttention2 to fix ImportError in DeepSeek-OCR remote code
# The remote code imports this class unconditionally, causing failure on systems without flash-attn support.
if not hasattr(transformers.models.llama.modeling_llama, "LlamaFlashAttention2"):
    print("Injecting dummy LlamaFlashAttention2 to satisfy remote code imports...")
    class LlamaFlashAttention2(transformers.models.llama.modeling_llama.LlamaAttention):
        pass
    transformers.models.llama.modeling_llama.LlamaFlashAttention2 = LlamaFlashAttention2

# Load model with flash attention for better performance (requires CUDA)
# For CPU/MPS, remove _attn_implementation parameter
try:
    model = AutoModel.from_pretrained(
        model_name,
        _attn_implementation='flash_attention_2',
        trust_remote_code=True,
        use_safetensors=True,
        torch_dtype=torch.bfloat16
    )
except Exception as e:
    print(f"Flash attention not available: {e}")
    print("Loading with default attention...")
    model = AutoModel.from_pretrained(
        model_name,
        trust_remote_code=True,
        use_safetensors=True,
        torch_dtype=torch.bfloat16
    )

model = model.eval().to(device)
print(f"Model loaded successfully on {device}")

You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.
Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at deepseek-ai/DeepSeek-OCR and are newly initialized: ['model.vision_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded successfully on cuda


In [9]:
from datasets import load_dataset

ds = load_dataset("mychen76/invoices-and-receipts_ocr_v1")

README.md:   0%|          | 0.00/782 [00:00<?, ?B/s]

data/train-00000-of-00001-76ffc8319f74dd(…):   0%|          | 0.00/249M [00:00<?, ?B/s]

data/test-00000-of-00001-af2d92d1cee2851(…):   0%|          | 0.00/18.8M [00:00<?, ?B/s]

data/valid-00000-of-00001-894b4e1f736b57(…):   0%|          | 0.00/14.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/125 [00:00<?, ? examples/s]

Generating valid split:   0%|          | 0/70 [00:00<?, ? examples/s]