## 1️⃣ Install the SDK

Run the cell below to install the PageIndex OCR SDK. This only needs to be done once in your environment.

In [115]:
!pip install --upgrade pageindex



## 2️⃣ Initialize the Client

Import the PageIndex client class and authenticate using your API key. Be sure to keep your API key secret and never share it publicly.

In [None]:
from pageindex import PageIndexClient

# Paste your API key here, you can get the api key from https://dash.pageindex.ai/api-keys
API_KEY = "API_KEY"  
pi_client = PageIndexClient(api_key=API_KEY)

## 3️⃣ Submit a PDF Document for OCR

Use the client to upload a PDF file for OCR processing (currently supports PDF files only).

After submission, you'll receive a `doc_id` that you can use to check status and get OCR results.

> Replace the file path below with your own PDF file if needed.

In [135]:
import requests, os

pdf_url = "https://arxiv.org/pdf/2501.12948.pdf"
pdf_path = os.path.join("../data", pdf_url.split('/')[-1])
os.makedirs(os.path.dirname(pdf_path), exist_ok=True)

response = requests.get(pdf_url)
with open(pdf_path, "wb") as f:
    f.write(response.content)

print(f"Downloaded file to: {pdf_path}")
result = pi_client.submit_document(pdf_path)
doc_id = result["doc_id"]
print(f"Document submitted. Document ID: {doc_id}")

Downloaded file to: ../data/2501.12948.pdf
Document submitted. Document ID: pi-cmdx9gua3003y08pou6noqdro


## 4️⃣ Check Status and Get OCR Results

OCR processing may take anywhere from a few seconds (for small files) to several minutes (for larger files).

This code polls the service every 3 seconds, for up to 5 minutes. Once finished, it previews the extracted text from the first page.

In [136]:
import time

# Simple polling
for attempt in range(60):  # Try up to 300s (100 x 3s)
    ocr_result = pi_client.get_ocr(doc_id)
    if ocr_result["status"] == "completed":
        print("OCR Results ready!")
        break
    elif ocr_result["status"] == "failed":
        print("OCR failed.")
        break
    time.sleep(3)
else:
    print("Still processing after 10 minutes. Try again later.")

# Preview the first page's markdown
if ocr_result.get("status") == "completed":
    if ocr_result["result"]:
        first_page = ocr_result["result"][0]
        print(f"Page {first_page['page_index']} (partial content):\n")
        print(first_page["markdown"][:1000])  # Print first 1000 chars
    else:
        print("No pages found in OCR result.")
else:
    print("OCR not completed yet. Try again later.")

OCR Results ready!
Page 1 (partial content):

# DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI<br>research@deepseek.com


## Abstract

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B,

## 5️⃣ Get the Document Tree Structure

You can also get the document's PageIndex tree structure using the method below. If the tree is not ready yet, try again later.

In [None]:
from pprint import pprint

def extract_toc_from_json(data, indent_size=2):
    toc_lines = []
    
    def process_node(node, level=0):
        if isinstance(node, dict):
            if 'title' in node:
                indent = '  ' * level
                toc_lines.append(f"{indent}- {node['title']}")
            if 'nodes' in node:
                for child in node['nodes']:
                    process_node(child, level + 1)
        elif isinstance(node, list):
            for item in node:
                process_node(item, level)
    
    process_node(data)
    return '\n'.join(toc_lines)

def remove_text_fields(data):
    if isinstance(data, dict):
        return {k: remove_text_fields(v) for k, v in data.items() if k != 'text'}
    elif isinstance(data, list):
        return [remove_text_fields(item) for item in data]
    return data

tree_result = pi_client.get_tree(doc_id)
if tree_result.get("status") == "completed":
    toc = extract_toc_from_json(tree_result.get("result"))
    print("\n## Tree Structure (drop text fields)\n")
    pprint(remove_text_fields(tree_result.get("result")),sort_dicts=False)
else:
    print(f"Tree status: {tree_result.get('status')}. Try again later if still processing.")

Document tree structure loaded!

## Tree Structure (drop text fields)

[{'title': 'DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via '
           'Reinforcement Learning',
  'node_id': '0000',
  'page_index': 1,
  'nodes': [{'title': 'Abstract', 'node_id': '0001', 'page_index': 1},
            {'title': 'Contents', 'node_id': '0002', 'page_index': 2},
            {'title': '1. Introduction',
             'node_id': '0003',
             'page_index': 3,
             'nodes': [{'title': '1.1. Contributions',
                        'node_id': '0004',
                        'page_index': 4},
                       {'title': '1.2. Summary of Evaluation Results',
                        'node_id': '0005',
                        'page_index': 4}]},
            {'title': '2. Approach',
             'node_id': '0006',
             'page_index': 5,
             'nodes': [{'title': '2.1. Overview',
                        'node_id': '0007',
                        'page_index': 5},
 

In [138]:
if tree_result.get("status") == "completed":
    print("Document tree structure loaded!")
    toc = extract_toc_from_json(tree_result.get("result"))
    print(toc)
else:
    print(f"Tree status: {tree_result.get('status')}. Try again later if still processing.")

Document tree structure loaded!
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
  - Abstract
  - Contents
  - 1. Introduction
    - 1.1. Contributions
    - 1.2. Summary of Evaluation Results
  - 2. Approach
    - 2.1. Overview
    - 2.2. DeepSeek-R1-Zero: Reinforcement Learning on the Base Model
      - 2.2.1. Reinforcement Learning Algorithm
      - 2.2.2. Reward Modeling
      - 2.2.3. Training Template
      - 2.2.4. Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero
    - 2.3. DeepSeek-R1: Reinforcement Learning with Cold Start
      - 2.3.1. Cold Start
      - 2.3.2. Reasoning-oriented Reinforcement Learning
      - 2.3.3. Rejection Sampling and Supervised Fine-Tuning
      - 2.3.4. Reinforcement Learning for all Scenarios
    - 2.4. Distillation: Empower Small Models with Reasoning Capability
  - 3. Experiment
    - 3.1. DeepSeek-R1 Evaluation
    - 3.2. Distilled Model Evaluation
  - 4. Discussion
    - 4.1. Distillation 

## 6️⃣ Delete the Document (Cleanup)

If you do not need the document any more, you can delete it by running the code below.

In [139]:
pi_client.delete_document(doc_id)
print("Document deleted successfully.")

Document deleted successfully.


---

# 💬 Notes & Support

- Only **PDF files** are supported at this time.
- If you have any questions or need help:
    - 🤝 [Join the PageIndex Discord](https://discord.gg/VuXuf29EUj)
    - 📨 [Contact support via Typeform](https://ii2abc2jejf.typeform.com/to/meB40zV0)

---

### Full SDK Reference  
See: [PageIndex OCR SDK Reference](https://pageindex.ai/ocr/sdk) for advanced usage and all available parameters.