# PageIndex OCR and Tree Generation
In this notebook, we will explore how to use the PageIndex OCR to convert a PDF document into a markdown file and get the tree structure of the document.

---

## 1. Install the SDK


In [None]:
!pip install --upgrade pageindex

Collecting pageindex
  Downloading pageindex-0.1.5-py3-none-any.whl.metadata (600 bytes)
Downloading pageindex-0.1.5-py3-none-any.whl (2.9 kB)
Installing collected packages: pageindex
Successfully installed pageindex-0.1.5


## 2. Initialize the Client

You can get your API key in the [Dashboard](https://dash.pageindex.ai/api-keys).

In [None]:
from pageindex import PageIndexClient

API_KEY = "Your API Key" ## you can get your API key in https://dash.pageindex.ai/api-keys
pi_client = PageIndexClient(api_key=API_KEY)

## 3. Submit a PDF Document for OCR

- Use the client to upload a PDF file for OCR processing (currently supports PDF files only).

- After submission, you'll receive a `doc_id` that you can use to check status and get OCR results.


In [None]:
import requests, os

pdf_url = "https://arxiv.org/pdf/2501.12948.pdf"
pdf_path = os.path.join("../data", pdf_url.split('/')[-1])
os.makedirs(os.path.dirname(pdf_path), exist_ok=True)

response = requests.get(pdf_url)
with open(pdf_path, "wb") as f:
    f.write(response.content)

print(f"Downloaded file to: {pdf_path}")
result = pi_client.submit_document(pdf_path)
doc_id = result["doc_id"]
print(f"Document submitted. Document ID: {doc_id}")

Downloaded file to: ../data/2501.12948.pdf
Document submitted. Document ID: pi-cmdxdh35f000x0bpoiaihrsn5


## 4. Get Markdown Results

- OCR processing may take anywhere from a few seconds (for small files) to several minutes (for larger files).

In [None]:
from IPython.display import Markdown, display

ocr_result = pi_client.get_ocr(doc_id)

if ocr_result.get("status") == "completed":
  markdown_text=ocr_result["result"][3]["markdown"]
  display(Markdown(markdown_text))
else:
    print("Processing...")

### 1.1. Contributions

#### Post-Training: Large-Scale Reinforcement Learning on the Base Model

- We directly apply RL to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area.
- We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.


#### Distillation: Smaller Models Can Be Powerful Too

- We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future.
- Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. DeepSeek-R1-Distill-Qwen-7B achieves 55.5\% on AIME 2024, surpassing QwQ-32B-Preview. Additionally, DeepSeek-R1-Distill-Qwen-32B scores 72.6\% on AIME 2024, 94.3\% on MATH-500, and $57.2 \%$ on LiveCodeBench. These results significantly outperform previous opensource models and are comparable to o1-mini. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.


### 1.2. Summary of Evaluation Results

- Reasoning tasks: (1) DeepSeek-R1 achieves a score of 79.8\% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of $97.3 \%$, performing on par with OpenAI-o1-1217 and significantly outperforming other models. (2) On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming $96.3 \%$ human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could help developers in real world tasks.
- Knowledge: On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeekR1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of $90.8 \%$ on MMLU, $84.0 \%$ on MMLU-Pro, and $71.5 \%$ on GPQA Diamond. While its performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1 surpasses other closed-source models, demonstrating its competitive edge in educational tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses 40 on this benchmark.

## 5. Get the PageIndex Tree Structure



In [None]:
from pprint import pprint

def remove_text_fields(data):
    if isinstance(data, dict):
        return {k: remove_text_fields(v) for k, v in data.items() if k != 'text'}
    elif isinstance(data, list):
        return [remove_text_fields(item) for item in data]
    return data

tree_result = pi_client.get_tree(doc_id)
print("\n Raw Tree Structure without text fields\n")
pprint(remove_text_fields(tree_result.get("result")),sort_dicts=False)



 Raw Tree Structure without text fields

[{'title': 'DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via '
           'Reinforcement Learning',
  'node_id': '0000',
  'page_index': 1,
  'nodes': [{'title': 'Abstract', 'node_id': '0001', 'page_index': 1},
            {'title': 'Contents', 'node_id': '0002', 'page_index': 2},
            {'title': '1. Introduction',
             'node_id': '0003',
             'page_index': 3,
             'nodes': [{'title': '1.1. Contributions',
                        'node_id': '0004',
                        'page_index': 4,
                        'nodes': [{'title': 'Post-Training: Large-Scale '
                                            'Reinforcement Learning on the '
                                            'Base Model',
                                   'node_id': '0005',
                                   'page_index': 4},
                                  {'title': 'Distillation: Smaller Models Can '
                           

## 6. Print A Simplified Tree Structure

In [None]:
def print_toc_from_json(data, indent_size=2):
    def print_node(node, level=0):
        if isinstance(node, dict):
            if 'title' in node:
                indent = ' ' * (indent_size * level)
                print(f"{indent}- {node['title']}")
            if 'nodes' in node:
                for child in node['nodes']:
                    print_node(child, level + 1)
        elif isinstance(node, list):
            for item in node:
                print_node(item, level)
    print_node(data)


print_toc_from_json(tree_result.get("result"))

- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
  - Abstract
  - Contents
  - 1. Introduction
    - 1.1. Contributions
      - Post-Training: Large-Scale Reinforcement Learning on the Base Model
      - Distillation: Smaller Models Can Be Powerful Too
    - 1.2. Summary of Evaluation Results
  - 2. Approach
    - 2.1. Overview
    - 2.2. DeepSeek-R1-Zero: Reinforcement Learning on the Base Model
      - 2.2.1. Reinforcement Learning Algorithm
      - 2.2.2. Reward Modeling
      - 2.2.3. Training Template
    - 2.3. DeepSeek-R1: Reinforcement Learning with Cold Start
      - 2.3.1. Cold Start
      - 2.3.2. Reasoning-oriented Reinforcement Learning
      - 2.3.3. Rejection Sampling and Supervised Fine-Tuning
      - 2.3.4. Reinforcement Learning for all Scenarios
    - 2.4. Distillation: Empower Small Models with Reasoning Capability
  - 3. Experiment
    - 3.1. DeepSeek-R1 Evaluation
    - 3.2. Distilled Model Evaluation
  - 4. Discussion
    - 4.1

---
# Get Started with the PageIndex OCR API
- 🔑 [Get API Key](https://dash.pageindex.ai/api-keys)
- 📖 [SDK Reference](https://pageindex.ai/ocr/sdk)
- 🤝 [Join the PageIndex Discord](https://discord.gg/VuXuf29EUj)
- 📨 [Contact Support](https://ii2abc2jejf.typeform.com/to/meB40zV0)
