# 📄 PageIndex OCR SDK: Python Quickstart

Welcome to the PageIndex OCR SDK tutorial!

This notebook will guide you step-by-step through installing the SDK, authenticating with your API key, uploading a PDF for OCR, checking the processing status, gettting OCR results, and finally cleaning up by deleting your document.

> **You’ll need your [API Key](https://dash.pageindex.ai/api-keys) to run this notebook.**

## 1️⃣ Install the SDK

Run the cell below to install the PageIndex OCR SDK. This only needs to be done once in your environment.

In [None]:
!pip install pageindex

Collecting pageindex
  Downloading pageindex-0.1.3-py3-none-any.whl.metadata (600 bytes)
Downloading pageindex-0.1.3-py3-none-any.whl (2.5 kB)
Installing collected packages: pageindex
Successfully installed pageindex-0.1.3


## 2️⃣ Initialize the Client

Import the PageIndex client class and authenticate using your API key. Be sure to keep your API key secret and never share it publicly.

In [None]:
from pageindex import PageIndexClient

API_KEY = "YOUR_API_KEY"  # Paste your API key here
pi_client = PageIndexClient(api_key=API_KEY)

## 3️⃣ Submit a PDF Document for OCR

Use the client to upload a PDF file for OCR processing (currently supports PDF files only).

After submission, you'll receive a `doc_id` that you can use to check status and get OCR results.

> Replace the file path below with your own PDF file if needed.

In [None]:
pdf_path = "./2023-annual-report.pdf"

result = pi_client.submit_document(pdf_path)
doc_id = result["doc_id"]
print(f"Document submitted. Document ID: {doc_id}")

## 4️⃣ Check Status and Get OCR Results

OCR processing may take anywhere from a few seconds (for small files) to several minutes (for larger files).

This code polls the service every 3 seconds, for up to 5 minutes. Once finished, it previews the extracted text from the first page.

In [None]:
import time

# Simple polling
for attempt in range(60):  # Try up to 300s (100 x 3s)
    ocr_result = pi_client.get_ocr(doc_id)
    if ocr_result["status"] == "completed":
        print("OCR Results ready!")
        break
    elif ocr_result["status"] == "failed":
        print("OCR failed.")
        break
    time.sleep(3)
else:
    print("Still processing after 10 minutes. Try again later.")

# Preview the first page's markdown
if ocr_result.get("status") == "completed":
    if ocr_result["result"]:
        first_page = ocr_result["result"][0]
        print(f"Page {first_page['page_index']} (partial content):\n")
        print(first_page["markdown"][:1000])  # Print first 1000 chars
    else:
        print("No pages found in OCR result.")
else:
    print("OCR not completed yet. Try again later.")

NameError: name 'pi_client' is not defined

## 5️⃣ Get the Document Tree Structure

You can also get the document's PageIndex tree structure using the method below. If the tree is not ready yet, try again later.

In [None]:
tree_result = pi_client.get_tree(doc_id)
if tree_result.get("status") == "completed":
    print("Document tree structure loaded!")
    print(tree_result.get("result"))
else:
    print(f"Tree status: {tree_result.get('status')}. Try again later if still processing.")

## 6️⃣ Delete the Document (Cleanup)

If you do not need the document any more, you can delete it by running the code below.

In [None]:
pi_client.delete_document(doc_id)
print("Document deleted successfully.")

---

# 💬 Notes & Support

- Only **PDF files** are supported at this time.
- If you have any questions or need help:
    - 🤝 [Join the PageIndex Discord](https://discord.gg/VuXuf29EUj)
    - 📨 [Contact support via Typeform](https://ii2abc2jejf.typeform.com/to/meB40zV0)

---

### Full SDK Reference  
See: [PageIndex OCR SDK Reference](https://pageindex.ai/ocr/sdk) for advanced usage and all available parameters.