# Document Processing API Walkthrough

This notebook demonstrates how to use the Document Processing API to extract information from PDF files. We'll walk through the complete workflow from project creation to data extraction.

## 1. Setup and Configuration

First, we'll set up our environment by configuring:
- Test files to process
- Base URL for the API endpoints

This establishes our connection to the local backend service.

In [1]:
files= ['resume.pdf']
base_url = 'http://localhost:8000'

## 2. Creating a New Project

Now we'll create a new project to organize our document processing:
- Send a POST request to create the project
- Configure project name and description
- Specify the data source type

The response will include important details like project ID and organization information.

In [2]:
import requests

url = f'{base_url}/core/project/'
headers = {
    'Accept': '*/*',
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
data = {
    "name": "Resume Project",
    "description": "resume extract",
    "data_source_type": "UPLOAD",
}

response = requests.post(url, headers=headers, json=data)

print(response.status_code)
project = response.json()


201


In [3]:
project

{'id': '10875952-9dcf-4f50-afaa-e4fa7b194702',
 'created_by_email': 'bypass@example.com',
 'updated_by_email': 'bypass@example.com',
 'organization_name': 'personal',
 'organization_id': '843ea37d-9974-4ce4-b0f9-d82c585b2493',
 'is_deleted': False,
 'deleted_at': None,
 'created_at': '2025-02-21T23:00:45.851937Z',
 'updated_at': '2025-02-21T23:00:45.856588Z',
 'name': 'Resume Project',
 'description': 'resume extract',
 'data_source_type': 'UPLOAD',
 'created_by': '00000000-0000-0000-0000-000000000001',
 'updated_by': '00000000-0000-0000-0000-000000000001',
 'organization': '843ea37d-9974-4ce4-b0f9-d82c585b2493',
 'owner': '00000000-0000-0000-0000-000000000001',
 'collaborators': []}

## 3. File Upload Process

With our project created, we can now upload files:
- Upload PDF files to the project
- Handle multipart form data for file uploads
- Receive asset details including IDs and URLs

This step creates the assets that we'll process in later steps.

In [4]:
import requests

url = f"{base_url}/core/asset/assets/"
headers = {
    'Accept': '*/*',
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

data = {
    "project_id": project['id'],
    "upload_source": "UPLOAD",
}
fs = [("files", (file, open(file, "rb"), "application/pdf")) for file in files]

response = requests.post(url,  headers=headers, data=data, files=fs)
print(response.json())

[{'id': 'cf4a6756-5a02-4fd4-9d84-9e1d285d8f43', 'is_deleted': False, 'deleted_at': None, 'created_at': '2025-02-21T23:00:45.900070Z', 'updated_at': '2025-02-21T23:00:45.903140Z', 'name': 'resume.pdf', 'description': 'Uploaded file: resume.pdf', 'url': '/tmp/unstruct/assets/cf4a6756-5a02-4fd4-9d84-9e1d285d8f43/resume.pdf', 'upload_source': 'UPLOAD', 'file_type': 'PDF', 'mime_type': None, 'size': None, 'source_file_id': None, 'source_credentials': None, 'metadata': None, 'created_by': None, 'updated_by': None, 'organization': None, 'owner': None, 'project': '10875952-9dcf-4f50-afaa-e4fa7b194702'}]


In [5]:
assets = response.json()

In [6]:
assets

[{'id': 'cf4a6756-5a02-4fd4-9d84-9e1d285d8f43',
  'is_deleted': False,
  'deleted_at': None,
  'created_at': '2025-02-21T23:00:45.900070Z',
  'updated_at': '2025-02-21T23:00:45.903140Z',
  'name': 'resume.pdf',
  'description': 'Uploaded file: resume.pdf',
  'url': '/tmp/unstruct/assets/cf4a6756-5a02-4fd4-9d84-9e1d285d8f43/resume.pdf',
  'upload_source': 'UPLOAD',
  'file_type': 'PDF',
  'mime_type': None,
  'size': None,
  'source_file_id': None,
  'source_credentials': None,
  'metadata': None,
  'created_by': None,
  'updated_by': None,
  'organization': None,
  'owner': None,
  'project': '10875952-9dcf-4f50-afaa-e4fa7b194702'}]

## 4. Defining Extraction Actions

Now we'll define what information to extract:
- Create extraction actions for specific fields (name and address)
- Configure output columns and types
- Set up multiple extraction parameters

These actions define what data we want to extract from our documents.

In [7]:
import requests



url = f'{base_url}/core/action/'

headers = {
    'Accept': '*/*',
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

actions_details = [{
    "output_column_name": "name",
    "output_column_type": "TEXT",
    "action_type": "EXTRACT",
    "description": "name"
},{
    "output_column_name": "address",
    "output_column_type": "TEXT",
    "action_type": "EXTRACT",
    "description": "address"
}
]

actions = []
for act in actions_details:
    response = requests.post(url, headers=headers, json=act)

    actions.append(response.json())


In [8]:
actions

[{'id': '9fe8246c-f3ca-40a5-91ce-2415b7c89a67',
  'created_by_email': 'bypass@example.com',
  'updated_by_email': 'bypass@example.com',
  'organization_name': 'personal',
  'organization_id': '843ea37d-9974-4ce4-b0f9-d82c585b2493',
  'is_deleted': False,
  'deleted_at': None,
  'created_at': '2025-02-21T23:00:45.942804Z',
  'updated_at': '2025-02-21T23:00:45.942819Z',
  'output_column_name': 'name',
  'output_column_type': 'TEXT',
  'action_type': 'EXTRACT',
  'description': 'name',
  'created_by': '00000000-0000-0000-0000-000000000001',
  'updated_by': '00000000-0000-0000-0000-000000000001',
  'organization': '843ea37d-9974-4ce4-b0f9-d82c585b2493',
  'owner': None},
 {'id': 'd6ba62a1-fa70-497a-8600-624120267fc5',
  'created_by_email': 'bypass@example.com',
  'updated_by_email': 'bypass@example.com',
  'organization_name': 'personal',
  'organization_id': '843ea37d-9974-4ce4-b0f9-d82c585b2493',
  'is_deleted': False,
  'deleted_at': None,
  'created_at': '2025-02-21T23:00:45.965987Z',


## 5. Task Creation and Configuration

With our assets and actions ready, we'll create a processing task:
- Create a new task linked to our project
- Associate assets and actions with the task
- Configure task parameters and initial status

This step brings together all the components we've set up.

In [9]:
import requests

url = f'{base_url}/core/task/'

headers = {
    'Accept': '*/*',
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

data = {
    "name": "Resume Extract",
    "description": "Resume Extract",
    "system_prompt": "Resume Extract",
    "status": "PENDING",
    "project": project['id'],
    "assets": [
        asset['id'] for asset in assets
    ],
    "actions": [
        action['id'] for action in actions
    ],
    "organization_id": project["organization_id"]
}

response = requests.post(url, headers=headers, json=data)

print(response.status_code)
print(response.text)


201
{"id":"149dcfa1-756d-4ca3-92d0-6f1e77a648a0","created_by_email":"bypass@example.com","updated_by_email":"bypass@example.com","organization_name":"personal","organization_id":"843ea37d-9974-4ce4-b0f9-d82c585b2493","is_deleted":false,"deleted_at":null,"created_at":"2025-02-21T23:00:46.010700Z","updated_at":"2025-02-21T23:00:46.010720Z","name":"Resume Extract","system_prompt":"Resume Extract","status":"PENDING","description":"Resume Extract","result_file_url":null,"process_results":"[]","total_files":0,"processed_files":0,"failed_files":0,"started_at":null,"completed_at":null,"created_by":"00000000-0000-0000-0000-000000000001","updated_by":"00000000-0000-0000-0000-000000000001","organization":"843ea37d-9974-4ce4-b0f9-d82c585b2493","owner":null,"project":"10875952-9dcf-4f50-afaa-e4fa7b194702","assets":["cf4a6756-5a02-4fd4-9d84-9e1d285d8f43"],"actions":["9fe8246c-f3ca-40a5-91ce-2415b7c89a67","d6ba62a1-fa70-497a-8600-624120267fc5"]}


In [10]:
task = response.json()
task

{'id': '149dcfa1-756d-4ca3-92d0-6f1e77a648a0',
 'created_by_email': 'bypass@example.com',
 'updated_by_email': 'bypass@example.com',
 'organization_name': 'personal',
 'organization_id': '843ea37d-9974-4ce4-b0f9-d82c585b2493',
 'is_deleted': False,
 'deleted_at': None,
 'created_at': '2025-02-21T23:00:46.010700Z',
 'updated_at': '2025-02-21T23:00:46.010720Z',
 'name': 'Resume Extract',
 'system_prompt': 'Resume Extract',
 'status': 'PENDING',
 'description': 'Resume Extract',
 'result_file_url': None,
 'process_results': '[]',
 'total_files': 0,
 'processed_files': 0,
 'failed_files': 0,
 'started_at': None,
 'completed_at': None,
 'created_by': '00000000-0000-0000-0000-000000000001',
 'updated_by': '00000000-0000-0000-0000-000000000001',
 'organization': '843ea37d-9974-4ce4-b0f9-d82c585b2493',
 'owner': None,
 'project': '10875952-9dcf-4f50-afaa-e4fa7b194702',
 'assets': ['cf4a6756-5a02-4fd4-9d84-9e1d285d8f43'],
 'actions': ['9fe8246c-f3ca-40a5-91ce-2415b7c89a67',
  'd6ba62a1-fa70-497

## 6. Processing and Results

Finally, we'll process our documents and get results:
- Trigger the extraction process
- Monitor task status
- View extracted data
- Access the generated CSV results

This is where we see the results of our document processing pipeline.

In [11]:
import requests

url = f'{base_url}/core/task/{task["id"]}/process/'

headers = {
    'Accept': '*/*',
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

response = requests.post(url, headers=headers)

print(response.status_code)


500


In [12]:
response.json()

{'error': 'Gemini API key is not set in the environment variables.'}