# Flexible key/value pair extraction with Box AI

In this notebook, we will learn how to use the Box AI API `/extract` endpoint to extract key/value pairs from a single file in Box. In this exercise, we will be using a PDF containing a W-2 form.

## Prequisites

You must have completed the Setup notebook first. This will create all of the Box objects, folders, and files that you need, and will have created an environment file to help you get started and import all the libraries you will need.

## Workshop

The first step is to import all of the environment variables we need for this exercise.

In [1]:
import os
from dotenv import load_dotenv

load_dotenv(override=True)

BOX_CLIENT_ID=os.getenv('BOX_CLIENT_ID')
BOX_CLIENT_SECRET=os.getenv('BOX_CLIENT_SECRET')
BOX_USER_ID=os.getenv('BOX_USER_ID')
BOX_FOLDER_ID=os.getenv('EXERCISE4_FOLDER')

Next we will grab the BoxClient object from the Python SDK to authenticate ourselves to the API. We'll print out the current user's information to ensure we are properly authenticated.

In [2]:
from box_sdk_gen import BoxClient, CCGConfig, BoxCCGAuth

ccg_config = CCGConfig(
    client_id=BOX_CLIENT_ID,
    client_secret=BOX_CLIENT_SECRET,
    user_id=BOX_USER_ID,
)

ccg_auth = BoxCCGAuth(ccg_config)

client = BoxClient(ccg_auth)

print(f"{client.users.get_user_me()}")

<class 'box_sdk_gen.schemas.user_full.UserFull'> {'id': '19498290761', 'type': 'user', 'name': 'Scott Hurrey', 'login': 'shurrey+eplusadmin@boxdemo.com', 'created_at': '2022-05-26T10:57:52-07:00', 'modified_at': '2025-08-25T13:51:12-07:00', 'language': 'en', 'timezone': 'America/Los_Angeles', 'space_amount': 999999999999999, 'space_used': 3609201911, 'max_upload_size': 536870912000, 'status': 'active', 'job_title': '', 'phone': '', 'address': '', 'avatar_url': 'https://hurrey.app.box.com/api/avatar/large/19498290761'}


Now let's find the file in the exercise folder we'll need for this demo.

In [None]:
files = client.folders.get_folder_items(BOX_FOLDER_ID)

file_id = ""
for file in files.entries:
    if file.name == "WAYNE_GATSBY_W2_2023.pdf":
        file_id = file.id
        break

print(f"File ID: {file_id}")

File ID: 1965326694867


Now that we have our file ID, we'll use the create_ai_extract method to extract the data we need from the W-2. We'll use a comma-separated list of fields, but this can be any string that tells Box AI what fields to extract. 

In [4]:
from box_sdk_gen import (
    AiItemBase
)

prompt = """
firstName, lastName, wages, federalTaxWithheld, 
socialSecurityWages, socialSecurityTaxWithheld, 
medicareWagesAndTips, medicareTaxWithheld, 
stateWages, stateTaxWithheld, 
localWagesAndTips, localTaxWithheld
"""

ai_response = client.ai.create_ai_extract(
    prompt,
    [AiItemBase(id=file.id)],
)

print (ai_response.answer)

{"firstName": "Wayne", "lastName": "Gatsby", "wages": "355000", "federalTaxWithheld": "125600", "socialSecurityWages": "355000", "socialSecurityTaxWithheld": "12300", "medicareWagesAndTips": "355000", "medicareTaxWithheld": "13200", "stateWages": "355,000", "stateTaxWithheld": "15000", "localWagesAndTips": "355000", "localTaxWithheld": "8000"}


Flexible extraction gives you access to the powerful Box extraction capabilities without having to have a strictly structured input. You can write a blurb of text, stringify a json or XML object, or even an object for your other third-party systems. Imagine getting exactly what you need to extract to power your production database, or a Salesforce object ready to be serialized and inserted into your opportunity.

To get you started, we've provided full, runnable files for this exercise. Running the following cells will generate the file for you in the exercise folders. Use it as-is, or use it as inspiration or a starting point for your workflows and applications.

In [5]:
%%writefile box_ai_flexible_extract.py
import os
import asyncio
from dotenv import load_dotenv

from box_sdk_gen import (
    BoxClient,
    CCGConfig,
    BoxCCGAuth,
    AiItemBase
)

load_dotenv(override=True)

BOX_CLIENT_ID=os.getenv('BOX_CLIENT_ID')
BOX_CLIENT_SECRET=os.getenv('BOX_CLIENT_SECRET')
BOX_USER_ID=os.getenv('BOX_USER_ID')
BOX_FOLDER_ID=os.getenv('EXERCISE4_FOLDER')

def get_box_client():
    ccg_config = CCGConfig(
        client_id=BOX_CLIENT_ID,
        client_secret=BOX_CLIENT_SECRET,
        user_id=BOX_USER_ID,
    )

    ccg_auth = BoxCCGAuth(ccg_config)

    client = BoxClient(ccg_auth)

    return client

def get_file_id(client):
    files = client.folders.get_folder_items(BOX_FOLDER_ID)

    file_id = ""
    for file in files.entries:
        if file.name == "WAYNE_GATSBY_W2_2023.pdf":
            file_id = file.id
            break

    return file_id

async def chat_with_ai(): 
    client = get_box_client()
    print(f"{client.users.get_user_me()}")
    file_id = get_file_id(client)
    print(f"File ID: {file_id}")


    prompt = """
    firstName, lastName, wages, federalTaxWithheld, 
    socialSecurityWages, socialSecurityTaxWithheld, 
    medicareWagesAndTips, medicareTaxWithheld, 
    stateWages, stateTaxWithheld, 
    localWagesAndTips, localTaxWithheld
    """

    ai_response = client.ai.create_ai_extract(
        prompt,
        [AiItemBase(id=file_id)],
    )

    print (ai_response.answer)

if __name__ == "__main__":
    asyncio.run(chat_with_ai())

Writing box_ai_flexible_extract.py
