## Page Classification

A very important task for PDF and document processing is page classification. This task is often used as the first step in data ingestion pipelines, as it allows us to clearly classify and label how we will need to perform further processing steps on our various pages, as well as tag them with important pieces of metadata.

### Environment Setup

To start, let's make sure the environment is setup correctly. Depending on what service provider you are using, there are a few environment variables you will need to set or you may choose to pass the credentials as kwargs at run-time instead.

**For Anthropic**:
- `ANTHROPIC_API_KEY`: The API key for your anthropic account

In [1]:
import os
from dotenv import load_dotenv

load_dotenv("../.env")

ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", None)
assert ANTHROPIC_API_KEY is not None, "ANTHROPIC_API_KEY is not set"


### Load Document Resources

With our environment properly configured, we can now begin loading our PDF documents into the environment. There are a few unique ways to store and retrieve documents, but we will opt for the simplest, out of the box method, which is the `load_document_node` utility.

For this exercise, we will be loading a PDF file of a legal deposition. It is around ~40 pages long and contains the deposition transcript as well as some additional title, index, and metadata pages.

In [None]:
from docprompt import load_document_node

# A PDF with a legal deposition
node = load_document_node("../data/example-1.pdf")

### Setup Classification Task

A first step in many document processing pipelines, is to classify all of the individual document pages that we are ingesting. This first step often ends up being some form of page classification, be it binary, single, or multi-label. Regardless, it is a curcial step in nearly every document processing workflow.

To begin, let's use the `ClassificationConfig` class to setup the parameters of our classification task.

In [None]:
from docprompt.tasks.classification.base import ClassificationConfig

# Setup our classification task
singel_label_config = ClassificationConfig(
    # Declare the task type as 'single_label' for single-label classification
    type='single_label',

    # Define the label categories for classification task
    labels=['title_page', 'index_page', 'body_page', "other_page"],

    # Add your own custom instructions for the model, if you find it needs additional guidance for your domain
    instructions="Classify the page of the legal deposition carefully. Be sure to read the page carefully and select the most appropriate label. If you are unsure, select 'other_page'.",

    # Provide the model with detailed descriptions of each label category (optional -- but reccomended)
    descriptions=[
        "A title page of the deposition, containing the title, participants, and other metadata.",
        'An index page of the deposition, containing a table of contents or other reference aids.',
        "A page containing the transcript or dialgoue of the deposition.",
        "Any other page in the deposition, that doesn't fit into the other categories."
    ]
)

### Execute Classification Task

Now that we have our classification task configured, we need to use the Anthropic Factory to create an Anthropic Page Classification Provider.

In [8]:
from docprompt.tasks.factory import AnthropicTaskProviderFactory

factory = AnthropicTaskProviderFactory()
classification_provider = factory.get_page_classification_provider()


In [9]:

# Run the page classification
single_label_results = await classification_provider.aprocess_document_node(
    node, # The document node to process
    task_config=singel_label_config # Pass classification config at runtime
)

Processing messages: 42it [00:03, 11.17it/s]


In [None]:
for page_num, result in single_label_results.items():
    print(f"Page {page_num}: {result.labels}")

We can see that our classification results seem to make a lot of sense. The model identified a few title pages at the begining of the document, an index page at page 3 (likely an table of contents, note of exhibits, etc.) and then all body pages up until the final page of the document. 

### Another Approach for the Same Task

While single label classification certainly applies well to this task, we can also see the use case where we only want to differentiate between body and non-body pages. In this instance, we could use a binary classification task (which will be more token efficient and faster) and is less likely to confuse the model. Suppose we only wanted to identify every body page in the deposition, so that we could do further processing on those pages. Let's see how this task would be setup:

In [5]:

# Setup our binary classification task
binary_config = ClassificationConfig(
    # Declare the task type as 'single_label' for single-label classification
    type='binary',

    # Required for binary - Tell the model how to make the binary decision.
    # NOTE: When providing instructions for a binary task, the default labels are 'YES' and 'NO'
    instructions="Determine weather or not the page is a body page of the deposition. If the page contains the transcript or dialgoue of the deposition, select 'YES'. Otherwise, select 'NO'.",

    # Confidence score can also be requested from the model
    # This will be returned as 'high', 'medium', or 'low' confidence
    confidence=True
)


In [None]:

binary_results = await classification_provider.aprocess_document_node(
    node,
    task_config=binary_config
)

In [None]:
for page_num, result in binary_results.items():
    print(f"Page {page_num}: {result.labels}")

That seems like it has given use the same exact response, and for a reduced token count and faster infernce time. Let's verify that our answers are indeed the same

In [None]:
pages_equal = []
for single_res, bin_res in zip(single_label_results.values(), binary_results.values()):
    single_label = single_res.labels
    bin_label = bin_res.labels

    if single_label == "body_page" and bin_label == "YES":
        pages_equal.append(True)
    elif single_label == "body_page" and bin_label == "NO":
        pages_equal.append(False)
    elif single_label != "body_page" and bin_label == "NO":
        pages_equal.append(True)
    else:
        pages_equal.append(False)

assert all(pages_equal), "Binary and single label classifications do not match"

It seems like our classification task was successful! Pooling the same classification task into multiple task configurations as is shown above, can be a very effective way of reducing errors in pipelines where incredibly high accuracy is a top priority. Thankfully DocPrompt makes this process incredibly easy!

This notebook uses more code checking and displaying results than is even required to generate these results in the first place!