# Workshop Notebook 1 - Document data model

This notebook will guide you through the Document data model, using `aryn_sdk` to partition a document
with the Aryn DocParse service.

## Are you set up correctly?

First we're going to make sure you've downloaded the pdfs in the expected place and installed the poppler library.

In [None]:
from error_messages import *
from pathlib import Path

repo_root = Path.cwd()
pdf_dir = repo_root / "files" / "earnings_calls"
one_pdf = pdf_dir / "broadcom-avgo-q1-2024-earnings-call-transcript.pdf"
assert one_pdf.exists()
one_pdf

## Display a document

We can use the pdf2image library to convert the pdf into a list of images and then display it in the notebook

In [None]:
from pdf2image import convert_from_path
from IPython.display import display

try:
    ims = convert_from_path(one_pdf)
    display(ims[0])
except Exception as e:
    poppler_failed_error()
    print(e)

This earnings call document is a transcript of a conversation between several different people. This document specifically focuses on Broadcom's earnings in Q1 in 2024. 
In this particular quarter, Broadcom's VMware acquisition is a hot topic and analysts are asking the CEO (Hock Tan) and the CFO (Kirsten Spears) about Broadcom's strategy 
behind the VMware acquisiton. In the next section of the workshop, we will discuss and implement a data processing job to pull the information from this document required 
to answer the question:

0. In the Broadcom earnings call, what details did the CFO, Kirsten Spears, discuss about the VMware acqusition?

## Partition a Document

For now though, let's just explore the Document data model, by partitioning the document. We'll use `aryn_sdk` to send the document to Aryn DocParse, which will break it
into Elements. Then I have a few exercises for you to get familiar with elements and what you can do with them.

In [None]:
# Get started with aryn_sdk. 
# This will also make sure your credentials are set correctly.
from aryn_sdk.partition import partition_file

try:
    data = partition_file(one_pdf)
    elements = data['elements']
except Exception as e:
    aryn_no_api_key_error()
    print(e)

We can visualize the elements by drawing the bounding boxes onto the pdf. `aryn_sdk` has a function for that.

In [None]:
from aryn_sdk.partition import draw_with_boxes

graffitied_pages = draw_with_boxes(one_pdf, data)
graffitied_pages[1]

Here, we've printed one of the pages of the Broadcom earnings call. If you scroll through, you'll notice 
several bounding boxes that denote the elements that DocParse detected. Each element contains a bunch of 
information. Core information includes `type`, `bbox`, and `text_representation`. Additional information 
is stored in a `properties` dict, such as the page number the element is on. Let's look at the JSON 
representation of the first element that DocParse detected. 

In [None]:
import json
print(json.dumps(elements[0], indent=2))

You'll notice that DocParse detected an image at the top of the page and it returned some information about that element such as its bounding box etc. 

Let's have a quick quiz to introduce elements. I've created a bunch of functions that operate on the list of elements returned by the partitioner. Your job is to implement them.

In [None]:
def number_of_footnotes(elts: list[dict]) -> int:
    """Return the number of elements of type 'Footnote'"""

    raise NotImplementedError("Finish this yourself")
    
def number_of_elements_after_page_4(elts: list[dict]) -> int:
    """Return the number of elements that fall after page 4. Note that page numbers are 1-indexed."""

    raise NotImplementedError("Finish this yourself")

def number_of_vmware_mentions(elts: list[dict]) -> int:
    """Return the number of elements that mention 'vmware' (this is case insensitive, so count 'VMware' and 'vmware')
    Note: some elements do not have a 'text_representation' key."""

    raise NotImplementedError("Finish this yourself")
    

def number_of_elements_that_cover_a_third_of_the_page(elts: list[dict]) -> int:
    """For this you'll need the bbox property. bboxes are represented as 4 floats, [x1, y1, x2, y2]. Each 
    coordinate ranges from 0 to 1, representing the fraction of the page (left-to-right for x, top-to-bottom for y) 
    where the point lies. So [0, 0, 1, 1] is the whole page, and [0, 0.5, 0.5, 1] is the lower-left quadrant.
    
    Return the number of elements that cover at least a third of the page. An element covers a third of the page if its 
    area is greater than 1/3"""
    
    raise NotImplementedError("Finish this yourself")


assert number_of_footnotes(elements) == 2, f"Got {number_of_footnotes(elements)}. Make sure your capitalization is correct."

assert number_of_elements_after_page_4(elements) == 232, f"Got {number_of_elements_after_page_4(elements)}. If you got 241, 'after page 4' does not include page 4, and page numbers are 1-indexed. (use > 4, not >= 4)"

assert number_of_vmware_mentions(elements) == 24, f"Got {number_of_vmware_mentions(elements)}. A 'vwmare mention' is defined as an element whose text contains the string 'VMware'."

assert number_of_elements_that_cover_a_third_of_the_page(elements) == 1, f"Got {number_of_elements_that_cover_a_third_of_the_page(elements)}"

print("All correct! Nice")

When you get here, stand up so we can tell when everyone's done. Also feel free to help your neighbors!

In the next section of the workshop, we will process documents and elements to derive metadata that will
allow us to answer questions. We will use sycamore to structure that processing job and efficiently
execute it.