# Experiment to compare Azure Document Intelligence and Langchain

Se trabajó con un pdf que incluye la información financiara tabulada del reporte 10K de Testa ([link](https://github.com/castillosebastian/genai0/blob/main/exp/exp4_doc_recognizer/Tesla_10k_short.pdf))

In [16]:
import polars as pl
import pandas as pd
pl.Config(tbl_rows=-1)

<polars.config.Config at 0x7fe6eea12460>

# Azure Document Intelligence

Procesamiento con Azure DI, siguiendo este [script](https://github.com/castillosebastian/genai0/blob/main/exp/exp4_doc_recognizer/exp.py) (basado en [este](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api?view=doc-intel-4.0.0&pivots=programming-language-python))

In [18]:
import pandas as pd
balance_sheet = pd.read_csv('balance_sheet.csv')
balance_sheet = balance_sheet.replace('\n', '', regex=True)


### Tesla Balance Sheet 2022 (10-K)

In [19]:
balance_sheet

Unnamed: 0,0,1,2
0,,"December 31,2022","December 31,2021"
1,Assets,,
2,Current assets,,
3,Cash and cash equivalents,"$16,253","$17,576"
4,Short-term investments,5932,131
5,"Accounts receivable, net",2952,1913
6,Inventory,12839,5757
7,Prepaid expenses and other current assets,2941,1723
8,Total current assets,40917,27100
9,"Operating lease vehicles, net",5035,4511


### Income Statement 

In [22]:
statement_operation = pd.read_csv('statement_operation.csv')
statement_operation = statement_operation.replace('\n', '', regex=True)

In [23]:
statement_operation

Unnamed: 0,0,1,2,3
0,,"Year Ended December 31,",,
1,2022,2021,2020,
2,Revenues,,,
3,Automotive sales,"$67,210","$44,125","$24,604"
4,Automotive regulatory credits,1776,1465,1580
5,Automotive leasing,2476,1642,1052
6,Total automotive revenues,71462,47232,27236
7,Energy generation and storage,3909,2789,1994
8,Services and other,6091,3802,2306
9,Total revenues,81462,53823,31536


**Observación**: veo que las columnas necesitan pos-procesamiento. 

# Langchain + OpenAI 

In [24]:
import time
from typing import Any
from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

# Ref
# intall tesseract!
# https://python.langchain.com/docs/integrations/providers/unstructured
# https://github.com/Unstructured-IO/unstructured

# Process PDF----------------------------------------------------------------------------
# See: https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf
start_time_partitionpdf = time.perf_counter()
raw_pdf_elements = partition_pdf(    
    filename= 'Tesla_10k_short.pdf',
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path='bd/image',
)
end_time_partitionpdf = time.perf_counter()
duration_partition_pdf = end_time_partitionpdf - start_time_partitionpdf

In [25]:
# collect element by type
class Element(BaseModel):
    type: str
    text: Any

# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))
# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))

7
6


In [75]:
balance_sheet_l = table_elements[0]

In [76]:
import os
import openai
from langchain.embeddings import AzureOpenAIEmbeddings
from langchain.chat_models import AzureChatOpenAI
from langchain.schema import HumanMessage
#read .env file
from dotenv import load_dotenv
load_dotenv()

# os.environ["OPENAI_API_KEY"] #= os.getenv('OPENAI_API_KEY')
# os.environ["AZURE_OPENAI_ENDPOINT"] #= os.getenv('AZURE_OPENAI_ENDPOINT')

def embeddings():
    open_ai_embeddings = AzureOpenAIEmbeddings(
        azure_deployment="text-embedding-ada-002",
        openai_api_version="2023-05-15",
        chunk_size=10,
    )
    return open_ai_embeddings
    
def llm():
    return AzureChatOpenAI(model_name="gtp35turbo-latest")


In [77]:
model = llm()

In [78]:
message = HumanMessage(
    content= f'You are a financial asistant, convert this raw text into a financial report in tabular format {balance_sheet_l}. Return it as json format'
)

In [79]:
balance_sheet_procesed = model([message])

In [80]:
balance_sheet_procesed.content

'{\n  "December 31, 2022": {\n    "Cash and cash equivalents": 16253,\n    "Short-term investments": 5932,\n    "Accounts receivable, net": 2952,\n    "Inventory": 12839,\n    "Prepaid expenses and other current assets": 2941,\n    "Total current assets": 40917,\n    "Accounts payable": 15255,\n    "Accrued liabilities and other": 7142,\n    "Deferred revenue": 1747,\n    "Customer deposits": 1063,\n    "Current portion of debt and finance leases": 1502,\n    "Total current liabilities": 26709,\n    "Total liabilities": 36440,\n    "Preferred stock; $0.001 par value; 100 shares authorized; no shares issued and outstanding": 0,\n    "Common stock; $0.001 par value; 6,000 shares authorized; 3,164 and 3,100 shares issued and outstanding as of December 31, 2022 and December 31, 2021, respectively (1)": 0,\n    "Additional paid-in capital": 32,\n    "Accumulated other comprehensive (loss) income": -361,\n    "Retained earnings (1)": 12885,\n    "Total stockholders\' equity": 44704,\n    "To

In [81]:
import json
balance_sheet_f = pd.read_json(balance_sheet_procesed.content)


In [82]:
balance_sheet_f

Unnamed: 0,2022-12-31,2021-12-31
Cash and cash equivalents,16253,5035
Short-term investments,5932,5489
"Accounts receivable, net",2952,23548
Inventory,12839,2563
Prepaid expenses and other current assets,2941,184
Total current assets,40917,215
Accounts payable,15255,1597
Accrued liabilities and other,7142,2804
Deferred revenue,1747,5330
Customer deposits,1063,36440


**Observación**: hay errores en el tabulado

# Resultado Provisorio

- Mejor Desempeño Document Intelligence: puede reconocer y estructurar información, con un elevado grado de eficacia. Los errores de tabulado pueden resolverse con posprocesamiento.
- Lanchain + LLM: en la implementación base, tiene problemas para tabular.