# Revisión Azure DI vs. Langchain+Unstructured+OpenAI 
## Tablas Financieras del Reporte de Tesla 10-K 2022

Se trabajó con un pdf que incluye la información financiara tabulada del reporte 10K de Testa ([link](https://github.com/castillosebastian/genai0/blob/main/exp/exp4_doc_recognizer/Tesla_10k_short.pdf))

Importamos librerías

In [36]:
import polars as pl
import pandas as pd
import json
import os
import openai
from langchain.embeddings import AzureOpenAIEmbeddings
from langchain.chat_models import AzureChatOpenAI
from langchain.schema import HumanMessage
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import uuid
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain.vectorstores import Chroma
from langchain_core.documents import Document
import time
from typing import Any
from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf
from dotenv import load_dotenv
load_dotenv()

True

# Azure Document Intelligence

Procesamos el documento con la API de Azure Doc-Intelligence, siguiendo este [script](https://github.com/castillosebastian/genai0/blob/main/exp/exp4_doc_recognizer/exp.py) (basado en el de Azure [ver aquí](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api?view=doc-intel-4.0.0&pivots=programming-language-python)).

### Balance Sheet 2022 (10-K)

In [3]:
balance_sheet = pd.read_csv('balance_sheet.csv')
balance_sheet = balance_sheet.replace('\n', '', regex=True)
balance_sheet

Unnamed: 0,0,1,2
0,,"December 31,2022","December 31,2021"
1,Assets,,
2,Current assets,,
3,Cash and cash equivalents,"$16,253","$17,576"
4,Short-term investments,5932,131
5,"Accounts receivable, net",2952,1913
6,Inventory,12839,5757
7,Prepaid expenses and other current assets,2941,1723
8,Total current assets,40917,27100
9,"Operating lease vehicles, net",5035,4511


**Obaservación**: No contiene errores. Necesita posprocesamiento para pasar al LLM.

### Income Statement 2022 (10-K)

In [4]:
statement_operation = pd.read_csv('statement_operation.csv')
statement_operation = statement_operation.replace('\n', '', regex=True)
statement_operation

Unnamed: 0,0,1,2,3
0,,"Year Ended December 31,",,
1,2022,2021,2020,
2,Revenues,,,
3,Automotive sales,"$67,210","$44,125","$24,604"
4,Automotive regulatory credits,1776,1465,1580
5,Automotive leasing,2476,1642,1052
6,Total automotive revenues,71462,47232,27236
7,Energy generation and storage,3909,2789,1994
8,Services and other,6091,3802,2306
9,Total revenues,81462,53823,31536


**Observación**: veo que las columnas necesitan pos-procesamiento. Por ejemplo el fila 1 tiene celdas desplazadas, que son importantes pues refieren a los años!  

# Langchain + Unstructured  + OpenAI 

In [61]:
elements = partition_pdf(filename='Tesla_10k_short.pdf', infer_table_structure=True)
tables = [el for el in elements if el.category == "Table"]
print(tables[0].text)

December31, 2022 December31, 2021 Cashandcashequivalents Short-terminvestments Accountsreceivable,net Inventory Prepaidexpensesandothercurrentassets Totalcurrentassets $ 16,253 $ 5,932 2,952 12,839 2,941 40,917 5,035 5,489 23,548 2,563 184 215 194 4,193 82,338 $ Totalassets $ Accountspayable Accruedliabilitiesandother Deferredrevenue Customerdeposits Currentportionofdebtandfinanceleases Totalcurrentliabilities $ 15,255 $ 7,142 1,747 1,063 1,502 26,709 1,597 2,804 5,330 36,440 Totalliabilities 409 Preferredstock;$0.001parvalue;100sharesauthorized; nosharesissuedandoutstanding Commonstock;$0.001parvalue;6,000sharesauthorized; 3,164and3,100sharesissuedandoutstandingasof December31,2022andDecember31,2021,respectively(1) Additionalpaid-incapital — 3 32,177 Accumulatedothercomprehensive(loss)income Retainedearnings(1) Totalstockholders’equity Totalliabilitiesandequity $ (361 ) 12,885 44,704 785 82,338 $


In [62]:
print(tables[0].metadata.text_as_html)

<table><thead><th rowspan="2">sets</th><th>December 31, 2022</th><th colspan="2">December 31, 2021</th></thead><thead><th></th><th></th><th></th><th></th></thead><tr><td>irrent assets</td><td></td><td></td><td></td></tr><tr><td>Cash and cash equivalents</td><td>16,253</td><td>$</td><td>17,576</td></tr><tr><td>Short-term investments</td><td>5,932</td><td></td><td>131</td></tr><tr><td>Accounts receivable, net</td><td>2,952</td><td></td><td>1,913</td></tr><tr><td>Inventory</td><td>12,839</td><td></td><td>5,757</td></tr><tr><td>Prepaid expenses and other current assets</td><td>2,941</td><td></td><td>1,723</td></tr><tr><td>Total current assets</td><td>40,917</td><td></td><td>27,100</td></tr><tr><td>yerating lease vehicles, net</td><td>5,035</td><td></td><td>4,511</td></tr><tr><td>lar energy systems, net</td><td>5,489</td><td></td><td>5,765</td></tr><tr><td>operty, plant and equipment, net</td><td>23,548</td><td></td><td>18,884</td></tr><tr><td>lease</td><td></td><td></td><td></td></tr><tr><

In [63]:
balance_sheet_text = tables[0].text
balance_sheet_html = tables[0].metadata.text_as_html

In [64]:
def llm():
    return AzureChatOpenAI(model_name="gtp35turbo-latest")

model = llm()

message = HumanMessage(
    content= f'You are a financial assistant, convert this html table related to 10-k report in tabular format in simple text: "{balance_sheet_html}". Only return the text'
)

html_result = model([message])

In [67]:
print(html_result.content)

sets  December 31, 2022  December 31, 2021 
irrent assets           16,253              $             17,576 
Cash and cash equivalents            5,932                                131 
Short-term investments            2,952                                1,913 
Accounts receivable, net           12,839                               5,757 
Inventory           2,941                                1,723 
Total current assets           40,917                               27,100 
yerating lease vehicles, net           5,035                                4,511 
lar energy systems, net           5,489                                5,765 
operty, plant and equipment, net           23,548                               18,884 
lease           
yerating right-of-use assets           2,563                                2,016 
gital assets, net           184                                1,260 
tangible assets, net           215                                257 
odwill           194    

In [69]:
message = HumanMessage(
    content= f'You are an assistant tasked with summarizing tables. Give a concise summary of Tesla Balance Sheet: {html_result.content}'
)

summary = model([message])

In [72]:
print(summary.content)

The Tesla balance sheet as of December 31, 2022, and December 31, 2021, shows the following key figures:
- Current assets decreased from $17,576 million in 2021 to $16,253 million in 2022.
- Cash and cash equivalents significantly increased from $131 million in 2021 to $5,932 million in 2022.
- Short-term investments also increased from $1,913 million in 2021 to $2,952 million in 2022.
- Accounts receivable, net increased from $5,757 million in 2021 to $12,839 million in 2022.
- Inventory increased from $1,723 million in 2021 to $2,941 million in 2022.
- Total current assets increased from $27,100 million in 2021 to $40,917 million in 2022.
- Property, plant, and equipment, net increased from $18,884 million in 2021 to $23,548 million in 2022.
- Total assets increased from $62,131 million in 2021 to $82,338 million in 2022.
- Current liabilities increased from $19,705 million in 2021 to $26,709 million in 2022.
- Long-term debt and finance leases decreased from $5,245 million in 2021 t

# Conclusion

| Tecnología                        | Estado                          | Observaciones |
|-----------------------------------|---------------------------------|---------------|
| Document Intelligence-Azure       | ✅+, con errores no críticos.   | Mejor Desempeño Document Intelligence: se puede reconocer y estructurar información, con un elevado grado de eficacia. Los errores de tabulado requieren de posprocesamiento. |
| Langchain-Unstructured-OpenAI     | ✅-, con errores no críticos.   | Langchain + Unstructured + OpenAI: se puede reconocer y estructurar información. La implementación funciona solo cuando el pre-procesamiento convierte el pdf en html. Cuando se convertía a texto perdía la estructura. Requiere prompt-engineering. 



