<a href="https://colab.research.google.com/github/duper203/upstage_official_cookbook/blob/main/cookbooks/upstage/financial_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import userdata
upstage_api_key = userdata.get('upstage_api_key')

In [None]:
!pip install langchain-chroma

In [None]:
!pip install langchain_upstage

In [None]:
!pip install langchain

# INTRO
### Financial Analysis over 10-K documents
In the world of finance, extracting critical insights from lengthy documents like 10-K forms is an essential but often time-consuming task for analysts. The 10-K form is an annual report required by the U.S. Securities and Exchange Commission (SEC), which provides a comprehensive summary of a company's financial performance. These documents can run hundreds of pages and are filled with complex, domain-specific terminology. To address this challenge, we showcase how Upstage, combined with LangChain and Chroma, can assist financial analysts in quickly extracting and synthesizing insights from multiple documents with minimal coding effort.

We demonstrate how 'Upstage' can empower financial analysts to efficiently extract and synthesize insights from multiple documents with minimal coding effort.

## 1. Extract data from Document & Split
The first step in our process involves loading the 10-K document and splitting it into manageable chunks of text. This allows for efficient processing and analysis. Here’s how you can do it:

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
from langchain_upstage import UpstageLayoutAnalysisLoader
from langchain_chroma import Chroma
from langchain_upstage import UpstageEmbeddings

In [None]:
# check if text is in the vector store
def is_in_vectorstore(vectorstore, text):
    search_results = vectorstore.get(ids=[text])
    if search_results and search_results["ids"]:
        return True
    else:
        return False


In [None]:
file_path = "c3_k-10.pdf"
# For image files, set use_ocr to True to perform OCR inference on the document before layout detection.
loader = UpstageLayoutAnalysisLoader(file_path, split="page", api_key=upstage_api_key, use_ocr=True)

# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = loader.load()  # or loader.lazy_load()

In [None]:
print(docs[0].page_content[:1000])


<p id='1' data-category='paragraph' style='font-size:22px'>UNITED STATES<br>SECURITIES AND EXCHANGE COMMISSION<br>Washington, D.C. 20549<br>FORM 10-K</p> <br><p id='2' data-category='paragraph' style='font-size:16px'>(Mark One)</p> <br><p id='3' data-category='paragraph' style='font-size:14px'>ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934<br>For the fiscal year ended April 30, 2022</p> <p id='4' data-category='paragraph' style='font-size:14px'>OR</p> <p id='5' data-category='paragraph' style='font-size:14px'>TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934</p> <br><p id='6' data-category='paragraph' style='font-size:14px'>For the transition period from to<br>Commission File Number: 001-39744</p> <p id='7' data-category='paragraph' style='font-size:20px'>C3.ai, Inc.<br>(Exact name of registrant as specified in its charter)</p> <p id='8' data-category='paragraph' style='font-size:14px'>Delaware<br>(State or ot

In [None]:
# 2. Split
text_splitter = RecursiveCharacterTextSplitter.from_language(
    chunk_size=1000, chunk_overlap=100, language=Language.HTML
)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))

Splits: 985


## 2. Store in a vectordb
Once the document is split, the next step is to store these chunks in a vector database. We’ll use Chroma to create a vector store and Upstage for generating embeddings.



In [None]:
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=UpstageEmbeddings(model="solar-embedding-1-large", upstage_api_key=upstage_api_key),
)
retriever = vectorstore.as_retriever()


unique_splits = [
    split for split in splits if not is_in_vectorstore(vectorstore, split.page_content)
]
print(len(unique_splits))

985


In [None]:
import hashlib

def generate_unique_id(content, index):
    return hashlib.md5(f"{index}-{content}".encode()).hexdigest()
if len(unique_splits) > 0:
    vectorstore = Chroma.from_documents(
        ids=[generate_unique_id(split.page_content, i) for i, split in enumerate(unique_splits)],
        persist_directory="./chroma_db",
        documents=unique_splits,
        embedding=UpstageEmbeddings(model="solar-embedding-1-large", upstage_api_key=upstage_api_key),

)

## 3. Retrive relevant docs
Now that our vector store is set up, we can easily retrieve relevant sections of the document based on specific queries. For example, we can ask about the company's financial risks.



In [None]:
# Retrieving documents you would like to analyze from the company's financial document
search_result = retriever.invoke("Tell me about finanical risks")

# Store the relevant documents based on the qeury
doc_base = []

print(search.page_content for search in search_result)
print(search_result[0].page_content[:100])
print(search_result[0])



<generator object <genexpr> at 0x7c7c6a91fa00>
<p id='83' data-category='paragraph' style='font-size:16px'>ITEM 7A. QUANTITATIVE AND QUALITATIVE DI
page_content='<p id='83' data-category='paragraph' style='font-size:16px'>ITEM 7A. QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK</p> <p id='84' data-category='paragraph' style='font-size:14px'>We are exposed to market risks in the ordinary course of our business. Market risk represents the risk of loss that may impact our financial position due to<br>adverse changes in financial market prices and rates. Our market risk exposure is primarily the result of fluctuations in interest rates and foreign currency<br>exchange rates. We do not hold or issue financial instruments for trading purposes.</p> <h1 id='85' style='font-size:16px'>Interest Rate Risk</h1>' metadata={'page': 87}


In [None]:
from bs4 import BeautifulSoup

# Extract page content from relevant docuement that was parsed
for search in search_result:
    print(search.page_content)
    soup = BeautifulSoup(search.page_content, 'html.parser')
    text = soup.get_text(separator="\n")
    doc_base.append(text)

<p id='83' data-category='paragraph' style='font-size:16px'>ITEM 7A. QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK</p> <p id='84' data-category='paragraph' style='font-size:14px'>We are exposed to market risks in the ordinary course of our business. Market risk represents the risk of loss that may impact our financial position due to<br>adverse changes in financial market prices and rates. Our market risk exposure is primarily the result of fluctuations in interest rates and foreign currency<br>exchange rates. We do not hold or issue financial instruments for trading purposes.</p> <h1 id='85' style='font-size:16px'>Interest Rate Risk</h1>
<p id='59' data-category='paragraph' style='font-size:16px'>SELECTED RISKS AFFECTING OUR BUSINESS</p> <p id='60' data-category='paragraph' style='font-size:16px'>Investing in our Class A common stock involves numerous risks, including the risks described under "Risk Factors" in Part 1, Item 1A of this Annual<br>Report on Form 10-K. Below 

# 4. Analyze Financial Document
Finally, we can use Upstage’s LLM capabilities to generate a summary report that includes an overview of identified risks, categorized by type, with a severity ranking.

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage

llm = ChatUpstage(api_key=upstage_api_key)

prompt_template = PromptTemplate.from_template(
    """
    Generate mainly two things from the following context.
    1.Summary of Risks: Generate a summary report that includes an overview of identified risks, categorized by type, with a severity ranking.
    2. Detailed Analysis: Provide a detailed breakdown of each identified risk, including the specific language from the document, context, and potential impact.
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [None]:
chain.invoke({"Context": doc_base})


"Summary of Risks:\n\n1. Market risk exposure primarily due to fluctuations in interest rates and foreign currency exchange rates.\n2. Limited operating history, making it difficult to evaluate prospects and future results of operations.\n3. Risks related to the business and industry, including potential harm to business, financial condition, operating results, and prospects.\n4. Concentration of credit risk in cash and cash equivalents held by one financial institution, potentially exceeding FDIC insurance limits.\n\nDetailed Analysis:\n\n1. Market risk exposure: The company is exposed to market risks in the ordinary course of business, including fluctuations in interest rates and foreign currency exchange rates. However, the company does not hold or issue financial instruments for trading purposes.\n2. Limited operating history: The company has a limited operating history, which makes it difficult to evaluate its prospects and future results of operations.\n3. Risks related to the bu