<a href="https://colab.research.google.com/github/duper203/upstage_cookbook/blob/main/financial_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title Install requirements
!pip install langchain
!pip install langchain-chroma
!pip install langchain_upstage
!pip install -q python-dotenv

# Introduction
## Financial Analysis over 10-K documents
In the world of finance, extracting critical insights from lengthy documents like 10-K forms is an essential but often time-consuming task for analysts. The 10-K form is an annual report required by the U.S. Securities and Exchange Commission (SEC), which provides a comprehensive summary of a company's financial performance. These documents can run hundreds of pages and are filled with complex, domain-specific terminology. To address this challenge, we showcase how Upstage, combined with LangChain and Chroma, can assist financial analysts in quickly extracting and synthesizing insights from a document with minimal coding effort.

We demonstrate how Upstage can empower financial analysts to efficiently extract and synthesize insights from multiple documents with minimal coding effort.

In [None]:
#@title 0. Set API key
from pprint import pprint
import os

import warnings
warnings.filterwarnings('ignore')

from IPython import get_ipython

upstage_api_key_env_name = 'UPSTAGE_API_KEY'
def load_env():
    if 'google.colab' in str(get_ipython()):
        # Running in Google Colab
        from google.colab import userdata
        upstage_api_key = userdata.get(upstage_api_key_env_name)
        return os.environ.setdefault('UPSTAGE_API_KEY', upstage_api_key)
    else:
        # Running in local Jupyter Notebook
        from dotenv import load_dotenv
        load_dotenv()
        return os.environ.get(upstage_api_key_env_name)

UPSTAGE_API_KEY = load_env()

In [None]:
## will be DELETEd!!
from google.colab import userdata
UPSTAGE_API_KEY=userdata.get('upstage_api_key')

## 1. Extract data from Document & Split
The first step in our process involves loading the 10-K document and splitting it into manageable chunks of text.

For extracting text from the document we will be using [Upstage Layout Analysis API](https://developers.upstage.ai/docs/apis/layout-analysis). Upstage Layout Analysis API automatically categorizes data into meaningful chunks and marks non-text elements like images and tables for easy identification. This approach significantly streamlines data extraction tasks.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
from langchain_upstage import UpstageLayoutAnalysisLoader
from langchain_chroma import Chroma
from langchain_upstage import UpstageEmbeddings

In [None]:
# check if text is in the vector store
def is_in_vectorstore(vectorstore, text):
    search_results = vectorstore.get(ids=[text])
    if search_results and search_results["ids"]:
        return True
    else:
        return False


In [None]:
file_path = "c3_k-10.pdf"

loader = UpstageLayoutAnalysisLoader(file_path, split="page", api_key=UPSTAGE_API_KEY)

# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = loader.load()  # or loader.lazy_load()

In [None]:
print(docs[0].page_content[:1000])


<h1 id='1' style='font-size:20px'>UNITED STATES<br>SECURITIES AND EXCHANGE COMMISSION<br>Washington, D.C. 20549<br>FORM 10-K</h1> <br><p id='2' data-category='paragraph' style='font-size:14px'>(Mark One)</p> <br><p id='3' data-category='paragraph' style='font-size:14px'>☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934<br>For the fiscal year ended April 30, 2022</p> <p id='4' data-category='paragraph' style='font-size:14px'>OR</p> <p id='5' data-category='paragraph' style='font-size:14px'>☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934</p> <br><p id='6' data-category='paragraph' style='font-size:14px'>For the transition period from __________ to __________<br>Commission File Number: 001-39744</p> <br><p id='7' data-category='paragraph' style='font-size:16px'>C3.ai, Inc.<br>(Exact name of registrant as specified in its charter)</p> <p id='8' data-category='paragraph' style='font-size:14px'>Delaware<br>(State

In [None]:
# 2. Split
text_splitter = RecursiveCharacterTextSplitter.from_language(
    chunk_size=1000, chunk_overlap=100, language=Language.HTML
)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))

Splits: 1001


## 2. Store in a vectordb
Once the document is split, the next step is to store these chunks in a vector database. We’ll use Chroma to create a vector store and Upstage for generating embeddings.



In [None]:
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=UpstageEmbeddings(model="solar-embedding-1-large", upstage_api_key=UPSTAGE_API_KEY),
)
retriever = vectorstore.as_retriever()


unique_splits = [
    split for split in splits if not is_in_vectorstore(vectorstore, split.page_content)
]
print(len(unique_splits))

1001


In [None]:
import hashlib

def generate_unique_id(content, index):
    return hashlib.md5(f"{index}-{content}".encode()).hexdigest()
if len(unique_splits) > 0:
    vectorstore = Chroma.from_documents(
        ids=[generate_unique_id(split.page_content, i) for i, split in enumerate(unique_splits)],
        persist_directory="./chroma_db",
        documents=unique_splits,
        embedding=UpstageEmbeddings(model="solar-embedding-1-large", upstage_api_key=UPSTAGE_API_KEY),

)

# ADDED
Develop and Test Financial-Specific Prompts

In [None]:
# Define financial information categories
# Identify the key financial metrics and sections you want to extract from the 10-K documents.

financial_categories = {
    "Interest Rate Risk": [
        "Summarize the company's exposure to interest rate risks.",
        "What is mentioned about the company's strategy for mitigating interest rate risks?",
        "Extract details on how interest rate fluctuations have impacted the company's financial performance."
    ],
    "Liquidity Ratios": [
        "Provide the liquidity ratios mentioned in the document.",
        "Explain how the company's current liquidity is evaluated.",
        "List the ratios used to assess the company's liquidity."
    ],
    "Credit Risk": [
        "Describe the company's exposure to credit risk.",
        "What measures has the company taken to mitigate credit risk?",
        "Detail the impact of credit risk on the company's financial health."
    ],
    "Market Risk": [
        "Outline the market risks the company is exposed to.",
        "How does the company manage market risk?",
        "Discuss the effects of market risk on the company's operations."
    ]
}



In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage

llm = ChatUpstage(api_key=UPSTAGE_API_KEY)


In [None]:
def generate_responses(category, prompts, context):
    responses = {}
    for i, prompt in enumerate(prompts, 1):
        prompt_template = PromptTemplate.from_template(
            """
            Generate the requested information based on the following context.
            ---
            Context: {Context}
            """
        )
        chain = prompt_template | llm | StrOutputParser()
        response = chain.invoke({"Context": context, "prompt": prompt})
        responses[f"Prompt_{i}"] = response
    return responses


In [None]:
from bs4 import BeautifulSoup

def retrieve_documents(query, retriever, top_k=5):
    search_result = retriever.invoke(query, top_k=top_k)
    extracted_texts = []
    for search in search_result:
        soup = BeautifulSoup(search.page_content, 'html.parser')
        text = soup.get_text(separator="\n")
        extracted_texts.append(text)
    return extracted_texts


In [None]:
# Store all responses
all_responses = {}

for category, prompts in financial_categories.items():
    print(f"\nProcessing Category: {category}")
    # Retrieve relevant documents for the category
    query = f"Tell me about {category.lower()}"
    context = retrieve_documents(query, retriever)

    # Generate responses for each prompt in the category
    responses = generate_responses(category, prompts, context)
    all_responses[category] = responses

    # Display responses
    for prompt_name, response in responses.items():
        print(f"\n{prompt_name}: {response}")



Processing Category: Interest Rate Risk

Prompt_1: The context provided discusses the market risks faced by the company, which include interest rate risk and foreign currency risk. The company does not hold or issue financial instruments for trading purposes and does not use derivative financial instruments to manage its interest rate risk exposure. As of April 30, 2022, a hypothetical 10% relative change in interest rates would not have had a material impact on the value of the company's cash equivalents or investment portfolio. The company also mentions that it does not currently hedge its foreign currency risk, but may do so in the future if its exposure to foreign currencies becomes more significant. The company also mentions that it does not have any exposure to inflation risk.

Prompt_2: Based on the context provided, the company has exposure to market risks in the ordinary course of its business, primarily due to fluctuations in interest rates and foreign currency exchange rate