## Table of Contents
[1. Introduction](#introduction)<br>
[2. Install Libraries](#load_install)<br>
[3. Load Dataset](#load_dataset)<br>
[4. Load Gemini Model](#load_model)<br>
[5. Prompt Engineering Strategy](#novel_strategy)<br>
[6. Sample Dataset](#sample_dataset)<br>
[7. Summary Generation](#summarization)<br>
[8. Summary Evaluation Strategy](#summary_eval)<br>
[9. Evaulation](#evaluation)<br>
[10. Review Results](#review_results)<br>
[11. Conclusion](#conclusion)<br>
[Thanks!](#thanks)<br>

# 1. **Introduction**
<a id="introduction"></a>

### **Use Case: Long-Form Summarization for Financial Reports**

In today's business environment, a thorough understanding of a company's financials, growth trajectory, and industry dynamics is critical for making informed investment decisions and conducting effective competitor analysis. 

In the U.S. market, the **10-K filings** provide a detailed snapshot of a company's financial performance, featuring in-depth financial statements, management commentary, and identified risk factors. It outlines the company's strategic goals and prospects, making it a crucial resource for stakeholders and financial analysts seeking comprehensive insights.

**Management's Discussion and Analysis (MD&A)** is one of the key sections of the 10-K filings that offers qualitative insights into a company's financial performance and future outlook. 

This section provides management's perspective on various topics, including operational results, market trends, strategic initiatives, compliance, risks, and forward-looking plans. It gives investors a deeper view of the company's positioning and potential challenges. Analyzing the MD&A is critical for evaluating a company's past performance and assessing the future direction and management's confidence in its prospects.

Summarizing 10-K filings is essential for several reasons:

**Information Overload**: Financial reports contain vast amounts of data, making it time-consuming and challenging for stakeholders to extract critical insights. Summarization simplifies and accelerates this process, enabling quicker access to essential information.

**Decision-Making**: Investors and analysts often seek specific information within these documents. A structured summarization framework highlights critical data, facilitating the analysis of multiple companies and enabling large-scale assessments of their financial health.


### **References**

Hargrave, M. (2024, August 18). *Management discussion and analysis (MD&A): Definition and example*. Investopedia. https://www.investopedia.com/terms/m/mdanalysis.asp

CFI Team. (n.d.). *What is MD&A? Management discussion and analysis in the 10-K*. Corporate Finance Institute. https://corporatefinanceinstitute.com/resources/valuation/mda-management-discussion-analysis/

U.S. Securities and Exchange Commission. (n.d.). *How to read a 10-K*. https://www.sec.gov/answers/reada10k.htm

Cornell Law School. (n.d.). *17 CFR § 229.303 - Management’s discussion and analysis of financial condition and results of operations*. https://www.law.cornell.edu/cfr/text/17/229.303


# 2. Install Libraries
<a id="load_install"></a>

In [1]:
# install datasets package if it's not installed already in the kaggle notebook
# pip install datasets

# 3. Load Dataset
<a id="load_dataset"></a>


For this project, we will use the **EDGAR-CORPUS** dataset, a novel corpus containing 10-k filing reports extracted from the EDGAR API for all the publicly traded companies in the US over a period of 25+ years (1993-2020).

### **References**

Loukas, L., Fergadiotis, M., Androutsopoulos, I., & Malakasiotis, P. (2021). *EDGAR-CORPUS: Billions of tokens make the world go round*. In *Proceedings of the Third Workshop on Economics and Natural Language Processing* (pp. 13–18). Association for Computational Linguistics. https://aclanthology.org/2021.econlp-1.2

Dataset Link: https://huggingface.co/datasets/eloukas/edgar-corpus

In [20]:
import datasets

# Load 10k filings dataset from year 2020
edgar_10k_filings = (
    datasets
    .load_dataset(
        "eloukas/edgar-corpus", 
        "year_2020", 
        split="train",
        trust_remote_code=True
    )
    # convert to a pandas dataframe 
    .to_pandas()
)

# print dataframe shape
print(edgar_10k_filings.shape)

# print dataframe
edgar_10k_filings.head(2)


(5480, 23)


Unnamed: 0,filename,cik,year,section_1,section_1A,section_1B,section_2,section_3,section_4,section_5,...,section_8,section_9,section_9A,section_9B,section_10,section_11,section_12,section_13,section_14,section_15
0,718413_2020.htm,718413,2020,Item 1. The Business\nOrganization and Operati...,Item 1A. Risk Factors\nBefore deciding to inve...,Item 1B. Unresolved Staff Comments\nNot Applic...,Item 2. Properties\nAlthough the Company does ...,Item 3. Legal Proceedings\nThere are no pendin...,Item 4. Mine Safety Disclosures\nNot Applicabl...,"Item 5. Market for Registrant’s Common Equity,...",...,Item 8. Financial Statements and Supplementary...,Item 9. Changes in and Disagreements with Acco...,Item 9A. Controls and Procedures\nDisclosure C...,Item 9B. Other Information\nNone\nPART III.\nI...,"Item 10. Directors, Executive Officers and Cor...",Item 11. Executive Compensation\nThe following...,Item 12. Security Ownership of Certain Benefic...,Item 13. Certain Relationships and Related Tra...,Item 14. Principal Accounting Fees and Service...,Item 15. Exhibits and Financial Statement Sche...
1,931059_2020.htm,931059,2020,"Item 1. Business\nRennova Health, Inc. (“Renno...",Item 1A. Risk Factors\nAn investment in our se...,Item 1B. Unresolved Staff Comments\nNot applic...,Item 2. Properties\nThe table below summarizes...,"Item 3. Legal Proceedings\nFrom time to time, ...",Item 4. Mine Safety Disclosures\nNot applicabl...,"Item 5. Market for Registrant’s Common Equity,...",...,Item 8. Financial Statements and Supplementary...,Item 9. Changes in and Disagreements With Acco...,Item 9A. Controls and Procedures.\nEvaluation ...,Item 9B. Other Information.\nNone\nPART III\nI...,"Item 10. Directors, Executive Officers and Cor...",Item 11. Executive Compensation.\nThe followin...,Item 12. Security Ownership of Certain Benefic...,Item 13. Certain Relationships and Related Tra...,Item 14. Principal Accounting Fees and Service...,"Item 15. Exhibits, Financial Statement Schedul..."


In [21]:
edgar_10k_filings["section_1"][0]

'Item 1. The Business\nOrganization and Operation\nThe Company. The Company was organized under the laws of the State of Vermont in 1982 and became a registered bank holding company under the Bank Holding Company Act of 1956, as amended, in October 1983 when it acquired all of the voting shares of the Bank, headquartered in Derby, Vermont. The Bank is the only subsidiary of the Company and principally all of the Company’s business operations are presently conducted through it. Therefore, the following narrative and the other information about the Company contained in this report are based primarily on the Bank’s operations.\nThe Bank; Banking Services. Community National Bank was organized in 1851 as the Peoples Bank, and was subsequently reorganized as the National Bank of Derby Line in 1865. In 1975, after 110 continuous years of operation as the National Bank of Derby Line, the Bank acquired the Island Pond National Bank and changed its name to “Community National Bank.” On December

# 4. Load Gemini Model
<a id="load_model"></a>


* Create your GEMINI MODEL API KEY from here: https://ai.google.dev/tutorials/setup
* Create a new secret variable and name it as "GEMINI_API_KEY" 
    * In the top menu, go to "Add-ons" -> "Secrets -> "Add Secret"


In [22]:
from kaggle_secrets import UserSecretsClient
import os, time

import google.generativeai as genai
from google.generativeai import GenerationConfig

# load Gemini model API key
user_secrets = UserSecretsClient()
api_key = user_secrets.get_secret("GEMINI_API_KEY")

# load Gemini model 
genai.configure(api_key=api_key)
model = genai.GenerativeModel(model_name='gemini-1.5-flash-latest')


In [23]:
model

genai.GenerativeModel(
    model_name='models/gemini-1.5-flash-latest',
    generation_config={},
    safety_settings={},
    tools=None,
    system_instruction=None,
    cached_content=None
)

# 5. Prompt Engineering Strategy
<a id="novel_strategy"></a>

As mentioned in the introduction, financial analysts often seek specific insights from the MD&A section. To address this, I use chain-of-thought prompting technique alongside a novel approach called **structured summarization**.

### **References**

Cao, T., Raman, N., Dervovic, D., & Tan, C. (2024). *Characterizing multimodal long-form summarization: A case study on financial reports* (arXiv:2404.06162v3). https://doi.org/10.48550/arXiv.2404.06162


In [24]:
def get_summarization_cot_prompt(input_text: str)->str:
    '''
    Create a Chain-of-Thought (CoT) prompt for Long Form Structured Text Summarization.
    '''

    TASK = (
        "Summarize the following M&DA report."
        "\n"
        "\n"
        "MD&A REPORT:"
        "\n"
    )

    COT_INSTRUCTIONS = (
        "\n"
        "\n"
        "## INSTRUCTIONS:"
        "\n"
        "\n"
        "Let’s generate a summary step by step."
        "\n"
        "\n"
        "1. Read the MD&A Report: Carefully review the entire report to grasp the overall context."
        "\n"
        "2. Extract Key Insights: Identify and summarize insights related to the following eight themes:"
        "\n"
        "\n"
        "- Overview of Financial Results: Summarize the company’s performance, including revenue, expenses, and net income, over the reporting period."
        "\n"
        "- Liquidity and Capital Resources: Discuss the company’s ability to meet short-term and long-term obligations, including cash flow analysis and funding sources."
        "\n"
        "- Results of Operations: Break down revenue and expenses by segment or category or region, comparing current results to previous periods."
        "\n"
        "- Critical Accounting Estimates: Explain accounting policies that require significant judgment and estimates, which can impact financial statements."
        "\n"
        "- Trends and Outlook: Provide insights into market trends and economic conditions that may influence the company’s future performance."
        "\n"
        "- Business Risks and Uncertainties: Identify key risks that could impact operations, financial results, and company’s market position."
        "\n"
        "- Planned Actions: Outline management's plans to address anticipated risks or challanges."
        "\n"
        "- Future Goals: Outline Management's future goals and approaches to new or innovative projects or products."
        "\n"
        "\n"
        "3. Review Numeric Data: Pay attention to any tables presenting numeric data, such as income statements, balance sheets, or cash flow statements."
        "\n"
         "4. Include Accurate Numbers: When including numbers in the summary, ensure they are: a) Explicitly stated values from the original report (do not fabricate numbers). b) Stemmed from step-by-step verified calculations. c) Correctly rounded. d) Appropriately represented with clear context from the original source."
        "\n"
        "5. Synthesize Information: : Combine the extracted insights into a concise summary organized into the above mentioned seven themes, ensuring logical flow."
        "\n"
        "6. JSON Format Response: Provide your response strictly in the JSON format as specified below. Do not include any text outside of the JSON format."
        "\n"
        "\n"
    )

    JSON_FORMAT = (
        "## JSON FORMAT:"
        "\n"
    )

    JSON_EXAMPLE = {
        "Overview of Financial Results": " ",
        "Liquidity and Capital Resources": " ",
        "Results of Operations": " ",
        "Critical Accounting Estimates": " ",
        "Trends and Outlook": " ",
        "Business Risks and Uncertainties": " ",
        "Planned Actions": " ",
        "Future Goals": " "
    }

    SUMMARY = (
        "\n"
        "\n"
        "## SUMMARY:"
        "\n"
    )

    prompt = "{task}{input_text}{cot_instructions}{json_format}{json_example}{summary}".format(
        task=TASK,
        input_text=input_text,
        cot_instructions=COT_INSTRUCTIONS,
        json_format=JSON_FORMAT,
        json_example=JSON_EXAMPLE,
        summary=SUMMARY
    )

    return prompt

# example prompt 
print(get_summarization_cot_prompt(input_text="test_input_text"))

Summarize the following M&DA report.

MD&A REPORT:
test_input_text

## INSTRUCTIONS:

Let’s generate a summary step by step.

1. Read the MD&A Report: Carefully review the entire report to grasp the overall context.
2. Extract Key Insights: Identify and summarize insights related to the following eight themes:

- Overview of Financial Results: Summarize the company’s performance, including revenue, expenses, and net income, over the reporting period.
- Liquidity and Capital Resources: Discuss the company’s ability to meet short-term and long-term obligations, including cash flow analysis and funding sources.
- Results of Operations: Break down revenue and expenses by segment or category or region, comparing current results to previous periods.
- Critical Accounting Estimates: Explain accounting policies that require significant judgment and estimates, which can impact financial statements.
- Trends and Outlook: Provide insights into market trends and economic conditions that may influe

# 6. Sample Dataset
<a id="sample_dataset"></a>

Due to API call limits, I used a smaller sample to illustrate the use case. In the future, I plan to conduct a comprehensive analysis using a paid API subscription.

In [25]:
# sample size
SAMPLE_SIZE = 5

# Create a sample Dataset
edgar_10k_filings = (
    edgar_10k_filings

    # filter rows where MD&A section (section 7)  is blank
    .query("section_7!= ''")

    # create columns
    .assign(
        
        # MD&A section - word count
        section_7_word_count = lambda x: x.section_7.apply( lambda x: len(str(x).split()) ),
        
        # MD&A section summary prompt
        section_7_summary_prompt = lambda x: x.apply(lambda x: get_summarization_cot_prompt(input_text=x.section_7) ,axis=1 ),
        
        # MD&A section summary prompt - word count
        section_7_summary_prompt_word_count = lambda x: x.section_7_summary_prompt.apply( lambda x: len(str(x).split()) ),

    )

    # Stress test the Gemini Model, select MD&A section (section-7) of the 10k filings where the word count is more than 20,000 words
    .query("section_7_word_count>20000")
    
    # filter rows to a SAMPLE_SIZE  
    .sample(n=SAMPLE_SIZE, random_state=786)    
    
    # reset index
    .reset_index(drop=True, inplace=False)
)

# print dataframe shape 
print(edgar_10k_filings.shape)

# MD&A section word count distribution 
print(edgar_10k_filings.section_7_word_count.describe())

# print dataframe
edgar_10k_filings.head(2)

(5, 26)
count        5.000000
mean     25568.600000
std       6405.799583
min      20524.000000
25%      20629.000000
50%      21623.000000
75%      31611.000000
max      33456.000000
Name: section_7_word_count, dtype: float64


Unnamed: 0,filename,cik,year,section_1,section_1A,section_1B,section_2,section_3,section_4,section_5,...,section_9B,section_10,section_11,section_12,section_13,section_14,section_15,section_7_word_count,section_7_summary_prompt,section_7_summary_prompt_word_count
0,1413329_2020.htm,1413329,2020,Item 1.Business.\nGeneral Development of Busin...,Item 1A. Risk Factors.\nThe following risk fac...,Item 1B.Unresolved Staff Comments.\nNone.\nIte...,Item 2. Properties.\nWe own or lease various m...,Item 3.Legal Proceedings.\nThe information cal...,Item 4.Mine Safety Disclosures.\nNot applicabl...,"Item 5. Market for Registrant’s Common Equity,...",...,Item 9B.Other Information.\nNone.\nPART III\nE...,"Item 10.Directors, Executive Officers and Corp...",Item 11.Executive Compensation.\nRefer to Comp...,Item 12.Security Ownership of Certain Benefici...,Item 13. Certain Relationships and Related Tra...,Item 14.Principal Accounting Fees and Services...,Item 15.Exhibits and Financial Statement Sched...,21623,Summarize the following M&DA report.\n\nMD&A R...,21972
1,36146_2020.htm,36146,2020,ITEM 1.\nBUSINESS\nThe Corporation\nDescriptio...,ITEM 1A.\nRISK FACTORS\nTrustmark and its subs...,ITEM 1B.\nUNRESOLVED STAFF COMMENTS\nNone\nITE...,ITEM 2.\nPROPERTIES\nTrustmark’s principal off...,ITEM 3.\nLEGAL PROCEEDINGS\nInformation requir...,ITEM 4.\nMINE SAFETY DISCLOSURES\nNot applicab...,ITEM 5.\nMARKET FOR THE REGISTRANT’S COMMON EQ...,...,ITEM 9B.\nOTHER INFORMATION\nNone\nPART III\nI...,"ITEM 10.\nDIRECTORS, EXECUTIVE OFFICERS AND CO...",ITEM 11.\nEXECUTIVE COMPENSATION\nThe informat...,ITEM 12.\nSECURITY OWNERSHIP OF CERTAIN BENEFI...,ITEM 13.\nCERTAIN RELATIONSHIPS AND RELATED TR...,ITEM 14.\nPRINCIPAL ACCOUNTING FEES AND SERVIC...,,20629,Summarize the following M&DA report.\n\nMD&A R...,20978


In [14]:
edgar_10k_filings["section_7_summary_prompt"][0]



# 7. Summary Generation
<a id="summarization"></a>

In [26]:
from tqdm import tqdm
import time, random, ast, traceback

edgar_10k_filings["section_7_summary_overview_of_financial_results"] = ""
edgar_10k_filings["section_7_summary_liquidity_and_capital_resources"] = ""
edgar_10k_filings["section_7_summary_results_of_operations"] = ""
edgar_10k_filings["section_7_summary_critical_accounting_estimates"] = ""
edgar_10k_filings["section_7_summary_trends_and_outlook"] = ""
edgar_10k_filings["section_7_summary_business_risks"] = ""
edgar_10k_filings["section_7_summary_planned_actions"] = ""
edgar_10k_filings["section_7_summary_future_goals"] = ""


for index, row in tqdm(edgar_10k_filings.iterrows(), total=len(edgar_10k_filings)):

    try:
        # generate a structured summary out of 10k filings-item 7
        summary = model.generate_content(
            row['section_7_summary_prompt'],
            generation_config=GenerationConfig(max_output_tokens=8192, temperature=0.001)
        )
        
        # clean the llm response text to extract JSON/Python dictionary
        summary_cleaned = summary.text.replace("```json\n", "").replace("\n```", "")
         
        # convert the response text to a python dictionary
        response_dict = ast.literal_eval(summary_cleaned)
        
        # store structured summary components in dataframe
        edgar_10k_filings.at[index, "section_7_summary_overview_of_financial_results"] = response_dict["Overview of Financial Results"]
        edgar_10k_filings.at[index, "section_7_summary_liquidity_and_capital_resources"] = response_dict["Liquidity and Capital Resources"]
        edgar_10k_filings.at[index, "section_7_summary_results_of_operations"] = response_dict["Results of Operations"]
        edgar_10k_filings.at[index, "section_7_summary_critical_accounting_estimates"] = response_dict["Critical Accounting Estimates"]
        edgar_10k_filings.at[index, "section_7_summary_trends_and_outlook"] = response_dict["Trends and Outlook"]
        edgar_10k_filings.at[index, "section_7_summary_business_risks"] = response_dict["Business Risks and Uncertainties"]
        edgar_10k_filings.at[index, "section_7_summary_planned_actions"] = response_dict["Planned Actions"]
        edgar_10k_filings.at[index, "section_7_summary_future_goals"] = response_dict["Future Goals"]        
        
        # randomly wait for 5-10 seconds to avoid "429 Resource has been exhausted (e.g. check quota)." API Error
        time.sleep(random.uniform(5, 10)) 
        
    except Exception as e:
        # Sometimes the Gemini Model API returns no text response due to RECITATION finish reason.
        # The recitation stop reason is used whenever the model begins to recite model training data, especially copyrighted material. [https://github.com/google/generative-ai-docs/issues/257]
        print(index)        
        # traceback.print_exc()
        pass


100%|██████████| 5/5 [02:32<00:00, 30.53s/it]


In [28]:
response_dict

{'Overview of Financial Results': "BioRestorative Therapies, Inc. (BioRestorative) generated royalty revenue of $77,000 in 2020, down from $130,000 in 2019.  The company incurred significant operating losses, primarily due to research and development, marketing, and general and administrative expenses. As of December 31, 2020, BioRestorative had an accumulated deficit of $89,842,833 and a stockholder's deficit of $1,331,492.",
 'Liquidity and Capital Resources': 'BioRestorative required additional equity and/or debt financing to continue operations as of December 31, 2020.  Outstanding debt totaled $9,637,102, due November 16, 2023.  The company underwent Chapter 11 reorganization in 2020, exchanging approximately $14,700,000 in debt for equity and new convertible debt.  Post-reorganization, BioRestorative secured DIP financing and anticipates needing an additional $12,000,000 to complete Phase 2 clinical trials, plus approximately $45,000,000 more to complete all trials (assuming no r

# 8. Summary Evaluation Strategy
<a id="summary_eval"></a>


For evaluating the generated summaries, I opted for the LLM-as-Judge strategy due to its scalability and cost-effectiveness. To measure the quality of the summaries, I am using the faithfulness score as the evaluation metric. **The faithfulness score is defined as:** <br><br><br>


$$
\text{Faithfulness score} = \frac{|\text{Number of claims in the generated answer that can be inferred from the given context}|}{|\text{Total number of claims in the generated answer}|}
$$



<br>
There are many Python libraries available for LLM-as-Judge-based model evaluation. The biggest challenge is integrating them with a specific LLM model. Many of these libraries are primarily designed around OpenAI model APIs. As a workaround, I referenced the code documentation of these libraries and wrote my own evaluation function specific to this use case to ensure compatibility with the Gemini model API.



### References
Comet ML. (n.d.). *opik: A library for evaluating LLMs*. GitHub. https://github.com/comet-ml/opik

Comet ML. (n.d.). *template.py* [Python file]. GitHub. https://github.com/comet-ml/opik/blob/main/sdks/python/src/opik/evaluation/metrics/llm_judges/factuality/template.py

Ragas. (n.d.). *Faithfulness metrics*. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/

Confident AI. (n.d.). *Why LLM as a judge is the best LLM evaluation method*. https://www.confident-ai.com/blog/why-llm-as-a-judge-is-the-best-llm-evaluation-method
s/




In [30]:
from typing import Dict, List
import ast

def generate_faithfulness_cot_prompt(llm_answer: str, context: str):
    '''
    Create a prompt for Evaluation Metric: Faithfulness (LLM As Judge)
    '''
    
    INTRODUCTION = (
        "YOU ARE THE WORLD'S LEADING EXPERT IN FACT-CHECKING AND VALIDATING INFORMATION."
        "\n"
        "YOU HAVE DEVELOPED A CUTTING-EDGE SYSTEM THAT ANALYZES THE FACTUALITY OF CLAIMS MADE IN AN ANSWER BY COMPARING THEM AGAINST THE PROVIDED CONTEXT."
        "\n"
        "YOUR GOAL IS TO DETERMINE THE ACCURACY OF CLAIMS MADE IN THE ANSWER FROM OTHER LANGUAGE MODELS, AND TO PROVIDE A DETAILED VERDICT FOR EACH CLAIM."
        "\n"
        "\n"
    )
    
    INSTRUCTIONS =(
        "###INSTRUCTIONS###"
        "\n"
        "\n"
        "1. **ANALYZE** the provided LLM answer and context to identify individual claims or statements."
        "\n"
        "2. **VALIDATE** claims from LLM answer by cross-referencing with Facts from the Context."
        "\n"
        '3. **CATEGORIZE** each claim as "True", "False" or "Unclear" based on the evidence found in the context.'
        "\n"
        "4. **EXPLAIN** the reasoning behind each verdict, including a brief summary of the evidence supporting or contradicting the claim. Explanation must be short as possible."
        "\n"
        "5. **FORMAT** the result in a JSON object with a list of claims (ONLY FROM ANSWER), their verdicts, and corresponding explanations."
        "\n"
        "\n"
    )

    COT = (
        "###CHAIN OF THOUGHTS###"
        "\n"
        "\n"
        "1. **CLAIM EXTRACTION:**"
        "\n"
        "- Identify and extract distinct claims or statements from the LLM answer."
        "\n"
        "- Ensure that each claim is specific and can be independently verified."
        "\n"
        "\n"
        "2. **FACTUAL VERIFICATION:**"
        "\n"
        "- For each claim, perform a thorough search in the provided context."
        "- Determine whether the claim aligns with the evidence (True), contradicts the evidence (False), or if the evidence is insufficient or ambiguous (Unclear)."
        "\n"
        "\n"
        "3. **REASONING AND EXPLANATION:**"
        "\n"
        "- For each claim, provide a concise explanation that justifies the verdict, citing relevant evidence or the lack thereof."
        "\n"
        "\n"
        "4. **JSON OUTPUT CONSTRUCTION:**"
        "\n"
        "- Format the results as a JSON object (list of dictionaries) with the following structure:"
        "\n"
        "- `claim`: The original claim being evaluated. Return facts only from LLM answer."
        "\n"
        '- `verdict`: The factuality of the claim ("True", "False", or "Unclear").'
        "\n"
        "- `reason`: A brief summary of the reasoning and evidence for the verdict."
        "\n"
        "\n"
    )

    WHAT_NOT_TO_DO = (
        "###WHAT NOT TO DO###"
        "\n"
        "\n"
        "- DO NOT IGNORE any claim present in the LLM answer; every claim must be evaluated."
        "\n"
        "- DO NOT PROVIDE VERDICTS WITHOUT A CLEAR RATIONALE."
        "\n"
        "- DO NOT BASE VERDICTS ON UNSUBSTANTIATED OPINIONS OR UNVERIFIED SOURCES."
        "\n"
        "- AVOID MAKING ASSUMPTIONS; BASE VERDICTS STRICTLY ON AVAILABLE EVIDENCE FROM THE PROVIDED CONTEXT."
        "\n"
        "- DO NOT OMIT THE EXPLANATION FOR ANY CLAIM."
        "\n"
        "\n"
    )

    INPUTS_TEXT = (
        "###INPUTS:###"
        "\n"
        "***"
        "\n"
        "LLM Answer:"
        "\n"
    )

    CONTEXT_TEXT = (
        "\n"
        "\n"
        "Context:"
        "\n"
    )

    END_TEXT = (
        "\n"
        "***"
        "\n"
    )
    
    prompt = "{introduction}{instructions}{cot}{what_not_to_do}{inputs_text}{llm_answer}{context_text}{context}{end_text}".format(
        introduction=INTRODUCTION,
        instructions=INSTRUCTIONS,
        cot=COT,
        what_not_to_do=WHAT_NOT_TO_DO,
        inputs_text=INPUTS_TEXT,
        llm_answer=llm_answer,
        context_text=CONTEXT_TEXT,
        context=context,
        end_text=END_TEXT        
    )

    return prompt


def drop_incomplete_dict(text: str)-> str:
    '''
    Drop incomplete dictionary (if any) from the model response.
    '''
    
    # Find the last occurrence of '}' or '},'
    last_brace_index = max(text.rfind("}"), text.rfind("},"))
    #rfind(): 文字列のメソッドで、指定した部分文字列の最後の出現位置を検索します。
    #find() メソッドと異なり、最初の出現位置ではなく、最後の出現位置を探します。 見つからない場合は -1 を返します。
    if last_brace_index == -1:
        # No complete dictionary entry found, return an empty list
        return []

    # slice the text to keep everything up to the last complete dictionary entry
    cleaned_text = text[:last_brace_index + 1] + "\n]"
    
    return cleaned_text


def compute_faithfulness_score(response_dict: List[Dict[str, str]]):
    '''
    Calculate Faithfulness Score.
    '''
    score = 0.0
        
    for claim in response_dict:
        
        verdict = claim["verdict"]
    
        if verdict == "True":
            score += 1.0

        if verdict == "False":
            score += 0.0

        if verdict=="Unclear":
            score += 0.0
        
    score /= len(response_dict)
        
    return(score)


def faithfulness_score(llm_answer: str, context:str):
    '''
    This function returns a faithfulness score between 0 and 1.
    '''
    
    # create a prompt
    prompt = generate_faithfulness_cot_prompt(llm_answer=llm_answer, context=context)
    
    # generate LLM-as-judge model response
    model_response = model.generate_content(
        prompt,
        generation_config=GenerationConfig(max_output_tokens=8192, temperature=0.001)
    )
    
    try:
        # clean the response text to extract JSON/Python dictionary
        model_response_cleaned = model_response.text.replace("```json\n", "").replace("\n```", "")

        # remove any incomplete dictionaries from the end of the response text (this occurs due to incomplete model response due to token limit)
        model_response_cleaned_2 = drop_incomplete_dict(model_response_cleaned)

        # convert the response text to a python list of dictionaries  
        response_dict = ast.literal_eval(model_response_cleaned_2)

        # calculate faithfulness score
        faithfulness_score = compute_faithfulness_score(response_dict=response_dict)

    except:
        # Sometimes the Gemini Model API returns no text response due to RECITATION finish reason.
        # The recitation stop reason is used whenever the model begins to recite model training data, especially copyrighted material. [https://github.com/google/generative-ai-docs/issues/257]

        faithfulness_score = np.NaN

    return faithfulness_score

# example prompt
print(generate_faithfulness_cot_prompt(llm_answer="test_llm_answer", context="text_context"))

YOU ARE THE WORLD'S LEADING EXPERT IN FACT-CHECKING AND VALIDATING INFORMATION.
YOU HAVE DEVELOPED A CUTTING-EDGE SYSTEM THAT ANALYZES THE FACTUALITY OF CLAIMS MADE IN AN ANSWER BY COMPARING THEM AGAINST THE PROVIDED CONTEXT.
YOUR GOAL IS TO DETERMINE THE ACCURACY OF CLAIMS MADE IN THE ANSWER FROM OTHER LANGUAGE MODELS, AND TO PROVIDE A DETAILED VERDICT FOR EACH CLAIM.

###INSTRUCTIONS###

1. **ANALYZE** the provided LLM answer and context to identify individual claims or statements.
2. **VALIDATE** claims from LLM answer by cross-referencing with Facts from the Context.
3. **CATEGORIZE** each claim as "True", "False" or "Unclear" based on the evidence found in the context.
4. **EXPLAIN** the reasoning behind each verdict, including a brief summary of the evidence supporting or contradicting the claim. Explanation must be short as possible.
5. **FORMAT** the result in a JSON object with a list of claims (ONLY FROM ANSWER), their verdicts, and corresponding explanations.

###CHAIN OF TH

# 9. Evaluation
<a id="evaluation"></a>


Each section of the structured summary needs to be evaluated seperately.

In [31]:
import numpy as np
import time, random, traceback

edgar_10k_filings["section_7_summary_overview_of_financial_results_faithfulness_score"] = np.NaN
edgar_10k_filings["section_7_summary_liquidity_and_capital_resources_faithfulness_score"] = np.NaN
edgar_10k_filings["section_7_summary_results_of_operations_faithfulness_score"] = np.NaN
edgar_10k_filings["section_7_summary_critical_accounting_estimates_faithfulness_score"] = np.NaN
edgar_10k_filings["section_7_summary_trends_and_outlook_faithfulness_score"] = np.NaN
edgar_10k_filings["section_7_summary_business_risks_faithfulness_score"] = np.NaN
edgar_10k_filings["section_7_summary_planned_actions_faithfulness_score"] = np.NaN
edgar_10k_filings["section_7_summary_future_goals_faithfulness_score"] = np.NaN


for index, row in tqdm(edgar_10k_filings.iterrows(), total=len(edgar_10k_filings)):

    try:
        
        if row["section_7_summary_overview_of_financial_results"]!="":
            
            edgar_10k_filings.at[index, "section_7_summary_overview_of_financial_results_faithfulness_score"] =  (
                faithfulness_score(
                    llm_answer=row["section_7_summary_overview_of_financial_results"], 
                    context=row["section_7"]
                )
            )

            # randomly wait for 5-10 seconds to avoid "429 Resource has been exhausted (e.g. check quota)." API Error
            time.sleep(random.uniform(5, 10)) 
        
        if row["section_7_summary_liquidity_and_capital_resources"]!="":
            
            edgar_10k_filings.at[index, "section_7_summary_liquidity_and_capital_resources_faithfulness_score"] =  (
                faithfulness_score(
                    llm_answer=row["section_7_summary_liquidity_and_capital_resources"], 
                    context=row["section_7"]
                )
            )
            
            # randomly wait for 5-10 seconds to avoid "429 Resource has been exhausted (e.g. check quota)." API Error
            time.sleep(random.uniform(5, 10))
        
        if row["section_7_summary_results_of_operations"]!="":
            
            edgar_10k_filings.at[index, "section_7_summary_results_of_operations_faithfulness_score"] =  (
                faithfulness_score(
                    llm_answer=row["section_7_summary_results_of_operations"], 
                    context=row["section_7"]
                )
            )

            # randomly wait for 5-10 seconds to avoid "429 Resource has been exhausted (e.g. check quota)." API Error
            time.sleep(random.uniform(5, 10)) 
        
        if row["section_7_summary_critical_accounting_estimates"]!="":
            
            edgar_10k_filings.at[index, "section_7_summary_critical_accounting_estimates_faithfulness_score"] =  (
                faithfulness_score(
                    llm_answer=row["section_7_summary_critical_accounting_estimates"], 
                    context=row["section_7"]
                )
            )
            
            # randomly wait for 5-10 seconds to avoid "429 Resource has been exhausted (e.g. check quota)." API Error
            time.sleep(random.uniform(5, 10)) 
    
        if row["section_7_summary_trends_and_outlook"]!="":
            
            edgar_10k_filings.at[index, "section_7_summary_trends_and_outlook_faithfulness_score"] =  (
                faithfulness_score(
                    llm_answer=row["section_7_summary_trends_and_outlook"], 
                    context=row["section_7"]
                )
            )

            # randomly wait for 5-10 seconds to avoid "429 Resource has been exhausted (e.g. check quota)." API Error
            time.sleep(random.uniform(5, 10)) 
        
        if row["section_7_summary_business_risks"]!="":
            
            edgar_10k_filings.at[index, "section_7_summary_business_risks_faithfulness_score"] =  (
                faithfulness_score(
                    llm_answer=row["section_7_summary_business_risks"], 
                    context=row["section_7"]
                )
            )

            # randomly wait for 5-10 seconds to avoid "429 Resource has been exhausted (e.g. check quota)." API Error
            time.sleep(random.uniform(5, 10)) 
        
        if row["section_7_summary_planned_actions"]!="":
        
            edgar_10k_filings.at[index, "section_7_summary_planned_actions_faithfulness_score"] =  (
                faithfulness_score(
                    llm_answer=row["section_7_summary_planned_actions"], 
                    context=row["section_7"]
                )
            )

            # randomly wait for 5-10 seconds to avoid "429 Resource has been exhausted (e.g. check quota)." API Error
            time.sleep(random.uniform(5, 10)) 
        
        if row["section_7_summary_future_goals"]!="":
            
            edgar_10k_filings.at[index, "section_7_summary_future_goals_faithfulness_score"] =  (
                faithfulness_score(
                    llm_answer=row["section_7_summary_future_goals"], 
                    context=row["section_7"]
                )
            )

            # randomly wait for 5-10 seconds to avoid "429 Resource has been exhausted (e.g. check quota)." API Error
            time.sleep(random.uniform(5, 10)) 
        
    except Exception as e:
        print(index)
        # traceback.print_exc()
        pass

# store results
edgar_10k_filings.to_excel('/kaggle/working/results.xlsx')


 60%|██████    | 3/5 [02:56<01:36, 48.14s/it]

2


 80%|████████  | 4/5 [04:59<01:17, 77.89s/it]

3


100%|██████████| 5/5 [06:30<00:00, 78.01s/it]


# 10. Review Results
<a id="review_results"></a>

In [32]:
import pandas as pd 

columns = [
    "section_7_summary_overview_of_financial_results_faithfulness_score",
    "section_7_summary_liquidity_and_capital_resources_faithfulness_score",
    "section_7_summary_results_of_operations_faithfulness_score",
    "section_7_summary_critical_accounting_estimates_faithfulness_score",
    "section_7_summary_trends_and_outlook_faithfulness_score",
    "section_7_summary_business_risks_faithfulness_score",
    "section_7_summary_planned_actions_faithfulness_score",
    "section_7_summary_future_goals_faithfulness_score"
]

# calculate the mean faithfulness scores 
mean_faithfulness_scores = edgar_10k_filings[columns].mean()

# calculate and sample size
sample_sizes = edgar_10k_filings[columns].count()

# mean faithfulness scores
mean_faithfulness_scores = pd.DataFrame({
    'Mean Faithfulness Score': mean_faithfulness_scores,
    'Sample Size': sample_sizes
})

mean_faithfulness_scores

Unnamed: 0,Mean Faithfulness Score,Sample Size
section_7_summary_overview_of_financial_results_faithfulness_score,1.0,3
section_7_summary_liquidity_and_capital_resources_faithfulness_score,1.0,3
section_7_summary_results_of_operations_faithfulness_score,1.0,3
section_7_summary_critical_accounting_estimates_faithfulness_score,1.0,3
section_7_summary_trends_and_outlook_faithfulness_score,1.0,3
section_7_summary_business_risks_faithfulness_score,1.0,3
section_7_summary_planned_actions_faithfulness_score,1.0,3
section_7_summary_future_goals_faithfulness_score,1.0,3


# 11. Conclusion
<a id="conclusion"></a>

The results are promissing, with the average faithfulness score of more than 85% across all of the structured summarization sections.

As part of the next steps, I plan to: 

- add another LLM-as-Judge Evaluation Metric: Answer Relavence. 
- manually compare the summaries with the original text to validate the efficacy of the LLL-as-Judge evaluation strategy.
