# TASK_02
#### Coding Assignment Task # 2
- Dataset: IEDGAR SEC filings (public data) -
- https://huggingface.co/datasets/eloukas/edgar-corpus
- Language: Python
- Implementation: Pyspark
- Submission: Github repository containing code + plots (jpeg).
- Expected Maximum Duration: 3 hours

Given a set of documents: create a solution that allows the end user to understand the documents in a two dimensional space and to identify outliers.
Task #2 – Gen AI

Dataset

( Year: 2018-2020

( Filing type: 10K

( Sections: All

( Company: Choose 1.

( Choose 5 data attributes to extract from a single year.

Steps

( Convert documents to chunks

( Covert chunks to embeddings

( Create a query

( Create a prompt to extract data from chunks from a specific year.

( Create a validation dataset (5 true values from chunks).

( Demonstrate that your LLM can retrieve the correct chunks from your

embedding object for the correct year

#### Dataset card:
This dataset card is based on the paper EDGAR-CORPUS: Billions of Tokens Make The World Go Round authored by Lefteris Loukas et.al, as published in the ECONLP 2021 workshop.
This dataset contains the annual reports of public companies from 1993-2020 from SEC EDGAR filings.
There is supported functionality to load a specific year.
Care: since this is a corpus dataset, different train/val/test splits do not have any special meaning. It's the default HF card format to have train/val/test splits.
If you wish to load specific year(s) of specific companies, you probably want to use the open-source software which generated this dataset, EDGAR-CRAWLER: https://github.com/nlpaueb/edgar-crawler.

In [1]:
# import os
import regex as re
from collections import defaultdict
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.patches import Patch
import colorcet as cc
from scipy.spatial.distance import cdist
from functools import reduce
from pathlib import Path

import datasets
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler as sklearnScaler
from sklearn.decomposition  import PCA as sklearnPCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, DBSCAN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import silhouette_score


import pyspark
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import size, col, count, when, length, concat_ws, \
                udf, explode, pandas_udf, lit, count, avg, max as spark_max, min as spark_min, length, expr
from pyspark.sql.types import ArrayType, StringType, FloatType,  StructType, StructField, DoubleType

from pyspark.ml.linalg import Vectors, VectorUDT, DenseVector
from pyspark.ml.feature import StandardScaler, PCA

from pyspark.ml.functions import vector_to_array


from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer
import requests
from dotenv import load_dotenv
from openai import OpenAI
import bert_score

import os
load_dotenv()
openai_token = os.getenv("OPENAI_API_KEY")

import spacy
NER = spacy.load("en_core_web_sm")
from math import sqrt

import torch
print(torch.cuda.is_available())




from IPython.display import display, HTML
display(HTML("<style>div.output_scroll { height: 44em; }</style>"))


  from .autonotebook import tqdm as notebook_tqdm


False


In [2]:
import sys
os.environ['JAVA_HOME'] = r'D:\Softwares\Microsoft\jdk-17.0.15.6-hotspot'
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

In [3]:
print(sys.executable)

D:\Softwares\Anaconda3\envs\coding_task_venv\python.exe


In [4]:
spark = SparkSession.builder \
    .appName("AIG_SEC_Filing_Analysis") \
    .master("local[8]") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .config("spark.python.worker.memory", "2g") \
    .config("spark.sql.shuffle.partitions", "50") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.execution.pyspark.udf.faulthandler.enabled", "true") \
    .config("spark.python.worker.faulthandler.enabled", "true") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")



### Based on the observations:
- The text will retain context if split with "\n"
- It might also help to consider ":" as that sometime is identifying the start of a sub-section [ Text between "\n" and ":" will be the header]
- ";" might signal the end of a contextually similar section
- Size of each chunk should be less than 512 tokens or about 2000 characters (to begin with) as I am planning to use BERT based sentence transformers
- An overlap startegy of 200 characters to preserve context

In [5]:
def create_custom_chunks(text:str, overlap: int = 200, chunk_size = 2000)-> list:
    if not isinstance(text, str):
        return []
    text = re.sub(r'[\x00-\x09\x0B-\x1F\x7F-\x9F]','', text)
    chunks = []

    
    if len(text) <= chunk_size:
        return [text.strip()] if text.strip() else []
        
    prev_chunk_len = 0  
    
    while len(text) > chunk_size:
        right = chunk_size
        
        last_newline_index = text[:chunk_size].rfind("\n")
        
        if last_newline_index > 0:
            right = last_newline_index

        # to ensure chunks have complete words instead of cut words
        # if right > 0 and text[right-1].isalnum() and text[right].isalnum():
        #     last_space = text[:right].rfind(' ')
        #     if last_space > right-50 and last_space > 0:
        #         right = last_space
            
        chunk = text[:right].strip()
        if chunk:
            chunks.append(chunk)
        
        if len(chunks) > 1 and prev_chunk_len < 1000 and len(chunks[-1]) < 1000 :
            chunks[-2]+=" " + chunks[-1]
            chunks = chunks[:-1]
        
        prev_chunk_len = len(chunks[-1])
        
        if right > overlap:
            start = right - overlap
            # Prevent index error
            while start < len(text):
                if start == 0:
                    break
                if not (text[start].isalnum() and text[start-1].isalnum()):
                    break
                start += 1
            text = text[start:]
        else:
            text = text[right:]
            
    if text:
        rem = text.strip()
        if rem:
            chunks.append(rem)        

    return chunks

In [6]:
file_path = [
    "D:\\PracticeProjects\\nlp_10K_filings_EDGAR_corpus\\Data\\data_2018.parquet",
    "D:\\PracticeProjects\\nlp_10K_filings_EDGAR_corpus\\Data\\data_2019.parquet",
    "D:\\PracticeProjects\\nlp_10K_filings_EDGAR_corpus\\Data\\data_2020.parquet"
]

df = spark.read.parquet(*file_path)

In [7]:
df.printSchema()

root
 |-- filename: string (nullable = true)
 |-- cik: string (nullable = true)
 |-- year: string (nullable = true)
 |-- section_1: string (nullable = true)
 |-- section_1A: string (nullable = true)
 |-- section_1B: string (nullable = true)
 |-- section_2: string (nullable = true)
 |-- section_3: string (nullable = true)
 |-- section_4: string (nullable = true)
 |-- section_5: string (nullable = true)
 |-- section_6: string (nullable = true)
 |-- section_7: string (nullable = true)
 |-- section_7A: string (nullable = true)
 |-- section_8: string (nullable = true)
 |-- section_9: string (nullable = true)
 |-- section_9A: string (nullable = true)
 |-- section_9B: string (nullable = true)
 |-- section_10: string (nullable = true)
 |-- section_11: string (nullable = true)
 |-- section_12: string (nullable = true)
 |-- section_13: string (nullable = true)
 |-- section_14: string (nullable = true)
 |-- section_15: string (nullable = true)



In [8]:
df.count()

3

In [9]:
# chunking_udf = udf(create_custom_chunks, ArrayType(StringType()))       
@pandas_udf(returnType = ArrayType(StringType()))
def chunk_pandas_udf(series):
    return series.apply(create_custom_chunks)



In [10]:
# Get all columns except metadata columns
columns = [column for column in df.columns if column not in ["filename", "cik", "year"]]

# Create chunks for each section column
for column in columns:
    df = df.withColumn(f"{column}_chunked", chunk_pandas_udf(col(column)))

# Explode and collect all sections into unified format
all_chunks = []
for section_name in columns:
    df_section = df.select(
        "cik", 
        "filename",
        "year",
        explode(col(f"{section_name}_chunked")).alias("chunk")
    ).filter(length(col("chunk")) > 15) \
    .withColumn("section_name", lit(section_name)) \
    .withColumn("chunk_id", expr("uuid()"))
    df_section = df_section.select("cik", "filename", "year", "chunk", "section_name","chunk_id")
    
    all_chunks.append(df_section)

# Union all sections into single DataFrame
df_exploded = reduce(DataFrame.unionByName, all_chunks)



In [11]:
print(df_exploded.columns)

['cik', 'filename', 'year', 'chunk', 'section_name', 'chunk_id']


In [12]:
df_exploded_pd = df_exploded.toPandas()

In [13]:
texts = df_exploded_pd["chunk"].tolist()

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2', device=device)

embeddings = model.encode(
    texts, batch_size=64, show_progress_bar=True, convert_to_numpy=True
)

Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:15<00:00, 15.13s/it]


In [14]:
has_nan = np.isnan(embeddings).any()
print(f"Contains NaNs? {has_nan}")

# Check if any embedding vectors are empty or zero vectors
empty_vectors = np.sum(embeddings, axis=1) == 0
print(f"Number of empty embeddings: {np.sum(empty_vectors)}")

Contains NaNs? False
Number of empty embeddings: 0


In [15]:
embeddings.shape

(299, 768)

In [16]:
df_exploded_pd["embeddings"] = embeddings.tolist()

In [17]:
df_cleaned = spark.createDataFrame(df_exploded_pd)

In [18]:
df_cleaned.printSchema()

root
 |-- cik: string (nullable = true)
 |-- filename: string (nullable = true)
 |-- year: string (nullable = true)
 |-- chunk: string (nullable = true)
 |-- section_name: string (nullable = true)
 |-- chunk_id: string (nullable = true)
 |-- embeddings: array (nullable = true)
 |    |-- element: double (containsNull = true)



In [19]:
df_embedding = df_cleaned

In [20]:
df_embedding.select(["year"]).distinct().show()

+----+
|year|
+----+
|2020|
|2018|
|2019|
+----+



In [21]:
def extract_year(text):
   """
   Extract year information from user queries using advanced NLP techniques.

   Args:
       text (str): Input text from which to extract year information
   
   Returns:
       str or None: Extracted year as string (e.g., "2019") or None if no year found
   """
   if not isinstance(text, str):
       return None
   

   entities = NER(text).ents
   dates = [ent.text for ent in entities if ent.label_ == "DATE"]
   pattern = r'(20\d{2}|19\d{2})'  # Pattern for years 1900-2099
   
   # Process NER-identified date entities first
   if dates:
       for date_text in dates:
           match = re.findall(pattern, date_text)
           if match:
               year = match[0]
               print(f"Extracted year from NER date entity: {year}")
               return year
   
   # direct text pattern matching
   match = re.findall(pattern, text)
   if match:
       # Handle multiple years by selecting the most recent
       years = [int(year) for year in match]
       year = str(max(years))  # Prioritize most recent year
       print(f"Extracted year from direct text search: {year}")
       return year
   
   return None

In [22]:
def encode_using_sbert(texts: list) -> np.array:
    try:
        # Input validation - check for None, empty, or non-string values
        for i, text in enumerate(texts):
            if text is None:
                raise ValueError(f"Text at index {i} is None")
            if not isinstance(text, str):
                raise ValueError(f"Text at index {i} is not a string: {type(text)}")
            if not text.strip():
                raise ValueError(f"Text at index {i} is empty or whitespace only")
        
        # Use GPU if available, otherwise CPU
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        print(f"Device: {device}")
    
        # Load sentence transformer model
        model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2', device=device)

        # Generate embeddings in batches
        embeddings = model.encode(texts, 
                                batch_size=32,         
                                show_progress_bar=True,  
                                convert_to_numpy=True)
        
        # Ensure output length matches input length
        if len(embeddings) != len(texts):
            raise ValueError(f"Length mismatch: input={len(texts)}, output={len(embeddings)}")
        
        if np.isnan(embeddings).any():
            raise ValueError("Embeddings contain NaN values")
        
        empty_vectors = np.sum(embeddings, axis=1) == 0
        if np.sum(empty_vectors) > 0:
            raise ValueError("Embeddings contain empty vectors")
        
        return embeddings
                
    except Exception as e:
        raise e

In [23]:
def return_top_k_context(query:str, query_embedding: np.array, df_embedding: DataFrame, k = 3)-> list:
   """
   Retrieve the top-k most relevant document chunks for a given query.
  
   Args:
       query (str): The user's natural language query
       query_embedding (np.array): Numerical representation of the query
       df_embedding (DataFrame): PySpark DataFrame containing document embeddings
       k (int, optional): Number of top chunks to retrieve. Defaults to 3.
   
   Returns:
       list: List of Row objects containing the top-k most relevant chunks with metadata
             including company CIK, year, section name, chunk content, chunk ID, and similarity score
   """
   # Extract YEAR from query using NER
   year = extract_year(query)
   if year:
       filtered_embeddings_df = df_embedding.filter(col("year") == year)
   else:
       filtered_embeddings_df = df_embedding

   # Convert query embedding to Spark ML Vector format for computation
   query_vector = Vectors.dense(query_embedding)

   # Define cosine  using Spark UDF
   @udf(DoubleType())
   def cosine_similarity(vec):
       """
       Compute cosine similarity between document and query vectors.
       """
       if isinstance(vec, list):
           vec = Vectors.dense(vec)
       dot = float(vec.dot(query_vector))
       norm1 = float(vec.norm(2))
       norm2 = float(query_vector.norm(2))
       return dot / (norm1 * norm2) if norm1!=0 and norm2 !=0 else 0.0
   
   # Apply similarity computation and rank results
   df_result = filtered_embeddings_df.withColumn("score", cosine_similarity(col("embeddings")))
   # Retrieve top-k chunks with comprehensive metadata
   top_k_chunks = df_result.orderBy(col("score").desc()).select("cik","year", "section_name","chunk","chunk_id", "score").take(k)

   return top_k_chunks

In [24]:
def build_prompt(question: str, context: str, query_type: str = "general") -> str:
   """
   This method creates tailored prompts that guide the LLM to provide appropriate responses
   based on the type of financial query (financial, risk, operational, regulatory, or general).
   
   Args:
       question (str): The user's question about the SEC filing
       context (str): Relevant document context retrieved from the knowledge base
       query_type (str, optional): Classification of the query type.
   
   Returns:
       str: A formatted prompt string optimized for the specific query type
   """
   
   # Base instructions
   base_instruction = """
You are an expert financial analyst reviewing SEC 10-K filings. 
Analyze the provided context to answer the specific question below.

INSTRUCTIONS:
- Base your answer ONLY on the information provided in the context
- If the answer is not found in the context, respond with "not found"
- Be precise and cite specific sections when possible
- For questions on numerical values and amounts, respond with just the exact amounts with units (millions, billions, etc.)
- If multiple relevant pieces of information exist, provide all of them
- The context does not provide information on any question reply with just 'not found'
"""

   # Specialized instructions for different query categories
   query_instructions = {

       "financial": """
- Focus on financial metrics, dollar amounts, and quantitative data
- Include timeframes and comparison periods where mentioned
- Note any significant changes or trends
- For questions on numerical values and amounts, respond with just the exact amounts with units (millions, billions, etc.). Nothing else is needed.
""",
       "risk": """
- Identify specific risk factors and their potential impacts
- Distinguish between current risks and future uncertainties
- Note any risk mitigation strategies mentioned
- For questions on numerical values and amounts, respond with just the exact amounts with units (millions, billions, etc.). Nothing else is needed.
""",
       "operational": """
- Focus on business operations, processes, and organizational changes
- Include geographic locations and business segments
- Note any operational improvements or challenges
- For questions on numerical values and amounts, respond with just the exact amounts with units (millions, billions, etc.). Nothing else is needed.
""",
       "regulatory": """
- Identify regulatory requirements and compliance matters
- Note any legal proceedings or regulatory changes
- Include relevant dates and jurisdictions
- For questions on numerical values and amounts, respond with just the exact amounts with units (millions, billions, etc.). Nothing else is needed.
""",
       "general": """
- Provide comprehensive information relevant to the question
- Include supporting details and context
- Note any qualifications or limitations mentioned
- For questions on numerical values and amounts, respond with just the exact amounts with units (millions, billions, etc.). Nothing else is needed.
"""
   }
   
   # Final prompt
   context_instruction = f"""
CONTEXT FROM SEC 10-K FILING:
{context}

QUESTION: {question}

ANALYSIS or ANSWER:"""
   
   # Combine all components into a complete prompt
   full_prompt = base_instruction + query_instructions.get(query_type, query_instructions["general"]) + context_instruction
   
   return full_prompt

In [25]:
query_list  = [
    {
        "query": "What is the outstanding amount of Subordinated Debentures in 2020?",
    },
    {
        "query": "How much money was allocated under the HHSB Act in 2020 ?",
    },
    {
        "query": "What is the FDIC's standard maximum deposit insurance amount in 2020 ?",
    },
    {
        "query": "What amount of wholesale deposit was purchased in 2018 ?",
     },
    {
        "query": "In the year 2020, what were the general risk that the company was subjected to ?",
    },
    {
        "query": "How much money was allocated under the CARES Act in 2018 ?",
    },
    {
        "query": "As of the year 2019, describe the elements of an extensive regulatory framework which are applicable to bank holding companies and banks ?",
    }
]

In [26]:
queries = [query["query"] for query in query_list if query["query"]]

In [27]:
queries

['What is the outstanding amount of Subordinated Debentures in 2020?',
 'How much money was allocated under the HHSB Act in 2020 ?',
 "What is the FDIC's standard maximum deposit insurance amount in 2020 ?",
 'What amount of wholesale deposit was purchased in 2018 ?',
 'In the year 2020, what were the general risk that the company was subjected to ?',
 'How much money was allocated under the CARES Act in 2018 ?',
 'As of the year 2019, describe the elements of an extensive regulatory framework which are applicable to bank holding companies and banks ?']

In [28]:
query_embeddings = encode_using_sbert(queries)

Device: cpu


Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.97it/s]


In [29]:
query_embeddings.shape

(7, 768)

In [30]:
context = []
for index, query in enumerate(query_list):
   # Retrieve top-k relevant chunks for current query
   context.append(return_top_k_context(
       query["query"],
       query_embeddings[index],
       df_embedding)
                 )

Extracted year from NER date entity: 2020
Extracted year from NER date entity: 2020
Extracted year from NER date entity: 2020
Extracted year from NER date entity: 2018
Extracted year from NER date entity: 2020
Extracted year from NER date entity: 2018
Extracted year from NER date entity: 2019


In [31]:
len(context)

7

In [32]:
for index, query in enumerate(query_list):
   query["context"] = context[index]


In [33]:
query_list[0]

{'query': 'What is the outstanding amount of Subordinated Debentures in 2020?',
 'context': [Row(cik='718413', year='2020', section_name='section_1A', chunk='assets and a reduction in our income. At the same time, the marketability of the property securing a loan may be adversely affected by any reduced demand resulting from higher interest rates.\nIn addition, increases in interest rates will increase the dividend rate on our Series A preferred stock, which is tied to the prime rate, and the interest rate on our debentures, which is tied to LIBOR. Furthermore, as discussed below, there is uncertainty as to our future interest costs on our subordinated debentures due to the scheduled phase out of LIBOR at the end of 2021. Higher preferred stock dividend payments and debenture interest costs would decrease the amount of funds available for payment of dividends on our common stock.\nOur interest costs may increase as a result of the retirement of London Interbank Offered Rate (“LIBOR”) a

In [34]:
responses = []

In [35]:
llm_client = OpenAI()

In [36]:
for query in query_list:
   # Combine retrieved chunks into a single context string
   context = "\n\n".join([row["chunk"] for row in query["context"]])
   question = query["query"]

   # Query Type Classification
   query_type_prompt = f"""
   This is a question on SEC 10-K filing of a company: {question}. 
   As an expert on SEC 10-K forms categorize the question into one of the below categories:
   financial, risk, operation, regulatory and general. No Other categories exist. If it fit into none of them reply 'general'
   Answer with one of the five categories given above only in lower cases.
   Answer:"""
   
   # API call for query classification
   try:
       response = llm_client.chat.completions.create(
           model = "gpt-4",
           messages=[{"role": "system", "content": "You are a financial analyst specializing in SEC filing analysis."},
                     {"role":"user", "content":query_type_prompt}
                     ],
           max_tokens=10,
       )
       query_type = response.choices[0].message.content
   except Exception as e:
       print("API call failed for query type")
       raise(e)                

   # default to 'general' if invalid
   if query_type not in ["financial", "risk", "operation", "general"]:
       query_type = "general"

   # Response Generation
   prompt = build_prompt(question, context, query_type)

   # API call for final response generation
   try:
       response = llm_client.chat.completions.create(
           model="gpt-4",
           messages=[
               {"role": "system", "content": "You are a financial analyst specializing in SEC filing analysis."},
               {"role": "user", "content": prompt}
           ],
           temperature=0.1,
           max_tokens=250,
       )
       answer = response.choices[0].message.content
   except Exception as e:
       print("API call failed")
       raise(e)

   # default to 'not found'
   if answer:
       if answer.lower() == "not found":
           answer = 'not found'
       responses.append(answer)
   else:
       responses.append('not found')

In [37]:
responses

['$12,887,000',
 '$284.45 billion',
 "The FDIC's standard maximum deposit insurance amount in 2020 was $250,000.",
 '$35.3 million',
 "The general risks that the company was subjected to in 2020 include:\n\n1. Economic and external risks due to the COVID-19 pandemic. This includes disruptions to the company, its customers, employees, and third-party service providers. The impacts of the pandemic could significantly affect the company's business, financial condition, results of operations, and prospects. Specific risks related to the pandemic include increased allowance for loan losses, declines in collateral value, impaired ability of borrowers and loan guarantors to honor commitments, reduced demand due to economic downturn, employee illness, reduced operating effectiveness due to remote work, business interruptions, unavailability of key personnel, effects on key employees, branch closures, declines in demand for loans and banking services, increased unemployment, reduced consumer sp

## Validation

In [38]:
validation = [
    {
        "query": "What is the outstanding amount of Subordinated Debentures in 2020?",
        "expected_value": "$12,887,000",
        "year": "2020",
        "chunk_id": "6bb74a17-d3dc-4a22-91c8-a37afe5ac18e",
        "section": "item_8"
    },
    {
        "query": "How much money was allocated under the HHSB Act in 2020 ?",
        "expected_value": "$284.45 billion",
        "year": "2020",
        "chunk_id": "d4650e17-66c5-44f7-90a1-abdd5287b7cc",
        "section": "section_1"
    },
    {
        "query": "What is the FDIC's standard maximum deposit insurance amount in 2020 ?",
        "expected_value": "$250,000",
        "year": "2020",
        "chunk_id": "13b7893f-f441-4fba-a6b1-54ee5c9e3ca0",
        "section": "section_1"
    },
    {
        "query": "What amount of wholesale deposit was purchased in 2018 ?",
        "expected_value": "$35.3 million",
        "year": "2018",
        "chunk_id": "d04471ec-a841-49a9-b656-959f2866e6bd",
        "section": "section_1"
     },
     {
        "query": "In the year 2020, what were the general risk that the company was subjected to ?",
        "expected_value": """In 2020, the company was subject to several general risks stemming from competitive pressures, technological changes, and evolving consumer behavior in the financial services industry. One major risk was the shift in how financial services were delivered. The increasing use of online and mobile banking reduced reliance on traditional branch facilities. This trend posed a threat to the company's branch-based service model, requiring continuous evaluation of its physical infrastructure. While closing underperforming branches could improve efficiency, it also carried the risk of incurring restructuring charges and damaging customer relationships.
Additionally, the company faced substantial competition from a wide range of financial and nonfinancial entities. These included local community banks, large national banks, credit unions with tax advantages, and non-bank financial service providers. Many of these competitors, especially larger institutions, had superior capital resources, more advanced technology, and stronger marketing capabilities. This gave them a competitive edge in attracting customers and offering more flexible or cost-effective financial products and services.
Technological advancements also introduced new competitive threats. Financial transactions could increasingly be completed electronically without the need for a bank’s physical presence or even bank involvement at all. Consumers now had the option to transfer funds and pay bills online or by phone, using fintech services that often operated with lower costs and lighter regulatory burdens. This trend placed pressure on the company’s fee income, deposit levels, and income derived from those deposits.
Furthermore, the emergence of out-of-market competitors offering digital-first solutions created new challenges in customer retention and acquisition. Collectively, these risks threatened the company’s market share, profitability, and long-term viability if it failed to adapt to the rapidly changing financial services landscape.
""",
        "year": "2018",
        "chunk_id": "36665ebd-9caa-40e1-af21-c258dc416f55",
        "section": "section_1A"
    },
    {
        "query": "How much money was allocated under the CARES Act in 2018 ?",
        "expected_value": "not found",
        "year": "2018",
        "chunk_id": None,
        "section": None
    },
    {
        "query": "As of the year 2019, describes elements of an extensive regulatory framework which are applicable to bank holding companies and banks ?",
        "expected_value": """
Based on the 2019 SEC 10-K filing, the extensive regulatory framework for bank holding companies and banks is primarily designed to protect depositors and the FDIC Deposit Insurance Fund rather than shareholders or creditors. The framework is dominated by the Dodd-Frank Wall Street Reform and Consumer Protection Act of 2010, which comprehensively restructured financial services regulation.
Key elements include the establishment of the Consumer Financial Protection Bureau (CFPB) with centralized authority over federal consumer financial laws, though smaller banks like the Company are enforced by agencies such as the OCC rather than the CFPB directly. The Act applies uniform leverage and risk-based capital requirements to bank holding companies on a consolidated basis, prohibiting the use of additional trust preferred securities as Tier 1 capital while grandfathering existing ones.
Additional provisions mandate reasonable and proportional debit card interchange fees for large institutions, eliminate exclusivity arrangements between issuers and networks, and require public companies to hold shareholder votes on executive compensation matters, including "say on executive pay" and "say on frequency" votes.
Banking operates as a highly regulated business where federal and state law changes are frequent and unpredictable. While many Dodd-Frank provisions target systemically significant large institutions, they indirectly affect smaller banking organizations through competitive pressures and regulatory standards, creating an environment where company earnings are significantly influenced by economic conditions, management policies, and regulatory authority actions.
        """,
        "year": "2019",
        "chunk_id": "66bb2611-a118-4d4e-ac68-009d8b2b8644",
        "section": "section_1"
    }
]

In [39]:
results = []
correct_answers = 0
correct_chunks = 0
top1_chunks = 0
bert_scores = []

In [40]:
def bert_f1(reference, generated):
   _, _, F1 = bert_score.score([generated], [reference], lang='en')
   return F1.item()

In [41]:
for i, (query, response) in enumerate(zip(query_list, responses)):
   expected = validation[i]['expected_value']
   expected_chunk = validation[i]['chunk_id']
   
   chunk_ids = [row["chunk_id"] for row in query["context"]]
   scores = [row["score"] for row in query["context"]]
   
   # Evaluating answer for keyword match vs semantic match
   use_bert = len(expected) > 15 and '$' not in expected
   if use_bert:
       bert_f1_response = bert_f1(expected, response)
       bert_scores.append(bert_f1_response)
       answer_correct = bert_f1_response > 0.8
   else:
       bert_f1_response = None
       answer_correct = expected.strip() == response.strip()
   
   # Evaluating chunks
   # Only works if Validation set used same chunked dataset from /Knowledge
   chunk_correct = expected_chunk in chunk_ids if expected_chunk else True
   top1_correct = chunk_ids[0] == expected_chunk if chunk_ids and expected_chunk else False
   
   # Count metrics
   correct_answers += answer_correct
   correct_chunks += chunk_correct
   top1_chunks += top1_correct
   
   # Store results
   results.append({
       'Query': query['query'],
       'Expected_Answer': expected,
       'Generated_Answer': response,
       'Expected_Chunk_ID': expected_chunk,
       'Retrieved_Chunk_IDs': ', '.join(chunk_ids),
       'Answer_Correct': answer_correct,
       'Chunk_Retrieved': chunk_correct,
       'Top1_Chunk_Match': top1_correct,
       'BERT_F1_Score': bert_f1,
       'Avg_Retrieval_Score': f"{sum(scores)/len(scores):.3f}" if scores else "0.000"
   })
   
   # Print details
   print(f"\nQuery: {query['query']}")
   print(f"Expected: {expected}")
   print(f"Generated: {response}")
   print(f"Expected Chunk: {expected_chunk}")
   print(f"Retrieved Chunks: {chunk_ids}")
   print(f"Retrieval Scores: {[f'{s:.3f}' for s in scores]}")
   if bert_f1_response:
       print(f"BERT F1: {bert_f1_response:.3f}")
   print(f"Answer Correct: {answer_correct}")


Query: What is the outstanding amount of Subordinated Debentures in 2020?
Expected: $12,887,000
Generated: $12,887,000
Expected Chunk: 6bb74a17-d3dc-4a22-91c8-a37afe5ac18e
Retrieved Chunks: ['972a00bc-b917-45af-8690-ee685107f9e9', '6a89ecda-bb77-469f-91ec-132a43ca7081', 'b8c9e312-7209-4f97-80b8-083c7d481879']
Retrieval Scores: ['0.463', '0.441', '0.373']
Answer Correct: True

Query: How much money was allocated under the HHSB Act in 2020 ?
Expected: $284.45 billion
Generated: $284.45 billion
Expected Chunk: d4650e17-66c5-44f7-90a1-abdd5287b7cc
Retrieved Chunks: ['41857213-f237-424b-a5e2-4b91f493e4eb', 'a990e7c6-37d0-4b5a-b487-5c7240d66020', '1e6d5bc3-f0cc-4189-a35c-85c4a379a226']
Retrieval Scores: ['0.478', '0.454', '0.391']
Answer Correct: True

Query: What is the FDIC's standard maximum deposit insurance amount in 2020 ?
Expected: $250,000
Generated: The FDIC's standard maximum deposit insurance amount in 2020 was $250,000.
Expected Chunk: 13b7893f-f441-4fba-a6b1-54ee5c9e3ca0
Retrie

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Query: In the year 2020, what were the general risk that the company was subjected to ?
Expected: In 2020, the company was subject to several general risks stemming from competitive pressures, technological changes, and evolving consumer behavior in the financial services industry. One major risk was the shift in how financial services were delivered. The increasing use of online and mobile banking reduced reliance on traditional branch facilities. This trend posed a threat to the company's branch-based service model, requiring continuous evaluation of its physical infrastructure. While closing underperforming branches could improve efficiency, it also carried the risk of incurring restructuring charges and damaging customer relationships.
Additionally, the company faced substantial competition from a wide range of financial and nonfinancial entities. These included local community banks, large national banks, credit unions with tax advantages, and non-bank financial service providers

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Query: As of the year 2019, describe the elements of an extensive regulatory framework which are applicable to bank holding companies and banks ?
Expected: 
Based on the 2019 SEC 10-K filing, the extensive regulatory framework for bank holding companies and banks is primarily designed to protect depositors and the FDIC Deposit Insurance Fund rather than shareholders or creditors. The framework is dominated by the Dodd-Frank Wall Street Reform and Consumer Protection Act of 2010, which comprehensively restructured financial services regulation.
Key elements include the establishment of the Consumer Financial Protection Bureau (CFPB) with centralized authority over federal consumer financial laws, though smaller banks like the Company are enforced by agencies such as the OCC rather than the CFPB directly. The Act applies uniform leverage and risk-based capital requirements to bank holding companies on a consolidated basis, prohibiting the use of additional trust preferred securities as 

In [42]:
# metrics
total = len(queries)
metrics = {
   'Answer_Accuracy': f"{correct_answers/total:.1%}",
   'Chunk_Recall': f"{correct_chunks/total:.1%}",
   'Top1_Chunk_Accuracy': f"{top1_chunks/total:.1%}",
   'Avg_BERT_F1': f"{sum(bert_scores)/len(bert_scores):.3f}" if bert_scores else "N/A",
   'BERT_Pass_Rate': f"{sum(1 for s in bert_scores if s > 0.8)/len(bert_scores):.1%}" if bert_scores else "N/A"
}

# Print summary
print(f"\n{'='*50}")
print("EVALUATION SUMMARY\n")
print(f"{'='*50}")
for key, value in metrics.items():
   print(f"{key}: {value}")
print(f"{'='*50}")


EVALUATION SUMMARY

Answer_Accuracy: 85.7%
Chunk_Recall: 14.3%
Top1_Chunk_Accuracy: 0.0%
Avg_BERT_F1: 0.836
BERT_Pass_Rate: 100.0%
