### LLM recommendations notebook

In this notebook, I will use the downloaded and slightly preprocessed financial statmements of ~1490 companies to obtain buy/sell/hold recommendations of Google's Gemini-2.5-flash model. 

Overall, three functions are employed:
- get_llm_ratings(): For a given CIK, this function loops over dates, grabbing the respective financial statements and parses it together in a format that can be input to a LLM.
- get_llm_ratings_with_previous_quarters(): Very similar to the above function, but it additionally selects financial statements from previous quartes, belonging to the CIK. 
- llm_ratings_loop(): This function simply loops over an input list of CIKs, repeatedly calls one of the above functions and neatly concatenates and saves the results. 

In [None]:
import pandas as pd
import numpy as np 
from google import genai
from google.genai import types
import json
import re
from tqdm import tqdm
import time
from google.genai.errors import ServerError  

In [2]:
# Import Gemini API key
with open("../proton_google_api_key.txt", "r") as f:
    key = f.read().strip()    

# Initialize the Gemini client with the API key
client = genai.Client(api_key = key)

- Read in financial statements

In [4]:
labeled_balance_sheets = pd.read_csv("../data/balance_sheets_with_labels.csv", dtype={"CIK": str})
labeled_income_statements = pd.read_csv("../data/income_statements_with_labels.csv", dtype={"CIK": str})
labeled_cash_flow_statements = pd.read_csv("../data/cash_flow_statements_with_labels.csv", dtype={"CIK": str})

  labeled_income_statements = pd.read_csv("../data/income_statements_with_labels.csv", dtype={"CIK": str})
  labeled_cash_flow_statements = pd.read_csv("../data/cash_flow_statements_with_labels.csv", dtype={"CIK": str})


In [5]:
# Minor validity check
bs_uniques = labeled_balance_sheets["CIK"].unique()
is_uniques = labeled_income_statements["CIK"].unique()
cs_uniques = labeled_cash_flow_statements["CIK"].unique()

# Find overlap 
ciks = set(bs_uniques) & set(is_uniques) & set(cs_uniques)
len(ciks)

1490

In [11]:
# Convert the set to a DataFrame
cik_df = pd.DataFrame(list(ciks), columns=["CIK"])
ciks1 = cik_df.iloc[:500]
ciks2 = cik_df.iloc[500:1000]
ciks3 = cik_df.iloc[1000:]
len(ciks1) + len(ciks2) + len(ciks3) == len(ciks)

True

---

Functions to be used in the process

In [None]:
def get_llm_ratings(cik: str, balance_sheets = None, income_statements = None, cash_flow_statements = None):
    
    """
    Function that returns a DataFrame with LLM ratings for a given CIK.
    For every reporting date, the function fetches the most recent financial statements, i.e.
    - Balance Sheet
    - Cash Flow Statement
    - Income Statement
    and calls the LLM to get a buy/sell/hold recommendation. In order to avoid issues with reports that were filed slightly apart,
    a window of 10 days around a given reporting date is used. This helps to ensure that the LLM has access to all relevant financial information for a given reporting date.
    Furthermore, reports that were filed slightly apart will not lead to recommendations that are based on partial information only and will also not cause multiple
    recommendations that only lie within the window of 10 days around a given reporting date.
    
    Parameters: 
    cik: str, CIK of the company (Can be looked up on the SEC website)
    balance_sheets: Balance Sheets DataFrame with columns: STD Balance Sheet All, FCC Item Name, CIK and Date (among others)
    income_statements: Income Statements DataFrame with columns: STD Income Statement All, FCC Item Name, CIK and Date (among others)
    cash_flow_statements: Cash Flow Statements DataFrame with columns: STD Cash Flow All, FCC Item Name, CIK and Date (among others)
    """
    
    # First filter dfs for input CIK
    balance_sheets = balance_sheets[balance_sheets["CIK"] == cik].copy()
    income_statements = income_statements[income_statements["CIK"] == cik].copy()
    cash_flow_statements = cash_flow_statements[cash_flow_statements["CIK"] == cik].copy()
    
    # Convert the date columns to datetime objects
    for df in [balance_sheets, income_statements, cash_flow_statements]:
        df["Report Date"] = pd.to_datetime(df["Date"])

    # Determine unique dates
    reporting_dates = pd.concat([
        balance_sheets["Report Date"],
        income_statements["Report Date"],
        cash_flow_statements["Report Date"]
    ]).unique()

    # Sort dates just to be safe
    reporting_dates = np.sort(reporting_dates)

    # In order to handle reports, that were filed slighty apart, a window of 10 days around a given reporting date is used
    window = pd.Timedelta(days=10)

    # Loop over reporting dates to obtain LLM ratings
    llm_ratings = []
    for date in reporting_dates:

        # Subset all financial statements for the given dates +- window days
        # Current quarter
        bs = balance_sheets[(balance_sheets["Report Date"] >= date - window) & (balance_sheets["Report Date"] <= date + window)]
        is_ = income_statements[(income_statements["Report Date"] >= date - window) & (income_statements["Report Date"] <= date + window)]
        cf = cash_flow_statements[(cash_flow_statements["Report Date"] >= date - window) & (cash_flow_statements["Report Date"] <= date + window)]

        # If any of the DataFrames is empty, skip this date
        if bs.empty or is_.empty or cf.empty:
            continue

        # Concatenate reports into a string with correct labels
        bs_str = "\n".join(bs.apply(lambda row: f"{row['position_label']}: {row['STD Balance Sheet All']}", axis=1).astype(str))
        is_str = "\n".join(is_.apply(lambda row: f"{row['position_label']}: {row['STD Income Statement All']}", axis=1).astype(str))
        cf_str = "\n".join(cf.apply(lambda row: f"{row['position_label']}: {row['STD Cash Flow All']}", axis=1).astype(str))




        # Call the LLM to get the rating
        response = client.models.generate_content(
            model="gemini-2.5-flash-lite", # "gemini-2.5-flash"
            config=types.GenerateContentConfig(
                temperature=0, # Deterministic ouput
                system_instruction="""You are an experienced, data-driven financial analyst, that provides concise and accurate answers.""",
                
                thinking_config=types.ThinkingConfig(thinking_budget=0),# Disables thinking, but only required for Gemini 2.5
            ),
            
            contents=[f"""
            Based on the following financial reports only, please provide an investment recommendation for the underlying company.
                      
            Balance Sheet: 
            {bs_str}

            Income Statement: 
            {is_str}

            Cash Flow Statement: 
            {cf_str}

            Provide your answer using only one of the following signals: 'strong buy', 'buy', 'hold', 'sell', or 'strong sell'.
            """]
        )

        # Extract rating from the response
        rating = response.text.strip().lower()
        llm_ratings.append({
            "cik": str(cik),  # Ensure CIK is a string
            "date": date,
            "rating": rating
        })
    
    # Convert the list of dictionaries to a DataFrame
    llm_ratings_df = pd.DataFrame(llm_ratings)

    # Convert Report Date to datetime
    llm_ratings_df["date"] = pd.to_datetime(llm_ratings_df["date"]).dt.date

    # Sort by Report Date
    llm_ratings_df.sort_values(by="date", inplace=True)

    # Reset index
    llm_ratings_df.reset_index(drop=True, inplace=True)
    
    return llm_ratings_df

In [None]:
def get_llm_ratings_with_previous_quarters(cik: str, balance_sheets = None, income_statements = None, cash_flow_statements = None):
    
    """
    Function that returns a DataFrame with LLM ratings for a given CIK.
    For every reporting date, the function fetches the most recent financial statements, i.e.
    - Balance Sheet
    - Cash Flow Statement
    - Income Statement
    and calls the LLM to get a buy/sell/hold recommendation. In order to avoid issues with reports that were filed slightly apart,
    a window of 10 days around a given reporting date is used. This helps to ensure that the LLM has access to all relevant financial information for a given reporting date.
    Furthermore, reports that were filed slightly apart will not lead to recommendations that are based on partial information only and will also not cause multiple
    recommendations that only lie within the window of 10 days around a given reporting date.
    
    Parameters: 
    cik: str, CIK of the company (Can be looked up on the SEC website)
    balance_sheets: Balance Sheets DataFrame with columns: STD Balance Sheet All, FCC Item Name, CIK and Date (among others)
    income_statements: Income Statements DataFrame with columns: STD Income Statement All, FCC Item Name, CIK and Date (among others)
    cash_flow_statements: Cash Flow Statements DataFrame with columns: STD Cash Flow All, FCC Item Name, CIK and Date (among others)
    """
    
    # First filter dfs for input CIK
    balance_sheets = balance_sheets[balance_sheets["CIK"] == cik].copy()
    income_statements = income_statements[income_statements["CIK"] == cik].copy()
    cash_flow_statements = cash_flow_statements[cash_flow_statements["CIK"] == cik].copy()
    
    # Convert the date columns to datetime objects
    for df in [balance_sheets, income_statements, cash_flow_statements]:
        df["Report Date"] = pd.to_datetime(df["Date"])

    # Determine unique dates
    reporting_dates = pd.concat([
        balance_sheets["Report Date"],
        income_statements["Report Date"],
        cash_flow_statements["Report Date"]
    ]).unique()

    # Sort dates just to be safe
    reporting_dates = np.sort(reporting_dates)

    # In order to handle reports, that were filed slighty apart, a window of 10 days around a given reporting date is used
    window = pd.Timedelta(days=10)

    # Loop over reporting dates to obtain LLM ratings
    llm_ratings = []
    for i, date in enumerate(reporting_dates):


        # Determine current date
        q0_date = date
        # Determine date of previous quarter
        qminus1_date = date - pd.DateOffset(months=3)
        # Determine date of q0 -2 
        qminus2_date = date - pd.DateOffset(months=6)
        # Determine date of q0 -3
        qminus3_date = date - pd.DateOffset(months=9)        


        # Subset all financial statements for the given dates +- window days
        # Current quarter
        bs = balance_sheets[
            (balance_sheets["Report Date"] >= date - window) &
            (balance_sheets["Report Date"] <= date + window)
        ]
        is_0 = income_statements[
            (income_statements["Report Date"] >= date - window) &
            (income_statements["Report Date"] <= date + window)
        ]
        cf_0 = cash_flow_statements[
            (cash_flow_statements["Report Date"] >= date - window) &
            (cash_flow_statements["Report Date"] <= date + window)
        ]

        # Previous quarter (Q-1)
        is_qminus1 = income_statements[
            (income_statements["Report Date"] >= qminus1_date - window) &
            (income_statements["Report Date"] <= qminus1_date + window)
        ]
        cf_qminus1 = cash_flow_statements[
            (cash_flow_statements["Report Date"] >= qminus1_date - window) &
            (cash_flow_statements["Report Date"] <= qminus1_date + window)
        ]

        # Two quarters ago (Q-2)
        is_qminus2 = income_statements[
            (income_statements["Report Date"] >= qminus2_date - window) &
            (income_statements["Report Date"] <= qminus2_date + window)
        ]
        cf_qminus2 = cash_flow_statements[
            (cash_flow_statements["Report Date"] >= qminus2_date - window) &
            (cash_flow_statements["Report Date"] <= qminus2_date + window)
        ]

        # Three quarters ago (Q-3)
        is_qminus3 = income_statements[
            (income_statements["Report Date"] >= qminus3_date - window) &
            (income_statements["Report Date"] <= qminus3_date + window)
        ]
        cf_qminus3 = cash_flow_statements[
            (cash_flow_statements["Report Date"] >= qminus3_date - window) &
            (cash_flow_statements["Report Date"] <= qminus3_date + window)
        ]

        # If no reports are available for the given date, skip to next date
        if bs.empty or is_0.empty or cf_0.empty:
            print(f"No reports available for date {date}. Skipping...")
            continue
        
        # Concatenate reports into strings with correct labels
        bs_str = "\n".join(bs.apply(lambda row: f"{row['position_label']}: {row['STD Balance Sheet All']}", axis=1).astype(str))
        is_str = "\n".join(is_0.apply(lambda row: f"{row['position_label']}: {row['STD Income Statement All']}", axis=1).astype(str))
        cf_str = "\n".join(cf_0.apply(lambda row: f"{row['position_label']}: {row['STD Cash Flow All']}", axis=1).astype(str))

        # Append previous quarters if available â€” even if just one of IS or CF is present
        if not is_qminus1.empty:
            is_qminus1_str = "\n".join(is_qminus1.apply(lambda row: f"{row['position_label']}: {row['STD Income Statement All']}", axis=1).astype(str))
            is_str += f"\n\nIncome Statement from previous quarter:\n{is_qminus1_str}"
        if not cf_qminus1.empty:
            cf_qminus1_str = "\n".join(cf_qminus1.apply(lambda row: f"{row['position_label']}: {row['STD Cash Flow All']}", axis=1).astype(str))
            cf_str += f"\n\nCash Flow Statement from previous quarter:\n{cf_qminus1_str}"

        if not is_qminus2.empty:
            is_qminus2_str = "\n".join(is_qminus2.apply(lambda row: f"{row['position_label']}: {row['STD Income Statement All']}", axis=1).astype(str))
            is_str += f"\n\nIncome Statement from two quarters ago:\n{is_qminus2_str}"
        if not cf_qminus2.empty:
            cf_qminus2_str = "\n".join(cf_qminus2.apply(lambda row: f"{row['position_label']}: {row['STD Cash Flow All']}", axis=1).astype(str))
            cf_str += f"\n\nCash Flow Statement from two quarters ago:\n{cf_qminus2_str}"

        if not is_qminus3.empty:
            is_qminus3_str = "\n".join(is_qminus3.apply(lambda row: f"{row['position_label']}: {row['STD Income Statement All']}", axis=1).astype(str))
            is_str += f"\n\nIncome Statement from three quarters ago:\n{is_qminus3_str}"
        if not cf_qminus3.empty:
            cf_qminus3_str = "\n".join(cf_qminus3.apply(lambda row: f"{row['position_label']}: {row['STD Cash Flow All']}", axis=1).astype(str))
            cf_str += f"\n\nCash Flow Statement from three quarters ago:\n{cf_qminus3_str}"
            

        # Call the LLM to get the rating
        response = client.models.generate_content(
           # model="gemini-2.5-flash", 
            model="gemini-2.5-flash-lite",
            config=types.GenerateContentConfig(
                temperature=0, # Deterministic ouput
                system_instruction="""You are an experienced, data-driven financial analyst, that provides concise and accurate answers.""",
                
                thinking_config=types.ThinkingConfig(thinking_budget=0),# Disables thinking, but only required for Gemini 2.5
            ),
            
            contents=[f"""
            Based on the following financial reports only, please provide an investment recommendation for the underlying company.
                      
            Balance Sheet: 
            {bs_str}

            Income Statement: 
            {is_str}

            Cash Flow Statement: 
            {cf_str}

            Provide your answer using only one of the following signals: 'buy', 'hold' or 'sell'.
            """]
        )

        # Extract rating from the response
        rating = response.text.strip().lower()
        llm_ratings.append({
            "cik": str(cik), 
            "date": date,
            "rating": rating
        })

    # If no ratings were generated, return None
    if not llm_ratings:
        return None
    
    # Convert the list of dictionaries to a DataFrame
    llm_ratings_df = pd.DataFrame(llm_ratings)

    llm_ratings_df["date"] = pd.to_datetime(llm_ratings_df["date"]).dt.date

    # Sort by Report Date
    llm_ratings_df.sort_values(by="date", inplace=True)

    # Reset index
    llm_ratings_df.reset_index(drop=True, inplace=True)
    
    return llm_ratings_df

In [None]:
def llm_ratings_loop(
    cik_list,
    balance_sheets,
    income_statements,
    cash_flow_statements,
    function_to_use,
    output_path_ratings="../data/ciklist1_ratings_with_previous_quarters.csv",
    output_path_failed="../data/failed_ciks1.csv",
    retries=5,
    retry_delay=30
):
    """
    Process a list of CIKs to retrieve LLM ratings with retry logic on server errors.

    Args:
        cik_list (pd.DataFrame): DataFrame with a "CIK" column.
        balance_sheets (dict or DataFrame): Balance sheet data.
        income_statements (dict or DataFrame): Income statement data.
        cash_flow_statements (dict or DataFrame): Cash flow data.
        function_to_use (callable): Function to call for each CIK. (Either get_llm_ratings_with_previous_quarters or get_llm_ratings)
        output_path_ratings (str): File path to save ratings CSV.
        output_path_failed (str): File path to save failed CIKs.
        retries (int): Number of retry attempts on server error.
        retry_delay (int): Seconds to wait between retries.

    Returns:
        pd.DataFrame: Combined ratings DataFrame.
    """
    list_ratings = []
    failed_ciks = []
    progress_bar = tqdm(cik_list["CIK"], desc="Processing CIKs")

    for i, cik in enumerate(progress_bar):
        progress_bar.set_description(
            f"Processing CIK {i+1}/{len(cik_list)}: {cik} | Time: {pd.Timestamp.now().strftime('%H:%M:%S')}"
        )

        for attempt in range(retries):
            try:
                ratings = function_to_use(
                    cik,
                    balance_sheets,
                    income_statements,
                    cash_flow_statements
                )
                if ratings is not None:
                    list_ratings.append(ratings)
                break  # success, exit retry loop which starts at for attempt in range(retries)

            except ServerError as e:
                print(f"ServerError for CIK {cik} (Attempt {attempt + 1}/{retries}): {e}")
                if attempt < retries - 1:
                    time.sleep(retry_delay)
                else: # This else statement basically only runs if all retries failed i.e. the inner loop is completed, hence it starts with the next CIK
                    failed_ciks.append(cik)

    # Save results to CSV
    cik_ratings_df = pd.concat(list_ratings, ignore_index=True)
    cik_ratings_df.to_csv(output_path_ratings, index=False)

    # Save failed CIKs to CSV
    if failed_ciks:
        pd.Series(failed_ciks).to_csv(output_path_failed, index=False)

    return cik_ratings_df

---

### Downloading recommendations, including only most recent financial statements

In [None]:
ratings1 = llm_ratings_loop(
    cik_list=ciks1,
    balance_sheets=labeled_balance_sheets,
    income_statements=labeled_income_statements,
    cash_flow_statements=labeled_cash_flow_statements,
    function_to_use=get_llm_ratings,
    output_path_ratings="../data/new_llm_ratings/ciklist1_ratings.csv",
    output_path_failed="../data/new_llm_ratings/failed_ciks1.csv",
    retries=5,
    retry_delay=30)

In [None]:
ratings2 = llm_ratings_loop(
    cik_list=ciks2,
    balance_sheets=labeled_balance_sheets,
    income_statements=labeled_income_statements,
    cash_flow_statements=labeled_cash_flow_statements,
    function_to_use=get_llm_ratings,
    output_path_ratings="../data/new_llm_ratings/ciklist2_ratings.csv",
    output_path_failed="../data/new_llm_ratings/failed_ciks2.csv",
    retries=5,
    retry_delay=30)

In [None]:
ratings3 = llm_ratings_loop(
    cik_list=ciks3,
    balance_sheets=labeled_balance_sheets,
    income_statements=labeled_income_statements,
    cash_flow_statements=labeled_cash_flow_statements,
    function_to_use=get_llm_ratings,
    output_path_ratings="../data/new_llm_ratings/ciklist3_ratings.csv",
    output_path_failed="../data/new_llm_ratings/failed_ciks3.csv",
    retries=5,
    retry_delay=30)

---
### Downloading recommendations, including financial data from previous quartes

In [None]:
ratings_w_prev_quarters1 = llm_ratings_loop(
    cik_list=ciks1,
    balance_sheets=labeled_balance_sheets,
    income_statements=labeled_income_statements,
    cash_flow_statements=labeled_cash_flow_statements,
    function_to_use=get_llm_ratings,
    output_path_ratings="../data/new_llm_ratings/ciklist1_ratings_w_prev_quarters.csv",
    output_path_failed="../data/new_llm_ratings/failed_ciks1_w_prev_quarters.csv",
    retries=5,
    retry_delay=30)

In [None]:
ratings_w_prev_quarters2 = llm_ratings_loop(
    cik_list=ciks2,
    balance_sheets=labeled_balance_sheets,
    income_statements=labeled_income_statements,
    cash_flow_statements=labeled_cash_flow_statements,
    function_to_use=get_llm_ratings,
    output_path_ratings="../data/new_llm_ratings/ciklist2_ratings_w_prev_quarters.csv",
    output_path_failed="../data/new_llm_ratings/failed_ciks2_w_prev_quarters.csv",
    retries=5,
    retry_delay=30)

In [None]:
ratings_w_prev_quarters3 = llm_ratings_loop(
    cik_list=ciks3,
    balance_sheets=labeled_balance_sheets,
    income_statements=labeled_income_statements,
    cash_flow_statements=labeled_cash_flow_statements,
    function_to_use=get_llm_ratings,
    output_path_ratings="../data/new_llm_ratings/ciklist3_ratings_w_prev_quarters.csv",
    output_path_failed="../data/new_llm_ratings/failed_ciks3_w_prev_quarters.csv",
    retries=5,
    retry_delay=30)