# LLM Applicatios

---
### What you'll build
1. **Earnings calls:** Compare two consecutive transcripts (FedEx) and extract **surprising changes** vs prior call.
2. **FOMC press conference:** Surface statements most likely to **surprise markets**.
3. **Forecasting pitfalls:** Demonstrate **leakage** and why naive LLM+ML pipelines can overstate accuracy.

> This notebook assumes an internet connection. It includes robust fallbacks (short excerpts) so it still runs if websites block scraping during class.

## Learning Objectives
- Use **Gemini (AI Studio)** from Python to classify/extract structured insights from financial text.
- Combine **statistical novelty** (TF‑IDF) with **LLM judgment** to find *surprising* statements.
- Run a small **FOMC surprise** detector with hawkish/dovish cues + LLM vetting.
- See a concrete **look‑ahead bias** failure case and how to fix it.

## 0) Setup
**Gemini API (AI Studio) is free for prototyping** in supported regions. Create a key in AI Studio and set it as `GEMINI_API_KEY`.

### Install (if needed)

In [1]:
# If running on a fresh environment, uncomment:
#%pip install google-generativeai pandas numpy matplotlib scikit-learn beautifulsoup4 requests tqdm
%pip install pdfminer.six pypdf


Collecting pdfminer.six
  Downloading pdfminer_six-20251107-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdf
  Downloading pypdf-6.4.0-py3-none-any.whl.metadata (7.1 kB)
Downloading pdfminer_six-20251107-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdf-6.4.0-py3-none-any.whl (329 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.5/329.5 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf, pdfminer.six
Successfully installed pdfminer.six-20251107 pypdf-6.4.0


In [2]:
import os, re, json, time, math, warnings, textwrap
warnings.filterwarnings('ignore')
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import requests
from bs4 import BeautifulSoup
import google.generativeai as genai


Versions:
Python 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
google-generativeai 0.8.5


## 1) Utilities: fetch & preprocess transcripts
These helpers try to download real transcripts. If blocked, they fall back to short built‑in excerpts (for demo only).

In [3]:
from bs4 import BeautifulSoup
import requests, re
HEADERS = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0 Safari/537.36'}

def fetch_text(url, min_len=2000, timeout=15):
    try:
        r = requests.get(url, headers=HEADERS, timeout=timeout)
        if r.status_code!=200:
            return None
        soup = BeautifulSoup(r.text, 'html.parser')
        for tag in soup(['script','style','noscript']): tag.extract()
        txt = ' '.join(soup.get_text('\n').split())
        return txt if len(txt)>=min_len else None
    except Exception:
        return None

def split_sentences(text):
    s = re.split(r'(?<=[.!?])\s+(?=[A-Z\[])', text)
    return [x.strip() for x in s if len(x.strip())>0]






import re, io, requests

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/123.0 Safari/537.36"
}

def fetch_pdf_bytes(url, timeout=40):
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    print(f"Fetched {url} | status={r.status_code} | bytes={len(r.content)} | type={r.headers.get('Content-Type')}")
    return r.content

def pdf_bytes_to_text(data: bytes) -> str:
    # Try pdfminer.six
    try:
        from pdfminer.high_level import extract_text
        return extract_text(io.BytesIO(data))
    except Exception as e:
        print("pdfminer.six failed:", repr(e))
    # Fallback: PyPDF
    try:
        from pypdf import PdfReader
        reader = PdfReader(io.BytesIO(data))
        return "\n".join((p.extract_text() or "") for p in reader.pages)
    except Exception as e:
        print("PyPDF failed:", repr(e))
        return ""

def parse_quarter_key(url: str):
    # Prefer explicit "Q4-FY25" pattern
    m = re.search(r'Q([1-4])[-_]?FY(\d{2,4})', url, re.I)
    if m:
        q = int(m.group(1)); fy = int(m.group(2))
        fy = fy + 2000 if fy < 100 else fy
        return (fy, q)
    # Fallback "/2025/q4/" pattern in path
    m2 = re.search(r'/(\d{4})/q([1-4])/', url, re.I)
    if m2:
        return (int(m2.group(1)), int(m2.group(2)))
    return (0, 0)  # unknown → sorts earliest

## 2) Earnings Call: What’s surprising vs the previous call? (FedEx)
**Method:** TF‑IDF novelty (current vs prior) → top candidates → Gemini JSON classification (surprise category/direction/relevance).

In [4]:
# Ensure chronological order (oldest -> newest), then take the last two

fedex_urls = [
  'https://s21.q4cdn.com/665674268/files/doc_financials/2025/q4/FDX-Q4-FY25-Earnings-Call-Transcript_Final.pdf',
  'https://s21.q4cdn.com/665674268/files/doc_financials/2025/q3/FDX-Q3-FY25-Earnings-Call-Transcript.pdf'
]
fedex_urls_sorted = sorted(fedex_urls, key=parse_quarter_key)
prev_url, curr_url = fedex_urls_sorted[-2], fedex_urls_sorted[-1]
print("Prev URL:", prev_url)
print("Curr URL:", curr_url)

# Extract text
prev_text = pdf_bytes_to_text(fetch_pdf_bytes(prev_url))
curr_text = pdf_bytes_to_text(fetch_pdf_bytes(curr_url))



print('Lengths -> prev:', len(prev_text), 'curr:', len(curr_text))
print("\n--- preview prev_text ---\n", prev_text[:600])
print("\n--- preview curr_text ---\n", curr_text[:600])

Prev URL: https://s21.q4cdn.com/665674268/files/doc_financials/2025/q3/FDX-Q3-FY25-Earnings-Call-Transcript.pdf
Curr URL: https://s21.q4cdn.com/665674268/files/doc_financials/2025/q4/FDX-Q4-FY25-Earnings-Call-Transcript_Final.pdf
Fetched https://s21.q4cdn.com/665674268/files/doc_financials/2025/q3/FDX-Q3-FY25-Earnings-Call-Transcript.pdf | status=200 | bytes=762180 | type=application/pdf
Fetched https://s21.q4cdn.com/665674268/files/doc_financials/2025/q4/FDX-Q4-FY25-Earnings-Call-Transcript_Final.pdf | status=200 | bytes=192666 | type=application/pdf
Lengths -> prev: 65222 curr: 57980

--- preview prev_text ---
 FedEx Q3 FY25 Earnings Call Transcript – March 20, 2025 

Jenifer Hollander 
Vice President-Investor Relations, FedEx Corp. 

Good afternoon, and welcome to FedEx Corporation's third quarter earnings conference call. The third quarter earnings 
release, Form 10-Q and stat book are on our website at investors.fedex.com. This call and the accompanying slides are 
being streamed 

In [5]:
# 2.2 Novelty scoring
prev_s = split_sentences(prev_text)
curr_s = split_sentences(curr_text)
vec = TfidfVectorizer(stop_words='english', max_features=20000)
Xp = vec.fit_transform(prev_s)
Xc = vec.transform(curr_s)
sims = cosine_similarity(Xc, Xp).max(axis=1)
nov = 1 - sims
df_curr = pd.DataFrame({'sentence': curr_s, 'novelty': nov}).sort_values('novelty', ascending=False)
df_curr.head(10)


Unnamed: 0,sentence,novelty
527,"So, condolences to family, friends and colleag...",1.0
487,"So, apologize for that slight delay.",1.0
353,And that U.S.,1.0
280,It \nruns across and is part of our culture here.,1.0
6,But Fred was a man grounded by a mission.,1.0
3,It feels strange to be here with you all so \n...,1.0
5,Smith.,1.0
74,We continue to apply our digital platform-base...,0.831813
75,These solutions support a wide range \nof stak...,0.828557
541,The scale of FedEx comes into play in these ki...,0.82825


### Lets initiate the LLM

In [7]:

from google import genai
import os
import json
import pandas as pd

# Ensure the client uses the key from the environment
client = genai.Client(api_key="AIzaSyD9OEQ1dKzXNvh1oS4eYUpu3hxpDrcSoBs")


### Putting the LLM to work

Two pieces

1. (The How) What the LLM will do. Needs to be very specific

2. (The What) The text that will analyse

In [8]:

# Select model
model_id = "gemini-2.5-flash"
print(f"Using model: {model_id}")

# Combine texts into a single string
user_prompt = f"""
--- PREVIOUS EARNINGS CALL ---
{prev_text}

--- CURRENT EARNINGS CALL ---
{curr_text}
"""

# Updated system message to request specific JSON structure
system_msg = (
  'Act as an equity analyst. Comparing with the previous_call text, for each sentence from the current earnings call, decide if it is surprising vs the prior call AND likely to be market-moving. '
  'Use categories: guidance, demand, margins, capital_allocation, network/operations, macro, costs, other. '
  'Return a JSON object with a single key "surprising_claims" containing an array of objects with fields: claim, category, direction (up/down/neutral), is_surprising (bool), market_relevance (low/med/high), rationale, confidence. '
  'Only include sentences that are both surprising AND have medium or high market_relevance.'
)

try:
    response = client.models.generate_content(
        model=model_id,
        contents=user_prompt,
        config={
            'system_instruction': system_msg,
            'response_mime_type': 'application/json'
        }
    )
    # Parse response and assign to fedex_res
    fedex_res = json.loads(response.text)
    # print(json.dumps(fedex_res, indent=2))
except Exception as e:
    print("Error generating content:", e)
    if 'response' in locals():
        print("Raw response text:", response.text)

fedex_res=pd.DataFrame(fedex_res['surprising_claims'])
fedex_res

Using model: gemini-2.5-flash


Unnamed: 0,claim,category,direction,is_surprising,market_relevance,rationale,confidence
0,"For FY 2026, we expect to achieve $1 billion o...",costs,up,True,high,The previous call indicated an expectation of ...,high
1,We will also continue to repurchase shares and...,capital_allocation,up,True,high,While the previous call provided FY25 capital ...,high
2,We're currently planning for FY 2026 CapEx to ...,capital_allocation,up,True,high,The previous call guided FY25 CapEx down to $4...,high
3,We are only providing first quarter outlook at...,guidance,down,True,high,The CFO explicitly stated in the previous call...,high
4,This translates to a Q1 adjusted EPS range of ...,guidance,neutral,False,high,"This is new, concrete guidance for the upcomin...",high
5,Europe remains a significant opportunity for l...,costs,up,True,medium,While the previous call noted ongoing European...,high
6,"last month, we named Brad Martin as Chairman o...",network/operations,up,True,high,The previous call only mentioned a 'comprehens...,high
7,Our current expectation is for flat to 2% reve...,guidance,down,True,high,"While the USPS contract expiration was known, ...",high
8,"Following the April 2 tariff announcement, cus...",demand,down,True,high,The previous call downplayed the impact of de ...,high
9,The top-end of the range assumes current favor...,demand,neutral,True,medium,"In Q4, US domestic volumes held up well with a...",medium


## 3) FOMC Press Conference: likely market‑moving lines
**Method:** novelty + hawk/dove tone → Gemini vetting to flag likely market movers.

In [9]:
# 3.1 Fetch two FOMC transcripts (or fallback) — edit URLs for specific dates when teaching
fomc_urls = ['https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20250618.pdf','https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20250730.pdf']



fomc_urls_sorted = sorted(fomc_urls, key=parse_quarter_key)
prev_url, curr_url = fomc_urls_sorted[-2], fomc_urls_sorted[-1]
print("Prev URL:", prev_url)
print("Curr URL:", curr_url)

# Extract text
prev_text = pdf_bytes_to_text(fetch_pdf_bytes(prev_url))
curr_text = pdf_bytes_to_text(fetch_pdf_bytes(curr_url))



print('Lengths -> prev:', len(prev_text), 'curr:', len(curr_text))
print("\n--- preview prev_text ---\n", prev_text[:600])
print("\n--- preview curr_text ---\n", curr_text[:600])


Prev URL: https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20250618.pdf
Curr URL: https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20250730.pdf
Fetched https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20250618.pdf | status=200 | bytes=217624 | type=application/pdf
Fetched https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20250730.pdf | status=200 | bytes=222261 | type=application/pdf
Lengths -> prev: 56626 curr: 47774

--- preview prev_text ---
 June 18, 2025 

  Chair Powell’s Press Conference 

FINAL 

Transcript of Chair Powell’s Press Conference 
June 18, 2025 

CHAIR POWELL.  Good afternoon.  My colleagues and I remain squarely focused on 

achieving our dual-mandate goals of maximum employment and stable prices for the benefit of 

the American people.  Despite elevated uncertainty, the economy is in a solid position.  The 

unemployment rate remains low, and the labor market is at or near maximum employment.  

Inflation has come do

In [10]:
# 3.2 Compute novelty + simple hawk/dove tone
prev_s = split_sentences(prev_text)
curr_s = split_sentences(curr_text)
vec = TfidfVectorizer(stop_words='english', max_features=20000)
Xp = vec.fit_transform(prev_s)
Xc = vec.transform(curr_s)
sims = cosine_similarity(Xc, Xp).max(axis=1)
nov = 1 - sims

hawk = set('tighten tightening restrictive inflation persistent upside overheating strong labor vigilantly price stability hikes higher longer'.split())
dove = set('ease easing lower cut disinflation confidence balanced downside progress softening slack'.split())
def tone(s):
    w = re.findall(r'[A-Za-z]+', s.lower())
    return sum(1 for x in w if x in hawk) - sum(1 for x in w if x in dove)
tones = np.array([tone(s) for s in curr_s])

df_f = pd.DataFrame({'sentence': curr_s, 'novelty': nov, 'tone': tones}).sort_values(['novelty','tone'], ascending=[False, False])
df_f.head(12)


Unnamed: 0,sentence,novelty,tone
193,That’s inefficient.,1.0,0
198,"Ideally, we do it efficiently.",1.0,0
215,And that’s what we do.,1.0,0
229,It was an honor to host him.,1.0,0
305,"Washing machines were tariffed, but, but, but ...",1.0,0
360,What will it take?,1.0,0
401,And we do all of \n\nthat.,1.0,0
406,Jay O’Brien.,1.0,0
407,JAY O’BRIEN.,1.0,0
412,He has personally pressured you.,1.0,0


In [11]:



user_prompt = f"""
--- PREVIOUS EARNINGS CALL ---
{prev_text}

--- CURRENT EARNINGS CALL ---
{curr_text}
"""

# Updated system message to request specific JSON structure
system_msg = (
   'Act as a macro-rates analyst. Comparing with the previous_transcript text, for each sentence from the current trasncript text, decide if it is surprising vs the prior trasncript AND likely to be market-moving. '
  'Use categories: forward path of policy, balance‑sheet pace, confidence about inflation path, changes in risk balance, financial conditions. '
  'Return a JSON object with a single key "surprising_claims" containing an array of objects with fields: claim, category,positive or negative for market, positive or negative for long term bonds, is_surprising (bool), market_relevance (low/med/high), rationale, confidence. '
  'Only include sentences that are both surprising AND have medium or high market_relevance.'
)

try:
    response = client.models.generate_content(
        model=model_id,
        contents=user_prompt,
        config={
            'system_instruction': system_msg,
            'response_mime_type': 'application/json'
        }
    )
    # Parse response and assign to fedex_res
    fedex_res = json.loads(response.text)
    # print(json.dumps(fedex_res, indent=2))
except Exception as e:
    print("Error generating content:", e)
    if 'response' in locals():
        print("Raw response text:", response.text)

fedex_res=pd.DataFrame(fedex_res['surprising_claims'])
fedex_res



Unnamed: 0,claim,category,positive_or_negative_for_market,positive_or_negative_for_long_term_bonds,is_surprising,market_relevance,rationale,confidence
0,"Financial conditions are accommodative, and th...",financial conditions,negative,negative,True,high,"In June, Chair Powell characterized policy as ...",high
1,"In coming months, we’ll receive a good amount ...",forward path of policy,positive,positive,True,high,"In June, Powell was non-committal on the timin...",high
2,"All that said, there’s also downside risk to t...",changes in risk balance,positive,positive,True,high,"In June, Powell described labor market cooling...",high
3,Higher tariffs have begun to show through more...,confidence about inflation path,negative,negative,True,medium,"In June, Powell stated that tariffs were 'like...",high
4,"We will, through our tools, make sure that thi...",confidence about inflation path,negative,negative,True,medium,While the Fed's commitment to price stability ...,high
5,The fact that it’s getting into balance due to...,changes in risk balance,positive,positive,True,high,"In June, the labor market was generally charac...",high
6,"You could argue we are, a bit, “looking throug...",forward path of policy,positive,positive,True,medium,"In June, the discussion around tariffs focused...",medium


## 4) Forecasting Pitfalls: look‑ahead bias & leakage demo
Two demos: (a) TF‑IDF fit leakage + random CV, (b) time‑aware split without leakage.

In [28]:


user_prompt = 'Act as a hedge fund manager. Assume you only have information up to 2019. Predict which 10 stocks would perform worse and the 10 the best if a global pandemic would hit the economy.   Please Return a json with the ticker, qualitative prediction,quantititative prediction, and your rational.'

#user_prompt = 'Act as a hedge fund manager. Assume you only have information up to 2019. Predict which 10 stocks would perform worse and the 10 the best if a global pandemic would hit the economy.  Do the very best you can do in purusing this forecast. Please Return a json with the ticker, qualitative prediction,quantititative prediction, and your rational.'


#user_prompt = 'Act as a hedge fund manager. Assume you only have information up to 2021. Predict which 10 stocks would perform worse and the 10 the best if a global pandemic would hit the economy.   Please Return a json with the ticker, qualitative prediction,quantititative prediction, and your rational.'

response = client.models.generate_content(
    model=model_id,
    contents=user_prompt,
    config={
        'response_mime_type': 'application/json'
    }
)

result = json.loads(response.text)







In [29]:
# Create DataFrame for best performers
best_performers_df = pd.DataFrame(result['best_performers'])

# Create DataFrame for worst performers
worst_performers_df = pd.DataFrame(result['worst_performers'])

# Concatenate the two DataFrames using pd.concat
# ignore_index=True ensures the new DataFrame has a continuous index
final_predictions_df = pd.concat([best_performers_df, worst_performers_df], ignore_index=True)

final_predictions_df

Unnamed: 0,ticker,qualitative_prediction,quantitative_prediction,rational
0,AMZN,Best Performer,+45%,E-commerce will become the primary way for peo...
1,NFLX,Best Performer,+30%,"As people are confined to their homes, demand ..."
2,MSFT,Best Performer,+25%,Microsoft's cloud services (Azure) will be cru...
3,ZM,Best Performer,+80%,"Zoom, having recently IPO'd, is perfectly posi..."
4,WMT,Best Performer,+20%,"As an essential retailer, Walmart will see a s..."
5,PG,Best Performer,+15%,Procter & Gamble produces essential consumer s...
6,NVDA,Best Performer,+35%,"NVIDIA's GPUs power data centers, which are es..."
7,MRNA,Best Performer,+70%,As a biotech company focused on mRNA technolog...
8,CHGG,Best Performer,+40%,"With schools and universities likely to close,..."
9,ATVI,Best Performer,+25%,Video games provide entertainment and social i...


## How to fix it?

- Always **time‑split** (train on `t<=T`, test on `t>T`).
- Hard to do with sophisitcated LLMS!
- Fit tokenizers/embeddings on **train only** (or use historical corpora).
- Need to develop your own personall gpt
- Alternative is not to use LLM direct for prediction, but instead use to construct signals from text and then implement sample splits
- When constructing the signals better be extra careful what you are asking!