# **Lecture: NLP and LLMs in Data Driven Investing**
**Prof. [Your Name]** | *Fin418: Data Driven Investing*

## **Overview**
In this lecture, we explore how Natural Language Processing (NLP) and Large Language Models (LLMs) can be utilized to extract signals from unstructured financial data.

**Agenda:**
1.  **Document Similarity:** Analyzing the shift in FedEx's narrative during the early pandemic (Q1 vs Q2 2020) using TF-IDF.
2.  **Sentiment Analysis with LLMs:** Using `FinBERT` to quantify the tone of Federal Reserve (FOMC) statements during the crisis.
3.  **The Look-Ahead Bias Trap:** A simulation demonstrating the most common pitfall when backtesting LLM-based strategies.

In [None]:
# @title 1. Setup and Libraries
# We install 'transformers' for the LLM and 'wordcloud' for visualization
!pip install transformers torch wordcloud matplotlib seaborn pandas scikit-learn -q

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from wordcloud import WordCloud
from transformers import pipeline

# Set plotting style for the course
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
print("Libraries installed and imported successfully.")

---
## **2. Earnings Call Analysis: FedEx (FDX)**

Earnings calls provide management's raw perspective on operations. We will compare two critical periods for FedEx using **Cosine Similarity** to measure how "distant" the narratives are.

* **Period A (March 2020):** Peak uncertainty, start of lockdowns.
* **Period B (June 2020):** Adaptation, rise of e-commerce.

In [None]:
# 2.1 The Data: Excerpts from FedEx Earnings Calls
# In a real application, you would scrape these via an API (e.g., FMP or SeekingAlpha)

fdx_march_2020 = """
We said on this call last year that FY 2020 would be a year of challenge and change.
Then beginning in January, we began to deal with COVID-19 in China then in Europe and then, of course, in the United States.
We've made every effort to keep our team members and the public safe as we've dealt with this terrible disease.
We are suspending our earnings forecast. The economic impact of the coronavirus is hard to predict.
Global trade has slowed significantly due to the shutdowns.
"""

fdx_june_2020 = """
Here in the United States, the COVID pandemic has accelerated e-commerce adoption.
In mid-March, Asia-Pacific outbound average daily volume grew substantially over pre-COVID-19 levels fueled by PPE demand surge.
We are also experiencing Europe outbound growth on the transatlantic lane due to limited capacity and surging e-commerce volume.
As we enter fiscal 21, there are signs of tentative economic recovery under way.
We have continued to champion small and medium businesses and support their recovery.
"""

# 2.2 Vectorization (TF-IDF) and Similarity
# We convert text to numbers, highlighting unique words (like "PPE" or "suspending")
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform([fdx_march_2020, fdx_june_2020])

# Calculate Similarity
similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

print(f"Cosine Similarity between March and June Calls: {similarity[0][0]:.4f}")
print("\nInterpretation: A low score indicates a massive shift in the company's operating environment.")

---
## **3. Application: FOMC Sentiment with FinBERT**

Standard dictionary approaches (counting "positive" vs "negative" words) often fail in finance because context matters. We will use **FinBERT**, a Large Language Model fine-tuned specifically on financial text, to analyze Federal Reserve statements.

In [None]:
# 3.1 Load the FOMC Text Data
fomc_march_15 = """
The coronavirus outbreak has harmed communities and disrupted economic activity in many countries, including the United States.
The Federal Reserve is prepared to use its full range of tools to support the flow of credit to households and businesses.
"""

fomc_april_29 = """
The Federal Reserve is committed to using its full range of tools to support the U.S. economy in this challenging time.
The ongoing public health crisis will weigh heavily on economic activity, employment, and inflation in the near term.
"""

fomc_data = pd.DataFrame({
    'Date': ['2020-03-15', '2020-04-29'],
    'Text': [fomc_march_15, fomc_april_29]
})

# 3.2 Load FinBERT
# Note: This downloads the model weights from Hugging Face
classifier = pipeline("sentiment-analysis", model="ProsusAI/finbert")

# 3.3 Apply the Model
print("Running LLM inference...")
def get_sentiment(text):
    # The model returns a label (Positive/Negative/Neutral) and a confidence score
    result = classifier(text)[0]
    return result['label'], result['score']

fomc_data[['Label', 'Score']] = fomc_data['Text'].apply(lambda x: pd.Series(get_sentiment(x)))

display(fomc_data)

---
## **4. The Pitfall: Look-Ahead Bias**

The most critical error in Data Driven Investing is **Look-Ahead Bias**.

When using LLMs, this happens if you align today's news with today's return. In reality, news often happens *during* or *after* market hours. If you trade at the **Open**, you cannot use news released at **Noon**.

Below is a simulation of this error.

In [None]:
# 4.1 Simulate Data
np.random.seed(42)
n_days = 200
dates = pd.date_range(start='2020-01-01', periods=n_days)

# Generate synthetic returns and sentiment
# Assumption: Sentiment has predictive power for TOMORROW's return
sentiment = np.random.choice([-1, 1], n_days) # -1 (Bad News), 1 (Good News)
noise = np.random.normal(0, 0.01, n_days)

# The Return generation process:
# Return(t) is driven by Sentiment(t-1) [Predictive] AND Sentiment(t) [Reactionary]
returns = (0.5 * np.roll(sentiment, 1) * 0.01) + noise
df = pd.DataFrame({'Date': dates, 'Return': returns, 'Sentiment': sentiment}).iloc[1:]

# 4.2 The "Flawed" Strategy (Look-Ahead)
# We trade using the sentiment of the SAME day (assuming we knew the news before the market opened)
df['Strategy_LookAhead'] = df['Sentiment'] * df['Return']

# 4.3 The "Correct" Strategy (Lagged)
# We trade using YESTERDAY'S sentiment (we read the news, then trade the next open)
df['Strategy_Real'] = df['Sentiment'].shift(1) * df['Return']

# 4.4 Visualization
df['Cum_Ret_LookAhead'] = (1 + df['Strategy_LookAhead']).cumprod()
df['Cum_Ret_Real'] = (1 + df['Strategy_Real']).cumprod()
df['Cum_Ret_Market'] = (1 + df['Return']).cumprod()

plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Cum_Ret_LookAhead'], label='Look-Ahead Bias (Flawed)', color='red', linewidth=2, linestyle='--')
plt.plot(df['Date'], df['Cum_Ret_Real'], label='Realistic Strategy (Lagged)', color='green', linewidth=2)
plt.plot(df['Date'], df['Cum_Ret_Market'], label='Market (Buy & Hold)', color='gray', alpha=0.5)

plt.title("The Danger of Look-Ahead Bias in NLP Strategies", fontsize=16)
plt.ylabel("Cumulative Return ($1 Invested)")
plt.legend()
plt.show()

print("Observation: The Red line is a fantasy. The Green line is reality.")