# Introduction to Earnings Call Sentiment Analysis
In this project, I’m diving into earnings call data from S&P 500 companies spanning 2015 to 2021 to build and compare sentiment measures. My goal is to see how well these measures—crafted using Bag-of-Words techniques and Large Language Models—can predict stock market reactions. I’ll load the data, clean the text, remove noise, and save my work for analysis. Here’s how I’ll approach it step by step.

# Step 1: Bringing in the Earnings Call Data
Load the earnings call dataset along with presentation and Q&A texts from online sources. I’ll filter the Q&A to focus on answers, combine them with presentations, and create a full transcript column to work with later.

In [68]:
### Starting counter for run time
import time
start_time = time.time()

In [69]:
### Core packages
import numpy as np
import pandas as pd
import sqlite3 as sql
import nltk

### Load the Sample of Earnings Calls for the S&P500 from 2015 to 2021 with financials
ECs = pd.read_csv("https://www.dropbox.com/scl/fi/2p7ahxroqj9pwf98ni5an/Sample_Calls.csv?rlkey=zfieicvz891u4e3z0aroeg0u7&dl=1")

### Load the Sample's Presentation texts
Presentations = pd.read_feather("https://www.dropbox.com/scl/fi/uceh2xva5g4apbmt92cgt/Sample_Calls_Presentations.feather?rlkey=ln4nzsa4nenqyvm0pg2cur9sp&dl=1")

### Load the Q&A session textual data for the sample
QAs = pd.read_feather("https://www.dropbox.com/scl/fi/iq4111nlmsykp2tzxk9xg/Sample_Calls_QA.feather?rlkey=xabjqmwhesx05jivrlfzkgj6m&dl=1")

### Filtering just the answers
temp = QAs[QAs['QA'] == 'a']
temp = temp.groupby('file_name')['QA_text'].apply(lambda x: ' '.join(x)).reset_index()
ECs['A'] = temp['QA_text']

### Filtering just the questions
temp = QAs[QAs['QA'] == 'q']
temp = temp.groupby('file_name')['QA_text'].apply(lambda x: ' '.join(x)).reset_index()
ECs['Q'] = temp['QA_text']

### Labeling questions and answers
QAs['QA_text'] = np.where(QAs['QA'] == 'a', 'a: ' + QAs['QA_text'], 'q: ' + QAs['QA_text'])
temp = QAs.groupby('file_name')['QA_text'].apply(lambda x: ' '.join(x)).reset_index()
ECs['QA'] = temp['QA_text']

### Combining
ECs['P'] = Presentations['presentation']
ECs['PA'] = ECs['P'] + ' ' + ECs['A']

### Deleting temp
del temp

### Turning niq into thousands
ECs['niq'] = ECs['niq'] / 1000

### Looking at missing data in our sample
print(ECs.isnull().sum()[ECs.isnull().sum() > 0])

CAR-11-Carhart     60
CAR-11-ff3         60
CAR01-Carhart      60
CAR01-ff3          60
IV                101
hvol              101
IV_l1d            101
IV_l2d            101
IV_f1d            101
actq              305
rectq              40
invtq              92
lctq              305
apq                42
dpq               153
cogsq               1
oiadpq              3
dlcq               60
xintq             241
dtype: int64


In [70]:
### Priting the shape before removing missing data
print(ECs.shape)

### Removing any rows where our control variables and dependent variables are missing
ECs = ECs.dropna(subset=['CAR-11-Carhart', 'niq', 'SurpDec', 'NUMUP', 'NUMDOWN'])

### Resetting the index
ECs = ECs.reset_index(drop=True)

### Looking at the shape after removing missing data
print(ECs.shape)

### Creating a SurpDec squared variable
ECs['SurpDec2'] = ECs['SurpDec'] ** 2
ECs['SurpDec2'] = np.where(ECs['SurpDec'] > 0, ECs['SurpDec2'], ECs['SurpDec2'] * -1)

### Looking at the first 5 rows
ECs.head()

(2877, 50)
(2817, 50)


Unnamed: 0,GVKEY,date_rdq,co_conm,file_name,CAR-11-Carhart,CAR-11-ff3,CAR01-Carhart,CAR01-ff3,IV,hvol,...,prccq,cshoq,dvpq,xintq,A,Q,QA,P,PA,SurpDec2
0,16101.0,2016-07-29 13:00:00+00:00,ABBVIE INC,Download ECC/SE/TRANSCRIPT/XMLStd/Archive/2016...,0.011886,0.014261,0.014261,0.021246,0.179151,0.129186,...,61.91,1628.542,0.0,245.0,Sure. So in terms of sales on the Life Planner...,"Hello, thank you. I just wanted to start with ...","q: Hello, thank you. I just wanted to start wi...",Good day and welcome to the Linear Technol...,Good day and welcome to the Linear Technol...,4.0
1,16101.0,2016-04-28 13:00:00+00:00,ABBVIE INC,Download ECC/SE/TRANSCRIPT/XMLStd/Archive/2016...,0.026387,0.023499,0.023499,0.02177,0.289777,0.114447,...,57.12,1617.359,0.0,215.0,"Ryan, this is Steve. We've long encouraged peo...","Hey, thanks. Good morning. I had a question ab...","q: Hey, thanks. Good morning. I had a question...",Welcome to Cerner Corporation's first quar...,Welcome to Cerner Corporation's first quar...,1.0
2,16101.0,2016-10-28 13:00:00+00:00,ABBVIE INC,Download ECC/SE/TRANSCRIPT/XMLStd/Archive/2016...,-0.078668,-0.07929,-0.07929,-0.092594,0.253269,0.381002,...,63.07,1624.908,0.0,271.0,"Jimmy, it's Rob. I'll take the first of those ...","Hi, good morning. I had a couple questions. Fi...","q: Hi, good morning. I had a couple questions....",Welcome to Cerner Corporation's second qua...,Welcome to Cerner Corporation's second qua...,1.0
3,16101.0,2017-01-27 14:00:00+00:00,ABBVIE INC,Download ECC/SE/TRANSCRIPT/XMLStd/Archive/2017...,-0.010152,-0.000737,-0.000737,-0.005279,0.18208,0.145941,...,62.62,1592.513,0.0,277.0,"Great, Gregg, thank you for the questions. Dav...","Thank you. First on sola, how would you charac...","q: Thank you. First on sola, how would you cha...",Welcome to Cerner Corporation's third quar...,Welcome to Cerner Corporation's third quar...,1.0
4,16101.0,2017-04-27 13:00:00+00:00,ABBVIE INC,Download ECC/SE/TRANSCRIPT/XMLStd/Archive/2017...,0.010397,0.010672,0.010672,0.012819,0.192822,0.112189,...,65.16,1591.366,0.0,273.0,"Great, Mike, thanks for the question. For the ...","Hi, guys, this is Mike DiFiore in for Mark Sch...","q: Hi, guys, this is Mike DiFiore in for Mark ...",Welcome to Cerner Corporation's fourth qua...,Welcome to Cerner Corporation's fourth qua...,1.0


## Columns:

### Identifiers
- **GVKEY**: "A unique company identifier used by Compustat."
- **date_rdq**: "The reporting date of the quarterly earnings or a related key event date."
- **co_conm**: "The company’s name in CRSP."

### Earnings Call Columns
- **file_name**: "The identifier or filename of the earnings call transcript."
- **CAR-11-Carhart**: "Cumulative Abnormal Return over an event window using the Carhart 4-factor model." (-11 means -1 day to day 1, so one day before and day after report)
- **CAR-11-ff3**: "Cumulative Abnormal Return over an event window using the Fama-French 3-factor model."
- **CAR01-Carhart**: "Cumulative Abnormal Return (alternative window) using the Carhart 4-factor model." (01 means from day 0 to day 1, so on the day of the report)
- **CAR01-ff3**: "Cumulative Abnormal Return (alternative window) using the Fama-French 3-factor model."
- **IV**: "Implied volatility (often from options) reflecting expected future stock price volatility."
- **hvol**: "Historical volatility of the stock, based on past price movements."
- **IV_l1d**: "Implied volatility lagged by one day."
- **IV_l2d**: "Implied volatility lagged by two days."
- **IV_f1d**: "Implied volatility forecasted or measured one day forward."

### I/B/E/S Columns
- **NUMEST**: "The number of analyst estimates contributing to the consensus."
- **NUMUP**: "The number of analysts who have revised their EPS estimates upward."
- **NUMDOWN**: "The number of analysts who have revised their EPS estimates downward."
- **MEDEST**: "The median of analyst EPS estimates."
- **MEANEST**: "The mean of analyst EPS estimates."
- **ACTUAL**: "The I/B/E/S standardized actual EPS figure, often adjusted for comparability."
- **surp**: "The earnings surprise, typically ACTUAL minus MEANEST."
- **SurpDec**: "A scaled or decimalized version of the earnings surprise."

### Compustat Columns
- **atq**: "Total Assets (Quarterly)"
- **actq**: "Current Assets (Quarterly)"
- **cheq**: "Cash and Cash Equivalents (Quarterly)"
- **rectq**: "Accounts Receivable (Quarterly)"
- **invtq**: "Inventory (Quarterly)"
- **ltq**: "Total Liabilities (Quarterly)"
- **lctq**: "Current Liabilities (Quarterly)"
- **apq**: "Accounts Payable (Quarterly)"
- **ceqq**: "Total Equity (Quarterly)"
- **seqq**: "Common Equity (Quarterly)"
- **capxy**: "Capital Expenditures (Note: 'capxy' is annual by default, quarterly approximations derived from segments)"
- **dpq**: "Depreciation and Amortization (Quarterly)"
- **saleq**: "Revenue (Quarterly)"
- **cogsq**: "Cost of Goods Sold (Quarterly)"
- **oiadpq**: "Operating Income (Quarterly)"
- **niq**: "Net Income (Quarterly)"
- **epspxq**: "Basic Earnings Per Share (Quarterly)"
- **epspiq**: "Diluted Earnings Per Share (Quarterly)"
- **dlttq**: "Long-Term Debt (Quarterly)"
- **dlcq**: "Debt in Current Liabilities (Quarterly)"
- **prccq**: "Price Close - Fiscal Quarter"
- **cshoq**: "Common Shares Outstanding (Quarterly)"
- **dvpq**: "Dividends Paid (Quarterly)"
- **xintq**: "Interest Expense (Quarterly)"

### Text Variables 
*(All tokenized, stemmed, stopwords removed and punctuations removed)*
- **P**: "Presentation text"
- **A**: "All answers by management to questions"
- **PA**: "Presentation text + Answer text"

## Step 1.1: Picking a smaller sample
Now we will shorten the data down while we construct the code. We will put this code in comments (#) when running the code on the entire dataset when all is ready.

In [71]:
# ### Getting a sample of the data
# ECs = ECs.sample(100, random_state=123)

# ### Resetting the index
# ECs = ECs.reset_index(drop=True)

# #Looking at the shape of the data
# print(ECs.shape)

# Step 2: Preprocessing the Text
At this point, I’ll clean up the text data. I’ll tokenize it, remove stopwords, and apply stemming to simplify the words in the presentations, answers, and full transcripts, making them ready for sentiment analysis.

In [72]:
### Importing packages for cleaning
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

### Adding a progress bar
from tqdm import tqdm
tqdm.pandas()

### Initialize the stemmer and stopwords
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

### Define a function to clean text
def clean_text(text):
    """
    Cleans the input text by:
    - Tokenizing
    - Converting to lowercase
    - Removing non-alphabetic tokens
    - Removing stopwords
    - Applying stemming
    """
    if not isinstance(text, str):  # Check if the input is a string
        return []
    tokens = word_tokenize(text)                                  # Tokenize the text
    tokens = [word.lower() for word in tokens if word.isalpha()]  # Lowercase & keep alphabetic tokens
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    tokens = [stemmer.stem(word) for word in tokens]              # Apply stemming
    return tokens

### Apply the cleaning function to the specified columns
ECs['P_cleaned'] = ECs['P'].progress_apply(clean_text)
ECs['A_cleaned'] = ECs['A'].progress_apply(clean_text)
ECs['PA_cleaned'] = ECs['P_cleaned'] + ECs['A_cleaned']

100%|██████████| 2817/2817 [00:57<00:00, 49.32it/s] 
100%|██████████| 2817/2817 [01:00<00:00, 46.39it/s]


# Step 3: Removing Overly Common Words
Now, I’ll identify and remove words that show up in over 75% of the documents. These are too frequent to carry unique sentiment, so stripping them out will sharpen my focus on what matters.

In [73]:
### Calculate the document frequency for each unique word
from collections import Counter
doc_freq = Counter()

### Iterate over each presentation, clean the text, and update the document frequency
for presentation in ECs.PA_cleaned:
    cleaned_words = set(presentation)
    doc_freq.update(cleaned_words)

### Retrieve and display the top 'top_n' most common words by document frequency
top_common_words = doc_freq.most_common(1000)
print(top_common_words)

### Removing all words that appear in more than 75% of the documents
threshold = 0.50
words_to_remove = [word for word, count in top_common_words if count > threshold * len(ECs)]
print(f"Removing {len(words_to_remove)} words that appear in more than {threshold*100}% of the documents.")

### Removing the words in words_to_remove
def remove_common_words(text):
    return [word for word in text if word not in words_to_remove]
ECs['P_cleaned'] = ECs['P_cleaned'].progress_apply(remove_common_words)
ECs['A_cleaned'] = ECs['A_cleaned'].progress_apply(remove_common_words)
ECs['PA_cleaned'] = ECs['P_cleaned'] + ECs['A_cleaned']

[('go', 2817), ('thank', 2817), ('also', 2817), ('see', 2817), ('quarter', 2817), ('us', 2817), ('look', 2817), ('year', 2817), ('continu', 2817), ('forward', 2817), ('busi', 2816), ('good', 2816), ('call', 2816), ('well', 2816), ('first', 2815), ('time', 2815), ('would', 2815), ('last', 2814), ('like', 2814), ('oper', 2814), ('expect', 2814), ('think', 2813), ('result', 2813), ('question', 2811), ('point', 2811), ('one', 2810), ('today', 2810), ('make', 2808), ('get', 2808), ('growth', 2806), ('realli', 2806), ('new', 2806), ('take', 2804), ('includ', 2803), ('work', 2802), ('end', 2802), ('million', 2802), ('strong', 2800), ('increas', 2799), ('earn', 2799), ('product', 2799), ('market', 2798), ('term', 2798), ('turn', 2795), ('share', 2795), ('start', 2795), ('posit', 2793), ('impact', 2793), ('come', 2790), ('rate', 2786), ('financi', 2785), ('back', 2785), ('compani', 2782), ('perform', 2781), ('cost', 2781), ('thing', 2779), ('right', 2778), ('invest', 2778), ('base', 2777), ('se

100%|██████████| 2817/2817 [00:08<00:00, 333.51it/s]
100%|██████████| 2817/2817 [00:07<00:00, 364.44it/s]


# Step 4: Turning Tokens Back into Strings
After cleaning and filtering, I’ll join the tokenized words back into strings. Since the date is saved in SQL, it need to be strings and not lists.

In [74]:
### Turning the tokens back into a string
def tokens_to_string(tokens):
    return ' '.join(tokens)

ECs['P_cleaned'] = ECs['P_cleaned'].progress_apply(tokens_to_string)
ECs['A_cleaned'] = ECs['A_cleaned'].progress_apply(tokens_to_string)
ECs['PA_cleaned'] = ECs['PA_cleaned'].progress_apply(tokens_to_string)

### Display the first rows of the cleaned data
ECs[['P', 'A', 'PA', 'P_cleaned', 'A_cleaned', 'PA_cleaned']].head()

100%|██████████| 2817/2817 [00:00<00:00, 56990.63it/s]
100%|██████████| 2817/2817 [00:00<00:00, 24251.00it/s]
100%|██████████| 2817/2817 [00:00<00:00, 40253.86it/s]


Unnamed: 0,P,A,PA,P_cleaned,A_cleaned,PA_cleaned
0,Good day and welcome to the Linear Technol...,Sure. So in terms of sales on the Life Planner...,Good day and welcome to the Linear Technol...,linear fiscal zerio financ sir linear bob swan...,life planner gibraltar life planner life plann...,linear fiscal zerio financ sir linear bob swan...
1,Welcome to Cerner Corporation's first quar...,"Ryan, this is Steve. We've long encouraged peo...",Welcome to Cerner Corporation's first quar...,cerner variou constitut prospect health client...,ryan steve roa annuiti gradual pdi fee bip roa...,cerner variou constitut prospect health client...
2,Welcome to Cerner Corporation's second qua...,"Jimmy, it's Rob. I'll take the first of those ...",Welcome to Cerner Corporation's second qua...,cerner august variou constitut prospect health...,jimmi rob steve annuiti stori minut recal thre...,cerner august variou constitut prospect health...
3,Welcome to Cerner Corporation's third quar...,"Great, Gregg, thank you for the questions. Dav...",Welcome to Cerner Corporation's third quar...,cerner novemb variou constitut prospect health...,gregg dave regulatori interact solanezumab jan...,cerner novemb variou constitut prospect health...
4,Welcome to Cerner Corporation's fourth qua...,"Great, Mike, thanks for the question. For the ...",Welcome to Cerner Corporation's fourth qua...,cerner februari variou constitut prospect conc...,mike solanezumab scenario function cognit comp...,cerner februari variou constitut prospect conc...


# Step 5: Storing the Processed Data
Finally, I’ll save my processed dataset into a SQLite database. This keeps everything organized and secure, ready for the next phase of sentiment analysis and prediction. We also keep a back-up file saved in each code.

In [75]:
### Saving the data to a SQLite database
conn = sql.connect('data.db')
ECs.to_sql('ECs1', conn, if_exists='replace')
conn.close()

In [76]:
### Stopping the timer
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")

Execution time: 155.57 seconds
