# Predicting Earnings Surprises

## Task

We want to predict the magnitude of company's upcoming earnings announcement using a machine learning classification model. The model is trained on three types of data: earnings, pricing, and technical price action data. The optimized model outputs a result into one of three classes: positive, neutral, or negative. A 'positive' classification indicates a predicted surprise >15% of the estimated eps, a 'negative' classification indicates a predicted surprise <-15% of the estimated eps, and a 'neutral' classification indicates no predicted surprise (15% < x < -15%). 

## Data

The data for training and testing the model came from several external data providers. Earning and pricing data was collected from Financial Modeling Prep's historical earnings calendar and daily indicator endpoints. Technical data is collected from FMP Cloud's daily technical indicator endpoint.

The schema below outlines the database architecture into an AWS RDS MySQL database:

![Untitled Workspace (1)](https://user-images.githubusercontent.com/45079557/150410944-eb8c8e30-ac2d-4f23-bb03-cb5c3f489cfb.png)

## Code

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pymysql
import seaborn as sns
from decouple import config
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from scipy import stats

In [None]:
# Set parameters for AWS database
aws_hostname = config("AWS_HOST")
aws_database = config("AWS_DB")
aws_username = config("AWS_USER")
aws_password = config("AWS_PASS")
aws_port = config("AWS_PORT")

# Pull API keys from .env file
FMP_API_KEY = config("FMP_API_KEY")
FMP_CLOUD_API_KEY = config("FMP_CLOUD_API_KEY")

In [None]:
db = pymysql.connect(host=aws_hostname,user=aws_username, password=aws_password, database='rds-python', charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor)
cursor = db.cursor()

### Retrieve Data from MySQL Database

In [None]:
cursor.execute("""
SELECT *, 
LAG(perc_change) OVER(PARTITION BY symbol ORDER BY STR_TO_DATE(`date`, '%c/%e/%y')) AS lastSurp, 
LAG(perc_change, 2) OVER(PARTITION BY symbol ORDER BY STR_TO_DATE(`date`, '%c/%e/%y')) AS last2Surp,

LAG(eps) OVER(PARTITION BY symbol ORDER BY STR_TO_DATE(`date`, '%c/%e/%y')) AS lastEps,
LAG(eps, 2) OVER(PARTITION BY symbol ORDER BY STR_TO_DATE(`date`, '%c/%e/%y')) AS last2Eps,

LAG(epsEstimated) OVER(PARTITION BY symbol ORDER BY STR_TO_DATE(`date`, '%c/%e/%y')) AS lastEst,
LAG(epsEstimated, 2) OVER(PARTITION BY symbol ORDER BY STR_TO_DATE(`date`, '%c/%e/%y')) AS last2Est
FROM (
    SELECT *, COALESCE((eps - epsEstimated) / ABS(epsEstimated) * 100,0) AS perc_change
    FROM train_agg
    ORDER BY STR_TO_DATE(`date`, '%c/%e/%y')
)x
""")
train = cursor.fetchall()

In [None]:
train_df = pd.DataFrame(train)
train_df.info()

In [None]:
cursor.close()
db.close()

### Cleaning Data

In [None]:
print("...Start...")
print(train_df.head(2))
print("...End...")
print(train_df.tail(2))

In [None]:
is_NaN = train_df.isnull()
row_has_NaN = is_NaN.any(axis=1)
forecast_these = train_df[row_has_NaN]
print(len(forecast_these))
print(forecast_these.head())

In [None]:
forecast_these

In [None]:
train_df = train_df[train_df["eps"].notna()]
df = train_df
print(df.info())

### EDA/Feature Engineering

In [None]:
plt.scatter(pd.to_datetime(df["date"]), df["perc_change"])
plt.show()

Few outliers are present in the percentage difference between EPS and EPS estimated. Therefore, we will fitler out rows that are greater than 3 or less than -3 standard deviations away.

In [None]:
df = df[df.perc_change.between(df.perc_change.quantile(.01), df.perc_change.quantile(.99))]
print("Earnings Surprise Average: {}".format(df["perc_change"].mean()))

#### Distribution of Historical Earnings Surprises

In [None]:
num_bins = 100
plt.hist(df["perc_change"], num_bins, facecolor='blue', alpha=0.5)
plt.xlabel('EPS Difference (EPS - EPS Estimated)')
plt.ylabel('Number of Earnings')
plt.title('Histogram of Earnings Surprises')
plt.grid(True)
plt.tight_layout()
plt.savefig('visuals/histogram_eps_diff.png', facecolor='white', transparent=False)
plt.show()

After removing outliers, we can see that the majority of historical earnings follow a normal distribution around the mean of 9%.

#### Significant Earnings Surprise Breakdown

In [None]:
pos_surp_thres = 15
neg_surp_thres = -15

pos_surp = df[(df.perc_change > pos_surp_thres)]
neg_surp = df[(df.perc_change < neg_surp_thres)]
neu_surp = df[(df.perc_change < pos_surp_thres) & (df.perc_change > neg_surp_thres)]

x = ["Positive", "Neutral", "Negative"]
surprises = [len(pos_surp), len(neu_surp), len(neg_surp)]

print(len(pos_surp))
print("-------------")
print(len(neg_surp))
print("-------------")
print(len(neu_surp))

In [None]:
# Breakdown of the total number of each type of surprise in the dataset
# Positive: >15% surprise
# Neutral: <15% surprise and >-15% surprise
# Negative: <-15% surprise
plt.bar(x, surprises, color=['green', 'yellow', 'red'], alpha=0.5)
plt.ylabel('Number of Earnings')
plt.title('Earnings Surprise Breakdown')
plt.tight_layout()
plt.savefig('visuals/earn_bar.png', facecolor='white', transparent=False)
plt.show()

From the bar chart above, there are significantly more positive earnings surprises than negative earnings surprises. Therefore, it might be more lucrative, you only go long plays on earnings. 

#### Earnings Surprise Breakdown Based on Earnings Time

In [None]:
print(df["time"].unique())

In [None]:
pos_bmo = df[(df.perc_change > pos_surp_thres) & (df["time"] == 'bmo')]
pos_amc = df[(df.perc_change > pos_surp_thres) & (df["time"] == 'amc')]
neu_bmo = df[(df.perc_change < pos_surp_thres) & (df.perc_change > neg_surp_thres) & (df["time"] == 'bmo')]
neu_amc = df[(df.perc_change < pos_surp_thres) & (df.perc_change > neg_surp_thres) & (df["time"] == 'amc')]
neg_bmo = df[(df.perc_change < neg_surp_thres) & (df["time"] == 'bmo')]
neg_amc = df[(df.perc_change < neg_surp_thres) & (df["time"] == 'amc')]

In [None]:
x = ["Positive", "Neutral", "Negative"]
surprises_bmo = [len(pos_bmo), len(neu_bmo), len(neg_bmo)]
surprises_amc = [len(pos_amc), len(neu_amc), len(neg_amc)]

In [None]:
ind = np.arange(3) 
width = 0.35       
plt.bar(ind, surprises_bmo, width, label='Before Market Open', color='blue', alpha=0.5)
plt.bar(ind+width, surprises_amc, width,
    label='After Market Close', color='red', alpha=0.5)

plt.ylabel('Number of Earnings')
plt.title('Earnings Surprise Breakdown By Earnings Time')

plt.xticks(ind + width / 2, ('Positive', 'Neutral', 'Negative'))
plt.legend(loc='best')
plt.tight_layout()
plt.savefig('visuals/earn_bar_time.png', facecolor='white', transparent=False)
plt.show()


There is no significant difference between earnings surprise and when the earnings is announced (before market open or after market close).

#### Earnings Surprise Breakdown by Day of Week

In [None]:
# Need to find the date information using earnings date column
dates = pd.to_datetime(df["date"])
df["dow"] = dates.dt.dayofweek
print(df["dow"])


In [None]:
pos_m = df[(df.perc_change > pos_surp_thres) & (df["dow"] == 0)]
pos_tu = df[(df.perc_change > pos_surp_thres) & (df["dow"] == 1)]
pos_w = df[(df.perc_change > pos_surp_thres) & (df["dow"] == 2)]
pos_th = df[(df.perc_change > pos_surp_thres) & (df["dow"] == 3)]
pos_f = df[(df.perc_change > pos_surp_thres) & (df["dow"] == 4)]

neu_m = df[(df.perc_change < pos_surp_thres) & (df.perc_change > neg_surp_thres) & (df["dow"] == 0)]
neu_tu = df[(df.perc_change < pos_surp_thres) & (df.perc_change > neg_surp_thres) & (df["dow"] == 1)]
neu_w = df[(df.perc_change < pos_surp_thres) & (df.perc_change > neg_surp_thres) & (df["dow"] == 2)]
neu_th = df[(df.perc_change < pos_surp_thres) & (df.perc_change > neg_surp_thres) & (df["dow"] == 3)]
neu_f = df[(df.perc_change < pos_surp_thres) & (df.perc_change > neg_surp_thres) & (df["dow"] == 4)]

neg_m = df[(df.perc_change < neg_surp_thres) & (df["dow"] == 0)]
neg_tu = df[(df.perc_change < neg_surp_thres) & (df["dow"] == 1)]
neg_w = df[(df.perc_change < neg_surp_thres) & (df["dow"] == 2)]
neg_th = df[(df.perc_change < neg_surp_thres) & (df["dow"] == 3)]
neg_f = df[(df.perc_change < neg_surp_thres) & (df["dow"] == 4)]

In [None]:
x = ["Positive", "Neutral", "Negative"]
surprises_m = [len(pos_m), len(neu_m), len(neg_m)]
surprises_tu = [len(pos_tu), len(neu_tu), len(neg_tu)]
surprises_w = [len(pos_w), len(neu_w), len(neg_w)]
surprises_th = [len(pos_th), len(neu_th), len(neg_th)]
surprises_f = [len(pos_f), len(neu_f), len(neg_f)]

In [None]:
ind = np.arange(3) 
width = 0.15
plt.figure(figsize=(8, 4))
plt.bar(ind, surprises_m, width, label='Monday', alpha=0.5)
plt.bar(ind+width, surprises_tu, width, label='Tuesday', alpha=0.5)
plt.bar(ind+(2*width), surprises_w, width, label='Wednesday', alpha=0.5)
plt.bar(ind+(3*width), surprises_th, width, label='Thursday', alpha=0.5)
plt.bar(ind+(4*width), surprises_f, width, label='Friday', alpha=0.5)

plt.ylabel('Number of Earnings')
plt.title('Earnings Surprise Breakdown By Day of Week')

plt.xticks(ind + width*2, ('Positive', 'Neutral', 'Negative'))

plt.legend(loc='best')
plt.tight_layout()
plt.savefig('visuals/earn_bar_dow.png', facecolor='white', transparent=False)
plt.show()

### Feature Engineering

#### Create lagging features

Added lagging features derived from earnings data within the SQL statement. We already have lagging features for pricing data with the technical indicators. Now we can hide symbol as feature from the model.

In [None]:
# Null values within one lag
numer = df["lastEps"].isnull().sum()
denom = len(df["perc_change"])

null_perc = numer/denom * 100
print(null_perc)
print(numer)
print(denom)

In [None]:
# Null values within two lags
numer = df["last2Eps"].isnull().sum()
denom = len(df["perc_change"])

null_perc = numer/denom * 100
print(null_perc)
print(numer)
print(denom)

In [None]:
# Drop symbol
df = df.drop(["symbol"], axis=1)
print(df)

#### Create date features

In [None]:
df["month"] = pd.DatetimeIndex(df["date"]).month
print(df["month"])

In [None]:
df["day"] = pd.DatetimeIndex(df["date"]).day
print(df["day"])

In [None]:
df["year"] = pd.DatetimeIndex(df["date"]).year
print(df["year"])

In [None]:
# Drop date
df = df.drop(["date"], axis=1)
print(df)

#### Encode time variable

In [None]:
df["time"].loc[(df['time'] == "bmo")] = 0
df["time"].loc[(df['time'] == "amc")] = 1
df["time"] = df["time"].astype(int)

In [None]:
df = df.dropna()
print(df.info())

In [None]:
# Drop id and percentage change
# Removing percentage change because it wont be known at time of prediction
df = df.drop(["id", "perc_change"], axis=1)
print(df)

### Split Dataset

In [None]:
# Split Dataset
X = df.iloc[:, 1:]
y = df.iloc[:, :1]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X_train, y_train.values.ravel())