## Table of Contents
- [Introduction](#introduction)
- [Data Wrangling](#wrangling)
    - [Gather](#gather)
    - [Assess](#assess)
    - [Clean](#clean)
    - [Analyze](#analyze)
    - [Visualize](#visualize)
- [Conclusions](#conclusions)

In [1]:
PATH = './data/'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

## I) Introduction <a id = "introduction">

**Aim:** Analyze absolute difference and (possibly) margin of error between stock market forecast of price returns and actual stock market price returns.

I will be analyzing quarterly price returns within the past 20 years for the firms present in the S&P 500 2019 Index.

> At first, I wanted to analyze the forecasted vs. actual price earnings of the S&P in its entirety for the past 20 years. However, considering that firms continuously enter and leave stock indices every year, there would be varying levels of inconsistencies and marginal errors when comparing annual S&P returns alone. To combat this problem, I have isolated these two approaches:
- Analyze the historical earnings of *only* the firms present in the S&P 2019 Index
- Keep track of all firms that were present in the S&P for the past 20 years. Keep track of how many times each firm appeared in the Index and for those with the least count, analyze them individually on how they differ from the firms that stayed for longer.


## II) Data Wrangling <a id="wrangling"></a>

To gather the data depicted under the `./data` folder, I used Bloomberg Excel functions.

### A) Gather <a id = "gather"></a>
> **APPROACH 1:** Focus on the firms that appear in the 2019 S&P Index and analyze their forecasted vs. actual price earnings for the last 20 years.

To ensure consistency in analysis among multiple firms, I divide both the forecasted and actual price earning dates by *calendar period* instead of fiscal period. This is because fiscal period differs by firm whereas calendar period is consistent by dates. 

#### Through the Bloomberg Excel functions, I gathered four datasets with different purposes:

- historical forecasted EPS
- historical actual EPS
- historical actual EOD price
- historical forecasted EPS relying on terms

---
Before delving into the data, let's define what above terms:

**EPS**

> EPS stands for ***Estimated Price Earnings.*** The formal definition of EPS given by Investopedia is this:

Earnings per share is the portion of a company's profit that is allocated to each outstanding share of a common stock, serving as an indicator of the company's financial health.

In other words, the EPS is a portion of the company's **net income** after all of their dividends are paid off. Dividends are profits that are paid out to shareholders of the company. EPS is one of the most useful and valuable financial measurements because they ***determine a stock's worth.*** The higher the stock, the more the company can pay out dividends to its shareholders, and the more net profit they are determined to generate.

$$ EPS = \frac{Net Income - Preferred Dividends}{Weighted Average Common Shares Outstanding}\$$

**EOD**

> EOD stands for the ***End of Day*** price. For any given day, the EOD marks the ***price at which the stock was valued*** at the end of the day's trading period.

---

### Now let's summarize the contents of the following DataFrames:
> All DataFrames consist of the 505 firms found in the 2019 S&P Index with EPS and EOD data encompassing 20 years: from January 1999 until the December 2019.

**Historic forecasted EPS**
> According to Investopedia, consensus estimates is normally an average or median of all the forecasts from individual analysts tracking a particular stock. In this case, the consensus estimate is for ***EPS for each firm present in the index as of 2019.*** Forecasted EPS is calculated by ***quarterly earnings,*** usually by each firm's fiscal period. Estimates of quarterly earnings are published at the beginning of each quarterly period.

**Historic actual EPS**
> Unlike forecasted EPS, actual EPS are the real numbers denoting Earnings-per-Share for a singular firm. Historic actual EPS will be compared to forecasted EPS to draw correlations and comparisons.

**Historic actual EOD**
> Though this is not directly related to EPS data, EOD is a usual metric to have when exploring additional questions. The historic actual EOD 

**Historic forecasted EPS 3 months prior**

In [4]:
#historic forecasted EPS 
df_eps_fc = pd.read_csv(PATH + 'sp-eps-fc.csv')

#historic actual EPS
df_eps_act = pd.read_csv(PATH + 'sp-eps-act.csv')

#historic actual EOD
df_eod_act = pd.read_csv(PATH + 'sp-eod-act.csv')

#historic forecasted EPS with terms
df_eps_fc_terms = pd.read_csv(PATH + 'sp-eps-fc-terms.csv')

## B) Assess

In [5]:
dict_dfs = {'eps_fcast' : df_eps_fc,
           'eps_actual' : df_eps_act,
           'eod_actual' : df_eod_act,
           'eps_fcast_terms' : df_eps_fc_terms}



# CUTOFF

### Historic Forecasted EPS

In [3]:
df_eps_fc.sample(5)

Unnamed: 0,Term Forecasted,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,ACN UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
71,16Q4,0.521,0.919,1.085,1.656,1.197,1.228,0.19,0.646,1.3,...,0.437,0.52,0.704,0.65,1.004,0.649,0.734,2.136,0.521,0.453
47,10Q4,0.599,-0.321,0.544,0.586,,0.482,-0.102,1.291,0.636,...,0.309,0.449,1.631,0.5,,,0.598,1.189,,
64,15Q1,0.407,1.704,2.483,2.596,0.851,0.971,0.018,0.419,1.199,...,0.493,0.61,0.829,0.573,,0.313,0.719,1.583,,0.365
7,00Q4,,,,0.022,,,-0.18,0.479,,...,0.47,0.211,0.647,0.198,,,0.198,,0.758,
6,00Q3,,,,0.031,,,,0.419,,...,0.723,0.189,0.571,0.148,,,0.181,,0.731,


In [4]:
df_eps_fc.shape

(84, 506)

**Observation:** There are 505 firms encompassing 84 quarterly forecast periods since 1999.

> Since there are 4 quarters in a year, 84 quarterly forecast periods equate to 21 years. This is correct since we are analyzing the years from 1999 until the end of 2019.

In [5]:
#number of rows with missing data
df_eps_fc.isna().sum().max()

82

**Observation:** There are 82 rows with missing data. Most of the entries for each quarterly forecast period is incomplete. Out of 84 rows, ***only two*** quarterly calendar periods contain complete data across all firms.

In [6]:
#check for rows where all columns are NaN values.
columns_to_check = df_eps_fc.columns[2:]
df_eps_fc[columns_to_check].isnull().apply(lambda x: all(x), axis = 1).value_counts()

False    84
dtype: int64

In [7]:
#check for columns where all rows are NaN values.
df_eps_fc[columns_to_check].isnull().apply(lambda x: all(x), axis = 0).value_counts()

False    504
dtype: int64

**Observations:**
- No quarterly calendar period is empty of data for all firms.
- No firm is empty of data for all calendar periods.

This means that for historical forecasted EPS, ***no singular calendar period has completely missing data across all firms, and
no singular firm has completely missing data across an entire calendar period.***

### Historic Actual EPS


In [6]:
#generate 10 random samples 
df_eps_act.sample(5)

Unnamed: 0,Quarter,Year,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
67,Q4,2015,0.42,5.24,0.75,1.97,0.93,1.71,2.4,0.51,...,0.41,0.52,0.67,0.268807,1.011072,0.636516,0.64,0.624693,0.43,0.04
71,Q4,2016,0.38,0.56,0.84,1.68,0.85,0.66,0.26,0.53,...,0.445745,0.57,0.41,0.46,-3.35,0.278707,0.84,0.347305,0.61,0.31
30,Q3,2006,0.55,0.07,0.56,0.078571,,0.29,-0.17,0.47,...,0.55,0.23,1.79,0.32,2.2,,0.43,0.76,1.45,
7,Q4,2000,0.67,0.31,,0.037143,,0.135,-0.17,0.49,...,0.4,1.47,0.75,0.2,,,0.22,,0.76,
51,Q4,2011,0.83,-3.268657,0.92,1.018571,,0.55,-0.05,1.02,...,0.29,0.61,1.97,0.28688,1.08,0.278671,0.77,0.88,0.24,


In [9]:
df_eps_act.shape

(84, 507)

**Observation:** There are 505 firms encompassing 84 calendar periods.

In [10]:
#number of rows with missing data
df_eps_act.isna().sum().max()

84

**Observation:** All quarterly calendar periods contain incomplete data across all firms.

In [11]:
#check for rows where all columns are NaN values.
columns_to_check = df_eps_act.columns
df_eps_act[columns_to_check].isnull().apply(lambda x: all(x), axis = 1).value_counts()

False    84
dtype: int64

In [12]:
#check for columns where all rows are NaN values.
df_eps_act[columns_to_check].isnull().apply(lambda x: all(x), axis = 0).value_counts()

False    506
True       1
dtype: int64

**Observations:** 

- there is no calendar period that's empty of data for all firms.
- there is no firm that's empty of data for all calendar periods.

**Therefore, all calendar periods and firms have data for historical end of day stock price.**

In [14]:
#count how many rows have isolated data
df_eps_act.duplicated().sum()

0

**Observation:** There is no duplicated data among all firms for all calendar periods in `df_eod`.

In [None]:
#count number of repeated firm names
df_eod.columns.duplicated().sum()

In [7]:
df_eod_act.sample(10)

Unnamed: 0,date,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,ACN UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
79,12/31/2018,67.46,32.11,157.46,157.74,92.19,74.4,325.04,72.33,141.01,...,49.27,85.17,68.19,37.21,19.76,66.72,91.92,103.72,40.74,85.54
17,6/30/2003,13.1722,,20.3,1.3657,,16.8232,5.47,19.5843,18.09,...,,25.31,35.91,20.45,27.9003,,10.6276,45.05,50.61,
69,6/30/2016,44.36,28.31,161.63,95.6,61.91,79.32,109.29,39.31,113.29,...,,46.13,93.74,62.04,25.0023,44.65,59.624,120.38,25.13,47.46
25,6/30/2005,15.5102,,43.033,5.2586,,16.7747,8.55,23.45,22.67,...,,25.5,57.47,27.0,36.331,,18.7242,76.17,73.53,
4,3/31/2000,70.0721,,,4.8504,,3.6388,20.25,15.7478,,...,,82.8125,38.9063,9.4583,68.4994,,5.5839,,41.625,
10,9/28/2001,13.1722,,,1.1079,,17.2114,17.47,23.2049,12.75,...,,23.53,39.4,15.3133,20.4181,,7.0503,27.75,53.66,
78,9/28/2018,70.54,41.33,168.33,225.74,94.58,92.22,449.75,73.36,170.2,...,47.21,80.17,85.02,37.74,26.98,79.87,90.91,131.47,50.15,91.56
77,6/29/2018,61.84,37.96,135.7,185.11,92.65,85.27,409.05,60.99,163.59,...,45.68,65.26,82.73,43.77,24.0,67.38,78.22,111.44,52.69,85.19
32,3/30/2007,24.0913,,38.55,13.2729,,25.5927,13.66,26.6988,38.54,...,,25.73,75.45,32.75,44.4983,,20.7663,85.41,84.52,
30,9/29/2006,22.0256,,32.94,11.0043,,21.9296,14.79,23.2347,31.71,...,,21.95,67.1,30.11,40.9943,,18.7134,67.5,79.81,


### All Data

In [27]:
#count how many rows have isolated data
#count number of repeated firm names

**Observation:** There are no duplicated firms in `df_eod`.

### Quality

**Missing Data**

-  N/A
--- 

- firm names across both DataFrames are capitalized


### Tidiness

- both DataFrames need to be merged with firm names transposed into rows

## C) Cleaning

# III) Store Data