## Table of Contents
- [Introduction](#introduction)
- [Data Wrangling](#wrangling)
    - [Gather](#gather)
    - [Import](#import)
    - [Assess](#assess)
    - [Clean](#clean)
    - [Store](#store)

## I) Introduction <a id = "introduction">

**Broad question:** How do price forecasts for each firm in the S&P 2019 Index compare to their corresponding actual prices?

**Approach:** Analyze difference in means between average forecast EPS and average actual EPS for each firm.

I will be analyzing quarterly price returns within the past 20 years for the firms present in the S&P 500 2019 Index.

> At first, I wanted to analyze the forecasted vs. actual price earnings of the S&P in its entirety for the past 20 years. However, considering that firms continuously enter and leave stock indices every year, there would be varying levels of inconsistencies and marginal errors when comparing annual S&P returns alone. To combat this problem, I have isolated these two approaches:
- Analyze the historical earnings of *only* the firms present in the S&P 2019 Index
- Keep track of all firms that were present in the S&P for the past 20 years. Keep track of how many times each firm appeared in the Index and for those with the least count, analyze them individually on how they differ from the firms that stayed for longer.

[TK] HEre is a breakdown of my final clean CSV's features, TK.csv.

## II) Data Wrangling <a id="wrangling"></a>

To gather the data depicted under the `./data` folder, I used Bloomberg Excel functions.

### A) Gather <a id = "gather"></a>
> **APPROACH 1:** Focus on the firms that appear in the 2019 S&P Index and analyze their forecasted vs. actual price earnings for the last 20 years.

To ensure consistency in analysis among multiple firms, I divide both the forecasted and actual price earning dates by *calendar period* instead of fiscal period. This is because fiscal period differs by firm whereas calendar period is consistent by dates. 

#### Through the Bloomberg Excel functions, I gathered four datasets with different purposes:

- historical forecasted EPS
- historical actual EPS
- historical actual EOD price
- historical forecasted EPS relying on terms

---
Before delving into the data, let's define what above terms:

**EPS**

> EPS stands for ***Estimated Price Earnings.*** The formal definition of EPS given by Investopedia is this:

Earnings per share is the portion of a company's profit that is allocated to each outstanding share of a common stock, serving as an indicator of the company's financial health.

In other words, the EPS is a portion of the company's **net income** after all of their dividends are paid off. Dividends are profits that are paid out to shareholders of the company. EPS is one of the most useful and valuable financial measurements because they ***determine a stock's worth.*** The higher the stock, the more the company can pay out dividends to its shareholders, and the more net profit they are determined to generate.

$$ EPS = \frac{Net Income - Preferred Dividends}{Weighted Average Common Shares Outstanding}\$$

**EOD**

> EOD stands for the ***End of Day*** price. For any given day, the EOD marks the ***price at which the stock was valued*** at the end of the day's trading period.

In [3]:
PATH = './data/'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

### C) Importing the Data <a id = "import"></a>

### Let's summarize the contents of the following DataFrames as we import them.
> All DataFrames consist of the 505 firms found in the 2019 S&P Index with EPS and EOD data encompassing 20 years: from January 1999 until the December 2019.

**Historic forecasted EPS**
> According to Investopedia, consensus estimates is normally an average or median of all the forecasts from individual analysts tracking a particular stock. In this case, the consensus estimate is for ***EPS for each firm present in the index as of 2019.*** Forecasted EPS is calculated by ***quarterly earnings,*** usually by each firm's fiscal period. Estimates of quarterly earnings are published at the beginning of each quarterly period.

In [4]:
#historic forecasted EPS 
df_eps_fc = pd.read_csv(PATH + 'sp-eps-fc.csv')

In [5]:
df_eps_fc.head()

Unnamed: 0,Term Forecasted,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,ACN UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
0,1999Q1,,,,0.025,,,-0.1,0.423,,...,0.456,0.09,0.239,0.123,,,0.123,,0.622,
1,1999Q2,,,,0.02,,,-0.13,0.42,,...,0.213,0.09,0.264,0.135,,,0.15,,0.661,
2,1999Q3,,,,0.023,,,-0.11,0.381,,...,0.767,0.099,0.3,0.129,,,0.171,,0.693,
3,1999Q4,,,,0.016,,,,0.432,,...,0.432,0.124,0.386,0.178,,,0.164,,0.667,
4,00Q1,,,,0.032,,,,0.441,,...,0.283,0.143,0.436,0.138,,,0.138,,0.668,


**Historic actual EPS**
> Unlike forecasted EPS, actual EPS are the real numbers denoting Earnings-per-Share for a singular firm. Historic actual EPS will be compared to forecasted EPS to draw correlations and comparisons.

In [6]:
#historic actual EPS
df_eps_act = pd.read_csv(PATH + 'sp-eps-act.csv')

In [7]:
df_eps_act.head()

Unnamed: 0,Quarter,Year,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
0,Q1,1999,0.16,0.99,,0.04,,0.0925,-0.03,0.44,...,0.34,0.0,0.21,0.123333,1.64,,0.1725,,0.56,
1,Q2,1999,0.35,1.76,,0.035357,,0.1,-0.145,0.42,...,0.06,0.0975,0.285,0.133333,2.28,,0.29,,0.6,
2,Q3,1999,0.3,1.86,,0.050357,,0.1075,-0.135,0.3,...,0.63,0.0325,0.315,0.13,1.96,,0.32,,0.64,
3,Q4,1999,0.32,1.89,,0.024643,,0.0375,-0.08,0.43,...,0.43,0.13,0.6,0.18,1.72,,0.2375,,0.49,
4,Q1,2000,0.29,0.89,,0.040585,,0.105,-0.045,0.45,...,0.45,0.165,0.5,0.14,-1.48,,0.2025,,-0.33,


**Historic actual EOD**
> Though this is not directly related to EPS data, EOD would be an interesting measure to use when generating intriguing visualizations and analyses. Who knows what visuals and conclusions I would arrive to with this measure. 

In [8]:
#historic actual EOD
df_eod_act = pd.read_csv(PATH + 'sp-eod-act.csv')

In [9]:
df_eod_act.head()

Unnamed: 0,date,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,ACN UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
0,3/31/1999,,,,1.2835,,8.2934,6.25,20.9504,,...,,20.2813,35.2813,7.75,140.6214,,12.6284,,66.5,
1,6/30/1999,,,,1.654,,6.1859,6.875,20.363,,...,,28.625,38.5625,9.625,155.6057,,9.7297,,63.5,
2,9/30/1999,,,,2.2612,,5.7462,7.75,16.4471,,...,,32.7656,37.9688,7.5833,110.4883,,7.3591,,55.125,
3,12/31/1999,52.0909,,,3.6719,,3.6843,18.375,16.2513,,...,,45.4688,40.2813,7.875,59.7723,,6.9434,,59.1875,
4,3/31/2000,70.0721,,,4.8504,,3.6388,20.25,15.7478,,...,,82.8125,38.9063,9.4583,68.4994,,5.5839,,41.625,


**Historic forecasted EPS 3 months prior**
> Instead of using forecast data collected at the beginning of the fiscal period, this feature contains EPS data projected 3 months before the current fiscal period. This is an interesting metric to see how differently forecasters make their predictions at different times.

In [10]:
#historic forecasted EPS 3-months prior
df_eps_fc_terms = pd.read_csv(PATH + 'sp-eps-fc-terms.csv')

In [11]:
df_eps_fc_terms.head()

Unnamed: 0,Forecast Made,Term Forecasted,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
0,10/1/1999,00Q1,,,,0.03,,,,0.466,...,,0.143,0.398,0.14,,,0.144,,0.777,
1,1/1/2000,00Q2,,,,0.026,,,,0.437,...,,0.168,0.378,0.153,,,0.184,,0.79,
2,4/1/2000,00Q3,,,,0.029,,,,0.414,...,0.84,0.189,0.425,0.149,,,0.218,,0.803,
3,7/1/2000,00Q4,,,,0.032,,,-0.18,0.477,...,0.53,0.211,0.487,0.2,,,0.244,,0.777,
4,9/1/2000,01Q1,,,,0.041,,,-0.09,0.495,...,0.4,0.259,0.475,0.158,,,0.178,,0.75,


## B) Assess <a id = "assess"></a>

> The following DataFrames contain data for each firm across various dates. To account for all firm averages, my goal is to generate a CSV file where each row contains the firm average, with the features as columns.

In [12]:
dict_dfs = {'eps_fcast' : df_eps_fc,
           'eps_actual' : df_eps_act,
           'eod_actual' : df_eod_act,
           'eps_fcast_terms' : df_eps_fc_terms}

***CHECK FOR MISSING DATA***

In [13]:
for key, df in dict_dfs.items():
    print(key, df.shape)

eps_fcast (84, 506)
eps_actual (84, 507)
eod_actual (84, 506)
eps_fcast_terms (80, 507)


***We need to make sure the number of firms in each DataFrame is consistent.***

---

**Observation 1:** for `eps_fcast`, there are 505 firms encompassing 84 quarterly fiscal periods since 1999.

> There are 506 columns: 1 column being `Term Forecast`, the rest firm names.

**Observation 2:** for `eps_actual`, there are 505 firms encompassing 84 quarterly fiscal periods since 1999.

> There are 507 columns: 2 columns being `Quarter` and `Year`, the rest firm names.

**Observation 3:** for `eod_actual`, there are 505 firms encompassing 84 quarterly calendar periods since 1999.

> There are 506 columns: 1 column being `date`, the rest firm names.

**Observation 4:** for `eps_fcast_terms`, there are 505 firms encompassing 80 quarterly calendar periods since 1999.

> There are 507 columns: 2 columns being `Forecast Made` and `Term Forecasted`, the rest firm names.
 
Since there are only 80 quarterly calendar periods, that ***implies an entire year is missing.***

**Observation 5:** For `eps_fcast`, `eps_actual`, and `eod_actual`, since there are 4 quarters in a year, 84 quarterly forecast periods equate to 21 years. This is correct since we are analyzing the years from 1999 until the end of 2019.

### Most importantly, the number of firms across all DataFrames is consistent.

----

***Check which year is missing from `eps_fcast_terms`***.

> Each **Term Forecasted** entry under `eps_fcast_terms` records the year with 2 digits, so Quarter 1 of the year 2000 becomes 00Q1.
- isolate the first 2 characters to get the year
- join a '20' in front of the string so 00 becomes 2000
- list the number of unique values.

In [14]:
#iterate over all years in eps_fcast_terms, append '20' in front of string
list(map(lambda x: '20' + x, df_eps_fc_terms['Term Forecasted'].str[:2].unique()))

['2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019']

**Observation**: The year 1999 is missing from `df_eps_fc_terms`. This makes sense because the start of the forecasting period would be in the last quarter of 1999, which is October. 

---

***CHECK NULLS***

In [102]:
for key, df in dict_dfs.items():
    #display (row, column) per DataFrame
    print('# NaN in {}: {}'.format(key, df.isna().sum().sum()))

# NaN in eps_fcast: 7055
# NaN in eps_actual: 5021
# NaN in eod_actual: 6921
# NaN in eps_fcast_terms: 7080


**Observation:** All four DataSets contain null values.
> In order to combat this, we'll have to look at both the **number of rows** and **number of columns** with missing data, separately. This way, we can isolate which firms and/or time periods contain complete or incomplete data.

In [103]:
#check rows for missing data
for key, df in dict_dfs.items():
    num_rows_missing = df.isna().sum().max()
    print('{} has {} time periods containing missing data, out of {} total rows.'.format(key, num_rows_missing, df.shape[0]))

eps_fcast has 82 time periods containing missing data, out of 84 total rows.
eps_actual has 84 time periods containing missing data, out of 84 total rows.
eod_actual has 83 time periods containing missing data, out of 84 total rows.
eps_fcast_terms has 80 time periods containing missing data, out of 80 total rows.


In [104]:
len(df_eps_fc.columns[df_eps_fc.isnull().any()])

254

In [105]:
#check columns for missing data
for key, df in dict_dfs.items():
    cols_missing = df.columns[df.isnull().any()]
    num_cols_missing = len(cols_missing)
    print('{} has {} firms containing missing data, out of {} total columns.'
         .format(key, num_cols_missing, df.shape[1]))

eps_fcast has 254 firms containing missing data, out of 506 total columns.
eps_actual has 432 firms containing missing data, out of 507 total columns.
eod_actual has 505 firms containing missing data, out of 506 total columns.
eps_fcast_terms has 340 firms containing missing data, out of 507 total columns.


**Observation 1:** The only two datasets with incomplete data for all time periods are **actual EPS** and **forecasted EPS 3 months prior.**

> To address this problem, it'd be helpful to isolate the time period ranges for the datasets with incomplete data by row, **forecasted EPS** and **actual EOD price.**

**Observation 2:** For all datasets, all firms contain incomplete data across all time periods.
> This is expected, as analyzing financial history spanning over 20 years will naturally be rife with missing and inaccurate data. The ***good news is that `eps_fcast`, `eps_actual`, and `eps_fcast_terms` are the most complete, while `eod_actual` contains the most amount of missing data.***


**Moving forward, we need to make sure that these inconsistencies won't clash with our analysis.**
> **My approach:** instead of looking at rows and columns ***with*** missing data, we'll be looking at rows and columns that ***are all missing data.***

I figured that if there is some missing data here and there scattered throughout the matrix, then that should not skew our analysis too much.

However, if there a significant amount of rows/columns that are entirely empty, then we ***might have to get ready to drop some dates and firms from our data overall.***

In [106]:
#check for empty rows, return False if row contains at least one non-null value, True if all are null
for key, df in dict_dfs.items():
    cols_check = df.columns
    num_empty_rows = (df[cols_check].isnull().apply(lambda x: all(x), axis = 1)).value_counts()
    print(key, '\n', num_empty_rows)

eps_fcast 
 False    84
dtype: int64
eps_actual 
 False    84
dtype: int64
eod_actual 
 False    84
dtype: int64
eps_fcast_terms 
 False    80
dtype: int64


**Observation:** All datasets do not contain empty rows.
> This is good news, since we can rely on the firms' averages per row instead of having to drop or limit time periods.

In [107]:
#check for empty columns, return False if column contains at least one non-null value, True if all are null
for key, df in dict_dfs.items():
    cols_check = df.columns
    num_empty_cols = df[cols_check].isnull().apply(lambda x: all(x), axis = 0).value_counts()
    print(key, '\n', num_empty_cols)

eps_fcast 
 False    506
dtype: int64
eps_actual 
 False    506
True       1
dtype: int64
eod_actual 
 False    506
dtype: int64
eps_fcast_terms 
 False    506
True       1
dtype: int64


**Observation:** `eps_actual` and `eps_fcast_terms` are the only datasets that have an empty column.
> Let's isolate and look at the singular empty column for both DataFrames.

In [108]:
#create function to return an array of column names containing empty data
def comb_cols(df):
    empty_cols = []
    for column in df:
        if df[column].isnull().all():
            empty_cols.append(column)
            
    return empty_cols

In [109]:
#comb datasets for empty columns
print('In eps_act, the firm {} has no data.'.format(comb_cols(df_eps_act)))
print('In eps_fc_terms, the firm {} has no data.'.format(comb_cols(df_eps_fc_terms)))

In eps_act, the firm ['AMCR UN Equity'] has no data.
In eps_fc_terms, the firm ['AMCR UN Equity'] has no data.


**Observation:** The same firm in both datasets is empty of data.
> Though this is an annoying error to deal with, it still is to our advantage that both datasets ***share one firm*** in common for missing data. This way, we don't have to worry about dropping two entire firms.

---

***CHECK DUPLICATE DATA***

In [111]:
#check for duplicate data across all rows and columns
for key, df in dict_dfs.items():
    print(key, df.duplicated().sum())

eps_fcast 0
eps_actual 0
eod_actual 0
eps_fcast_terms 0


**Observation:** For all datasets, there are ***no duplicate data.*** This is good news!

**Next, I will check for duplicated firm names.** Although the presence of duplicated firm names will inherently imply duplicated data, sometimes data gets dispersed in weird, unexpected ways, especially when dealing with large datasets.

In [113]:
#check for duplicated firm names
for key, df in dict_dfs.items():
    print(key, df.columns.duplicated().sum())

eps_fcast 0
eps_actual 0
eod_actual 0
eps_fcast_terms 0


**Observation:** For all datasets, there are ***no duplicate firm names.*** This is also good news.
> There is no need to dedupe our data during the cleaning stage.

***CHECK DATA TYPES***

***REORDER COLUMNS (IF APPLICABLE)***

# Store Data



# CUTOFF

**Observation:** There are 505 firms encompassing 84 calendar periods.

**Observations:** 

- there is no calendar period that's empty of data for all firms.
- there is no firm that's empty of data for all calendar periods.

**Therefore, all calendar periods and firms have data for historical end of day stock price.**

In [14]:
#count how many rows have isolated data
df_eps_act.duplicated().sum()

0

**Observation:** There is no duplicated data among all firms for all calendar periods in `df_eod`.

In [None]:
#count number of repeated firm names
df_eod.columns.duplicated().sum()

In [7]:
df_eod_act.sample(10)

Unnamed: 0,date,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,ACN UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
79,12/31/2018,67.46,32.11,157.46,157.74,92.19,74.4,325.04,72.33,141.01,...,49.27,85.17,68.19,37.21,19.76,66.72,91.92,103.72,40.74,85.54
17,6/30/2003,13.1722,,20.3,1.3657,,16.8232,5.47,19.5843,18.09,...,,25.31,35.91,20.45,27.9003,,10.6276,45.05,50.61,
69,6/30/2016,44.36,28.31,161.63,95.6,61.91,79.32,109.29,39.31,113.29,...,,46.13,93.74,62.04,25.0023,44.65,59.624,120.38,25.13,47.46
25,6/30/2005,15.5102,,43.033,5.2586,,16.7747,8.55,23.45,22.67,...,,25.5,57.47,27.0,36.331,,18.7242,76.17,73.53,
4,3/31/2000,70.0721,,,4.8504,,3.6388,20.25,15.7478,,...,,82.8125,38.9063,9.4583,68.4994,,5.5839,,41.625,
10,9/28/2001,13.1722,,,1.1079,,17.2114,17.47,23.2049,12.75,...,,23.53,39.4,15.3133,20.4181,,7.0503,27.75,53.66,
78,9/28/2018,70.54,41.33,168.33,225.74,94.58,92.22,449.75,73.36,170.2,...,47.21,80.17,85.02,37.74,26.98,79.87,90.91,131.47,50.15,91.56
77,6/29/2018,61.84,37.96,135.7,185.11,92.65,85.27,409.05,60.99,163.59,...,45.68,65.26,82.73,43.77,24.0,67.38,78.22,111.44,52.69,85.19
32,3/30/2007,24.0913,,38.55,13.2729,,25.5927,13.66,26.6988,38.54,...,,25.73,75.45,32.75,44.4983,,20.7663,85.41,84.52,
30,9/29/2006,22.0256,,32.94,11.0043,,21.9296,14.79,23.2347,31.71,...,,21.95,67.1,30.11,40.9943,,18.7134,67.5,79.81,


**Observation:** There are no duplicated firms in `df_eod`.

### Quality

**Missing Data**

-  `df_eps_fc_terms` is missing the year 1999.
- `df_eps_act` and `df_eps_fc_act` have one firm with empty data.
--- 

- no recorded 20-year averages for each dataset.
- firm names across both DataFrames are capitalized


### Tidiness

- forecast + actual EPS DataFrames need to be merged with firm names transposed into rows
- all firm averages need to be joined in a new DataFrame

## C) Cleaning

### Code


### Test

***CHECK THAT OUR DATA IS CONSISTENT WITH THE ORIGINAL DATA***

# III) Store Data