## Table of Contents
- [Introduction](#introduction)
- [Data Wrangling](#wrangling)
    - [Gather](#gather)
    - [Assess](#assess)
    - [Clean](#clean)
    - [Analyze](#analyze)
    - [Visualize](#visualize)
- [Conclusions](#conclusions)

In [None]:
PATH = './data/'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

## I) Introduction <a id = "introduction">

**Aim:** Analyze absolute difference and (possibly) margin of error between stock market forecast of price returns and actual stock market price returns.

I will be analyzing quarterly price returns within the past 20 years for the firms present in the S&P 500 2019 Index.

> At first, I wanted to analyze the forecasted vs. actual price earnings of the S&P in its entirety for the past 20 years. However, considering that firms continuously enter and leave stock indices every year, there would be varying levels of inconsistencies and marginal errors when comparing annual S&P returns alone. To combat this problem, I have isolated these two approaches:
- Analyze the historical earnings of *only* the firms present in the S&P 2019 Index
- Keep track of all firms that were present in the S&P for the past 20 years. Keep track of how many times each firm appeared in the Index and for those with the least count, analyze them individually on how they differ from the firms that stayed for longer.


## II) Data Wrangling <a id="wrangling"></a>

To gather the data depicted under the `./data` folder, I used Bloomberg Excel functions.

### A) Gather <a id = "gather"></a>
> **APPROACH 1:** Focus on the firms that appear in the 2019 S&P Index and analyze their forecasted vs. actual price earnings for the last 20 years.

To ensure consistency in analysis among multiple firms, I divide both the forecasted and actual price earning dates by *calendar period* instead of fiscal period. This is because fiscal period differs by firm whereas calendar period is consistent by dates. 

**Historic *forecasted* and *actual* data from January 1999 until December 2019**

In [None]:
#historic forecasted EPS 
df_eps_fc = pd.read_csv(PATH + 'sp-eps-fc.csv')

#historic actual EPS
df_eps_act = pd.read_csv(PATH + 'sp-eps-act.csv')

#historic actual EOD
df_eod_act = pd.read_csv(PATH + 'sp-eod-act.csv')

#historic forecasted EPS with terms
df_eps_fc_terms = pd.read_csv(PATH + 'sp-eps-fc-terms.csv')

## B) Assess

### Historic Forecasted EPS

In [10]:
df_eps_fc.sample(5)

Unnamed: 0,Term Forecasted,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,ACN UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
17,03Q2,,,0.349,0.001,,,-0.375,0.519,0.239,...,0.23,0.107,0.554,0.26,,,0.23,0.419,0.989,
32,07Q1,0.348,0.303,0.706,0.113,,0.279,-0.168,0.535,0.415,...,0.36,0.298,1.542,0.358,0.84,,0.318,0.933,1.418,
51,11Q4,0.805,-0.994,0.745,1.05,,0.556,-0.014,1.436,0.896,...,0.3,0.522,1.979,0.518,,0.456,0.741,1.34,,
21,04Q2,,,0.467,0.007,,,,0.54,0.268,...,0.166,0.151,0.855,0.295,,,0.261,0.552,1.104,
83,19Q4,0.855,1.269,1.365,2.837,2.2,1.583,1.084,0.947,1.713,...,0.528,1.034,0.747,0.749,1.13,0.9,1.138,2.265,1.102,0.88


In [11]:
df_eps_fc.shape

(84, 506)

**Observation:** There are 505 firms encompassing 84 quarterly forecast periods since 1999.

> Since there are 4 quarters in a year, 84 quarterly forecast periods equate to 21 years. This is correct since we are analyzing the years from 1999 until the end of 2019.

In [12]:
#number of rows with missing data
df_eps_fc.isna().sum().max()

82

**Observation:** There are 82 rows with missing data. Most of the entries for each quarterly forecast period is incomplete. Out of 84 rows, ***only two*** quarterly calendar periods contain complete data across all firms.

In [13]:
#check for rows where all columns are NaN values.
columns_to_check = df_eps_fc.columns[2:]
df_eps_fc[columns_to_check].isnull().apply(lambda x: all(x), axis = 1).value_counts()

False    84
dtype: int64

In [14]:
#check for columns where all rows are NaN values.
df_eps_fc[columns_to_check].isnull().apply(lambda x: all(x), axis = 0).value_counts()

False    504
dtype: int64

**Observations:**
- No quarterly calendar period is empty of data for all firms.
- No firm is empty of data for all calendar periods.

This means that for historical forecasted EPS, ***no singular calendar period has completely missing data across all firms, and
no singular firm has completely missing data across an entire calendar period.***

### Historic Actual EPS


In [15]:
#generate 10 random samples 
df_eps_act.sample(5)

Unnamed: 0,Quarter,Year,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
50,Q3,2011,0.95,-0.48,1.43,1.127143,,0.67,-0.02,0.19,...,0.69,0.59,2.13,0.43,0.92,0.42,0.82,1.02,0.35,
79,Q4,2018,0.61,0.69,0.74,2.94,-1.23,1.08,0.83,0.37,...,0.42,0.57,1.41,0.01,0.56,1.25139,1.07,-4.42,1.14,0.72
47,Q4,2010,0.67,-0.29,0.58,0.672857,,0.51,0.02,0.93,...,0.29,0.54,1.86,0.48,0.48,0.52546,0.58,0.18,-0.62,
38,Q3,2008,0.47,0.120155,0.59,0.182857,,-0.34,-0.26,0.7,...,0.51,0.36,2.89,0.44,1.2,,0.6,0.96,0.31,
29,Q2,2006,0.27,1.44,0.6,0.07,,0.31,-0.11,0.4,...,0.24,0.25,1.74,0.38,1.08,,0.355,0.82,1.37,


In [18]:
df_eps_act.shape

(84, 507)

**Observation:** There are 505 firms encompassing 84 calendar periods.

In [1]:
#number of rows with missing data
df_eps_act.isna().sum().max()

NameError: name 'df_eps_act' is not defined

**Observation:** All quarterly calendar periods contain incomplete data across all firms.

In [12]:
#check for rows where all columns are NaN values.
columns_to_check = df_eod.columns
df_eod[columns_to_check].isnull().apply(lambda x: all(x), axis = 1).value_counts()

False    84
dtype: int64

In [13]:
#check for columns where all rows are NaN values.
df_eod[columns_to_check].isnull().apply(lambda x: all(x), axis = 0).value_counts()

False    506
dtype: int64

**Observations:** 

- there is no calendar period that's empty of data for all firms.
- there is no firm that's empty of data for all calendar periods.

**Therefore, all calendar periods and firms have data for historical end of day stock price.**

In [14]:
#count how many rows have isolated data
df_eod.duplicated().sum()

0

**Observation:** There is no duplicated data among all firms for all calendar periods in `df_eod`.

In [15]:
#count number of repeated firm names
df_eod.columns.duplicated().sum()

0

**Observation:** There are no duplicated firms in `df_eod`.

### Quality

**Missing Data**

-  N/A
--- 

- firm names across both DataFrames are capitalized


### Tidiness

- both DataFrames need to be merged with firm names transposed into rows

## C) Cleaning

# III) Store Data

# IV) Explore Data

## Univariate

## Bivariate

## Multivariate

# V) Visualize Data