## Table of Contents
- [Introduction](#introduction)
- [Data Wrangling](#wrangling)
    - [Gather](#gather)
    - [Assess](#assess)
    - [Clean](#clean)
    - [Analyze](#analyze)
    - [Visualize](#visualize)
- [Conclusions](#conclusions)

In [1]:
PATH = './data/'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

## I) Introduction <a id = "introduction">

**Aim:** Analyze absolute difference and (possibly) margin of error between stock market forecast of price returns and actual stock market price returns.

I will be analyzing quarterly price returns within the past 20 years for the firms present in the S&P 500 2019 Index.

> At first, I wanted to analyze the forecasted vs. actual price earnings of the S&P in its entirety for the past 20 years. However, considering that firms continuously enter and leave stock indices every year, there would be varying levels of inconsistencies and marginal errors when comparing annual S&P returns alone. To combat this problem, I have isolated these two approaches:
- Analyze the historical earnings of *only* the firms present in the S&P 2019 Index
- Keep track of all firms that were present in the S&P for the past 20 years. Keep track of how many times each firm appeared in the Index and for those with the least count, analyze them individually on how they differ from the firms that stayed for longer.


## II) Data Wrangling <a id="wrangling"></a>

To gather the data depicted under the `./data` folder, I used Bloomberg Excel functions.

### A) Gather <a id = "gather"></a>
> **APPROACH 1:** Focus on the firms that appear in the 2019 S&P Index and analyze their forecasted vs. actual price earnings for the last 20 years.

To ensure consistency in analysis among multiple firms, I divide both the forecasted and actual price earning dates by *calendar period* instead of fiscal period. This is because fiscal period differs by firm whereas calendar period is consistent by dates. 

**Historical data from January 1999 until December 2019**

In [2]:
#historic forecasted EPS
df_fc = pd.read_csv(PATH + 'sp-fc.csv')

#historic end of day price
df_eod = pd.read_csv(PATH + 'sp-eod.csv')

## B) Assess

### Forecasted Historical EPS



In [3]:
df_fc.sample(5)

Unnamed: 0,Forecast Made,Term Forecasted,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
23,7/1/2005,05Q4,0.36,,0.356,0.046,,0.24,0.0,0.742,...,0.305,0.202,1.128,0.362,,,0.409,0.823,1.308,
44,10/1/2010,11Q1,0.51,-1.1,1.403,0.702,,0.552,-0.14,0.966,...,0.435,0.533,1.471,0.466,0.887,,0.632,1.133,-0.111,
62,4/1/2015,15Q3,0.418,3.182,2.14,1.671,1.09,1.128,0.036,0.569,...,0.803,0.609,0.994,0.643,1.052,0.52,1.081,1.704,0.417,0.429
13,1/1/2003,03Q2,,,0.32,0.003,,,-0.375,0.539,...,0.3,0.107,0.493,0.26,,,0.242,0.389,0.984,
59,7/1/2014,14Q4,0.937,1.178,1.447,1.334,0.892,0.964,0.063,0.697,...,0.305,0.555,1.906,0.649,1.244,0.629,0.999,1.752,0.454,0.391


In [4]:
df_fc.shape

(80, 507)

**Observation:** There are 505 firms encompassing 80 quarterly forecast periods since 1999.

In [5]:
#number of rows with missing data
df_fc.isna().sum().max()

80

**Observation:** There are 80 rows with missing data, which means that all quarterly calendar periods contain incomplete data across all firms in the 2019 S&P Index.

In [12]:
#check for rows where all columns are NaN values.
columns_to_check = df_fc.columns[2:]
df_fc[columns_to_check].isnull().apply(lambda x: all(x), axis = 1).value_counts()

False    80
dtype: int64

In [13]:
#check for columns where all rows are NaN values.
df_eod[columns_to_check].isnull().apply(lambda x: all(x), axis = 0).value_counts()

False    505
dtype: int64

**Observations:**
- No quarterly calendar period is empty of data for all firms.
- No firm is empty of data for all calendar periods.

***This means that no singular calendar period has missing data for an entire firm, and
no singular firm has missing data for an entire calendar period.***

### Historic Stock Returns

In [10]:
#generate 10 random samples 
df_eod.sample(5)

Unnamed: 0,date,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,ACN UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
34,9/28/2007,26.3724,,33.56,21.9343,,22.665,12.43,25.6557,40.25,...,,26.14,92.56,41.64,45.6838,,24.3256,80.99,68.67,
59,12/31/2013,40.8958,25.25,110.68,80.1586,52.81,70.31,26.74,38.33,82.22,...,,45.92,101.2,48.48,32.063,34.6,54.3677,93.19,29.96,32.69
46,9/30/2010,23.8624,,58.68,40.5357,,30.66,10.61,24.9954,42.49,...,,26.61,61.79,31.97,27.268,,33.1196,52.33,21.36,
50,9/30/2011,22.3465,,58.1,54.4543,,37.27,11.03,24.4691,52.68,...,,27.44,72.63,30.69,18.3631,,35.5141,53.5,14.07,
60,3/31/2014,39.9877,36.6,126.5,76.6771,51.4,65.59,26.04,38.51,79.72,...,,54.27,97.68,46.04,29.7709,36.42,54.2095,94.58,30.98,28.94


In [15]:
df_eod.shape

(84, 506)

**Observation:** There are 505 firms encompassing 84 calendar periods.

In [16]:
#number of rows with missing data
df_eod.isna().sum().max()

83

**Observation:** There are 83 rows with missing values, which means only one row has non-missing data.

In [57]:
#isolate row with no missing data
df_eod[~(df_eod.isna().any(axis=1))]

Unnamed: 0,date,A UN Equity,AAL UW Equity,AAP UN Equity,AAPL UW Equity,ABBV UN Equity,ABC UN Equity,ABMD UW Equity,ABT UN Equity,ACN UN Equity,...,XEL UW Equity,XLNX UW Equity,XOM UN Equity,XRAY UW Equity,XRX UN Equity,XYL UN Equity,YUM UN Equity,ZBH UN Equity,ZION UW Equity,ZTS UN Equity
82,9/30/2019,76.63,26.97,165.4,223.97,75.72,82.33,177.89,83.67,192.35,...,64.89,95.9,70.61,53.31,29.91,79.62,113.43,137.27,44.52,124.59


**Observation:** September 30, 2019 is the only recorded calendar time in the past 20 years that contains complete end-of-day stock prices for all firms in the 2019 S&P Index.

In [6]:
#check for rows where all columns are NaN values.
columns_to_check = df_eod.columns
df_eod[columns_to_check].isnull().apply(lambda x: all(x), axis = 1).value_counts()

False    84
dtype: int64

In [94]:
#check for columns where all rows are NaN values.
df_eod[columns_to_check].isnull().apply(lambda x: all(x), axis = 0).value_counts()

False    506
dtype: int64

**Observations:** 

- there is no calendar period that's empty of data for all firms.
- there is no firm that's empty of data for all calendar periods.

**Therefore, all calendar periods and firms have data for historical end of day stock price.**

In [97]:
#count how many rows have isolated data
df_eod.duplicated().sum()

0

**Observation:** There is no duplicated data among all firms for all calendar periods in `df_eod`.

In [117]:
#count number of repeated firm names
df_eod.columns.duplicated().sum()

0

**Observation:** There are no duplicated firms in `df_eod`.

### Quality

**Missing Data**

-  something
--- 

- firm names across both DataFrames are capitalized
- 


### Tidiness

- both DataFrames need to be merged with firm names transposed into rows

## C) Cleaning

# III) Store Data

# IV) Explore Data

## Univariate

## Bivariate

## Multivariate

# V) Visualize Data