# Introducing SIREN
**A tool for predicting pullbacks in equity markets**

## Problem statement
XX

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno # Visualising our missing values
import pickle # For exporting our dataset

from pandas_profiling import ProfileReport # Automating the EDA process
from datetime import timedelta

# Custom function(s)
import SIREN_func

In [3]:
# Ensuring our notebook remains tidy
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Customising aesthetics
sns.set_style('darkgrid')
sns.set_palette('icefire')

## Data collection
The bulk of our data has been sourced from Bloomberg via the API on excel. We will write a loop to pull every sheet and store in a dictionary for future use.

### Importing dataset(s)
* `spx_fundamentals.xlsx`
* `spx_eri.xlsx`

In [5]:
# Importing our dataset(s)
xlsx = pd.ExcelFile('../data/spx_fundamentals.xlsx')

# Reading all sheets to a mapsheet_to_df_map = {}
empty_list = {}
for i in xlsx.sheet_names:
    empty_list[i] = pd.read_excel('../data/spx_fundamentals.xlsx', sheet_name=i, skiprows=11, parse_dates=['date'])

In [6]:
# Preview dictionary of dataframes
empty_list.keys()

dict_keys(['econ_sur', 'usd', 'epu', 'finc', 'pe', 'pb', 'eq_indices', 'como', 'credit', 'pct52w', 'vol', 'aaii', 'us_cftc', 'put_call', 'us_yields', 'eu_yields', 'eurdollar', 'fra_ois'])

## Exploring our data
With the help of a really handy package (i.e. pandas_profiling), we can carry out our extensive EDA with just a few lines of code. We could always write a loop for this process, but I figured it'll be better to examine each sub-dataset individually to check for multi-collinearity.

In [7]:
# Exploring Citi Economic surprise indices
cesi_profile = ProfileReport(empty_list['econ_sur'].set_index('date'), title="Citi Economic Surprise Indices", explorative=True)
cesi_profile.to_widgets()

Summarize dataset: 100%|██████████| 56/56 [00:10<00:00,  5.20it/s, Completed]                    
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.49s/it]
                                                             

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

**Thoughts**: US economic surprises are highly correlated with Euro area, UK, China and global indices. We could technically leave Japan in, but we posit it will likely not have a significant impact on final model. 

**Action plan**: Keep `cesiusd` and drop the rest.

In [8]:
# Exploring USD indices
usd_profile = ProfileReport(empty_list['usd'].set_index('date'), title="USD Indices", explorative=True)
usd_profile.to_widgets()

Summarize dataset: 100%|██████████| 26/26 [00:04<00:00,  5.98it/s, Completed]                    
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.26s/it]
                                                             

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

**Thoughts**: No brainer. We're seeing significant multi-correlation here. There also appear to be repeated values for `usd_dxy` and `usd_bbdxy`.

**Action plan**: Keep the most diverse index (`usd_twi`) and drop the rest.

In [9]:
# Exploring 12-month forward Price-earnings ratios
pe_profile = ProfileReport(empty_list['pe'].set_index('date'), title="PE ratios", explorative=True)
pe_profile.to_widgets()

Summarize dataset: 100%|██████████| 56/56 [00:10<00:00,  5.28it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.52s/it]
                                                             

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

**Thoughts**: XX

**Action plan**: XX

## Creation of custom functions
We will import our custom .py module (**SIREN_func**) which stores the following functions:
* `eda_clean`: Provides a quick snapshot of our project
* `derive_yield_curves`: Calculating 30y10ys, 30y5ys, 30y2ys, 30y3ms, 10y5ys, 10y2ys, 10y3ms, 5y2ys, 5y3ms, 2y3ms for US and Euro-area regions
* `fix_credit`: Standardising credit spreads as pp; Calculating spread between US high-yield and investment-grade bonds
* `fix_cftc`: Deriving CFTC net non-commercial positions as a % of total open interest
* `eri_diff`: Derive earnings revision indices and rolling changes across different horizons (4, 13-week)
* `roll_diff`: Calculate rolling differences for different time horizons (1, 4, 13, 26-week)
* `lag_roll_pct_chg`: Lagging rolling percentage changes for various equity indices (4-week)
* `roll_pct_chg`: Calculate rolling percentage changes for different time horizons (1, 4, 13, 26-week)
* `adjust_dates_only`: Standardise dates for merging dataframes later

We will once again rely on a combination of loops and custom functions to clean and tidy our dataset(s). We will also create some features for the following datasets:

* No engineering required (values-only; non-stationary)
    - XX
* Rolling differences across different horizons (stationary)
    - Citi economic surprise indices (`econ_sur`)
    - US, Euro-area rates (`us_yields`, `eu_yields`)
        + Yield curves were also calculated
    - Credit spreads (`credit`)
* Rolling **percentage** changes across different horizons (stationary)
    - Equity indices (`eq_indices`)
    - USD indices (`usd`)
    - Commodities (`como`)
    - 12-month forward P/E ratios (`pe`)
    - 12-month forward P/B ratios (`pb`)

### Feature engineering
**Calculating differences across different horizons**

In [5]:
# Creating a empty dictionary to house our sub-dataset(s)
roll_d = {}

# Datasets of interest
for df in ['econ_sur', 'credit', 'us_yields', 'eu_yields']:
    if df in ['credit']:
        roll_d[f'{df}_chg'] = SIREN_func.roll_diff(SIREN_func.fix_credit(empty_list[df]))
    
    elif df in ['us_yields', 'eu_yields']:
        roll_d[f'{df}_chg'] = SIREN_func.roll_diff(SIREN_func.derive_yield_curves(empty_list[df]))

    else:
        roll_d[f'{df}_chg'] = SIREN_func.roll_diff(empty_list[df])

**Calculating percentage changes across different time horizons**

In [6]:
# Creating another empty dictionary to house sub-dataset(s)
roll_d2 = {}

# Datasets of interest
for df in ['usd', 'pe', 'pb', 'como', 'eq_indices']:
    if df in ['eq_indices']:
        roll_d2[f'{df}_4w_return'] = SIREN_func.lag_roll_pct_chg(empty_list[df], 4)

    else: 
        roll_d2[f'{df}_chg'] = SIREN_func.roll_pct_chg(empty_list[df]) # Tidy up with custom module later

**Storing the remaining time-series in a dictionary**

In [7]:
# Creating another empty dictionary to house sub-datasets
d3 = {}

# Datasets of interest
for df in ['epu', 'finc', 'pct52w', 'vol', 'aaii', 'us_cftc', 'put_call']:
    if df in ['us_cftc']:
        d3[df] = SIREN_func.fix_cftc(empty_list[df])
    
    else: d3[df] = SIREN_func.adjust_dates_only(empty_list[df])


#### Bringing everything together

In [8]:
# Merging out dictionaries
final_dict = {**roll_d, **roll_d2, **d3}

In [9]:
# Printing out our keys to our sub-dataset(s)
final_dict.keys()

dict_keys(['econ_sur_chg', 'credit_chg', 'us_yields_chg', 'eu_yields_chg', 'usd_chg', 'pe_chg', 'pb_chg', 'como_chg', 'eq_indices_4w_return', 'epu', 'finc', 'pct52w', 'vol', 'aaii', 'us_cftc', 'put_call'])

**Tidying earnings revision ratios**

In [10]:
# Read .xlsx highlighting S&P 500's earnings revisions
eri = pd.read_excel("../data/spx_eri.xlsx", sheet_name="Combined", parse_dates=['date'])

In [11]:
# Pipe custom function to calculate ERIs and 4, 13-week differences
eri_chg = SIREN_func.eri_diff(eri, 4, 13)

In [12]:
# Combining all our datasets
full = eri_chg.copy()
for df in final_dict.keys():
    full = pd.merge(left=full, right=final_dict[df], how='left', on='date')

In [13]:
# Previewing our final dataset
SIREN_func.eda_clean(full)

Dataset Statistics:
Shape of dataframe: (862, 285)
% of Null values in dataframe: 1.19%
% duplicate rows: 0.0%

Column names: Index(['eri', 'eri_1m_chg', 'eri_3m_chg', 'cesiusd_1w_chg', 'cesieur_1w_chg',
       'cesigbp_1w_chg', 'cesijpy_1w_chg', 'cesicny_1w_chg', 'cesiglf_1w_chg',
       'cesiusd_4w_chg',
       ...
       'skew', 'aaii_bull', 'aaii_bear', 'aaii_neut', 'cftc_nc_net',
       'cftc_nc_long', 'cftc_nc_short', 'cftc_oi', 'cftc_nc_net_pct_oi',
       'cboe_us'],
      dtype='object', length=285)
Columns Count: 
float64    285
dtype: int64


In [14]:
# Let's drop all missing values
full.dropna(inplace=True)

In [15]:
# Let's save this down as .csv
full.to_pickle('../data/full.pkl')