# Preprocessing
This notebook is to collecting data from Yahoo finance and prepared excel sheets then compile and enrich asset prices with indicators for use on model later.

## FinrRL
FinRL is the first open-source framework to demonstrate the great potential of applying deep reinforcement learning in quantitative finance. We help practitioners establish the development pipeline of trading strategies using deep reinforcement learning (DRL). A DRL agent learns by continuously interacting with an environment in a trial-and-error manner, making sequential decisions under uncertainty, and achieving a balance between exploration and exploitation.

In [None]:
## install finrl library
!pip install git+https://github.com/AI4Finance-LLC/FinRL-Library.git

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use('Agg')
%matplotlib inline
import datetime
import calendar

from finrl.apps import config
from finrl.finrl_meta.preprocessor.yahoodownloader import YahooDownloader
from finrl.finrl_meta.preprocessor.preprocessors import FeatureEngineer, data_split
from finrl.finrl_meta.env_portfolio_allocation.env_portfolio import StockPortfolioEnv
from finrl.drl_agents.stablebaselines3.models import DRLAgent
from finrl.plot import backtest_stats, backtest_plot, get_daily_return, get_baseline,convert_daily_return_to_pyfolio_ts


import sys
sys.path.append("../FinRL-Library")

  'Module "zipline.assets" not found; multipliers will not be applied'


In [None]:
import os
if not os.path.exists("./" + config.TRAINED_MODEL_DIR):
    os.makedirs("./" + config.TRAINED_MODEL_DIR)
if not os.path.exists("./" + "data"):
    os.makedirs("./" + "data")

## Download Data

 - Using FinRL to download stock data 
  - Yahoo Finance is a website that provides stock data, financial news, financial reports, etc. All the data provided by Yahoo Finance is free. FinRL uses a class YahooDownloader to fetch data from Yahoo Finance API
Call Limit: Using the Public API (without authentication), you are limited to 2,000 requests per hour per IP (or up to a total of 48,000 requests a day).
 - Downloading excel sheets for gold and long vol data

> Date Range 2004-12-01 to 2021-9-01

### Stocks from Yahoo Finance

In [None]:
# list of tickers required from yahoo finance
tickers = ['^BCOM','^SP500TR', 'EEM', 'IEF' , 'AGG']

In [None]:
df_stocks = YahooDownloader(start_date = '2004-12-01',
                     end_date = '2021-09-01',
                     ticker_list = tickers).fetch_data()

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (21083, 8)


In [None]:
df_stocks.tic = df_stocks.tic.replace({'^BCOM': 'COM', '^SP500TR': 'SNP'})

*AGG dividends is not included based on the above

### Indices from Funds and Cash assets

In [None]:
# get dates of stocks valuations
dates = df_stocks.date.unique()

# extract data from git repo
url = [
       'https://github.com/changjulian17/DataSciencePortfolio/blob/main/Investment_Portfolio/modern_portfolio_theory/data/gold.xlsx?raw=true',
       'https://github.com/changjulian17/DataSciencePortfolio/blob/main/Investment_Portfolio/modern_portfolio_theory/data/long-vol.xlsx?raw=true'
]
# extract gold data and format
df_gold = pd.read_excel(url[0],sheet_name='Daily_Indexed')
df_gold = df_gold[['Name', 'US dollar']]
df_gold.columns = ['date','close']       # match col names to stocks
df_gold['tic'] = 'GLD'                   # add ticker data
df_gold = df_gold[df_gold.date.isin(dates)] # slice date range
df_gold.date = df_gold.date.dt.strftime('%Y-%m-%d')  # pass date to string

# extract long-vol data and format
# percentage change is already in excel, so we can skip one step
df_lv = pd.read_excel(url[1])  
df_lv.columns = df_lv.iloc[2]
df_lv = df_lv[3:].set_index('ReturnDate')['Index']
df_lv.index = pd.to_datetime(df_lv.index)# set date as index
df_lv = df_lv.resample('24h').ffill()    # upsample month returns to daily return by averaging
df_lv = df_lv.reset_index()              # set date as column
df_lv.columns = ['date','close']         # match col names to stocks
df_lv = df_lv[df_lv.date.isin(dates)]    # slice date range
df_lv['tic'] = 'LOV'                     # add ticker data
df_lv.date = df_lv.date.dt.strftime('%Y-%m-%d')  # pass date to string

# extract cash
df_cash = pd.DataFrame(dates,columns=['date'])
df_cash['tic'], df_cash['close'] = 'CSH', 1

# Preprocess Data
Data preprocessing is a crucial step for training a high quality machine learning model. We need to check for missing data and do feature engineering in order to convert the data into a model-ready state.

- Add technical indicators. In practical trading, various information needs to be taken into account, for example the historical stock prices, current holding shares, technical indicators, etc.
- Add covariance matrix
- Add cash as an asset

All the datasets must have date for all the days that the model runs. This is to ensure all tensor calculations is possible

In [None]:
df_stocks.date.nunique(),df_gold.date.nunique(),df_lv.date.nunique(),df_cash.date.nunique()

(4217, 4217, 4217, 4217)

In [None]:
# combine the datasets
df = pd.concat([df_stocks,df_gold,df_lv,df_cash],axis=0).fillna(0)

In [None]:
# BCOM is missing two days '2005-11-25', '2014-09-08'
# these dates are removed
dates = [date for date in df_stocks.date.unique() if date not in ['2005-11-25', '2014-09-08']]
df = df[df['date'].isin(dates)]

In [None]:
df.date.nunique()

4215

In [None]:
df.head(8)

Unnamed: 0,date,open,high,low,close,volume,tic,day
0,2004-12-01,101.949997,101.949997,101.510002,59.327103,49000.0,AGG,2.0
1,2004-12-01,21.799999,21.872223,21.73889,15.839635,6652800.0,EEM,2.0
2,2004-12-01,84.459999,84.459999,84.099998,54.27771,257600.0,IEF,2.0
3,2004-12-01,153.320007,153.809998,149.880005,149.880005,0.0,COM,2.0
4,2004-12-01,1766.900024,1766.900024,1766.900024,1766.900024,0.0,SNP,2.0
5,2004-12-02,101.669998,101.739998,101.5,59.438148,51500.0,AGG,3.0
6,2004-12-02,22.0,22.0,21.788889,15.816258,2802600.0,EEM,3.0
7,2004-12-02,84.040001,84.169998,83.879997,54.155266,886300.0,IEF,3.0


## Get Standard Indicators

In [None]:
fe = FeatureEngineer(
                    use_technical_indicator=True,
                    use_turbulence=False,
                    user_defined_feature = False)

df = fe.preprocess_data(df)

Successfully added technical indicators


In [None]:
df.shape

(33720, 16)

In [None]:
df.head(8)

Unnamed: 0,date,open,high,low,close,volume,tic,day,macd,boll_ub,boll_lb,rsi_30,cci_30,dx_30,close_30_sma,close_60_sma
0,2004-12-01,101.949997,101.949997,101.510002,59.327103,49000.0,AGG,2.0,0.0,59.539668,59.225583,100.0,-66.666667,100.0,59.327103,59.327103
4215,2004-12-01,153.320007,153.809998,149.880005,149.880005,0.0,COM,2.0,0.0,59.539668,59.225583,100.0,-66.666667,100.0,149.880005,149.880005
8430,2004-12-01,0.0,0.0,0.0,1.0,0.0,CSH,0.0,0.0,59.539668,59.225583,100.0,-66.666667,100.0,1.0,1.0
12645,2004-12-01,21.799999,21.872223,21.73889,15.839635,6652800.0,EEM,2.0,0.0,59.539668,59.225583,100.0,-66.666667,100.0,15.839635,15.839635
16860,2004-12-01,0.0,0.0,0.0,157.35,0.0,GLD,0.0,0.0,59.539668,59.225583,100.0,-66.666667,100.0,157.35,157.35
21075,2004-12-01,84.459999,84.459999,84.099998,54.27771,257600.0,IEF,2.0,0.0,59.539668,59.225583,100.0,-66.666667,100.0,54.27771,54.27771
25290,2004-12-01,0.0,0.0,0.0,100.0,0.0,LOV,0.0,0.0,59.539668,59.225583,100.0,-66.666667,100.0,100.0,100.0


## Get Covariance Matrix as States

The Covariance is calculated based on the price movements in a year. But since this cannot be computed for the first year worth of date therefore data from 2014 will be dropped.

In [None]:
# add covariance matrix as states
df=df.sort_values(['date','tic'],ignore_index=True)
df.index = df.date.factorize()[0]

cov_list = []
return_list = []

# look back is one year
lookback=252  # 252 trading days in a year
for i in range(lookback,len(df.index.unique())):
  data_lookback = df.loc[i-lookback:i,:]
  price_lookback=data_lookback.pivot_table(index = 'date',columns = 'tic', values = 'close')
  return_lookback = price_lookback.pct_change().dropna()
  return_list.append(return_lookback)

  covs = return_lookback.cov().values 
  cov_list.append(covs)

  
df_cov = pd.DataFrame({'date':df.date.unique()[lookback:],'cov_list':cov_list,'return_list':return_list})
df = df.merge(df_cov, on='date')
df = df.sort_values(['date','tic']).reset_index(drop=True)

## Get log differences and month

daily log differences for all the closing prices. Add month to help account for seasonality

In [None]:
df_list = []

for tic in df.tic.unique():
  # get log differences for all close prices
  deltas = df[df.tic==tic][['close','close_30_sma','close_60_sma']]\
                                  .pct_change().apply(lambda x: np.log(1+x))
  deltas.columns = ['close_delta','close_30_sma_delta','close_60_sma_delta']

  # add to a list
  df_list.append(deltas)

df_deltas_combined = pd.concat(df_list).sort_index()
df = df.merge(df_deltas_combined,left_index=True,right_index=True)

In [None]:
df['month'] = df.date.str.split('-',expand=True)[1].astype('category')

# Feature selection
The RL models used are stationary, so it is important to give stationary features as states

In [None]:
df = df[[
         'date',
         'month',
         'close', 
         'volume', 
         'tic', 
         'macd',
         'rsi_30', 
         'cci_30', 
         'dx_30', 
         'close_30_sma',
         'close_60_sma', 
         'cov_list', 
         'return_list',
         'close_delta',
         "close_30_sma_delta",
         "close_60_sma_delta",
         ]].dropna()

df.shape

(31696, 16)

# Export

In [None]:
df.to_pickle('./data/processed_data.pkl')

(continued at train.ipynb)