# Engineering Predictive Alpha Factors

This notebook illustrates the following steps:

1. Select the adjusted open, high, low, and close prices as well as the volume for all tickers from the Quandl Wiki data that you downloaded and simplified for the last milestone for the 2007-2016 time period. Looking ahead, we will use 2014-2016 as our 'out-of-sample' period to test the performance of a strategy based on a machine learning model selected using data from preceding periods.
2. Compute the dollar volume as the product of closing price and trading volume; then select the stocks with at least eight years of data and the lowest average daily rank for this metric. 
3. Compute daily returns and keep only 'inliers' with values between -100% and + 100% as a basic check against data error.
4. Now we're ready to compute financial features. The Alpha Factory Library listed among the resources below illustrates how to compute a broad range of those using pandas and TA-Lib. We will list a few examples; feel free to explore and evaluate the various TA-Lib indicators.
    - Compute **historical returns** for various time ranges such as 1, 3, 5, 10, 21 trading days, as well as longer periods like 2, 3, 6 and 12 months.
    - Use TA-Lib's **Bollinger Band** indicator to create features that anticipate **mean-reversion**.
    - Select some indicators from TA-Lib's **momentum** indicators family such as
        - the Average Directional Movement Index (ADX), 
        - the Moving Average Convergence Divergence (MACD), 
        - the Relative Strength Index (RSI), 
        - the Balance of Power (BOP) indictor, or 
        - the Money Flow Index (MFI).
    - Compute TA-Lib **volume** indicators like On Balance Volume (OBV) or the Chaikin A/D Oscillator (ADOSC)
    - Create volatility metrics such as the Normalized Average True Range (NATR).
    - Compute rolling factor betas using the five Fama-French risk factors for different rolling windows of three and 12 months (see resources below).
    - Compute the outcome variable that we will aim to predict, namely the 1-day forward returns.

## Usage tips

- If you experience resource constraints (suddenly restarting Kernel), increase the memory available for Docker Desktop (> Settings > Advanced). If this not possible or you experienced prolonged execution times, reduce the scope of the exercise. The easiest way to do so is to select fewer stocks or a shorter time period, or both.
- You may want to persist intermediate results so you can recover quickly in case something goes wrong. There's an example under the first 'Persist Results' subsection.

## Imports & Settings

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
%matplotlib inline

from pathlib import Path
import numpy as np
import pandas as pd
import pandas_datareader.data as web

import statsmodels.api as sm
from statsmodels.regression.rolling import RollingOLS
from sklearn.preprocessing import scale
import talib

import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
sns.set_style('whitegrid')
idx = pd.IndexSlice
deciles = np.arange(.1, 1, .1).round(1)

## Load Data

In [4]:
DATA_STORE = Path('.', 'data', 'stock_prices.h5')

In [5]:
with pd.HDFStore(DATA_STORE) as store:
    stock_df = store.select('/us_stocks', where='date >= 20000101 & date < 20170101')
    
stock_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10561466 entries, 30 to 15389004
Data columns (total 9 columns):
 #   Column       Dtype         
---  ------       -----         
 0   ticker       object        
 1   date         datetime64[ns]
 2   ex-dividend  float64       
 3   split_ratio  float64       
 4   open         float64       
 5   high         float64       
 6   low          float64       
 7   close        float64       
 8   volume       float64       
dtypes: datetime64[ns](1), float64(7), object(1)
memory usage: 805.8+ MB


## Select 500 most-traded stocks prior to 2017

Compute the dollar volume as the product of the adjusted close price and the adjusted volume:

In [6]:
stock_df['dollar-volume'] = stock_df['close'] * stock_df['volume']

In [7]:
stock_df.head

<bound method NDFrame.head of          ticker       date  ex-dividend  split_ratio       open       high  \
30            A 2000-01-03          0.0          1.0  53.726454  53.856080   
31            A 2000-01-04          0.0          1.0  46.481058  46.992738   
32            A 2000-01-05          0.0          1.0  45.198445  45.239380   
33            A 2000-01-06          0.0          1.0  42.046493  42.298923   
34            A 2000-01-07          0.0          1.0  40.293135  44.986951   
...         ...        ...          ...          ...        ...        ...   
15389000   ZUMZ 2016-12-23          0.0          1.0  20.950000  21.500000   
15389001   ZUMZ 2016-12-27          0.0          1.0  21.200000  21.700000   
15389002   ZUMZ 2016-12-28          0.0          1.0  21.550000  21.749900   
15389003   ZUMZ 2016-12-29          0.0          1.0  21.550000  22.050000   
15389004   ZUMZ 2016-12-30          0.0          1.0  21.900000  22.190000   

                low      close   

In [8]:
TRADING_DAYS_PER_YEAR = 253
minObs = TRADING_DAYS_PER_YEAR * 8
print(minObs)

nObs = stock_df.groupby('ticker').size()
print(nObs)
keep = nObs[nObs > minObs].index
keep

2024
ticker
A       4277
AA        42
AAL     2836
AAMC    1020
AAN     4277
        ... 
ZNGA    1268
ZOES     687
ZQK     3946
ZTS      987
ZUMZ    2935
Length: 3186, dtype: int64


Index(['A', 'AAL', 'AAN', 'AAON', 'AAP', 'AAPL', 'AAWW', 'ABAX', 'ABC', 'ABCB',
       ...
       'ZEUS', 'ZIGO', 'ZINC', 'ZION', 'ZIOP', 'ZIXI', 'ZLC', 'ZMH', 'ZQK',
       'ZUMZ'],
      dtype='object', name='ticker', length=2528)

In [9]:
stock_df = stock_df[stock_df['ticker'].isin(keep)]

In [10]:
dGrp = stock_df.groupby('date')
#stock_df['dollar-volume'].rank()

In [11]:
stock_df['daily-rank'] = dGrp['dollar-volume'].rank()
stock_df.tail()

Unnamed: 0,ticker,date,ex-dividend,split_ratio,open,high,low,close,volume,dollar-volume,daily-rank
15389000,ZUMZ,2016-12-23,0.0,1.0,20.95,21.5,20.95,21.35,532292.0,11364434.2,1212.0
15389001,ZUMZ,2016-12-27,0.0,1.0,21.2,21.7,21.2,21.45,308004.0,6606685.8,975.0
15389002,ZUMZ,2016-12-28,0.0,1.0,21.55,21.7499,21.325,21.45,165827.0,3556989.15,696.0
15389003,ZUMZ,2016-12-29,0.0,1.0,21.55,22.05,21.4,21.9,322108.0,7054165.2,977.0
15389004,ZUMZ,2016-12-30,0.0,1.0,21.9,22.19,21.6,21.85,295429.0,6455123.65,866.0


In [12]:
avgRank = stock_df.groupby('ticker')['daily-rank'].mean()
avgRank

ticker
A       1977.764555
AAL     2288.401622
AAN      586.835632
AAON     620.805354
AAP     1944.249803
           ...     
ZIXI     684.787351
ZLC     1284.504004
ZMH     2105.309823
ZQK     1289.242524
ZUMZ    1275.064395
Name: daily-rank, Length: 2528, dtype: float64

In [17]:
keep = avgRank.sort_values()[:500].index
stock_df = stock_df[stock_df['ticker'].isin(keep)]
stock_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1914984 entries, 58938 to 15318829
Data columns (total 11 columns):
 #   Column         Dtype         
---  ------         -----         
 0   ticker         object        
 1   date           datetime64[ns]
 2   ex-dividend    float64       
 3   split_ratio    float64       
 4   open           float64       
 5   high           float64       
 6   low            float64       
 7   close          float64       
 8   volume         float64       
 9   dollar-volume  float64       
 10  daily-rank     float64       
dtypes: datetime64[ns](1), float64(9), object(1)
memory usage: 175.3+ MB


## Remove outliers based on daily returns

In [29]:
stock_df = stock_df[stock_df['close'].pct_change() > -0.99]
stock_df = stock_df[stock_df['close'].pct_change() < 0.99]

In [31]:
with pd.HDFStore(DATA_STORE) as store:
    store.put('clean_us_stocks', stock_df, format='table', data_columns=True)

In [6]:
with pd.HDFStore(DATA_STORE) as store:
    clean_df = store.get('/clean_us_stocks')
    
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1914621 entries, 58940 to 15318829
Data columns (total 11 columns):
 #   Column         Dtype         
---  ------         -----         
 0   ticker         object        
 1   date           datetime64[ns]
 2   ex-dividend    float64       
 3   split_ratio    float64       
 4   open           float64       
 5   high           float64       
 6   low            float64       
 7   close          float64       
 8   volume         float64       
 9   dollar-volume  float64       
 10  daily-rank     float64       
dtypes: datetime64[ns](1), float64(9), object(1)
memory usage: 175.3+ MB


## Compute returns

In [15]:
lags = {1,3, 5, 10, 21, 42, 63, 126, 253}

for lag in lags :
    print(f"ret_{lag}")
    clean_df[f"ret_{lag}"] = clean_df['close'].pct_change(lag)
    

ret_1
ret_3
ret_5
ret_10
ret_42
ret_21
ret_253
ret_126
ret_63
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1914621 entries, 58940 to 15318829
Data columns (total 21 columns):
 #   Column         Dtype         
---  ------         -----         
 0   ticker         object        
 1   date           datetime64[ns]
 2   ex-dividend    float64       
 3   split_ratio    float64       
 4   open           float64       
 5   high           float64       
 6   low            float64       
 7   close          float64       
 8   volume         float64       
 9   dollar-volume  float64       
 10  daily-rank     float64       
 11  ret_{lag}      float64       
 12  ret_1          float64       
 13  ret_3          float64       
 14  ret_5          float64       
 15  ret_10         float64       
 16  ret_42         float64       
 17  ret_21         float64       
 18  ret_253        float64       
 19  ret_126        float64       
 20  ret_63         float64       
dtypes: datetim

Unnamed: 0,ticker,date,ex-dividend,split_ratio,open,high,low,close,volume,dollar-volume,...,ret_{lag},ret_1,ret_3,ret_5,ret_10,ret_42,ret_21,ret_253,ret_126,ret_63
15318825,ZAZA,2015-09-28,0.0,1.0,0.355,0.37,0.341,0.341,31134.0,10616.694,...,-0.589157,-0.078378,-0.05749,-0.127877,-0.102632,-0.431667,-0.197269,-0.916626,-0.793333,-0.589157
15318826,ZAZA,2015-09-29,0.0,1.0,0.34,0.368,0.34,0.359,20290.0,7284.11,...,-0.54557,0.052786,-0.02973,-0.011019,-0.055263,-0.402662,-0.281713,-0.907235,-0.779755,-0.54557
15318827,ZAZA,2015-09-30,0.0,1.0,0.352,0.3599,0.3437,0.355,7718.0,2739.89,...,-0.561728,-0.011142,-0.040541,-0.018795,-0.089744,-0.388985,-0.27551,-0.90056,-0.758503,-0.561728
15318828,ZAZA,2015-10-01,0.0,1.0,0.3699,0.375,0.36,0.3649,28085.0,10248.2165,...,-0.615895,0.027887,0.070088,-0.013784,-0.013784,-0.399144,-0.087522,-0.899477,-0.753045,-0.615895
15318829,ZAZA,2015-10-02,0.0,1.0,0.369,0.4,0.355,0.3886,92641.0,36000.2926,...,-0.536443,0.064949,0.082451,0.05027,0.128339,-0.362951,0.021019,-0.891453,-0.750897,-0.536443


## Bollinger Bands

In [21]:
from talib import MA_Type

clean_df['bb_upper'], clean_df['bb_middle'], clean_df['bb_lower'] = talib.BBANDS(clean_df['close'], matype=MA_Type.T3)


clean_df.info()
clean_df.tail()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1914621 entries, 58940 to 15318829
Data columns (total 23 columns):
 #   Column         Dtype         
---  ------         -----         
 0   ticker         object        
 1   date           datetime64[ns]
 2   ex-dividend    float64       
 3   split_ratio    float64       
 4   open           float64       
 5   high           float64       
 6   low            float64       
 7   close          float64       
 8   volume         float64       
 9   dollar-volume  float64       
 10  daily-rank     float64       
 11  ret_1          float64       
 12  ret_3          float64       
 13  ret_5          float64       
 14  ret_10         float64       
 15  ret_42         float64       
 16  ret_21         float64       
 17  ret_253        float64       
 18  ret_126        float64       
 19  ret_63         float64       
 20  bb_upper       float64       
 21  bb_middle      float64       
 22  bb_lower       float64       
dtypes:

Unnamed: 0,ticker,date,ex-dividend,split_ratio,open,high,low,close,volume,dollar-volume,...,ret_5,ret_10,ret_42,ret_21,ret_253,ret_126,ret_63,bb_upper,bb_middle,bb_lower
15318825,ZAZA,2015-09-28,0.0,1.0,0.355,0.37,0.341,0.341,31134.0,10616.694,...,-0.127877,-0.102632,-0.431667,-0.197269,-0.916626,-0.793333,-0.589157,0.385555,0.364266,0.342977
15318826,ZAZA,2015-09-29,0.0,1.0,0.34,0.368,0.34,0.359,20290.0,7284.11,...,-0.011019,-0.055263,-0.402662,-0.281713,-0.907235,-0.779755,-0.54557,0.38247,0.361217,0.339965
15318827,ZAZA,2015-09-30,0.0,1.0,0.352,0.3599,0.3437,0.355,7718.0,2739.89,...,-0.018795,-0.089744,-0.388985,-0.27551,-0.90056,-0.758503,-0.561728,0.380129,0.358551,0.336973
15318828,ZAZA,2015-10-01,0.0,1.0,0.3699,0.375,0.36,0.3649,28085.0,10248.2165,...,-0.013784,-0.013784,-0.399144,-0.087522,-0.899477,-0.753045,-0.615895,0.377461,0.35765,0.337839
15318829,ZAZA,2015-10-02,0.0,1.0,0.369,0.4,0.355,0.3886,92641.0,36000.2926,...,0.05027,0.128339,-0.362951,0.021019,-0.891453,-0.750897,-0.536443,0.392117,0.360946,0.329775


## Momentum Indicators

TA-Lib offers the following choices - feel free to experiment with as many as you like (but you don't have to..):

|Function|             Name|
|:---|:---|
|PLUS_DM|              Plus Directional Movement|
|MINUS_DM|             Minus Directional Movement|
|PLUS_DI|              Plus Directional Indicator|
|MINUS_DI|             Minus Directional Indicator|
|DX|                   Directional Movement Index|
|ADX|                  Average Directional Movement Index|
|ADXR|                 Average Directional Movement Index Rating|
|APO|                  Absolute Price Oscillator|
|PPO|                  Percentage Price Oscillator|
|AROON|                Aroon|
|AROONOSC|             Aroon Oscillator|
|BOP|                  Balance Of Power|
|CCI|                  Commodity Channel Index|
|CMO|                  Chande Momentum Oscillator|
|MACD|                 Moving Average Convergence/Divergence|
|MACDEXT|              MACD with controllable MA type|
|MACDFIX|              Moving Average Convergence/Divergence Fix 12/26|
|MFI|                  Money Flow Index|
|MOM|                  Momentum|
|RSI|                  Relative Strength Index|
|STOCH|                Stochastic|
|STOCHF|               Stochastic Fast|
|STOCHRSI|             Stochastic Relative Strength Index|
|TRIX|                 1-day Rate-Of-Change (ROC) of a Triple Smooth EMA|
|ULTOSC|               Ultimate Oscillator|
|WILLR|                Williams' %R|

### Average Directional Movement Index (ADX)

The ADX combines of two other indicators, namely the positive and directional indicators (PLUS_DI and MINUS_DI), which in turn build on the positive and directional movement (PLUS_DM and MINUS_DM). For additional details see [Wikipdia](https://en.wikipedia.org/wiki/Average_directional_movement_index) and [Investopedia](https://www.investopedia.com/articles/trading/07/adx-trend-indicator.asp).

### Absolute Price Oscillator (APO)

The absolute Price Oscillator (APO) is computed as the difference between two exponential moving averages (EMA) of price series, expressed as an absolute value. The EMA windows usually contain 26 and 12 data points, respectively.

### Percentage Price Oscillator (PPO)

The Percentage Price Oscillator (APO) is computed as the difference between two exponential moving averages (EMA) of price series, expressed as a percentage value and thus comparable across assets. The EMA windows usually contain 26 and 12 data points, respectively. 

### Aroon Oscillator

#### Aroon Up/Down Indicator

The indicator measures the time between highs and the time between lows over a time period. It computes an AROON_UP and an AROON_DWN indicator as follows:

$$
\begin{align*}
\text{AROON_UP}&=\frac{T-\text{Periods since T period High}}{T}\times 100\\
\text{AROON_DWN}&=\frac{T-\text{Periods since T period Low}}{T}\times 100
\end{align*}
$$

#### Aroon Oscillator

The Aroon Oscillator is simply the difference between the Aroon Up and Aroon Down indicators.

### Balance Of Power (BOP)

The Balance of Power (BOP) intends to measure the strength of buyers relative to sellers in the market by assessing the ability of each side to drive prices. It is computer as the difference between the close and the open price, divided by the difference between the high and the low price: 

$$
\text{BOP}_t= \frac{P_t^\text{Close}-P_t^\text{Open}}{P_t^\text{High}-P_t^\text{Low}}
$$

### Commodity Channel Index (CCI)

The Commodity Channel Index (CCI) measures the difference between the current *typical* price, computed as the average of current low, high and close price and the historical average price. A positive (negative) CCI indicates that price is above (below) the historic average. When CCI is below zero, the price is below the hsitoric average. It is computed as:

$$
\begin{align*}
\bar{P_t}&=\frac{P_t^H+P_t^L+P_t^C}{3}\\
\text{CCI}_t & =\frac{\bar{P_t} - \text{SMA}(T)_t}{0.15\sum_{t=i}^T |\bar{P_t}-\text{SMA}(N)_t|/T}
\end{align*}
$$

### Moving Average Convergence/Divergence (MACD)

Moving Average Convergence Divergence (MACD) is a trend-following (lagging) momentum indicator that shows the relationship between two moving averages of a security’s price. It is calculated by subtracting the 26-period Exponential Moving Average (EMA) from the 12-period EMA.

The TA-Lib implementation returns the MACD value and its signal line, which is the 9-day EMA of the MACD. In addition, the MACD-Histogram measures the distance between the indicator and its signal line.

### Chande Momentum Oscillator (CMO)

The Chande Momentum Oscillator (CMO) intends to measure momentum on both up and down days. It is calculated as the difference between the sum of gains and losses over at time period T, divided by the sum of all price movement over the same period. It oscillates between +100 and -100.

### Money Flow Index

The Money Flow Index (MFI) incorporates price and volume information to identify overbought or oversold conditions.  The indicator is typically calculated using 14 periods of data. An MFI reading above 80 is considered overbought and an MFI reading below 20 is considered oversold.

### Relative Strength Index

RSI compares the magnitude of recent price changes across stocks to identify stocks as overbought or oversold. A high RSI (usually above 70) indicates overbought and a low RSI (typically below 30) indicates oversold. It first computes the average price change for a given number (often 14) of prior trading days with rising and falling prices, respectively as $\text{up}_t$ and $\text{down}_t$. Then, the RSI is computed as:
$$
\text{RSI}_t=100-\frac{100}{1+\frac{\text{up}_t}{\text{down}_t}}
$$



#### Stochastic RSI (STOCHRSI)

The Stochastic Relative Strength Index (STOCHRSI) is based on the RSI just described and intends to identify crossovers as well as overbought and oversold conditions. It compares the distance of the current RSI to the lowest RSI over a given time period T to the maximum range of values the RSI has assumed for this period. It is computed as follows:

$$
\text{STOCHRSI}_t= \frac{\text{RSI}_t-\text{RSI}_t^L(T)}{\text{RSI}_t^H(T)-\text{RSI}_t^L(T)}
$$

The TA-Lib implementation offers more flexibility than the original "Unsmoothed stochastic RSI" version by Chande and Kroll (1993). To calculate the original indicator, keep the `timeperiod` and `fastk_period` equal. 

The return value `fastk` is the unsmoothed RSI. The `fastd_period` is used to compute a smoothed STOCHRSI, which  is returned as `fastd`. If you do not care about STOCHRSI smoothing, just set `fastd_period` to 1 and ignore the `fastd` output.

Reference: "Stochastic RSI and Dynamic Momentum Index" by Tushar Chande and Stanley Kroll Stock&Commodities V.11:5 (189-199)


### Stochastic (STOCH)

A stochastic oscillator is a momentum indicator comparing a particular closing price of a security to a range of its prices over a certain period of time. Stochastic oscillators are based on the idea that closing prices should confirm the trend.

For stochastic (STOCH), there are four different lines: `FASTK`, `FASTD`, `SLOWK` and `SLOWD`. The `D` is the signal line usually drawn over its corresponding `K` function.

$$
\begin{align*}
& K^\text{Fast}(T_K) & = &\frac{P_t-P_{T_K}^L}{P_{T_K}^H-P_{T_K}^L}* 100 \\
& D^\text{Fast}(T_{\text{FastD}}) & = & \text{MA}(T_{\text{FastD}})[K^\text{Fast}]\\
& K^\text{Slow}(T_{\text{SlowK}}) & = &\text{MA}(T_{\text{SlowK}})[K^\text{Fast}]\\
& D^\text{Slow}(T_{\text{SlowD}}) & = &\text{MA}(T_{\text{SlowD}})[K^\text{Slow}]
\end{align*}
$$
  

The $P_{T_K}^L$, $P_{T_K}^H$, and $P_{T_K}^L$ are the extreme values among the last $T_K$ period.
 $K^\text{Slow}$ and $D^\text{Fast}$ are equivalent when using the same period. 

### Ultimate Oscillator (ULTOSC)

The Ultimate Oscillator (ULTOSC), developed by Larry Williams, measures the average difference of the current close to the previous lowest price over three time frames (default: 7, 14, and 28) to avoid overreacting to short-term price changes and incorporat short, medium, and long-term market trends. It first computes the buying pressure, $\text{BP}_t$, then sums it over the three periods $T_1, T_2, T_3$, normalized by the True Range ($\text{TR}_t$.
$$
\begin{align*}
\text{BP}_t & = P_t^\text{Close}-\min(P_{t-1}^\text{Close}, P_t^\text{Low})\\ 
\text{TR}_t & = \max(P_{t-1}^\text{Close}, P_t^\text{High})-\min(P_{t-1}^\text{Close}, P_t^\text{Low})
\end{align*}
$$

ULTOSC is then computed as a weighted average over the three periods as follows:
$$
\begin{align*}
\text{Avg}_t(T) & = \frac{\sum_{i=0}^{T-1} \text{BP}_{t-i}}{\sum_{i=0}^{T-1} \text{TR}_{t-i}}\\
\text{ULTOSC}_t & = 100*\frac{4\text{Avg}_t(7) + 2\text{Avg}_t(14) + \text{Avg}_t(28)}{4+2+1}
\end{align*}
$$

### Williams' %R (WILLR)

Williams %R, also known as the Williams Percent Range, is a momentum indicator that moves between 0 and -100 and measures overbought and oversold levels to identify entry and exit points. It is similar to the Stochastic oscillator and compares the current closing price $P_t^\text{Close}$ to the range of highest ($P_T^\text{High}$) and lowest ($P_T^\text{Low}$) prices over the last T periods (typically 14). The indicators is computed as:

$$
\text{WILLR}_t = \frac{P_T^\text{High}-P_t^\text{Close}}{P_T^\text{High}-P_T^\text{Low}}
$$


In [23]:
clean_df['RSI'] = talib.RSI(clean_df['close'])

In [24]:
clean_df['MFI'] = talib.MFI(clean_df['high'],clean_df['low'],clean_df['close'],clean_df['volume'])

In [27]:
clean_df['macd'], clean_df['macdsignal'], clean_df['macdhist'] = talib.MACD(clean_df['close'])

In [28]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1914621 entries, 58940 to 15318829
Data columns (total 28 columns):
 #   Column         Dtype         
---  ------         -----         
 0   ticker         object        
 1   date           datetime64[ns]
 2   ex-dividend    float64       
 3   split_ratio    float64       
 4   open           float64       
 5   high           float64       
 6   low            float64       
 7   close          float64       
 8   volume         float64       
 9   dollar-volume  float64       
 10  daily-rank     float64       
 11  ret_1          float64       
 12  ret_3          float64       
 13  ret_5          float64       
 14  ret_10         float64       
 15  ret_42         float64       
 16  ret_21         float64       
 17  ret_253        float64       
 18  ret_126        float64       
 19  ret_63         float64       
 20  bb_upper       float64       
 21  bb_middle      float64       
 22  bb_lower       float64       
 23  RS

## Volume Indicators

|Function|             Name|
|:---|:---|
|AD|                   Chaikin A/D Line|
|ADOSC|                Chaikin A/D Oscillator|
|OBV|                  On Balance Volume|

### Chaikin A/D Line

The Chaikin Advance/Decline or Accumulation/Distribution Line (AD) is a volume-based indicator designed to measure the cumulative flow of money into and out of an asset. The indicator assumes that the degree of buying or selling pressure can be determined by the location of the close, relative to the high and low for the period. There is buying (sellng) pressure when a stock closes in the upper (lower) half of a period's range. The intention is to signal a change in direction when the indicator diverges from the security price.

The Accumulation/Distribution Line is a running total of each period's Money Flow Volume. It is calculated as follows:

1. The Money Flow Multiplier (MFI) is the relationship of the close to the high-low range:
2. The MFI is multiplied by the period's volume $V_t$ to come up with a Money Flow Volume (MFV). 
3. A running total of the Money Flow Volume forms the Accumulation Distribution Line:
$$
\begin{align*}
&\text{MFI}_t&=\frac{P_t^\text{Close}-P_t^\text{Low}}{P_t^\text{High}-P_t^\text{Low}}\\
&\text{MFV}_t&=\text{MFI}_t \times V_t\\
&\text{AD}_t&=\text{AD}_{t-1}+\text{MFV}_t
\end{align*}
$$

### Chaikin A/D Oscillator (ADOSC)

The Chaikin A/D Oscillator (ADOSC) is the Moving Average Convergence Divergence indicator (MACD) applied to the Chaikin A/D Line. The Chaikin Oscillator intends to predict changes in the Accumulation/Distribution Line.

It is computed as the difference between the 3-day exponential moving average and the 10-day exponential moving average of the Accumulation/Distribution Line.

### On Balance Volume (OBV)

The On Balance Volume indicator (OBV) is a cumulative momentum indicator that relates volume to price change. It assumes that OBV changes precede price changes because smart money can be seen flowing into the security by a rising OBV. When the public then moves into the security, both the security and OBV will rise.

The current OBV is computed by adding (subtracting) the current volume to the last OBV if the security closes higher (lower) than the previous close.

$$
\text{OBV}_t = 
\begin{cases}
\text{OBV}_{t-1}+V_t & \text{if }P_t>P_{t-1}\\
\text{OBV}_{t-1}-V_t & \text{if }P_t<P_{t-1}\\
\text{OBV}_{t-1} & \text{otherwise}
\end{cases}
$$

In [29]:
clean_df['OBV'] = talib.OBV(clean_df['close'], clean_df['volume'])

## Volatility Indicators

|Function|             Name|
|:---|:---|
|TRANGE|               True Range|
|ATR|                  Average True Range|
|NATR|                 Normalized Average True Range|

### ATR

The Average True Range indicator (ATR) shows volatility of the market. It was introduced by Welles Wilder (1978)  and has been used as a component of numerous other indicators since. It aims to anticipate changes in trend such that the higher its value, the higher the probability of a trend change; the lower the indicator’s value, the weaker the current trend.

It is computed as the simple moving average for a period T of the True Range (TRANGE), which measures volatility as the absolute value of the largest recent trading range:
$$
\text{TRANGE}_t = \max\left[P_t^\text{High} - P_t^\text{low}, \left| P_t^\text{High} - P_{t-1}^\text{Close}\right|, \left| P_t^\text{low} - P_{t-1}^\text{Close}\right|\right]
$$

### NATR

In [30]:
clean_df['NATR'] = talib.NATR(clean_df['high'], clean_df['low'], clean_df['close'])

The Normalized Average True Range (NATR) is a normalized version of the ATR computed as follows:

$$
\text{NATR}_t = \frac{\text{ATR}_t(T)}{P_t^\text{Close}} * 100
$$

Normalization make the ATR function more relevant in the folllowing scenarios:
- Long term analysis where the price changes drastically.
- Cross-market or cross-security ATR comparison.

## Rolling Factor Betas

In [31]:
factors = ['Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA']
factor_data = web.DataReader('F-F_Research_Data_5_Factors_2x3', 'famafrench', start='2000')[0].drop('RF', axis=1)
factor_data.head()

Unnamed: 0_level_0,Mkt-RF,SMB,HML,RMW,CMA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01,-4.74,4.45,-1.89,-6.29,4.74
2000-02,2.45,18.38,-9.81,-18.76,-0.35
2000-03,5.2,-15.39,8.23,11.82,-1.61
2000-04,-6.4,-4.96,7.25,7.67,5.62
2000-05,-4.42,-3.87,4.83,4.18,1.32


In [33]:
clean_df.set_index(keys=['ticker', 'date'], inplace=True)

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1914621 entries, ('ABCB', Timestamp('2000-01-05 00:00:00')) to ('ZAZA', Timestamp('2015-10-02 00:00:00'))
Data columns (total 28 columns):
 #   Column         Dtype  
---  ------         -----  
 0   ex-dividend    float64
 1   split_ratio    float64
 2   open           float64
 3   high           float64
 4   low            float64
 5   close          float64
 6   volume         float64
 7   dollar-volume  float64
 8   daily-rank     float64
 9   ret_1          float64
 10  ret_3          float64
 11  ret_5          float64
 12  ret_10         float64
 13  ret_42         float64
 14  ret_21         float64
 15  ret_253        float64
 16  ret_126        float64
 17  ret_63         float64
 18  bb_upper       float64
 19  bb_middle      float64
 20  bb_lower       float64
 21  RSI            float64
 22  MFI            float64
 23  macd           float64
 24  macdsignal     float64
 25  macdhist       float64
 26  OBV            float64

In [35]:
factor_data.index = factor_data.index.to_timestamp()
factor_data = factor_data.resample('M').last().div(100)
factor_data.index.name = 'date'
factor_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 263 entries, 2000-01-31 to 2021-11-30
Freq: M
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Mkt-RF  263 non-null    float64
 1   SMB     263 non-null    float64
 2   HML     263 non-null    float64
 3   RMW     263 non-null    float64
 4   CMA     263 non-null    float64
dtypes: float64(5)
memory usage: 12.3 KB


In [36]:
factor_data = factor_data.join(clean_df['ret_21']).sort_index()
factor_data.info()
factor_data.head()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 64389 entries, ('ABCB', Timestamp('2000-01-31 00:00:00', freq='M')) to ('ZAZA', Timestamp('2015-09-30 00:00:00', freq='M'))
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Mkt-RF  64389 non-null  float64
 1   SMB     64389 non-null  float64
 2   HML     64389 non-null  float64
 3   RMW     64389 non-null  float64
 4   CMA     64389 non-null  float64
 5   ret_21  64388 non-null  float64
dtypes: float64(6)
memory usage: 3.2+ MB


Unnamed: 0_level_0,Unnamed: 1_level_0,Mkt-RF,SMB,HML,RMW,CMA,ret_21
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ABCB,2000-01-31,-0.0474,0.0445,-0.0189,-0.0629,0.0474,
ABCB,2000-02-29,0.0245,0.1838,-0.0981,-0.1876,-0.0035,-0.0125
ABCB,2000-03-31,0.052,-0.1539,0.0823,0.1182,-0.0161,0.028922
ABCB,2000-05-31,-0.0442,-0.0387,0.0483,0.0418,0.0132,0.0
ABCB,2000-06-30,0.0464,0.0987,-0.0841,-0.0826,-0.0291,-0.013397


In [41]:
betas12 = (factor_data.groupby(level='ticker',
                             group_keys=False)
         .apply(lambda x: RollingOLS(endog=x.ret_21,
                                     exog=sm.add_constant(x.drop('ret_21', axis=1)),
                                     window=min(12, x.shape[0]-1))
                .fit(params_only=True)
                .params
                .drop('const', axis=1)))

In [45]:
betas6 = (factor_data.groupby(level='ticker',
                             group_keys=False)
         .apply(lambda x: RollingOLS(endog=x.ret_21,
                                     exog=sm.add_constant(x.drop('ret_21', axis=1)),
                                     window=min(6, x.shape[0]-1))
                .fit(params_only=True)
                .params
                .drop('const', axis=1)))

In [47]:
clean_df["fwd_1d"] = clean_df['close'].pct_change(-1)
clean_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,ex-dividend,split_ratio,open,high,low,close,volume,dollar-volume,daily-rank,ret_1,...,bb_middle,bb_lower,RSI,MFI,macd,macdsignal,macdhist,OBV,NATR,fwd_1d
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
ABCB,2000-01-05,0.0,1.0,6.066039,6.066039,6.066039,6.066039,8169.6,49557.110415,307.0,,...,,,,,,,,8169.6,,-0.01223
ABCB,2000-01-06,0.0,1.0,6.175805,6.354898,6.141142,6.141142,1459.2,8961.154577,130.0,0.012381,...,,,,,,,,9628.8,,-0.011163
ABCB,2000-01-07,0.0,1.0,6.285572,6.354898,6.141142,6.210468,12351.6,76709.21994,359.0,0.011289,...,,,,,,,,21980.4,,0.02381
ABCB,2000-01-10,0.0,1.0,6.354898,6.354898,6.066039,6.066039,1848.0,11210.039665,148.0,-0.023256,...,,,,,,,,20132.4,,-0.056604
ABCB,2000-01-11,0.0,1.0,6.141142,6.430001,6.066039,6.430001,20521.2,131951.338708,405.0,0.06,...,,,,,,,,40653.6,,0.029602


## Persist results

In [48]:
with pd.HDFStore(DATA_STORE) as store:
        store.put('us_stocks_features', clean_df, format='table', data_columns=True)
        store.put('us_stocks_beta6', betas6, format='table', data_columns=True)
        store.put('us_stocks_beta12', betas12, format='table', data_columns=True)