# A Generalized Framework for Machine Learning applied to Stock Prediction

## Introduction

Use of machine learning in the quantitative investment field is, by all indications, skyrocketing.  The proliferation of easily accessible data - both traditional and alternative - along with some very approachable frameworks for machine learning models - is encouraging many to explore the arena.

However, usage of machine learning in stock market prediction requires much more than a good grasp of the concepts and techniques for supervised machine learning.  As I describe further in [this post], stock prediction is a challenging domain which requires some special techniques to overcome issues like non-stationarity, collinearity, and low signal-to-noise.  

In this and following posts, I'll present the design of an end-to-end system for developing, testing, and applying machine learning models in a way which addresses each of these problems in a very practical way.  These postings _will not_ offer any "secret sauce" or strategies which I use in live trading, but will offer a more valuable and more generalized set of techniques which will allow you to create your own strategies in a *robust manner*.  

Within other posts in this series, I plan to cover:
* Feature engineering / target engineering
* Feature selection
* Evaluating and comparing model performance (hint: not backtesting)
* Walk-forward out-of-sample training/testing
* Building ensemble models to combine many distinct, weak signals
* Using Pandas, scikit-learn, and pandas plus scikit-learn
* Techniques for improving model predictive power
* Techniques for improving model robustness out-of-sample
... and more 

In this first post - before covering these specifics - I will present a high-level framework which sets the stage for modeling.  For this, I will assume readers have a good working knowledge of [pandas DataFrames] and of basic supervised machine learning concepts.    


## Types of Data Structures

As will become clear as I build out examples, it's crucial that our framework uses a smartly designed set of conventions for how to organize and use data.  My system makes heavy use of three distinct types of data:

* __features:__ This is a dataframe which contains all _features_ or values which we will allow models to use in the course of learning relationships - and later making predictions.  All features must be values which would have been known __at the point in time when the model needed to make predictions__.  In other words, `next_12_months_returns` would be a bad feature since it would not become known at the time needed.  The `features` dataframe has a multi-index of date/symbol and column names unique to each feature.  More on this later.   
* __outcomes:__ This is a dataframe of all possible __future__ outcomes which we may be interested in predicting, magically shifted back in time to T-zero.  For instance, we may want to predict the total_return for a symbol over the year following T=0 (the time of prediction).  We would look ahead into the future, calculate what ultimately did happen to this metric, and log it onto time T=0.  I'll explain why in a minute.  Just like `features`, this dataframe has rows indexed by date/symbol and columns named with a convention which describes the feature.  
* __master:__ The final data structure type is the `master` dataframe.  This contains any _static_ information about each symbol in the universe, such as the SIC code, the number of shares outstanding, beta factors, etc...  In practice, things in the master may change over time (SIC codes and shares out can both change...) but I've found it sufficient for my purposes to take the current static values for the current point in time.  This dataframe uses row index of symbol only.  You could, of course, add a date/symbol index if you wanted to reflect changing values over time.  

## Why this data structure scheme?
It may seem odd to split the features and outcomes into distinct dataframes, and odd to create a dataframe of several different possible "outcomes".  It may seem odd to record on t=0 what will happen in the _next_ year or whatever.  There are two reasons for this:
1. This makes it trivial to extract the X's and y's when training models.  We only ever use one or more columns of `features` in the X and one column of `outcomes` in y.  
2. This makes it trivial to toggle between various time horizons - just change the column of `outcomes` used for y.
3. This helps us guard against inadvertent "peeking" at the future by being very careful not to let any future information leak into the `features` frame - and then only using subsets of that frame for X.  
4. This allows us to use the incredibly efficient pandas `join`, `merge`, and `concat` methods to quickly align data for purposes of training models.  Trust me.  

This will save you untold grey hairs and hours of debugging.  

Before going further, let's create simple toy examples of each dataframe using free data from [quandl](https://www.quandl.com/): 

First, we'll make a utility function which downloads one or more symbols from quandl and returns the adjusted OHLC data (I generally find adjusted data to be best).

In [77]:
import pandas_datareader.data as web
import pandas as pd

def get_symbols(symbols,data_source, begin_date=None,end_date=None):
    out = pd.DataFrame()
    for symbol in symbols:
        df = web.DataReader(symbol, data_source,begin_date, end_date)[['AdjOpen','AdjHigh','AdjLow','AdjClose','AdjVolume']].reset_index()
        df.columns = ['date','open','high','low','close','volume'] #my convention: always lowercase
        df['symbol'] = symbol # add a new column which contains the symbol so we can keep multiple symbols in the same dataframe
        df = df.set_index(['date','symbol'])
        out = pd.concat([out,df],axis=0) #stacks on top of previously collected data
    return out.sort_index()
        
prices = get_symbols(['AAPL','CSCO'],data_source='quandl',begin_date='2015-01-01',end_date='2017-01-01')



Now, we will create some toy features: 

In [85]:
features = pd.DataFrame(index=prices.index)
features['volume_change_ratio'] = prices.groupby(level='symbol').volume.diff(1) / prices.groupby(level='symbol').shift(1).volume
features['momentum_5_day'] = prices.groupby(level='symbol').close.pct_change(5) 
features['intraday_chg'] = (prices.groupby(level='symbol').close.shift(0) - prices.groupby(level='symbol').open.shift(0))/prices.groupby(level='symbol').open.shift(0)
features['day_of_week'] = features.index.get_level_values('date').weekday
features['day_of_month'] = features.index.get_level_values('date').day
features.dropna(inplace=True)
features.tail(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,volume_change_ratio,momentum_5_day,intraday_chg,day_of_week,day_of_month
date,symbol,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-12-23,AAPL,-0.453747,0.004743,0.008046,4,23
2016-12-23,CSCO,-0.291298,-0.001961,-0.000327,4,23
2016-12-27,AAPL,0.284036,0.005316,0.006351,1,27
2016-12-27,CSCO,0.54626,-0.002276,0.001305,1,27
2016-12-28,AAPL,0.142595,-0.001625,-0.006467,2,28
2016-12-28,CSCO,-0.1519,-0.004581,-0.009121,2,28
2016-12-29,AAPL,-0.280609,-0.002819,0.002404,3,29
2016-12-29,CSCO,-0.085396,0.001315,0.002963,3,29
2016-12-30,AAPL,1.033726,-0.004042,-0.007115,4,30
2016-12-30,CSCO,0.836194,-0.007879,-0.011126,4,30


If the syntax or logic of the features isn't immediately clear, I'll cover that in more depth in [the next post].  For now, just note that we've created five features for both symbols using only data that would be available _as of the end of day T_.  

Also note that I've dropped any rows which contain any nulls for simplicity, since scikit-learn can't handle those out of the box.  

Next, we'll create outcomes:


In [86]:
outcomes = pd.DataFrame(index=prices.index)
# next day's opening change
outcomes['open_1'] = prices.groupby(level='symbol').open.shift(-1)/prices.groupby(level='symbol').close.shift(0)-1
# next day's closing change
outcomes['close_1'] = prices.groupby(level='symbol').close.pct_change(-1)
outcomes['close_5'] = prices.groupby(level='symbol').close.pct_change(-5)

(outcomes.tail(15))

Unnamed: 0_level_0,Unnamed: 1_level_0,open_1,close_1,close_5
date,symbol,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-12-20,CSCO,0.004254,0.004602,0.004602
2016-12-21,AAPL,-0.006065,0.006621,0.002827
2016-12-21,CSCO,-0.000657,-0.001313,-0.001313
2016-12-22,AAPL,-0.006019,-0.001974,0.004058
2016-12-22,CSCO,0.002626,-0.002293,0.007942
2016-12-23,AAPL,0.0,-0.006311,
2016-12-23,CSCO,0.003603,-0.004889,
2016-12-27,AAPL,0.002217,0.004282,
2016-12-27,CSCO,0.000652,0.008547,
2016-12-28,AAPL,-0.002655,0.000257,


Note that the shifted periods are negative, which in pandas convention looks _ahead_ in time.  This means that at the ending of our time period we will have nulls - and more nulls in the outcome colums that need to look further into the future.  We don't dropna() here since we may want to use `open_1` and there's no reason to throw away data from that column just because _a different_ outcome didn't have data.  But I digress.

Now, to put it together, we'll train a simple linear model in `scikit-learn`, using all features to predict `close_1` 

In [80]:

# first, create y (a series) and X (a dataframe), with only rows where 
# a valid value exists for both y and X
y = outcomes.close_1
X = features
Xy = X.join(y).dropna()
y = Xy[y.name]
X = Xy[X.columns]
print(y.shape)
print(X.shape)

(996,)
(996, 5)


Note that all of these slightly tedious steps have left us with properly sized, identically indexed data objects.  At this point, the modeling is dead simple:

In [81]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X,y)
print("Model RSQ: "+ str(model.score(X,y)))

print("Coefficients: ")
pd.Series(model.coef_,index=X.columns).sort_values(ascending=False)

Model RSQ: 0.01598347165537528
Coefficients: 


intraday_chg           0.150482
volume_change_ratio    0.000976
day_of_month           0.000036
day_of_week           -0.000427
momentum_5_day        -0.005543
dtype: float64

Clearly, this model isn't very useful but illustrates the point. If we wanted to instead create a random forest to predict tomorrow's open, it'd be mostly copy-paste: 

In [82]:
from sklearn.ensemble import RandomForestRegressor

y = outcomes.open_1
X = features
Xy = X.join(y).dropna()
y = Xy[y.name]
X = Xy[X.columns]
print(y.shape)
print(X.shape)

model = RandomForestRegressor(max_features=3)
model.fit(X,y)
print("Model Score: "+ str(model.score(X,y)))

print("Feature Importance: ")
pd.Series(model.feature_importances_,index=X.columns).sort_values(ascending=False)

(996,)
(996, 5)
Model Score: 0.7941872364575131
Feature Importance: 


momentum_5_day         0.269462
intraday_chg           0.266634
volume_change_ratio    0.257447
day_of_month           0.129595
day_of_week            0.076862
dtype: float64

This yields a vastly improved RSQ but note that it is almost certainly ridiculously overfitted, as random forests are prone to do.  

We'll cover ways to systematically avoid allowing the model to overfit in future posts, but that requires going a bit further down the rabbit hole.  

One side point: in this example (and often, in real life) we've mixed together all observations from AAPL and CSCO into one dataset.  We could have alternatively trained two different models for the two symbols, which may have achieved better fit, but almost certainly at the cost of worse generalization out of sample.  The bias-variance trade-off in action!


## Prediction
Once the model is trained, it becomes a one-liner to make predictions from a set of feature values.  In this case, we'll simply feed the same X values used to train the model, but in live usage, of course, we'd want to apply the trained model to _new_ X values.  


In [89]:
pd.Series(model.predict(X),index=X.index).tail(10)

date        symbol
2016-12-22  AAPL     -0.001943
            CSCO      0.003121
2016-12-23  AAPL     -0.000231
            CSCO      0.002466
2016-12-27  AAPL      0.002638
            CSCO      0.001447
2016-12-28  AAPL     -0.002669
            CSCO     -0.000287
2016-12-29  AAPL      0.000690
            CSCO      0.002967
dtype: float64

Let me pause here to emphasize the most critical point to understand about this framework.  Read this twice!

The date of a feature row represents the day when a value would be known _after that day's trading_, using the feature value date as T=0.  The date of an outcome row represents what will happen in the n days _following_ that date.

** Predictions are indexed to the date of the _evening_ when the model could have been run**, _not_ on the day when it could have been traded. 

In other words, on 2016-12-23, the prediction value represents what the model believes will happen _after_ 12/23.  In practical usage, we can't start using the trading signal until T+1 (since we can't get predictions until after markets are closed on T+0).  

## Summary
This post presented the concept of organizing data into a `features` dataframe and `outcome` dataframe, and then showed how simple it is to join these two dataframes together to train a model.  

True, the convention may take a few examples to get used to.  However, after trial and error, I've found this to be the most error-resistant, flexible, and high-performance way to go.

In the [next post], I will share some methods of feature engineering and feature selection.  

