# Training a Model

Indicators are a good starting point for developing a trading strategy. But to create a successful strategy, it is likely that a more sophisticated approach using predictive modeling will be needed.

Luckily, one of the main features of **PyBroker** is training and backtesting machine learning models. Models trained in **PyBroker** can use indicators as features. After training, the models can be backtested using a popular technique known as [Walkforward Analysis](https://www.youtube.com/watch?v=WBZ_Vv-iMv4), which will be explained later.

But first, let's get started with some needed imports!

In [1]:
import numpy as np
import pandas as pd
import pybroker
from numba import njit
from pybroker import Strategy, YFinance

Just as with [DataSource](https://www.pybroker.com/en/latest/reference/pybroker.data.html#pybroker.data.DataSource) and [Indicator](https://www.pybroker.com/en/latest/reference/pybroker.indicator.html#pybroker.indicator.Indicator) data, **PyBroker** can also cache trained models to disk. Caching all three is enabled by calling [pybroker.enable_caches](https://www.pybroker.com/en/latest/reference/pybroker.cache.html#pybroker.cache.enable_caches):

In [2]:
pybroker.enable_caches('walkforward_strategy')

Below includes the close-minus-moving-average (CMMA) indicator that was implemented in [the last notebook](https://www.pybroker.com/en/latest/notebooks/5.%20Writing%20Indicators.html):

In [3]:
def cmma(bar_data, lookback):
    @njit  # Enable Numba JIT.
    # Define inner function.
    def vec_cmma(values):
        # Initialize the result array.
        n = len(values)
        out = np.array([np.nan for _ in range(n)])
        
        # For all bars starting at lookback:
        for i in range(lookback, n):
            # Calculate the moving average for the lookback.
            ma = 0
            for j in range(i - lookback, i):
                ma += values[j]
            ma /= lookback
            # Subtract the moving average from value.
            out[i] = values[i] - ma
        return out
    
    # Calculate for close prices.
    return vec_cmma(bar_data.close)

cmma_20 = pybroker.indicator('cmma_20', cmma, lookback=20)

## Train and Backtest

Next, we want to build a model that predicts the next day's return using the 20-day CMMA. Using [simple linear regression](https://en.wikipedia.org/wiki/Simple_linear_regression) for this task is a good example to begin experimenting with. Below we import [scikit-learn's](https://scikit-learn.org/stable/) [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model:

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

Then the ```LinearRegression``` model is trained in the following function:

In [5]:
def train_slr(symbol, train_data, test_data):
    # Train
    # Previous day close prices.
    train_prev_close = train_data['close'].shift(1)
    # Calculate daily returns.
    train_daily_returns = (train_data['close'] - train_prev_close) / train_prev_close
    # Predict next day's return.
    train_data['pred'] = train_daily_returns.shift(-1)
    train_data = train_data.dropna()
    # Train the LinearRegession model to predict the next day's return
    # given the 20-day CMMA.
    X_train = train_data[['cmma_20']]
    y_train = train_data[['pred']]
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Test
    test_prev_close = test_data['close'].shift(1)
    test_daily_returns = (test_data['close'] - test_prev_close) / test_prev_close
    test_data['pred'] = test_daily_returns.shift(-1)
    test_data = test_data.dropna()
    X_test = test_data[['cmma_20']]
    y_test = test_data[['pred']]
    # Make predictions from test data.
    y_pred = model.predict(X_test)
    # Print goodness of fit.
    r2 = r2_score(y_test, np.squeeze(y_pred))
    print(symbol, f'R^2={r2}')
    
    # Return the trained model.
    return model

**PyBroker** will train a model for every ticker symbol in the backtest. This means our ```train_slr``` function will be called for each symbol and passed ```train``` and ```test``` data as [Pandas DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

The ```train_slr``` function computes next day returns to train a ```LinearRegression``` model, using the 20-day CMMA indicator as the predictor. The test data is then used to compute R-squared. The final step is to return the trained model, making it available to the backtest.

After defining a function to train a model, the function needs to be registered with **PyBroker**:

In [6]:
model_slr = pybroker.model('slr', train_slr, indicators=[cmma_20])

The model returned by ```train_slr``` will be referenced with the name ```slr``` in backtests. Calling [pybroker.model](https://www.pybroker.com/en/latest/reference/pybroker.model.html#pybroker.model.model) returns a new [ModelSource](https://www.pybroker.com/en/latest/reference/pybroker.model.html#pybroker.model.ModelSource) instance that references the ```train_slr``` function.

Now to create a [Strategy](https://www.pybroker.com/en/latest/reference/pybroker.strategy.html#pybroker.strategy.Strategy) that uses the model:

In [7]:
strategy = Strategy(YFinance(), '3/1/2017', '3/1/2022')
strategy.add_execution(None, ['NVDA', 'AMD'], models=model_slr)

Passing ```None``` instead of a function as the first argument to [add_execution](https://www.pybroker.com/en/latest/reference/pybroker.strategy.html#pybroker.strategy.Strategy.add_execution) will result in only the model being trained during the backtest. 

The model is trained using a 50/50 train/test split:

In [8]:
strategy.backtest(train_size=0.5)

Backtesting: 2017-03-01 00:00:00 to 2022-03-01 00:00:00

Loading bar data...
[*********************100%***********************]  2 of 2 completed
Loaded bar data: 0:00:00 

Computing indicators...


100% (2 of 2) |##########################| Elapsed Time: 0:00:00 Time:  0:00:00



Train split: 2017-03-01 00:00:00 to 2019-08-28 00:00:00
AMD R^2=-0.006808549721842416
NVDA R^2=-0.004416132743176426
Finished training models: 0:00:00 

Finished backtest: 0:00:01


## Walkforward Analysis

**PyBroker** implements an algorithm used for backtesting known as [Walkforward Analysis](https://www.youtube.com/watch?v=WBZ_Vv-iMv4). It works by first dividing the backtest data into a specified number of time windows that each contain a train/test split of data. Then the algorithm "walks forward" in time through the windows in the same way that a strategy would be executed in real life. 

For example, the model is first trained and then evaluated on the test data in the earliest window. When the algorithm walks forward to evaluate the next window in time, the algorithm includes the test data of the previous window for training. This process continues until all of the time windows are evaluated.

![Walkforward Diagram](https://github.com/edtechre/pybroker/blob/master/docs/_static/walkforward.png?raw=true)

Next, consider a trading strategy that generates buy and sell signals from the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model that we trained earlier:

In [9]:
def hold_long(ctx):
    if not ctx.long_pos():
        # Buy if the next bar is predicted to have a positive return:
        if ctx.preds('slr')[-1] > 0:
            ctx.buy_shares = 100
    else:
        # Sell if the next bar is predicted to have a negative return:
        if ctx.preds('slr')[-1] < 0:
            ctx.sell_shares = 100
            
strategy.clear_executions()
strategy.add_execution(hold_long, ['NVDA', 'AMD'], models=model_slr)

The ```hold_long``` function opens a long position when the model predicts a positive return for the next bar, and then closes the position when the model predicts a negative return. 

Calling ```ctx.preds('slr')``` returns a [NumPy array](https://numpy.org/doc/stable/reference/generated/numpy.array.html) of predictions from the ```slr``` model for every bar in the test data. Here, ```slr``` refers to the model instance that was trained for the ticker symbol currently being executed in ```hold_long``` (i.e. ```NVDA``` or ```AMD```). The most recent prediction for ```NVDA``` or ```AMD``` is accessed with ```ctx.preds('slr')[-1]```, which is ```slr's``` prediction of the next bar's return.

But how did **PyBroker** generate the predictions? **PyBroker** will call the ```predict``` method on a trained model, passing input for all of the bars in the test data. By default, the input is a DataFrame containing all of the model's computed indicators.

Next, the [Strategy](https://www.pybroker.com/en/latest/reference/pybroker.strategy.html#pybroker.strategy.Strategy) is run with Walkforward Analysis using 3 time ```windows``` each with 50/50 train/test data. A ```lookahead``` of ```1``` is used since our ```slr``` model makes a prediction for one bar in the future. Passing the correct ```lookahead``` value is needed to prevent training data from leaking into the test boundary. The ```lookahead``` should always be the number of bars in the future being predicted.

In [10]:
result = strategy.walkforward(windows=3, train_size=0.5, lookahead=1)

Backtesting: 2017-03-01 00:00:00 to 2022-03-01 00:00:00

Loaded cached bar data.

Loaded cached indicator data.

Train split: 2017-03-06 00:00:00 to 2018-06-01 00:00:00
AMD R^2=-0.007950114729117885
NVDA R^2=-0.04203364470839133
Finished training models: 0:00:00 

Test split: 2018-06-04 00:00:00 to 2019-08-30 00:00:00


100% (314 of 314) |######################| Elapsed Time: 0:00:00 Time:  0:00:00



Train split: 2018-06-04 00:00:00 to 2019-08-30 00:00:00
AMD R^2=0.0006422677593683757
NVDA R^2=-0.023591728578221893
Finished training models: 0:00:00 

Test split: 2019-09-03 00:00:00 to 2020-11-27 00:00:00


100% (314 of 314) |######################| Elapsed Time: 0:00:00 Time:  0:00:00



Train split: 2019-09-03 00:00:00 to 2020-11-27 00:00:00
AMD R^2=-0.015508227883924253
NVDA R^2=-0.4567200095787838
Finished training models: 0:00:00 

Test split: 2020-11-30 00:00:00 to 2022-02-28 00:00:00


100% (314 of 314) |######################| Elapsed Time: 0:00:00 Time:  0:00:00



Calculating bootstrap metrics: sample_size=1000, samples=10000...
Calculated bootstrap metrics: 0:00:02 

Finished backtest: 0:00:03


As you can see, the ``slr`` model was trained on a window's train data, and then our ``hold_long`` function ran on that window's test data.

In [11]:
result.metrics_df

Unnamed: 0,name,value
0,trade_count,129.0
1,initial_market_value,100000.0
2,end_market_value,108479.0
3,total_pnl,33879.0
4,total_return_pct,33.879
5,total_profit,59796.0
6,total_loss,-25917.0
7,total_fees,0.0
8,max_drawdown,-18151.0
9,max_drawdown_pct,-15.898222


In [12]:
result.bootstrap.conf_intervals

Unnamed: 0_level_0,Unnamed: 1_level_0,lower,upper
name,conf,Unnamed: 2_level_1,Unnamed: 3_level_1
Log Profit Factor,97.5%,-0.295939,0.230177
Log Profit Factor,95%,-0.244621,0.190217
Log Profit Factor,90%,-0.190077,0.146544
Sharpe Ratio,97.5%,-0.065667,0.059654
Sharpe Ratio,95%,-0.055891,0.049016
Sharpe Ratio,90%,-0.044904,0.037764


In [13]:
result.bootstrap.drawdown_conf

Unnamed: 0_level_0,amount,percent
conf,Unnamed: 1_level_1,Unnamed: 2_level_1
99.9%,-55216.5,-40.780992
99%,-42949.5,-33.320717
95%,-33154.75,-26.928774
90%,-28447.5,-23.517172


And we are done! The metrics above were evaluated on the test data from all of the time windows in the Walkforward Analysis. Our strategy needs a lot improvement, but this gives you the gist of training a model with **PyBroker**.

Even though we based our buy and sell decisions on our model's predicted returns, we could also use model predictions to [rank tickers](https://www.pybroker.com/en/latest/notebooks/4.%20Ranking%20and%20Position%20Sizing.html).

And we are not only limited to building linear regression models. We can also train other model types like gradient boosted machines, neural networks, or any other architecture we would like! 

There are also some customization options available. We can specify an [input_data_fn](https://www.pybroker.com/en/latest/reference/pybroker.model.html#pybroker.model.model) for our model in case we need to customize how its input data is built. This would be needed when constructing input for autoregressive models (i.e. ARMA or RNN) since they use multiple past values to make predictions. Similarly, we can specify our own [predict_fn](https://www.pybroker.com/en/latest/reference/pybroker.model.html#pybroker.model.model) to customize how predictions are made (by default, the model's ``predict`` function is called).

Now you have enough knowledge to begin developing your own models and trading strategies in **PyBroker**!