# **Quantitative Finance Course**
---
Introduction to Algoritmic Trading with Python using Machine Learning

### What is Machine Learning?

Machine learning (ML) is a method where computers learn from data to make predictions or decisions. Instead of manually programming rules, we provide examples, and the model finds patterns to make accurate guesses. Let’s break this down using two examples:




<center><img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/free-quant-3/house-prices-graph-using-linear-regression_1.jpg">Source</a></center>

Imagine you want to predict house prices. One key factor is the square footage of the house.

**Concept of X (Features) and y (Target Variable):**
- X (Input/Feature): The independent variable(s) we use to predict something (e.g., Square Feet).
- y (Output/Target): The dependent variable we want to predict (e.g., House Price).

### Why Use Machine Learning ?
---

#### 1. Capture Non Linear Relationship
---


<center><img src="https://vitalflux.com/wp-content/uploads/2022/04/Linearly-vs-Not-linearly-separable-datasets.png"><a href="https://vitalflux.com/wp-content/uploads/2022/04/Linearly-vs-Not-linearly-separable-datasets.png">Source</a></center>

#### 2. Focus on Predictive Accuracy
---

Machine Learning model often perform better than simple linear model


<center><img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/quant-sc-03/rf_vs_lr.png
"></center>

What can machine learning do in Finance :

  

1. **Return Prediction**

   Predicting Return of collection of stocks






2. **Portofolio Construction**

   Allocating each stock / investment for amount of capital to Form a Portofolio




3. **Trade Decision Model**

    Predict Whether to buy or sell an asset


## Modeling Case : Return Prediction

---

**Case Description** :

We are going to predict future return of a stock ? Why not price ?


**Data** :
---

1. Input
   What factor can be included to predict future return ?

   Some of them can be :     

   - Past Return
   - Dividend
   - Fundamental Variable as in Financial Report
   - News Data ?

2. Output

   This is the return we are going to predict

   Questions :     

   - What Type of Return ? Gross Return / Pct Return /
   - What Period of Time we are going to use ? Monthly / Daily / Quarterly / Annualy ?

- **Input** : Previous 3 Month Return

- **Output** :  Next month percentage return

### Preparing Data

---

Goals :    
- Obtaining Input and Output

- Fetching stock data (Astra International Tbk. Ticker `ASII.JK`) from yahoo finance from `2010-01-01` to `2024-12-31`

In [1]:
import yfinance as yf
print(yf.__version__)



0.2.66


In [2]:
#!pip install yfinance

In [3]:
import pandas as pd
import yfinance as yf
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')
plt.style.use('ggplot')


In [4]:
start_date = '2010-01-01'
end_date = '2024-12-31'

stock_data = yf.Ticker('ASII.JK')\
               .history(start=start_date,
                        end=end_date,auto_adjust=False)

# check the data
stock_data.head(3)

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2010-01-04 00:00:00+07:00,3530.0,3550.0,3465.0,3530.0,1838.515259,40,0.0,0.0
2010-01-05 00:00:00+07:00,3550.0,3570.0,3485.0,3550.0,1848.931641,40,0.0,0.0
2010-01-06 00:00:00+07:00,3530.0,3580.0,3515.0,3530.0,1838.515259,40,0.0,0.0


In [5]:
print(f'Shape of Data  : \n \n Number of Rows : {stock_data.shape[0]} \n Number of Columns : {stock_data.shape[1]}')
print('\n')


Shape of Data  : 
 
 Number of Rows : 3700 
 Number of Columns : 8




- Resample data from `Daily` into `Monthly` ? So which stock price we choose ?
  - At the beginning of the month ?
  - Average ?
  - Median ?
  - In the End of the Month?

- Answer : **In the End of the Month**

In [6]:
stock_data_monthly = stock_data.copy()
stock_data_monthly = stock_data_monthly.resample('M').last()
stock_data_monthly

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2010-01-31 00:00:00+07:00,3595.0,3600.0,3485.0,3595.0,1872.369019,40,0.0,0.0
2010-02-28 00:00:00+07:00,3625.0,3680.0,3575.0,3625.0,1887.994019,27700000,0.0,0.0
2010-03-31 00:00:00+07:00,4190.0,4320.0,4190.0,4190.0,2182.260010,34990000,0.0,0.0
2010-04-30 00:00:00+07:00,4715.0,4720.0,4645.0,4715.0,2455.693848,38990000,0.0,0.0
2010-05-31 00:00:00+07:00,4315.0,4335.0,4215.0,4315.0,2247.363525,64990000,0.0,0.0
...,...,...,...,...,...,...,...,...
2024-08-31 00:00:00+07:00,5075.0,5125.0,5025.0,5100.0,4684.548828,74429400,0.0,0.0
2024-09-30 00:00:00+07:00,5100.0,5100.0,5000.0,5050.0,4638.622070,66737800,0.0,0.0
2024-10-31 00:00:00+07:00,5300.0,5300.0,5100.0,5100.0,4776.790039,77263000,0.0,0.0
2024-11-30 00:00:00+07:00,5125.0,5125.0,5025.0,5100.0,4776.790039,48756800,0.0,0.0


- Calculate Nett Return

$$
\mathbf{R}_{t} = \left( \frac{\mathbf{P}_{t}-\mathbf{P}_{t-1}}{\mathbf{P}_{t-1}} \right)
$$

- $\mathbf{R}_{t}$ : Percentage Return at time t
- $\mathbf{P}_{t}$ : Close Price at time t

**Question**

- How to Calculate Nett Return at January 2012 ?

In [7]:
# we only need close price
cols_to_use = ['Close']
stock_data_monthly = stock_data_monthly[cols_to_use]
stock_data_monthly.head(3)

Unnamed: 0_level_0,Close
Date,Unnamed: 1_level_1
2010-01-31 00:00:00+07:00,3595.0
2010-02-28 00:00:00+07:00,3625.0
2010-03-31 00:00:00+07:00,4190.0


In [8]:
# calculate nett return
stock_data_monthly['Last Month Close'] = stock_data_monthly['Close'].shift(1)
stock_data_monthly.head(3)

Unnamed: 0_level_0,Close,Last Month Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-01-31 00:00:00+07:00,3595.0,
2010-02-28 00:00:00+07:00,3625.0,3595.0
2010-03-31 00:00:00+07:00,4190.0,3625.0


In [9]:
stock_data_monthly['NettReturn'] = (stock_data_monthly['Close']\
                                    - stock_data_monthly['Last Month Close'])\
                                    / (stock_data_monthly['Last Month Close'])
stock_data_monthly.head(3)

Unnamed: 0_level_0,Close,Last Month Close,NettReturn
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-31 00:00:00+07:00,3595.0,,
2010-02-28 00:00:00+07:00,3625.0,3595.0,0.008345
2010-03-31 00:00:00+07:00,4190.0,3625.0,0.155862


- Calculate Lag Return 12 (Momentum Lag 12)


$$
\mathbf{LagReturn}_{t-12} = \left( \frac{\mathbf{P}_{t-2}-\mathbf{P}_{t-12}}{\mathbf{P}_{t-12}} \right)
$$

- $\mathbf{P}_{t}$ : Close Price at time t

In [10]:
# get previous 2 month price
stock_data_monthly['Price_T-2'] = stock_data_monthly['Close'].shift(2)
stock_data_monthly.head(3)

Unnamed: 0_level_0,Close,Last Month Close,NettReturn,Price_T-2
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-31 00:00:00+07:00,3595.0,,,
2010-02-28 00:00:00+07:00,3625.0,3595.0,0.008345,
2010-03-31 00:00:00+07:00,4190.0,3625.0,0.155862,3595.0


In [11]:
# get previous 12 month price
stock_data_monthly['Price_T-12'] = stock_data_monthly['Close'].shift(12)
stock_data_monthly.head(4)

Unnamed: 0_level_0,Close,Last Month Close,NettReturn,Price_T-2,Price_T-12
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-31 00:00:00+07:00,3595.0,,,,
2010-02-28 00:00:00+07:00,3625.0,3595.0,0.008345,,
2010-03-31 00:00:00+07:00,4190.0,3625.0,0.155862,3595.0,
2010-04-30 00:00:00+07:00,4715.0,4190.0,0.125298,3625.0,


In [12]:
# calculate momentum
stock_data_monthly['LagReturn-12'] = (stock_data_monthly['Price_T-2'] -  stock_data_monthly['Price_T-12'])\
                            / (stock_data_monthly['Price_T-12'])
stock_data_monthly.head(4)

Unnamed: 0_level_0,Close,Last Month Close,NettReturn,Price_T-2,Price_T-12,LagReturn-12
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-01-31 00:00:00+07:00,3595.0,,,,,
2010-02-28 00:00:00+07:00,3625.0,3595.0,0.008345,,,
2010-03-31 00:00:00+07:00,4190.0,3625.0,0.155862,3595.0,,
2010-04-30 00:00:00+07:00,4715.0,4190.0,0.125298,3625.0,,


- Drop the missing data

In [13]:
stock_data_monthly = stock_data_monthly.dropna()
stock_data_monthly.head(3)

Unnamed: 0_level_0,Close,Last Month Close,NettReturn,Price_T-2,Price_T-12,LagReturn-12
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2011-01-31 00:00:00+07:00,4890.0,5455.0,-0.103575,5190.0,3595.0,0.443672
2011-02-28 00:00:00+07:00,5205.0,4890.0,0.064417,5455.0,3625.0,0.504828
2011-03-31 00:00:00+07:00,5700.0,5205.0,0.095101,4890.0,4190.0,0.167064


### Data Splitting

---

Goals :
- Split Data Into Input and Output
- Split Data Into Training and Testing Example

In [14]:
# split into input and output
input_col = ['LagReturn-12']
output_col = ['NettReturn']
X = stock_data_monthly[input_col]
y = stock_data_monthly[output_col]

In [15]:
y.shape

(168, 1)

In [16]:
print(f'Shape of Input  : \n \n Number of Rows : {X.shape[0]} \n Number of Columns : {X.shape[1]}')
print('\n')
print(f'Shape of Output : \n \n Number of Rows : {y.shape[0]} \n Number of Columns : {y.shape[1]}')

Shape of Input  : 
 
 Number of Rows : 168 
 Number of Columns : 1


Shape of Output : 
 
 Number of Rows : 168 
 Number of Columns : 1


- Split data Into Training and Test Data



<center><img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/quant-sc-03/split_data.png"></center>
<center><a href="https://scikit-learn.org/1.5/_images/grid_search_cross_validation.png">Source</a></center>


- Training Data : 80% of All Data
- Test Data : 20% of All Data




In [17]:
train_rows = int(0.8*X.shape[0])
print(train_rows)

134


In [18]:
# split data intro training and test saple
X_train = X.iloc[:train_rows]
X_test = X.iloc[train_rows:]

y_train = y.iloc[:train_rows]
y_test = y.iloc[train_rows:]

Question :
- Why cannot randomly choose 80% of data ?

In [19]:
from IPython.display import display
print('Training Data : \n')
display(X_train.head(3))
display(y_train.head(3))
display(y_train.tail(3))
print('\n')
print('Test Data : \n')
display(X_test.head(3))
display(y_test.head(3))
display(y_test.tail(3))

Training Data : 



Unnamed: 0_level_0,LagReturn-12
Date,Unnamed: 1_level_1
2011-01-31 00:00:00+07:00,0.443672
2011-02-28 00:00:00+07:00,0.504828
2011-03-31 00:00:00+07:00,0.167064


Unnamed: 0_level_0,NettReturn
Date,Unnamed: 1_level_1
2011-01-31 00:00:00+07:00,-0.103575
2011-02-28 00:00:00+07:00,0.064417
2011-03-31 00:00:00+07:00,0.095101


Unnamed: 0_level_0,NettReturn
Date,Unnamed: 1_level_1
2021-12-31 00:00:00+07:00,-0.012987
2022-01-31 00:00:00+07:00,-0.039474
2022-02-28 00:00:00+07:00,0.059361




Test Data : 



Unnamed: 0_level_0,LagReturn-12
Date,Unnamed: 1_level_1
2022-03-31 00:00:00+07:00,0.037915
2022-04-30 00:00:00+07:00,0.054545
2022-05-31 00:00:00+07:00,0.252381


Unnamed: 0_level_0,NettReturn
Date,Unnamed: 1_level_1
2022-03-31 00:00:00+07:00,0.133621
2022-04-30 00:00:00+07:00,0.152091
2022-05-31 00:00:00+07:00,-0.029703


Unnamed: 0_level_0,NettReturn
Date,Unnamed: 1_level_1
2024-10-31 00:00:00+07:00,0.009901
2024-11-30 00:00:00+07:00,0.0
2024-12-31 00:00:00+07:00,-0.039216


In [20]:
print(f' X_train  : \n \n Number of Rows : {X_train.shape[0]} \n Number of Columns : {X.shape[1]}')
print('\n')
print(f'y_train : \n \n Number of Rows : {y_train.shape[0]} \n Number of Columns : {y.shape[1]}')

 X_train  : 
 
 Number of Rows : 134 
 Number of Columns : 1


y_train : 
 
 Number of Rows : 134 
 Number of Columns : 1


In [21]:
print(f' X_test  : \n \n Number of Rows : {X_test.shape[0]} \n Number of Columns : {X.shape[1]}')
print('\n')
print(f'y_test : \n \n Number of Rows : {y_test.shape[0]} \n Number of Columns : {y.shape[1]}')

 X_test  : 
 
 Number of Rows : 34 
 Number of Columns : 1


y_test : 
 
 Number of Rows : 34 
 Number of Columns : 1


### Train Machine Learning Model

---

- Retrain Machine Learning Model with Best Parameter

In [23]:
from sklearn.ensemble import RandomForestRegressor
random_forest_model = RandomForestRegressor()
random_forest_model.fit(X_train,y_train)

### Evaluate Model

---

Goal :    

- Check the performance of model of unseen data / test data

- Calculate the Error

$$\mathbf{RMSE} =
\sqrt{\cfrac{1}{N}\sum_{i}^{N}(\mathbf{y}-\mathbf{\hat{y}})^{2}}
$$

- $\mathbf{y}$ : actual values
- $\mathbf{\hat{y}}$ : predicted values

In [25]:
from sklearn.metrics import mean_squared_error

# predict on training data
y_pred_train = random_forest_model.predict(X_train)

# calculate RMSE for training data
error_train = np.sqrt(mean_squared_error(
    y_true=y_train,
    y_pred=y_pred_train,
    squared=False
))

# predict on test data
y_pred_test = random_forest_model.predict(X_test)

# calculate RMSE for test data
error_test = np.sqrt(mean_squared_error(
    y_true=y_test,
    y_pred=y_pred_test,
    squared=False
))

print('Error in Training Data : ', error_train)
print('Error in Test Data : ', error_test)


TypeError: got an unexpected keyword argument 'squared'

#### Plot Error

---

In [None]:
y_train

In [None]:
fig,ax = plt.subplots(nrows=1,
                      ncols=2,
                      figsize=(12,6))
sns.scatterplot(x=y_train['NettReturn'],y=y_pred_train,label='Training Data',ax=ax[0],color='blue')
sns.scatterplot(x=y_test['NettReturn'],y=y_pred_test,label='Test Data',ax=ax[1])
ax[0].set_xlabel('Actual Data')
ax[0].set_ylabel('Predicted Data')
ax[1].set_xlabel('Actual Data')
ax[1].set_ylabel('Predicted Data')
plt.title('Predicted vs Actual Value')

**Experiment** :     

- What if We change the number of lagReturn we use ?
  - 4
  - 5
  - 6
  - 1
  - 2

# Building Machine Learning Portofolio Strategy
---




**Objective**

1. Understand How to Make Machine Learning Powered Portofolio

**Factor Definitions**

1.  *LagReturn-6M* is defined as the past 6 months of the nett returns, i.e.

  $$
  \mathbf{R}_{t-6} = \left( \frac{\mathbf{P}_{t-6}-\mathbf{P}_{t-7}}{\mathbf{P}_{t-7}} \right)
  $$


   - $\mathbf{R}_{t-6}$ : Lag 6 Month Return
   - $\mathbf{P}_{t}$ : Close Price at time t


2. *Momentum Variable* , Nett Return Between Previous 2 to 12 Months


  $$
  \mathbf{MOM}_{t} = \left( \frac{\mathbf{P}_{t-2}-\mathbf{P}_{t-12}}{\mathbf{P}_{t-12}} \right)
  $$

  - $\mathbf{MOM}_{t}$ : Momentum of Return at time t
  - $\mathbf{P}_{t}$ : Close Price at time t

3. *Reversal Variable* , Previous Month Return


$$
\mathbf{REV}_{t} = \left( \frac{\mathbf{P}_{t-1}-\mathbf{P}_{t-2}}{\mathbf{P}_{t-2}} \right)
$$

- $\mathbf{REV}_{t}$ : Reversal of Return at time t
- $\mathbf{P}_{t}$ : Close Price at time t

**Methodology**

- We construct portofolio by taking **Top N%** stocks
- **Top N%** is taken from predicted from return prediction result from machine learning model
- Then compare the cumulative returns with the IHSG Index

**Assumptions**
- Transactions fees of 0.5% of the nett returns
- Rebalance the portfolio every month

**Portofolio Construction Workflow**  


For each portofolio construction (rebalance) period :   


<center><img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/quant-sc-03/worklow-2.png"></center>

**Example of results**

<center>
<img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/quant-p01/output_portfolio_indo.png" width=800>

## Load Data
---

**Data Definition**  

<center>
<table border="1">
  <tr>
    <th>Columns</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>Ticker</td>
    <td>Company Code</td>
  </tr>
  <tr>
    <td>Date</td>
    <td>Datetime (End of the Month)</td>
  </tr>
  <tr>
    <td>PriceClose</td>
    <td>Close Price at given date</td>
  </tr>
  <tr>
    <td>PriceOpen</td>
    <td>Open Price at given date</td>
  </tr>
  <tr>
    <td>PriceHigh</td>
    <td>Highest Price at given date</td>
  </tr>
  <tr>
    <td>PriceLow</td>
    <td>Lowest Price at given date</td>
  </tr>
</table>

</center>

In [None]:
# read data
path='price_data_ver1.csv'

data = pd.read_csv(path,
                   parse_dates=['Date'])
data.head()

In [None]:
data.Ticker.nunique()

In [None]:
stocks_data = data[['Ticker','Date','PriceClose']].copy()

### Creating Feature
---



#### 1. **Momentum Variable**


$$
\mathbf{MOM}_{t} = \left( \frac{\mathbf{P}_{t-2}-\mathbf{P}_{t-12}}{\mathbf{P}_{t-12}} \right)
$$

- $\mathbf{MOM}_{t}$ : Momentum of Return at time t
- $\mathbf{P}_{t}$ : Close Price at time t

In [None]:
def calculate_momentum(data) :
  """Calculate Momentum using T-12 Month up to T-2 Month Return"""
  data = data.copy()
  container = []
  # loop all over ticker
  for ticker in data['Ticker'].unique() :
    ticker_data = data.loc[
        data['Ticker']==ticker
    ]
    ticker_data['FactorMomentum'] = (data['PriceClose'].shift(2) - data['PriceClose'].shift(12))\
                                  /( data['PriceClose'].shift(12))
    # append data
    container.append(ticker_data)
  # combined data again
  combined_data = pd.concat(container,axis=0)
  return combined_data

stocks_data = calculate_momentum(data = stocks_data)
stocks_data.head(13)

#### 2. **Reversal Variable**


$$
\mathbf{REV}_{t} = \left( \frac{\mathbf{P}_{t-1}-\mathbf{P}_{t-2}}{\mathbf{P}_{t-2}} \right)
$$

- $\mathbf{REV}_{t}$ : Reversal of Return at time t
- $\mathbf{P}_{t}$ : Close Price at time t

In [None]:
def calculate_reversal(data) :
  """Calculate Reversal using previous month return """
  data = data.copy()
  container = []
  # loop all over ticker
  for ticker in data['Ticker'].unique() :
    ticker_data = data.loc[
        data['Ticker']==ticker
    ]
    ticker_data['FactorReversal'] = (data['PriceClose'].shift(1) - data['PriceClose'].shift(2))\
                                  /( data['PriceClose'].shift(2))
    # append data
    container.append(ticker_data)
  # combined data again
  combined_data = pd.concat(container,axis=0)
  return combined_data

stocks_data = calculate_reversal(data = stocks_data)
stocks_data.head(13)

#### 3. **Lagged Return** Last 6 Month


$$
\mathbf{R}_{t-6} = \left( \frac{\mathbf{P}_{t-6}-\mathbf{P}_{t-7}}{\mathbf{P}_{t-7}} \right)
$$

- $\mathbf{R}_{t-6}$ : Lag 6 Month Return
- $\mathbf{P}_{t}$ : Close Price at time t

In [None]:
def calculate_lag_return(data) :
  """Calculate Lag Return  using previous 6 month return """
  data = data.copy()
  container = []
  # loop all over ticker
  for ticker in data['Ticker'].unique() :
    ticker_data = data.loc[
        data['Ticker']==ticker
    ]
    ticker_data['FactorLagReturn-6M'] = (data['PriceClose'].shift(6) - data['PriceClose'].shift(7))\
                                  /( data['PriceClose'].shift(7))
    # append data
    container.append(ticker_data)
  # combined data again
  combined_data = pd.concat(container,axis=0)
  return combined_data

stocks_data = calculate_lag_return(data = stocks_data)
stocks_data.head(13)

### Calculate Label




Our Label : Next Month Nett Return
$$
\mathbf{R}_{t+1} = \left( \frac{\mathbf{P}_{t+1}-\mathbf{P}_{t}}{\mathbf{P}_{t}} \right)
$$

- $\mathbf{R}_{t+1}$ : Next Month Return
- $\mathbf{P}_{t}$ : Close Price at time t

In [None]:
def calculate_next_month_return(data) :
    """Calculate Next Month Return """
    data = data.copy()
    container = []
    # loop all over ticker
    for ticker in data['Ticker'].unique() :
      ticker_data = data.loc[
          data['Ticker']==ticker
      ]
      ticker_data['Rt+1'] = (data['PriceClose'].shift(-1) - data['PriceClose'].shift(0))\
                                    /( data['PriceClose'].shift(0))
      ticker_data['Rt+1'] = ticker_data['Rt+1'].fillna(-1)
      # append data
      container.append(ticker_data)
    # combined data again
    combined_data = pd.concat(container,axis=0)
    return combined_data

In [None]:
stocks_data = calculate_next_month_return(data=stocks_data)

### Select Only Necessary Column
---

In [None]:
stocks_data.columns

In [None]:
cols_preserved = ['Ticker', 'Date', 'FactorMomentum',
                  'FactorReversal','FactorLagReturn-6M', 'Rt+1']

stocks_data = stocks_data.loc[:,
                              cols_preserved
                              ]
stocks_data.head(2)

In [None]:
# filter date
start_period = '2009-01-01'
end_period = '2024-12-31'

stocks_data = stocks_data.loc[
    (stocks_data['Date'] >= start_period) &
    (stocks_data['Date'] <= end_period)
]
stocks_data.head(5)

## BackTesting
---

Does the backtest difference in machine learning portofolio strategy ?



> **There is a difference, where we need to create buffer period to train machine learning model to construct portofolio**





<center>
<img src="http://www.mlfactor.com/images/backtestoos.png">
<a href="http://www.mlfactor.com/images/backtestoos.png">Source</a>
</center>

Setups / Assumption :    

- Rebalance : Monthly Period
- Long Only Transaction : Buy An Asset that we assume will have positive return

### Backtesting : One Period Example
---

#### Creating Buffer Period
---

Initial Buffer Period : 2-10 Years

In [None]:
start_period = '2009-01-01'
end_period = '2024-12-31'
buffer_n_year = 2

initial_buffer_start = pd.to_datetime(start_period) + pd.offsets.MonthEnd(n=1) 
initial_buffer_end = pd.to_datetime(start_period) + pd.offsets.YearEnd(n=buffer_n_year)

In [None]:
print(initial_buffer_start)

In [None]:
print(initial_buffer_end) # validate if its ending in 2010-12-31

In [None]:
buffer_data = stocks_data.loc[
    (stocks_data['Date'] >= initial_buffer_start) &
    (stocks_data['Date'] <= initial_buffer_end)
]

In [None]:
buffer_data

split data for training purpose

In [None]:
feature = ['FactorMomentum','FactorReversal','FactorLagReturn-6M']

X_buffer = buffer_data.loc[
    :,feature
]
y_buffer = buffer_data['Rt+1']

make it into function

In [None]:
def get_training_data(data,
                      initial_buffer_start,
                      initial_buffer_end,
                      feature_cols) :
  data = data.copy()
  # filter buffer
  training_data = data.loc[
    (data['Date'] >= initial_buffer_start) &
    (data['Date'] <= initial_buffer_end)
  ]


  X_train = training_data.loc[
      :,feature_cols
  ]
  y_train = training_data['Rt+1']
  return X_train,y_train



In [None]:
start_period = '2009-01-01'
end_period = '2024-12-31'
buffer_n_year = 2

initial_buffer_start = pd.to_datetime(start_period) + pd.offsets.MonthEnd(n=1) # should ended in 2010
initial_buffer_end = pd.to_datetime(start_period) + pd.offsets.YearEnd(n=buffer_n_year)

feature = ['FactorMomentum','FactorReversal','FactorLagReturn-6M']

X_buffer,y_buffer = get_training_data(data = stocks_data,
                      initial_buffer_start = initial_buffer_start,
                      initial_buffer_end = initial_buffer_end ,
                      feature_cols = feature)


In [None]:
X_buffer

In [None]:
y_buffer

#### Training Machine Learning Model
---

Next, we are going to simulate training on buffer period

In [None]:
from sklearn.ensemble import RandomForestRegressor


model = RandomForestRegressor()
model.fit(X_buffer,y_buffer)

create function to fit model

In [None]:
def train_ml_model(X_train,
                   y_train) :

   model = RandomForestRegressor(n_estimators=20,
                                 random_state=42) # for reproducible result
   model.fit(X_train,y_train)

   return model


In [None]:
model = train_ml_model(X_train = X_buffer,
                       y_train = y_buffer)


#### Construct Portofolio
---

Construct portofolio on the next month after buffer period

In [None]:
first_porto_date = (initial_buffer_end + \
               pd.offsets.MonthEnd(n=1)) # it should yield 2011-01-31
print('first porto construction date',first_porto_date)

In [None]:
# select data
X_porto = stocks_data.loc[
    (stocks_data['Date'] == first_porto_date),feature
]
y_porto  = stocks_data.loc[
    (stocks_data['Date'] == first_porto_date),['Ticker','Date']
]
return_porto = stocks_data.loc[
    stocks_data['Date'] == first_porto_date,
    ['Ticker','Date','Rt+1'] ]

In [None]:
def get_return_pred_data(data,
                         porto_date,
                         feature_cols,
                         ) :
    data = data.copy()

    X_porto = data.loc[
    (data['Date'] == porto_date),feature_cols
    ]

    y_porto  = data.loc[
        (data['Date'] == porto_date),['Ticker','Date']
    ]

    return_porto = data.loc[
    (data['Date'] == porto_date),['Ticker','Date','Rt+1']
    ]
    return X_porto,y_porto,return_porto


In [None]:
feature = ['FactorMomentum','FactorReversal','FactorLagReturn-6M']

X_porto,y_porto,return_porto = get_return_pred_data(data=stocks_data,
                         porto_date = first_porto_date,
                         feature_cols = feature,
                         )

In [None]:
y_porto

In [None]:
X_porto

In [None]:
return_porto

In [None]:
# predict next return

y_porto['Pred Rt+1'] = model.predict(X_porto)

In [None]:
y_porto.head()

In [None]:
# construct porto --> pick top 10% bin
y_porto['is_porto'] = pd.qcut(y_porto['Pred Rt+1'],q=[0,0.9,1.0],labels=[False,True])
y_porto

In [None]:
# show which stocks belong to our porto
y_porto.loc[
    y_porto['is_porto']==True
]

In [None]:
# filter return data based on portofolio constructed
ticker_list = y_porto.loc[
    y_porto['is_porto']==True,'Ticker'
].unique().tolist()

return_portofolio = return_porto.loc[
    return_porto['Ticker'].isin(ticker_list)
]

In [None]:
return_portofolio

In [None]:
def construct_portofolio(data,
                         porto_date,
                         feature_cols,
                         model,
                         percentile) :

    X_porto,y_porto,return_porto = get_return_pred_data(data=data,
                         porto_date = porto_date,
                         feature_cols = feature_cols,
                         )
    # predict return using model
    y_porto['Pred Rt+1'] = model.predict(X_porto)

    # construct portofolio
    bound = 1  - percentile
    y_porto['is_porto'] = pd.qcut(y_porto['Pred Rt+1'],q=[0,bound,1.0],labels=[False,True])

    ticker_list = y_porto.loc[
    y_porto['is_porto']==True,'Ticker'
    ].unique().tolist()

    return_portofolio = return_porto.loc[
        return_porto['Ticker'].isin(ticker_list)
    ]

    return return_portofolio

In [None]:
feature = ['FactorMomentum','FactorReversal','FactorLagReturn-6M']
percentile = 0.1 # top 10%
return_portofolio = construct_portofolio(data=stocks_data,
                         porto_date = first_porto_date,
                         feature_cols = feature,
                         model = model ,
                         percentile=percentile)

In [None]:
return_portofolio

#### Calculate Portofolio Return
---

$$
\mathbf{Portofolio Return}_t = ((\sum_i^N w_i * R_{it}) - FEE)
$$

- $w_i$ : weight of  stock **i** in portofolio
- $R_{it}$ : Net Return of stock **i** on period **t**
- $N$ : Number of stocks in portofolio at time **t**
- $FEE$ : Transaction Cost , assumed 0.5%

In [None]:
# return on portofolio --> since our weight is uniform --> similar to average
FEE = 0.005
return_first_month_portofolio = return_portofolio['Rt+1'].mean() - FEE
print('Return at first month portofolio using uniform weight',return_first_month_portofolio)

make it into a function

In [None]:
def calculate_portofolio_return(return_porto,FEE) :

    portofolio_return = return_porto['Rt+1'].mean() - FEE
    return portofolio_return

In [None]:
FEE = 0.005
return_first_month_portofolio = calculate_portofolio_return(return_porto=return_portofolio,
                                                            FEE=FEE)
print('Return at first month portofolio using uniform weight',return_first_month_portofolio)

 ### Backtesting : Multi Period Example
---

Goals : Simulate Backtest in Multiple Period

In [None]:
# backtest configuration
start_period = '2009-01-01'
end_period = '2024-12-31'
buffer_n_year = 2

percentile = 0.1

In [None]:
feature_cols = ['FactorMomentum','FactorReversal','FactorLagReturn-6M']

In [None]:
stocks_data = stocks_data.copy()
stocks_data = stocks_data.dropna(subset='Rt+1')
stocks_data.loc[
    (stocks_data['Date'] > start_period ) &
    (stocks_data['Date'] < end_period)
]
# create buffer period
initial_buffer_start = pd.to_datetime(start_period) + pd.offsets.MonthEnd(n=1)
initial_buffer_end = pd.to_datetime(start_period) + pd.offsets.YearEnd(n=buffer_n_year)
porto_construction_date = (initial_buffer_end + pd.offsets.MonthEnd(n=1))
print('Backtest Started')


collection_return = []
end_time = pd.to_datetime(end_period)
while porto_construction_date < end_time :

    print('Constructing Portofolio On : ',porto_construction_date)
    # get training data for model
    X_buffer,y_buffer = get_training_data(data = stocks_data,
                          initial_buffer_start = initial_buffer_start,
                          initial_buffer_end = initial_buffer_end ,
                          feature_cols = feature_cols)
    print('-----Training ML Model')
    model = train_ml_model(X_train = X_buffer,
                          y_train = y_buffer)


    # construct porto
    return_portofolio = construct_portofolio(data=stocks_data,
                            porto_date = porto_construction_date,
                            feature_cols = feature_cols,
                            model = model ,
                            percentile = percentile)
    print('-----Calculating Return')
    return_first_month_portofolio = calculate_portofolio_return(return_porto=return_portofolio,
                                                      FEE=FEE)
    porto_return_df = pd.DataFrame(index=[0])
    porto_return_df['ReturnDate'] = porto_construction_date + pd.offsets.MonthEnd(n=1)
    porto_return_df['PortofolioReturn'] = return_first_month_portofolio
    # calculate cumulative Return


    collection_return.append(porto_return_df)
    # shift buffer period
    initial_buffer_start = initial_buffer_start + pd.offsets.MonthEnd(n=1)
    initial_buffer_end = initial_buffer_end + pd.offsets.MonthEnd(n=1)
    porto_construction_date = porto_construction_date + pd.offsets.MonthEnd(n=1)


#### Report Portofolio Performance
---

- Cumulative Return
- Average Return
- Volatility of Return


##### Calculate Cumulative Return
---

$$
\mathbf{CR}_{t+T} = \mathbf{GR}_{t+1} \times \mathbf{GR}_{t+2} ... \times \mathbf{GR}_{t+T}
$$

- $\mathbf{CR}_{t+T}$ = Compound / Cumulative Return of Portofolio Up to time T
- $\mathbf{GR}_{t}$ = Gross Return of Portofolio at time t

where :    




$\mathbf{GR}_{t} = \mathbf{Portofolio Return}_t + 1$


In [None]:
# combined all result in each period
collection_return_df = pd.concat(collection_return,axis=0)
collection_return_df['PortofolioCumulativeReturn'] = (1 + collection_return_df['PortofolioReturn'])\
                                                      .cumprod(axis=0)

In [None]:
collection_return_df

##### Mean Return
---


In [None]:
mean_return = collection_return_df['PortofolioReturn'].mean()
print(f'Average Monthly Return : {mean_return*100} %')

#####  Return Volatility
---


In [None]:
return_volatility = collection_return_df['PortofolioReturn'].std()
print(f' Monthly Return Volatility : {return_volatility*100} %')

In [None]:
print('Return Best Case : ',mean_return + 2*return_volatility)
print('Return Worst Case : ',mean_return - 2*return_volatility)

In [None]:
def report_performance(return_data) :
    return_data = return_data.copy()
    return_data['PortofolioCumulativeReturn'] = (1 + return_data['PortofolioReturn'])\
                                                          .cumprod(axis=0)
    mean_return = return_data['PortofolioReturn'].mean()
    print(f'Average Monthly Return : {mean_return*100} %')


    return_volatility = return_data['PortofolioReturn'].std()
    print(f' Monthly Return Volatility : {return_volatility*100} %')

    return return_data,mean_return,return_volatility

#### Compare with Index Return  
---

Compare with IHSG Return

In [None]:
ihsg_return = yf.Ticker('^JKSE')\
                .history(period='max')
# resample into monthly data
ihsg_return = ihsg_return.resample('M').last()
ihsg_return  = ihsg_return.reset_index()
# calculate monhtly return
ihsg_return['Return'] =  (ihsg_return['Close'] - ihsg_return['Close'].shift(1))\
                            / (ihsg_return['Close'].shift(1))
ihsg_return['Return'] = ihsg_return['Return']- FEE
ihsg_return = ihsg_return.loc[
    (ihsg_return['Date'] >='2011-02-28')
    & (ihsg_return['Date'] <= '2024-12-31')
]



ihsg_return['CumReturn'] = (1 + ihsg_return['Return']).cumprod()


ihsg_return = ihsg_return[['Date','Return','CumReturn']]
ihsg_return


#### Visualize Return
---

In [None]:
fig = plt.figure(figsize=(12,6))

sns.lineplot(x=collection_return_df['ReturnDate'],
             y=collection_return_df['PortofolioCumulativeReturn'],label='ML Porto Uniform (Top 10%)')
sns.lineplot(x=ihsg_return['Date'],
             y=ihsg_return['CumReturn'],label='IHSG Return ')
plt.title('Cumulative Return , FEE 0.5%')

Summary :  
- Our portofolio has pretty big decline after some increment (Drawdown), for example :  
    - 2011 to 2012
    - 2013 to 2014
    - 2019 to 2020 (Covid Crisis)

- Biggest Increase occured in 2020 to 2022

- Our portofolio is not good enough since so many drawdowns

## Wrapping Up
---


Combine from steps above :     

- Create Buffer Period
- Create Train Machine Learning
- Construct Portofolio
- Calculate Portofolio Return
- Report Portofolio Performance

In [None]:
def run_backtest(stocks_data,
                 start_period, # start period for backtest
                 end_period, # end period for backtest
                 percentile, # Top % Stock to pick
                 buffer_n_year,# number of year n buffer to train machine learning model
                 feature_cols, # feature columns to select
                 FEE) :        # fee assumption

    stocks_data = stocks_data.copy()
    stocks_data.loc[
        (stocks_data['Date'] > start_period ) &
        (stocks_data['Date'] < end_period)
    ]
    # create buffer period
    initial_buffer_start = pd.to_datetime(start_period) + pd.offsets.MonthEnd(n=1) # end of month from start priod
    initial_buffer_end = pd.to_datetime(start_period) + pd.offsets.YearEnd(n=buffer_n_year) # start period + n year
    porto_construction_date = (initial_buffer_end + pd.offsets.MonthEnd(n=1)) #
    end_time = pd.to_datetime(end_period)

    collection_return = []
    try :
      while porto_construction_date < end_time :

          print('Constructing Portofolio On : ',porto_construction_date)
          # get training data for model
          X_buffer,y_buffer = get_training_data(data = stocks_data,
                                initial_buffer_start = initial_buffer_start,
                                initial_buffer_end = initial_buffer_end ,
                                feature_cols = feature_cols)
          print('-----Training ML Model')
          model = train_ml_model(X_train = X_buffer,
                                y_train = y_buffer)


          # construct porto
          return_portofolio = construct_portofolio(data=stocks_data,
                                  porto_date = porto_construction_date,
                                  feature_cols = feature_cols,
                                  model = model ,
                                  percentile = percentile)
          print('-----Calculating Return')
          return_portofolio = calculate_portofolio_return(return_porto=return_portofolio,
                                                            FEE=FEE)
          print('Return')
          porto_return_df = pd.DataFrame(index=[0])
          porto_return_df['ReturnDate'] = porto_construction_date + pd.offsets.MonthEnd(n=1)
          porto_return_df['PortofolioReturn'] = return_portofolio
          collection_return.append(porto_return_df)
          # calculate cumulative Return

          # shift buffer period
          initial_buffer_start = initial_buffer_start + pd.offsets.MonthEnd(n=1)
          initial_buffer_end = initial_buffer_end + pd.offsets.MonthEnd(n=1)
          porto_construction_date = porto_construction_date + pd.offsets.MonthEnd(n=1)

    except Exception as e:
        print(f"An error occurred: {e}")

    finally:
        print('Preparing Performance Report')
        if collection_return:
            collection_return_df = pd.concat(collection_return, axis=0).reset_index(drop=True)
            # Calculate cumulative return
            (collection_return_df,
              mean_return,
                  return_volatility) = report_performance(return_data = collection_return_df)
            return collection_return_df,mean_return,return_volatility
        else:
            print("No portfolio returns to calculate.")
            return pd.DataFrame()  # Return an empty DataFrame if no data exists



#### Backtest Example : 1
----

Setups :    
- Invest in Top 30% Stock
- buffer_n_year = 2
- FEE = 0.5%

In [None]:
backtest_config = {
    'stocks_data' : stocks_data ,
    'start_period' : '2009-01-01',
    'end_period' : '2024-12-31',
    'percentile' : 0.3,  # top 30%
    'buffer_n_year' : 2 ,
    'feature_cols' : ['FactorMomentum','FactorReversal','FactorLagReturn-6M'] ,
    'FEE' : 0.005
}
top30= run_backtest(**backtest_config)

#### Backtest Example : 2
----

Setups :    
- Invest in Top 50% Return
- buffer_n_year = 2
- FEE = 0.5%

In [None]:
backtest_config2= {
    'stocks_data' : stocks_data ,
    'start_period' : '2009-01-01',
    'end_period' : '2024-12-31',
    'percentile' : 0.5,  # top 50%
    'buffer_n_year' : 2 ,
    'feature_cols' : ['FactorMomentum','FactorReversal','FactorLagReturn-6M'] ,
    'FEE' : 0.005
}
top50= run_backtest(**backtest_config2)

In [None]:
top50[0]

In [None]:
fig = plt.figure(figsize=(12,6))

sns.lineplot(x=collection_return_df['ReturnDate'],
             y=collection_return_df['PortofolioReturn'],label='Top 10% ML Porto Uniform')
sns.lineplot(x=top30[0]['ReturnDate'],
             y=top30[0]['PortofolioReturn'],label='Top 30% ML Porto Uniform')
sns.lineplot(x=top50[0]['ReturnDate'],
             y=top50[0]['PortofolioReturn'],label='Top 50% ML Porto Uniform')
sns.lineplot(x=ihsg_return['Date'],
              y=ihsg_return['Return'],label='IHSG Return ')
plt.title('Nett Return , FEE 0.5%')

Summary :
- In terms of nett return we can see so many fluctuations
- Since we are using the same variable we can see the same pattern, with different level of depth
- As number of stock growth we can see the volatility gets lower

In [None]:
fig = plt.figure(figsize=(12,6))

sns.lineplot(x=collection_return_df['ReturnDate'],
             y=collection_return_df['PortofolioCumulativeReturn'],label='Top 10% ML Porto Uniform')
sns.lineplot(x=top30[0]['ReturnDate'],
             y=top30[0]['PortofolioCumulativeReturn'],label='Top 30% ML Porto Uniform')
sns.lineplot(x=top50[0]['ReturnDate'],
             y=top50[0]['PortofolioCumulativeReturn'],label='Top 50% ML Porto Uniform')
sns.lineplot(x=ihsg_return['Date'],
             y=ihsg_return['CumReturn'],label='IHSG Return ')
plt.title('Cumulative Return , FEE 0.5%')

Summary :  
- For the strategy above we can see that the pattern quite the same, lot of drawdown
- Our strategy already beat the market (IHSG)
- For top 30% and top 50% strategy we already multiplied our initial investment in 2011 into factor of 3
- However we need to be cautious even though in long run (>10 years) profitable in the narrow cycle the investment value may decline many times


**Disclaimer** :
- This is not investment advice, pure educational / research purpose
- Backtest result may not reflect real condition in real market

## Self Experiment
---

### $1^{st}$ Experiment

- Perform Backtest with following setups

1. Pick top 50% stock
2. Set FEE = 0.5%
3. buffer_n_year = 3 years
4. use feature_cols = ['FactorMomentum','FactorReversal']