## Applying Machine Learning to Trading Strategies: Using Logistic Regression to Build Momentum-based Trading Strategies - **Patrick Beaudan and Shuoyuan He**

Objectives :

    1. Addressing the drawbacks of classical approach in building investment strategies
    2. Use of ML Model, Logistic Regression, to build a time-series dual momentum trading strategy on the S&P 500 Index
    3. Showing how the proposed model outperforms both buy-and-hold and several base-case dual momentum strategies, significantly increasing returns and reducing risk
    4. Applying the algorithm to other U.S. and international large capitalization equity indices 
    5. Analyzing yields improvements in risk-adjusted performance. 

### 1. Fetching data

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
%matplotlib inline 
plt.style.use('seaborn-v0_8-dark-palette')
import yfinance as yf 
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, log_loss

import warnings
warnings.filterwarnings('ignore') 

#### Tickers 
1. S&P 500 Index: **^GSPC**
2. S&P Small Cap 600 Index (SML): **^SML**  ==> Data not available 
3. S&P Mid Cap 400 Index (MID): **^MID**
4. FTSE 100 Index (UKX): **^FTSE**
5. FTSEurofirst 300 Index (E300): **^FTEU3**  ==> Data not available
6. Tokyo Stock Exchange Price Index (TPX): **^TPX**  ==> Data not available
7. Dow Jones Industrial Average Index (INDU): **^DJI**
8. Dow Jones Transportation Average Index (TRAN): **^DJT**

In [2]:
end = '2018-12-12'

# df_sml = yf.download('^SML',start='1993-12-31',end=end)
df_mid = yf.download('^MID',start='1990-12-31',end=end) 
df_ukx = yf.download('^FTSE',start='1997-12-19',end=end)
# df_e300 = yf.download('^FTEU3',start='1985-12-31',end=end)
# df_tpx = yf.download('^TPX',start='1997-12-19',end=end)
df_dji = yf.download('^DJI',start='1920-01-02',end=end)
df_djt = yf.download('^DJT',start='1920-01-02',end=end) 

data = yf.download('^GSPC',start='1927-12-30',end=end) 
print() 
df_21 = data.copy() 
print('Shape of data : ',data.shape) 
data.tail(3) 

[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed


Shape of data :  (22844, 6)





Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-07,2691.26001,2708.540039,2623.139893,2633.080078,2633.080078,4242240000
2018-12-10,2630.860107,2647.51001,2583.22998,2637.719971,2637.719971,4162880000
2018-12-11,2664.439941,2674.350098,2621.300049,2636.780029,2636.780029,3963440000


### 2. Defining class to include base-features Momentum and Drawdown

* Momentum features are calculated over time frames of 30, 60, 90, 120, 180, 270, 300, 360
* Drawdown features are calculated over time frames of 15, 60, 90, 120

Also, it is instructed to calculate features by skipping last month. We follow the convention of 252 business days per calendar year and 21 business days per calendar month.

Features are selected based on the fact that observing the change in the shape of the price history using multiple historical time windows for momenta and drawdowns is more pertinent than considering other metrics to predict short-term profitability. So, we use momenta and drawdowns of different timeframes as features

In [3]:
class IncludeFeatures:
    def __init__(self,data):
        self.data = data 

    def calculate_momentum(self,window): # computing the rate of change in the stock's closing price over window days
        self.data[f'momntm_{window}'] =  self.data['Adj Close'] - self.data['Adj Close'].shift(window) 

    def calculate_drawdown(self,window): # Compute the drawdown by finding the peak and trough in the price data
        # calculating cumulative maximum for stocks price
        self.data['Cumulative_Peak'] = self.data['Adj Close'].cummax() # max of cumulative value upto that day
        # calculating drawdown 
        self.data[f'drwdwn_{window}'] = (self.data['Adj Close']-self.data['Cumulative_Peak'])/self.data['Cumulative_Peak']

    def include_features(self):
        
        momentum_windows = [30, 60, 90, 120, 180, 270, 300, 360]
        drawdwn_windows = [15, 60, 90, 120]    

        for days in momentum_windows:
            self.calculate_momentum(days) 

        for days in drawdwn_windows:
            self.calculate_drawdown(days) 
        
        self.data.drop(columns=['Cumulative_Peak','Open','High','Low','Close','Volume'],axis=1,inplace=True)
        return self.data     

In [4]:
include_feat = IncludeFeatures(data) 
data_feat = include_feat.include_features()
data_feat.dropna(inplace=True)
print(data_feat.shape) 
data_feat.head(3) 

(22484, 13)


Unnamed: 0_level_0,Adj Close,momntm_30,momntm_60,momntm_90,momntm_120,momntm_180,momntm_270,momntm_300,momntm_360,drwdwn_15,drwdwn_60,drwdwn_90,drwdwn_120
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1929-06-10,25.27,-0.309999,-0.459999,-0.09,2.74,4.09,5.060001,6.380001,7.610001,-0.041714,-0.041714,-0.041714,-0.041714
1929-06-11,25.43,-0.1,-0.65,-0.02,2.99,4.25,5.07,6.48,7.67,-0.035647,-0.035647,-0.035647,-0.035647
1929-06-12,25.450001,-0.49,-0.59,-0.289999,2.75,4.230001,5.01,6.17,7.730001,-0.034888,-0.034888,-0.034888,-0.034888


In [5]:
print(f'Null values : {data_feat.isna().sum().sum()}') 

Null values : 0


### 3. Analyzing Key Performance Indicators over sample indices over the entire period

KPIs analysed here are Annual Return, Sharpe Ratio, Volatility, Maximum Drawdown, Average Daily Drawdown

In [9]:
class KPIs:
    def __init__(self,data):
        self.datac = data  

    def annual_return(self,datac):
        cumulative_returns = (1+datac['Daily_Return']).prod()-1 
        n_days = datac.shape[0]     # Number of trading days 
        annualized_return = (1+cumulative_returns)**(252/n_days)-1
        return annualized_return 
    
    def sharpe_ratio(self,datac):
        average_return = datac['Daily_Return'].mean() 
        risk_free_rate = 0.01/252  # constant 1% annual risk-free rate
        std_dev = datac['Daily_Return'].std() 
        # print(f'Average Return : {average_return:.4f}') 
        # print(f'Standard Deviation : {std_dev:.4f}') 
        # print() 
        sharpe_ratio = (average_return-risk_free_rate)/std_dev
        return sharpe_ratio 

    def volatility(self,datac):
        daily_volatility = datac['Daily_Return'].std()
        trading_days_per_year = 252 
        annual_volatility = daily_volatility*np.sqrt(trading_days_per_year)   # Annualizing Volatility
        return annual_volatility 
    
    def max_drawdown(self,datac):
        datac['Running_max'] = datac['Adj Close'].cummax() 
        datac['Drawdowns'] = (datac['Adj Close']-datac['Running_max'])/datac['Running_max']

        max_drawdown = datac['Drawdowns'].min() 
        avg_drawdown = datac['Drawdowns'].mean() 

        return max_drawdown, avg_drawdown 

    def calculate_kpi(self):        
        self.datac['Log_Return'] =  np.log(self.datac['Adj Close']/self.datac['Adj Close'].shift(1))
        self.datac['Daily_Return'] = self.datac['Adj Close'].pct_change() 
        self.datac.dropna(inplace=True) 

        annualized_return = self.annual_return(self.datac)
        sharpe_ratio = self.sharpe_ratio(self.datac)
        annual_volatility = self.volatility(self.datac)
        max_drawdown, avg_drawdown = self.max_drawdown(self.datac)

        print(f'Annual Return : {annualized_return*100:.1f}%')
        print(f'Sharpe Ratio : {sharpe_ratio:.4f}')
        print(f'Volatility : {annual_volatility*100:.0f}%')
        print(f'Maximum Drawdown : {max_drawdown*100:.0f}%')
        print(f'Average Drawdown : {avg_drawdown*100:.0f}%') 

#### 3.1 Performance Metrics of S&P 500 Index (SPX) - **^GSPC**

In [10]:
calc_kpi = KPIs(data)   
calc_kpi.calculate_kpi() 

Annual Return : 5.3%
Sharpe Ratio : 0.0200
Volatility : 19%
Maximum Drawdown : -86%
Average Drawdown : -22%


#### 3.2 Performance Metrics of S&P Mid Cap 400 Index (MID) - **^MID**

In [11]:
calc_kpi = KPIs(df_mid) 
calc_kpi.calculate_kpi() 

Annual Return : 10.8%
Sharpe Ratio : 0.0368
Volatility : 19%
Maximum Drawdown : -56%
Average Drawdown : -7%


#### 3.3 Performance Metrics of FTSE 100 Index (UKX) - **^FTSE**

In [12]:
calc_kpi = KPIs(df_ukx) 
calc_kpi.calculate_kpi() 

Annual Return : 1.5%
Sharpe Ratio : 0.0074
Volatility : 19%
Maximum Drawdown : -53%
Average Drawdown : -16%


#### 3.4 Performance Metrics of Dow Jones Industrial Average (INDU) - **^DJI**

In [13]:
calc_kpi = KPIs(df_dji) 
calc_kpi.calculate_kpi() 

Annual Return : 7.9%
Sharpe Ratio : 0.0299
Volatility : 17%
Maximum Drawdown : -54%
Average Drawdown : -9%


#### 3.4 Performance Metrics of Dow Jones Transportation Average (TRAN) - **^DJT**

In [14]:
calc_kpi = KPIs(df_djt)  
calc_kpi.calculate_kpi() 

Annual Return : 7.7%
Sharpe Ratio : 0.0249
Volatility : 23%
Maximum Drawdown : -61%
Average Drawdown : -13%


# I. Classical Time Series Dual-Momentum Trading Strategy

#### Strategy

1. The momentum, i.e. the percentage price change of a security, is calculated over a historical time horizon of twelve months, skipping the most recent month 
2. If momentum > threshold (here,5%=0.05) => Invest 
3. If momentum < threshold => the portfolio is moved to cash in the long-only strategy, or moved to a short position in the long-short strategy 
4. This investment decision is revisited at regular intervals of one month 

In [15]:
print(f'Shape before slicing : {df_21.shape}')
n = len(df_21)
# Slice the DataFrame to exclude the last 21 rows for skipping most recent month 
df_21 = df_21.iloc[:n-21]
print(f'Shape after slicing : {df_21.shape}') 
df_21.head(3)  

Shape before slicing : (22844, 6)
Shape after slicing : (22823, 6)


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1927-12-30,17.66,17.66,17.66,17.66,17.66,0
1928-01-03,17.76,17.76,17.76,17.76,17.76,0
1928-01-04,17.719999,17.719999,17.719999,17.719999,17.719999,0


Calculating momentum, percentage change

In [16]:
trading_days_per_month = 21
no_of_months = 12  
time_horizon = trading_days_per_month*no_of_months 

df_21['Momentum'] = df_21['Adj Close'].pct_change(periods=time_horizon)*100
df_21.dropna(inplace=True)
df_21.drop(columns=['Open','High','Low','Close','Volume'],axis=1,inplace=True)

df_21.head(3) 

Unnamed: 0_level_0,Adj Close,Momentum
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1929-01-03,24.860001,40.770107
1929-01-04,24.85,39.921172
1929-01-07,24.25,36.851021


In [17]:
threshold = 5 
df_21['Signals'] = (df_21['Momentum']>=threshold).astype(int) 
print('No of invest signals : ',df_21['Signals'].value_counts()) 
print() 
df_21.head(3) 

No of invest signals :  Signals
1    13404
0     9167
Name: count, dtype: int64



Unnamed: 0_level_0,Adj Close,Momentum,Signals
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1929-01-03,24.860001,40.770107,1
1929-01-04,24.85,39.921172,1
1929-01-07,24.25,36.851021,1


## Machine Learning Approach

### 4. Defining Function to create polynomial features

In [18]:
def degree(data,degree): 

    feature_names = data.columns 
    # feature_names = ['Adj Close', 'momntm_30', 'momntm_60', 'momntm_90', 'momntm_120',
    #                  'momntm_180', 'momntm_270', 'momntm_300', 'momntm_360', 'drwdwn_15',
    #                  'drwdwn_60', 'drwdwn_90', 'drwdwn_120'] 
    
    if data.shape[1] != len(feature_names):
        raise ValueError("The number of features in the data does not match the length of feature names.")

    poly = PolynomialFeatures(degree=degree, include_bias=False)
    poly_feat = poly.fit_transform(data) 
    
    feature_names_poly = poly.get_feature_names_out(input_features=feature_names)
    
    df_poly = pd.DataFrame(poly_feat, columns=feature_names_poly, index=data.index) 
    print(f'Shape of df_poly of degree 1 : ',data.shape) 
    print(f'Shape of df_poly of degree {degree} : ',df_poly.shape) 
    print('Number of duplicate columns : ',len(df_poly.columns)-len(set(df_poly.columns))) 
    return df_poly 

In [19]:
x_quad = degree(data_feat,2)  

Shape of df_poly of degree 1 :  (22482, 17)
Shape of df_poly of degree 2 :  (22482, 170)
Number of duplicate columns :  0


In [20]:
x_cubic = degree(data_feat,3) 

Shape of df_poly of degree 1 :  (22482, 17)
Shape of df_poly of degree 3 :  (22482, 1139)
Number of duplicate columns :  0


### 5. Creating Datasets for training with Target Variable

#### 5.1 Linear dataset

In [21]:
print('Shape of linear dataset before concatenation : ',data_feat.shape)
x_linear = pd.concat([data_feat, df_21[['Signals']]], axis=1)
x_linear.dropna(inplace=True) 
print('Shape of linear dataset after concatenation : ',x_linear.shape)

Shape of linear dataset before concatenation :  (22482, 17)
Shape of linear dataset after concatenation :  (22461, 18)


#### 5.2 Quadratic dataset

In [22]:
print('Shape of quadratic dataset before concatenation : ',x_quad.shape)
x_quad = pd.concat([x_quad, df_21[['Signals']]], axis=1)
x_quad.dropna(inplace=True) 
print('Shape of quadratic dataset after concatenation : ',x_quad.shape) 

Shape of quadratic dataset before concatenation :  (22482, 170)
Shape of quadratic dataset after concatenation :  (22461, 171)


#### 5.3 Cubic dataset

In [23]:
print('Shape of cubic dataset before concatenation : ',x_cubic.shape)
x_cubic = pd.concat([x_cubic, df_21[['Signals']]], axis=1)
x_cubic.dropna(inplace=True) 
print('Shape of cubic dataset after concatenation : ',x_cubic.shape) 

Shape of cubic dataset before concatenation :  (22482, 1139)
Shape of cubic dataset after concatenation :  (22461, 1140)


### 7. Class for Training and Evaluating the Model

Model metrics calculated are cost function, accuracy, confusion matrix and classification report. 

To calculate the cost function, also known as the loss function, for logistic regression, we need to use the logistic loss function, which is commonly referred to as cross-entropy loss or log loss.

In [24]:
class logistic_regression:
    def __init__(self):
        self.train_size = 0.4
        self.random_state = 42

    def scaling_x(self,X):
        scaler = StandardScaler()
        scaled_X = scaler.fit_transform(X)
        return scaled_X
    
    def cost_func(self,model,x_test,y_test): 
        probabilities = model.predict_proba(x_test)[:,1] # Getting probabilities for class 1 (positive class)
        cost = log_loss(y_test,probabilities) 
        return cost 

    def model_metrics(self,model,x_test,y_test):
        y_pred = model.predict(x_test) 
        cost_fn = self.cost_func(model,x_test,y_test)
        accuracy = accuracy_score(y_test,y_pred)
        conf_matrix = confusion_matrix(y_test, y_pred)
        class_report = classification_report(y_test, y_pred)

        print(f'Cost function : {cost_fn}') 
        print(f'Accuracy : {accuracy}')
        print('Confusion Matrix : ')
        print(conf_matrix) 
        print('Classification Report : ')
        print(class_report) 
        return y_pred 
    
    def training_model(self,df):

        X = df.drop(columns=['Signals'],axis=1)
        Y = df['Signals'] 

        scaled_X = self.scaling_x(X)
        # Split the data into initial training set (40%) and test set (60%)
        x_train, x_test, y_train, y_test = train_test_split(scaled_X,Y,train_size=0.4, shuffle=False, 
                                                            random_state=42)
        model = LogisticRegression(C=1.0)   # C is the regularization parameter
        
        model.fit(x_train,y_train) 
            
        self.model_metrics(model,x_test,y_test) 

logistic = logistic_regression()  

### 8. Evaluation of Linear, Quadratic and Cubic Combination of features

#### 8.1 Evaluation on Linear Combination of features

In [25]:
y_pred_linear = logistic.training_model(x_linear) 

Cost function : 0.9787919508670793
Accuracy : 0.9158566446538547
Confusion Matrix : 
[[4067  854]
 [ 280 8276]]
Classification Report : 
              precision    recall  f1-score   support

         0.0       0.94      0.83      0.88      4921
         1.0       0.91      0.97      0.94      8556

    accuracy                           0.92     13477
   macro avg       0.92      0.90      0.91     13477
weighted avg       0.92      0.92      0.91     13477



#### 8.2 Evaluation on Quadratic Combination of features

In [26]:
y_pred_quad = logistic.training_model(x_quad)  

Cost function : 1.0260370337067828
Accuracy : 0.9174148549380426
Confusion Matrix : 
[[4088  833]
 [ 280 8276]]
Classification Report : 
              precision    recall  f1-score   support

         0.0       0.94      0.83      0.88      4921
         1.0       0.91      0.97      0.94      8556

    accuracy                           0.92     13477
   macro avg       0.92      0.90      0.91     13477
weighted avg       0.92      0.92      0.92     13477



#### 8.3 Evaluation on Cubic Combination of features

In [27]:
y_pred_cubic = logistic.training_model(x_cubic)  

Cost function : 1.0535235752512844
Accuracy : 0.9129628255546487
Confusion Matrix : 
[[3985  936]
 [ 237 8319]]
Classification Report : 
              precision    recall  f1-score   support

         0.0       0.94      0.81      0.87      4921
         1.0       0.90      0.97      0.93      8556

    accuracy                           0.91     13477
   macro avg       0.92      0.89      0.90     13477
weighted avg       0.92      0.91      0.91     13477



### 10. Sliding Window over Cubic Polynomials

1. Training on an initial set of data (40% of data)
2. Training calibrate parameters before applying to testing sets
3. Convergence is monitored by cost function. 
4. Convergence is achieved when (cost function < threshold)  => Threshold=0.01 
5. After convergence, training set is slid by a window of 5 or 10 years to include more recent data and model is retrained. This helps the model to train over current market conditions 
6. Retraining is done every 50 days 
7. Continue this till the end of data

This approach ensures that your model remains updated with recent data and can adapt to changing market conditions effectively

In [37]:
class Strategy:
    def __init__(self,data):
        self.data = data 
        self.conv_interval = 50  # days
        self.retrain_freq = 252*8   # 8 years = 8*252 days
        self.tolerance = 0.0001

    def scaling_x(self, X):
        scaler = StandardScaler()
        scaled_x = scaler.fit_transform(X)
        return scaled_x 

    def cost_funcn(self, model, X, Y):
        y_prob = model.predict_proba(X)[:,1]     # Getting probabilities for class 1 (positive class)
        cost = log_loss(Y, y_prob) 
        return cost     
    
    def data_concat(self, y_pred, start, end): 
        datac = self.data[start:end] 
        y_pred = pd.Series(y_pred,index=datac.index) 

        bnch_df = datac[datac['Signals'] == 1].copy() 
        print('='*20,'Metric for bnch_df','='*20) 
        calc_kpi = KPIs(bnch_df)  
        calc_kpi.calculate_kpi() 

        log_indices = y_pred[y_pred==1].index 
        log_df = self.data.loc[log_indices].copy()  # Filter original DataFrame based on indices
        log_df['y_pred'] = y_pred[log_indices]     # Add y_pred column 
        log_df = log_df[['y_pred', 'Signals', 'Adj Close']] 
        # print(log_df.head(3))    
        print('='*20,'Metric for log_df','='*20) 
        calc_kpi = KPIs(log_df)  
        calc_kpi.calculate_kpi()
        
    def model_metrics(self, model, X, Y, start, end):
        y_pred = model.predict(X) 
        cost_fn = self.cost_funcn(model, X, Y)
        accuracy = accuracy_score(Y, y_pred)
        conf_matrix = confusion_matrix(Y, y_pred)
        class_report = classification_report(Y, y_pred)  
        
        print(f'Cost Function : {cost_fn}')
        print(f'Accuracy Score : {accuracy}')
        print('Confusion Matrix :') 
        print(conf_matrix) 
        print('Classification Report : ')
        print(class_report) 
        
        self.data_concat(y_pred, start, end)
    
    def date_correction(self,indx,df,num):
        idx1 = df.index.get_loc(indx)
        idx2 = idx1 + num 
        if idx2<len(df)-1:
            return idx2 
        else:
            return len(df)-1   
    
    def training_logistic(self, df):

        # Initialize Parameters        
        train_start = df.index[0]
        train_end = df.index[int(0.4*len(df))] 
        test_start = train_end 
        idx = df.index.get_loc(test_start)+252*8 
        test_end = df.index[idx] 

        model = LogisticRegression() 

        # Initial Training set
        x_train = df.loc[train_start:train_end].drop('Signals', axis=1) 
        xs_train = self.scaling_x(x_train) 
        y_train = df.loc[train_start:train_end, 'Signals'] 

        x_test = df.loc[test_start:test_end].drop('Signals', axis=1) 
        xs_test = self.scaling_x(x_test) 
        y_test = df.loc[test_start:test_end, 'Signals'] 

        model.fit(xs_train, y_train)

        print(f'Train set interval: {str(train_start).split(' 00:00:00')[0]} to {str(train_end).split(' 00:00:00')[0]}')       
        print() 
        print('='*20,'Metrics','='*20) 
        self.model_metrics(model, xs_train, y_train, train_start, train_end)

        print(f'Test set interval: {str(test_start).split(' 00:00:00')[0]} to {str(test_end).split(' 00:00:00')[0]}')       
        print()
        print('='*20,'Metrics','='*20) 
        self.model_metrics(model, xs_test, y_test, test_start, test_end)
         

        # Training Loop  
        while test_end<df.index[-1]:

            model.fit(xs_train, y_train)

            # Loop for checking convergence
            previous_cost = None 

            while test_end<df.index[-1]:
                x_test = df.loc[test_start:test_end].drop('Signals', axis=1) 
                xs_test = self.scaling_x(x_test) 
                y_test = df.loc[test_start:test_end, 'Signals'] 
                            
                current_cost = self.cost_funcn(model,xs_test,y_test) 

                if previous_cost is not None and (previous_cost-current_cost)/previous_cost < self.tolerance:
                    print() 
                    print(f'Convergence achieved at {str(test_end).split(' 00:00:00')[0]}') 
                    break   
                
                previous_cost = current_cost
                idx = self.date_correction(test_end,df,self.conv_interval)  
                test_end = df.index[idx] 

            # Slide the training window 
            idxt1 = self.date_correction(train_start, df,self.retrain_freq)
            train_start = df.index[idxt1] 

            idxt2 = self.date_correction(train_end, df,self.retrain_freq)
            train_end = df.index[idxt2]

            test_start = train_end 

            idxs = self.date_correction(test_end, df,self.retrain_freq)
            test_end = df.index[idxs]  

            # Updating training data
            
            if train_end<=df.index[-1]:
                x_train = df.loc[train_start:train_end].drop('Signals', axis=1) 
                xs_train = self.scaling_x(x_train) 
                y_train = df.loc[train_start:train_end, 'Signals'] 

            print(f'Train set interval: {str(train_start).split(' 00:00:00')[0]} to {str(train_end).split(' 00:00:00')[0]}')       
            print()
            print('='*20,'Metrics','='*20) 
            model_m = model.fit(xs_train,y_train) 
            self.model_metrics(model_m, xs_train, y_train, train_start, train_end)   

            # Updating Testing data
            
            if test_end<=df.index[-1]:
                x_test = df.loc[test_start:test_end].drop('Signals', axis=1) 
                xs_test = self.scaling_x(x_test) 
                y_test = df.loc[test_start:test_end, 'Signals'] 

            print(f'Test set interval: {str(test_start).split(' 00:00:00')[0]} to {str(test_end).split(' 00:00:00')[0]}')       
            print()
            print('='*20,'Metrics','='*20) 
            self.model_metrics(model_m, xs_test, y_test, test_start, test_end)   
            print('*'*100)
strategy = Strategy(x_linear)  

In [38]:
strategy.training_logistic(x_linear) 

Train set interval: 1929-06-12 to 1965-04-29

Cost Function : 0.1125953039703724
Accuracy Score : 0.9549248747913188
Confusion Matrix :
[[4054  192]
 [ 213 4526]]
Classification Report : 
              precision    recall  f1-score   support

         0.0       0.95      0.95      0.95      4246
         1.0       0.96      0.96      0.96      4739

    accuracy                           0.95      8985
   macro avg       0.95      0.95      0.95      8985
weighted avg       0.95      0.95      0.95      8985

Annual Return : 6.9%
Sharpe Ratio : 0.0260
Volatility : 25%
Maximum Drawdown : -78%
Average Drawdown : -31%
Annual Return : 6.9%
Sharpe Ratio : 0.0256
Volatility : 24%
Maximum Drawdown : -77%
Average Drawdown : -30%
Test set interval: 1965-04-29 to 1973-06-07

Cost Function : 0.4127768147100789
Accuracy Score : 0.8820029747149232
Confusion Matrix :
[[ 704  206]
 [  32 1075]]
Classification Report : 
              precision    recall  f1-score   support

         0.0       0.96    

### 10. Calculating Key Performance Indicators of various Logistic regression models

#### 10.1 Benchmark SPX

#### 10.2 Logistic Regression Linear Model

#### 10.3 Logistic Regression Quadratic Model

#### 10.4 Logistic Regression Cubic Model

### 11. Retraining the Model

#### Steps:
Frequency at which training set should be revised on regular intervals as new data is generated in market
1. Retraining period is about 5 and 10 years for one asset 
2. But for a portfolio with multiple assets, this approach is not feasible