## Naive Bayesian Classifier

For each week, the feature set consists of mean return (μ) and volatility (σ).  

- Training Set: Boeing (BA) weekly stock data for years 2020–2022
- Testing Set: Boeing (BA) weekly stock data for years 2023–2024


In [232]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt

In [233]:
ba_volatility = pd.read_csv('../Inertia Trading/ba_weekly_return_volatility.csv')
ba_detailed = pd.read_csv("../Inertia Trading/ba_weekly_return_detailed.csv")

labels = ba_detailed[['Year','Week_Number', 'label']].drop_duplicates()

ba_volatility = ba_volatility.merge(
    labels, 
    on=['Year','Week_Number'], 
    how='inner'
)
ba_volatility.head()

Unnamed: 0,Year,Week_Number,mean_return,volatility,label
0,2020,0,-0.084,0.118794,green
1,2020,1,-0.1612,1.584772,green
2,2020,2,-0.3456,1.269723,green
3,2020,3,-0.05525,2.818341,green
4,2020,4,-0.2888,1.510424,green


In [234]:
train_df = ba_volatility[ba_volatility['Year'] < 2023]
test_df = ba_volatility[ba_volatility['Year'] >= 2023]

In [235]:
features = ['mean_return', 'volatility']
X_train = train_df[features].values
le = LabelEncoder()
Y_train = le.fit_transform(train_df['label'].values)

X_test = test_df[features].values
Y_test = le.transform(test_df['label'].values)

NB_classifier = GaussianNB().fit(X_train, Y_train)
prediction = NB_classifier.predict(X_test)
acc = accuracy_score(Y_test, prediction)
print(f"Testing accuracy: {acc:.3f}")

Testing accuracy: 0.876


The model achieved 88% accuracy on the test set (Years 1-3). Compared to k-Nearest Neighbors(97%) and Logistic Regression(95%), the Naive Bayesian Classifier performed quite well. 

Also, as we observed before, our data is highly correlated and easily clustered visually, which is not an ideal scenario for Naive Bayes. However, it still managed to perform decently well.

### Testing Years Performance

In [236]:
cm = confusion_matrix(Y_test, prediction)
tn, fp, fn, tp = cm.ravel()
print(cm)

tpr = tp / (tp + fn)
tnr = tn / (fp + tn)
print(f"TPR: {tpr:.3f}")
print(f"TNR: {tnr:.3f}")

[[89  0]
 [13  3]]
TPR: 0.188
TNR: 1.000


Most of the testing weeks easily classified as "green" weeks, while the model struggled to identify "red" weeks. It is reasonable because of the imbalanced dataset.

### Buy-and-Hold vs Trading Strategy

In [237]:
# created functions for strategy comparison to avoid code duplication
def make_weekly_prices(ba_detailed):
    return (
        ba_detailed.groupby(['Year','Week_Number'], as_index=False)
        .agg(Open_w=('Open','first'), Close_w=('Close','last'))
        .sort_values(['Year','Week_Number'])
        .reset_index(drop=True)
    )
    
def buy_and_hold(weekly_prices, initial=100.0):
    wp = weekly_prices.sort_values(['Year','Week_Number']).reset_index(drop=True)
    yearly = (
        wp.groupby('Year', as_index=False)
          .agg(Close_y=('Close_w','last'))
          .sort_values('Year')
          .reset_index(drop=True)
    )
    shares = initial / wp.iloc[0]['Open_w']
    yearly['BuyHold'] = (shares * yearly['Close_y']).round(2)
    return yearly

def trading(df, label_col='predicted_label', green_value='green', initial=100.0):
    cash = initial
    shares = 0
    results = {}
    
    for i in range(len(df)):
        this_week = df.iloc[i]
        next_week = df.iloc[i+1] if i+1 < len(df) else None 
        
        if(shares == 0 and this_week[label_col] == green_value):
            shares = cash / this_week['Open_w']
            cash = 0
            
        if shares > 0 and ((next_week is None) or next_week[label_col] != green_value):
            cash = shares * this_week['Close_w']
            shares = 0

        year_end = (i == len(df)-1) or (this_week['Year'] != next_week['Year'])
        if year_end: #store yearly earnings/losses
            wealth = shares*this_week['Close_w'] if shares > 0 else cash
            results[this_week['Year']] = round(wealth, 2)
    return df[['Year']].drop_duplicates().assign(value=df['Year'].map(results))['value']

def compare_strategies(ba_detailed, labels_df, strategy_name, label_col='predicted_label', green_value='green', initial=100.0):
    weekly_prices = make_weekly_prices(ba_detailed)
    weekly_prices = weekly_prices.merge(labels_df, on=['Year','Week_Number'], how='inner').sort_values(['Year','Week_Number']).reset_index(drop=True)
    
    portfolio = buy_and_hold(weekly_prices, initial=initial)
    trad_str = trading(weekly_prices, label_col=label_col, green_value=green_value, initial=initial)
    trad_str_df = pd.DataFrame({'Year': weekly_prices['Year'].unique(), strategy_name: trad_str})

    portfolio = portfolio.merge(trad_str_df, on='Year', how='left')
    return portfolio

In [238]:
test_df = test_df.copy()
test_df['predicted_label'] = le.inverse_transform(prediction)
test_df.head()

Unnamed: 0,Year,Week_Number,mean_return,volatility,label,predicted_label
157,2023,1,2.84125,1.618816,green,green
158,2023,2,0.1204,1.91299,green,green
159,2023,3,-0.87075,0.492591,green,green
160,2023,4,0.4262,0.875527,green,green
161,2023,5,-0.482,1.689419,green,green


In [239]:
nb_portfolio = compare_strategies(ba_detailed, test_df, strategy_name="GaussianNB")
print(nb_portfolio)

   Year     Close_y  BuyHold  GaussianNB
0  2023  260.660004   135.09      135.09
1  2024  176.550003    91.50      118.84


The trading strategy based on the Naive Bayes predictions outperformed the Buy-and-Hold strategy during the testing period. While for year-1 the results are similar, for year-2 we see Gaussian Naive Bayes strategy performed better ending with $118.84 compared to $91.50 from buy-and-hold.

However, overall portfolio results below that of kNN ($252.42) and Logistic Regression ($224.78) strategies, which is aligns with the model’s lower predictive accuracy.

## Naive Bayesian with custom density

In [240]:
from scipy import stats

In [241]:
train_df = ba_volatility[ba_volatility['Year'] < 2023]
test_df = ba_volatility[ba_volatility['Year'] >= 2023]

features = ['mean_return', 'volatility']
X_train = train_df[features].values
le = LabelEncoder()
Y_train = le.fit_transform(train_df['label'].values)
X_test = test_df[features].values
Y_test = le.transform(test_df['label'].values)

### Student-t Naive Bayesian Classifier
We model the likelihood of each feature using a student-t distribution, which is defined by 3 parameters:
- degrees of freedom df
- location parameter μ
- scale parameter s²

When degrees of freedom is large, Student-t distribution approaches normal. When df → 0, we get a distribution similar to normal but with fatter tails.

In [242]:
class StudentTNaiveBayes:
    def fit(self, X, y, df):
        self.classes = np.unique(y)
        self.mu = {}
        self.scale = {}
        self.prior_prob = {}
        self.df = df
        for c in self.classes:
            X_c = X[y == c]
            self.mu[c] = np.mean(X_c, axis=0)
            self.scale[c] = np.std(X_c, axis=0, ddof=1)
            self.prior_prob[c] = X_c.shape[0] / X.shape[0]
        return self
    
    def predict(self, X):
        log_probs = []
        for cls in self.classes:
            lp = np.sum(stats.t.logpdf(X, df=self.df, loc=self.mu[cls], scale=self.scale[cls]), axis=1) + np.log(self.prior_prob[cls])
            log_probs.append(lp)
        log_probs = np.array(log_probs).T
        predictions = np.argmax(log_probs, axis=1)
        return predictions

In [243]:
df = [0.5, 1, 5]
for df_value in df:
    Stud_t_classifier = StudentTNaiveBayes().fit(X_train, Y_train, df=df_value)
    prediction = Stud_t_classifier.predict(X_test)
    acc = accuracy_score(Y_test, prediction)
    print(f"Testing accuracy (df={df_value}): {acc:.3f}")
    
    cm = confusion_matrix(Y_test, prediction, labels=[0,1])
    print(cm)
    tn, fp, fn, tp = cm.ravel()
    tpr = tp/(tp+fn)
    tnr = tn/(fp+tn)
    print(f"TPR: {tpr:.3f}")
    print(f"TNR: {tnr:.3f}")

Testing accuracy (df=0.5): 0.848
[[89  0]
 [16  0]]
TPR: 0.000
TNR: 1.000
Testing accuracy (df=1): 0.848
[[89  0]
 [16  0]]
TPR: 0.000
TNR: 1.000
Testing accuracy (df=5): 0.848
[[89  0]
 [16  0]]
TPR: 0.000
TNR: 1.000


We see that changing the degrees of freedom does not impact the accuracy of the model on our dataset, meaning that the data distribution is close to normal. 

From the confusion matrix, we see that model struggles to identify red (cash) weeks, similar to Gaussian Naive Bayes. And most of the weeks are classified as green (buy) weeks which matches our imbalanced dataset.

In [244]:
# df=5
Stud_t_classifier = StudentTNaiveBayes().fit(X_train, Y_train, df=5)
prediction = Stud_t_classifier.predict(X_test)

test_df = test_df.copy()
test_df['predicted_label'] = le.inverse_transform(prediction)
test_df.head()

Unnamed: 0,Year,Week_Number,mean_return,volatility,label,predicted_label
157,2023,1,2.84125,1.618816,green,green
158,2023,2,0.1204,1.91299,green,green
159,2023,3,-0.87075,0.492591,green,green
160,2023,4,0.4262,0.875527,green,green
161,2023,5,-0.482,1.689419,green,green


In [245]:
stud_t_nb_portfolio = compare_strategies(ba_detailed, test_df, strategy_name="Stud-t.NB")
print(stud_t_nb_portfolio)

   Year     Close_y  BuyHold  Stud-t.NB
0  2023  260.660004   135.09     135.09
1  2024  176.550003    91.50      91.50


While the model maintained consistent accuracy, the trading strategy based on its predictions didn't outperfmorm the buy-and-hold. Overall, both methods resulted in the same final portfolio values for 2023 and 2024.

### Exponential Naive Bayesian Classifier
The Exponential Naive Bayes model assumes all features follow an exponential distribution and is defined by a single parameter λ (rate parameter).

In [246]:
train_df = ba_volatility[ba_volatility['Year'] < 2023]
test_df = ba_volatility[ba_volatility['Year'] >= 2023]

In [247]:
features = ['mean_return', 'volatility']
X_train = train_df[features].values
le = LabelEncoder()
Y_train = le.fit_transform(train_df['label'].values)
X_test = test_df[features].values
Y_test = le.transform(test_df['label'].values)

# need to make data positive for exponential distribution
shift = np.maximum(0.0, -X_train.min(axis=0))
X_train_pos = X_train + shift
X_test_pos  = X_test  + shift

In [248]:
class ExponentialNaiveBayes:
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.lambdas = {}
        self.prior_prob = {}
        for cls in self.classes:
            X_c = X[y == cls]
            self.lambdas[cls] = 1 / np.mean(X_c, axis=0)
            self.prior_prob[cls] = len(X_c) / len(X)
        return self
    
    def predict(self, X):
        log_probs = []    
        for cls in self.classes:
            lp = np.sum(stats.expon.logpdf(X, scale=1/self.lambdas[cls]), axis=1) + np.log(self.prior_prob[cls])
            log_probs.append(lp)
        log_probs = np.array(log_probs).T
        predictions = np.argmax(log_probs, axis=1)
        return predictions

In [249]:
ENB_classifier = ExponentialNaiveBayes().fit(X_train, Y_train)
prediction = ENB_classifier.predict(X_test)
acc = accuracy_score(Y_test, prediction)
print(f"Testing accuracy: {acc:.3f}")
cm = confusion_matrix(Y_test, prediction)
tn, fp, fn, tp = cm.ravel()
print(cm)
tpr = tp / (tp + fn)
tnr = tn / (fp + tn)
print(f"TPR: {tpr:.3f}")
print(f"TNR: {tnr:.3f}")

Testing accuracy: 0.152
[[ 0 89]
 [ 0 16]]
TPR: 1.000
TNR: 0.000


The model classified all weeks as class 1 (red), resulting in poor performance (15%).

In [250]:
test_df = test_df.copy()
test_df['predicted_label'] = le.inverse_transform(prediction)
test_df.head()

Unnamed: 0,Year,Week_Number,mean_return,volatility,label,predicted_label
157,2023,1,2.84125,1.618816,green,red
158,2023,2,0.1204,1.91299,green,red
159,2023,3,-0.87075,0.492591,green,red
160,2023,4,0.4262,0.875527,green,red
161,2023,5,-0.482,1.689419,green,red


In [251]:
exp_nb_portfolio = compare_strategies(ba_detailed, test_df, strategy_name="Exp.NB")
print(exp_nb_portfolio)

   Year     Close_y  BuyHold  Exp.NB
0  2023  260.660004   135.09   100.0
1  2024  176.550003    91.50   100.0


Since the model never predicted "green", the model never entered the market, resulting with the initial $100 cash balance for both years.

Thus Exponential Naive Bayes is not suitable for our dataset because the features are not exponentially distributed.

## Linear and Quadratic Discriminant Analysis

In [252]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler

### Linear Discriminant Analysis (LDA)

In [253]:
train_df = ba_volatility[ba_volatility['Year'] < 2023]
test_df = ba_volatility[ba_volatility['Year'] >= 2023]

features = ['mean_return', 'volatility']
X_train = train_df[features].values
le = LabelEncoder()
Y_train = le.fit_transform(train_df['label'].values)

scaler = StandardScaler().fit(X_train)
X_train= scaler.transform(X_train)
X_test = test_df[features].values
X_test = scaler.transform(X_test)
Y_test = le.transform(test_df['label'].values)

lda_classifier = LinearDiscriminantAnalysis().fit(X_train, Y_train)
print(f"Equation coefficients: {lda_classifier.coef_}")

prediction = lda_classifier.predict(X_test)
acc = np.mean(prediction==Y_test)
print(f"Testing accuracy: {acc:.3f}")

cm = confusion_matrix(Y_test, prediction)
tn, fp, fn, tp = cm.ravel()
print(cm)
tpr = tp / (tp + fn)
tnr = tn / (fp + tn)
print(f"TPR: {tpr:.3f}")
print(f"TNR: {tnr:.3f}")

Equation coefficients: [[-2.48647367  0.85740764]]
Testing accuracy: 0.876
[[89  0]
 [13  3]]
TPR: 0.188
TNR: 1.000


From the coefficients, we see that mean return has a larger negative impact on the decision boundary than volatility. This indicates that weeks with higher mean returns are more likely to be classified as “green” (buy) weeks, while higher volatility contributes only slightly positively to the classification.

Also, the model achieved 88% accuracy similar to Gaussian Naive Bayes. It correctly identifies most “red” weeks but tends to predict “green” more often.

In [254]:
test_df = test_df.copy()
test_df['predicted_label'] = le.inverse_transform(prediction)
test_df.head()

Unnamed: 0,Year,Week_Number,mean_return,volatility,label,predicted_label
157,2023,1,2.84125,1.618816,green,green
158,2023,2,0.1204,1.91299,green,green
159,2023,3,-0.87075,0.492591,green,green
160,2023,4,0.4262,0.875527,green,green
161,2023,5,-0.482,1.689419,green,green


In [255]:
lda_portfolio = compare_strategies(ba_detailed, test_df, strategy_name="LDA")
print(lda_portfolio)

   Year     Close_y  BuyHold     LDA
0  2023  260.660004   135.09  135.09
1  2024  176.550003    91.50  133.51


LDA based trading strategy outperformed the buy-and-hold strategy. Although its accuracy (88%) was the same as the Gaussian Naive Bayes model, the final portfolio value was higher — $133.51 compared to $118.84 from Gaussian Naive Bayes. This makes LDA model more profitable.

### Quadratic Discriminant Analysis (QDA)

In [256]:
train_df = ba_volatility[ba_volatility['Year'] < 2023]
test_df = ba_volatility[ba_volatility['Year'] >= 2023]

features = ['mean_return', 'volatility']
X_train = train_df[features].values
le = LabelEncoder()
Y_train = le.fit_transform(train_df['label'].values)

scaler = StandardScaler().fit(X_train)
X_train= scaler.transform(X_train)
X_test = test_df[features].values
X_test = scaler.transform(X_test)
Y_test = le.transform(test_df['label'].values)

qda_classifier = QuadraticDiscriminantAnalysis(store_covariance=True).fit(X_train, Y_train)
# print(qda_classifier.coef_)
print(f"Equation coefficients:\nMean: {qda_classifier.means_}, \nCovariance:{qda_classifier.covariance_}, \nPriors:{qda_classifier.priors_}")

prediction = qda_classifier.predict(X_test)
acc = np.mean(prediction==Y_test)
print(f"Testing accuracy: {acc:.3f}")

cm = confusion_matrix(Y_test, prediction)
tn, fp, fn, tp = cm.ravel()
print(cm)
tpr = tp / (tp + fn)
tnr = tn / (fp + tn)
print(f"TPR: {tpr:.3f}")
print(f"TNR: {tnr:.3f}")

Equation coefficients:
Mean: [[ 0.30658354 -0.07481381]
 [-1.15201089  0.28111857]], 
Covariance:[array([[0.57874767, 0.44500739],
       [0.44500739, 0.79273396]]), array([[ 0.94886208, -0.74763306],
       [-0.74763306,  1.75599278]])], 
Priors:[0.78980892 0.21019108]
Testing accuracy: 0.876
[[89  0]
 [13  3]]
TPR: 0.188
TNR: 1.000


The QDA model produced results identical to the LDA classifier. Both achieved 88% accuracy with the same confusion matrix and similar true positive and true negative rates. It means that additional flexibility of QDA did not provide any advantage on our dataset, and our features are most likely linearly correlated.



In [257]:
test_df = test_df.copy()
test_df['predicted_label'] = le.inverse_transform(prediction)
test_df.head()

Unnamed: 0,Year,Week_Number,mean_return,volatility,label,predicted_label
157,2023,1,2.84125,1.618816,green,green
158,2023,2,0.1204,1.91299,green,green
159,2023,3,-0.87075,0.492591,green,green
160,2023,4,0.4262,0.875527,green,green
161,2023,5,-0.482,1.689419,green,green


In [258]:
qda_portfolio = compare_strategies(ba_detailed, test_df, strategy_name="QDA")
print(qda_portfolio)

   Year     Close_y  BuyHold     QDA
0  2023  260.660004   135.09  135.09
1  2024  176.550003    91.50  133.51


In the trading simulation, the QDA-based strategy also achieved the same final portfolio value ($133.51) as LDA, outperforming both the Buy-and-Hold ($91.50) and Gaussian Naive Bayes ($118.84) strategies.