#**NAIVE BAYES**

The Naive Bayes method, based on Bayes' Theorem and predicated on the assumption that predictors are independent, is a straightforward yet effective probabilistic classification technique. It is particularly valued for its ability to manage large datasets. However, it is criticized for its potential to produce biased results in the presence of interdependent features. Despite these shortcomings, it remains a staple benchmark in our analytical toolbox.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report


# import the third page of FinancialM
df= pd.read_excel('FinancialMarketData.xlsx', sheet_name=2)

# save in y the output (market crashes equal to 1)
y = df.iloc[:, 0]

# drop the first column (output), drop the second column (dates)
df.drop(df.columns[0], axis=1, inplace=True)
df.drop(df.columns[0], axis=1, inplace=True)

df.head()

Unnamed: 0,XAU BGNL,ECSURPUS,BDIY,CRY,DXY,JPY,GBP,Cl1,VIX,USGG30YR,...,LP01TREU,EMUSTRUU,LF94TRUU,MXUS,MXEU,MXJP,MXBR,MXRU,MXIN,MXCN
0,283.25,0.077,1388,157.26,100.56,105.86,1.646,25.77,22.5,6.671,...,116.4635,230.5267,123.7616,1416.12,127.75,990.59,856.76,224.33,217.34,34.3
1,287.65,0.043,1405,165.01,101.86,105.47,1.6383,28.85,21.5,6.747,...,117.2674,231.377,123.7616,1428.79,129.5,993.98,925.22,234.37,227.08,32.74
2,287.15,0.135,1368,167.24,102.41,106.04,1.6496,28.28,23.02,6.634,...,117.9946,232.3895,123.7616,1385.93,126.48,974.83,886.93,216.82,233.0,32.46
3,282.75,0.191,1311,166.85,104.92,107.85,1.6106,28.22,23.45,6.423,...,120.51,231.9417,122.3281,1385.31,129.19,1007.12,842.6,201.89,237.48,31.29
4,298.4,0.312,1277,165.43,104.22,109.3,1.6108,28.02,21.25,6.231,...,118.7914,237.8117,122.3281,1411.95,134.67,1034.58,945.15,218.0,258.02,31.32


In [None]:
# check for nan values
nan_column = df.isna().sum()
print("Number of NaN in each column:")
print(nan_column)

# drop nan values
df = df.dropna()
print("\nDataFrame senza righe contenenti NaN:")
print(df)

Number of NaN in each column:
XAU BGNL     0
ECSURPUS     0
BDIY         0
CRY          0
DXY          0
JPY          0
GBP          0
Cl1          0
VIX          0
USGG30YR     0
GT10         0
USGG2YR      0
USGG3M       0
US0001M      0
GTDEM30Y     0
GTDEM10Y     0
GTDEM2Y      0
EONIA        0
GTITL30YR    0
GTITL10YR    0
GTITL2YR     0
GTJPY30YR    0
GTJPY10YR    0
GTJPY2YR     0
GTGBP30Y     0
GTGBP20Y     0
GTGBP2Y      0
LUMSTRUU     0
LMBITR       0
LUACTRUU     0
LF98TRUU     0
LG30TRUU     0
LP01TREU     0
EMUSTRUU     0
LF94TRUU     0
MXUS         0
MXEU         0
MXJP         0
MXBR         0
MXRU         0
MXIN         0
MXCN         0
dtype: int64

DataFrame senza righe contenenti NaN:
      XAU BGNL  ECSURPUS  BDIY       CRY      DXY     JPY     GBP    Cl1  \
0       283.25     0.077  1388  157.2600  100.560  105.86  1.6460  25.77   
1       287.65     0.043  1405  165.0100  101.860  105.47  1.6383  28.85   
2       287.15     0.135  1368  167.2400  102.410  106.04  1

In [None]:
# standard scaler
scaler = StandardScaler()
df_standardized = scaler.fit_transform(df)
column_names = df.columns
column_names = column_names.tolist()

# create the dataframe for the standardized values
df_standardized_pd = pd.DataFrame(df_standardized, columns=column_names)
df_standardized_pd

Unnamed: 0,XAU BGNL,ECSURPUS,BDIY,CRY,DXY,JPY,GBP,Cl1,VIX,USGG30YR,...,LP01TREU,EMUSTRUU,LF94TRUU,MXUS,MXEU,MXJP,MXBR,MXRU,MXIN,MXCN
0,-1.424377,0.116941,-0.432277,-1.289246,0.846232,-0.058102,0.359879,-1.349527,0.290316,2.345104,...,-1.052593,-1.529484,-1.782485,-0.273397,0.956041,1.028789,-1.061289,-1.419419,-1.173857,-0.793990
1,-1.415478,0.019602,-0.423848,-1.175689,0.960528,-0.088487,0.323600,-1.230558,0.174552,2.409617,...,-1.045276,-1.526739,-1.782485,-0.255250,1.040230,1.045740,-0.991858,-1.385836,-1.149828,-0.857078
2,-1.416489,0.282989,-0.442193,-1.143014,1.008884,-0.044078,0.376841,-1.252575,0.350514,2.313697,...,-1.038657,-1.523471,-1.782485,-0.316639,0.894944,0.949983,-1.030691,-1.444539,-1.135224,-0.868402
3,-1.425389,0.443312,-0.470454,-1.148728,1.229563,0.096943,0.193090,-1.254892,0.400292,2.134589,...,-1.015762,-1.524916,-1.801371,-0.317527,1.025317,1.111444,-1.075649,-1.494479,-1.124171,-0.915718
4,-1.393734,0.789723,-0.487312,-1.169535,1.168019,0.209916,0.194033,-1.262618,0.145611,1.971609,...,-1.031405,-1.505970,-1.801371,-0.279370,1.288948,1.248754,-0.971646,-1.440592,-1.073498,-0.914505
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1106,1.497764,0.867021,0.005523,-0.890078,0.123178,0.164727,-0.912241,-0.113874,0.035636,-1.342820,...,1.781110,1.759206,1.723279,3.142187,1.560758,2.073760,-0.234767,0.108805,2.473954,2.300980
1107,1.412004,0.763956,-0.077773,-0.884629,0.207669,0.287049,-0.931558,-0.006107,-0.044241,-1.306913,...,1.788943,1.736163,1.670111,3.192590,1.678141,2.107362,-0.271440,0.149513,2.461125,2.218884
1108,1.528751,1.156174,-0.083227,-0.870486,0.123090,0.255105,-0.883029,-0.053231,-0.216730,-1.345706,...,1.807212,1.761959,1.734240,3.363323,1.744049,2.049008,-0.216624,0.082013,2.457967,2.345061
1109,1.527558,1.233472,-0.059428,-0.836822,0.080624,0.201346,-0.923077,-0.020399,-0.386903,-1.370407,...,1.804880,1.769406,1.730325,3.466292,1.748859,2.061209,-0.225549,0.105527,2.415978,2.202708


In [None]:
# split the dataset in training (80%) and test (20%) sets
X_train_temp, X_test, y_train_temp, y_test = train_test_split(df_standardized_pd, y, test_size=0.2, random_state=41)

# split the temp dataset in training (75%) e validation (25%)
X_train, X_val, y_train, y_val = train_test_split(X_train_temp, y_train_temp, test_size=0.25, random_state=41)

print("Number of instances in training set:", df_standardized_pd.shape[0])
print("Number of instances in training set:", X_train.shape[0])
print("Number of instances in validation set:", X_val.shape[0])
print("Number of instances in test set:", X_test.shape[0])

Number of instances in training set: 1111
Number of instances in training set: 666
Number of instances in validation set: 222
Number of instances in test set: 223


## **Buckets**

In [None]:
bond_train = X_train.iloc[:, 9:35]
equity_train = X_train.iloc[:, 35:42]
commodities_train = X_train.iloc[:, [0,3,7]]
indexes_train = X_train.iloc[:, [1,2,4,5,6,8]]

# print the column in each bucket to check them
print("Bonds: ", bond_train.columns.tolist())
print("Equity: ", equity_train.columns.tolist())
print("Commodities: ", commodities_train.columns.tolist())
print("Indexes: ", indexes_train.columns.tolist())

Bonds:  ['USGG30YR', 'GT10', 'USGG2YR', 'USGG3M', 'US0001M', 'GTDEM30Y', 'GTDEM10Y', 'GTDEM2Y', 'EONIA', 'GTITL30YR', 'GTITL10YR', 'GTITL2YR', 'GTJPY30YR', 'GTJPY10YR', 'GTJPY2YR', 'GTGBP30Y', 'GTGBP20Y', 'GTGBP2Y', 'LUMSTRUU', 'LMBITR', 'LUACTRUU', 'LF98TRUU', 'LG30TRUU', 'LP01TREU', 'EMUSTRUU', 'LF94TRUU']
Equity:  ['MXUS', 'MXEU', 'MXJP', 'MXBR', 'MXRU', 'MXIN', 'MXCN']
Commodities:  ['XAU BGNL', 'CRY', 'Cl1']
Indexes:  ['ECSURPUS', 'BDIY', 'DXY', 'JPY', 'GBP', 'VIX']


In [None]:
bond_val = X_val.iloc[:, 9:35]
equity_val = X_val.iloc[:, 35:42]
commodities_val = X_val.iloc[:, [0,3,7]]
indexes_val = X_val.iloc[:, [1,2,4,5,6,8]]

In [None]:
bond_test = X_test.iloc[:, 9:35]
equity_test = X_test.iloc[:, 35:42]
commodities_test = X_test.iloc[:, [0,3,7]]
indexes_test = X_test.iloc[:, [1,2,4,5,6,8]]

## **Naive Bayes**

In [None]:
def NaiveBayes(x_train,y_train,x_val,y_val):

  # Initialize Naive Bayes Classifier
  gnb = GaussianNB()

  # Addestrare il modello
  gnb.fit(x_train, y_train)

  # Fare previsioni sul set di test
  y_pred = gnb.predict(x_val)

  # Define the number of folds
  num_folds = 5

  # Evaluate on validation set
  val_accuracy = accuracy_score(y_val, y_pred)
  print("Validation Accuracy:", val_accuracy)
  print("Classification Report (Validation Set):")
  print(classification_report(y_val, y_pred))


In [None]:
NaiveBayes(bond_train,y_train,bond_val,y_val)

Validation Accuracy: 0.5765765765765766
Classification Report (Validation Set):
              precision    recall  f1-score   support

           0       0.83      0.58      0.68       175
           1       0.26      0.55      0.36        47

    accuracy                           0.58       222
   macro avg       0.55      0.57      0.52       222
weighted avg       0.71      0.58      0.62       222



In [None]:
NaiveBayes(equity_train,y_train,equity_val,y_val)

Validation Accuracy: 0.7207207207207207
Classification Report (Validation Set):
              precision    recall  f1-score   support

           0       0.86      0.77      0.81       175
           1       0.39      0.55      0.46        47

    accuracy                           0.72       222
   macro avg       0.63      0.66      0.63       222
weighted avg       0.76      0.72      0.74       222



In [None]:
NaiveBayes(commodities_train,y_train,commodities_val,y_val)

Validation Accuracy: 0.7882882882882883
Classification Report (Validation Set):
              precision    recall  f1-score   support

           0       0.79      0.99      0.88       175
           1       0.50      0.02      0.04        47

    accuracy                           0.79       222
   macro avg       0.65      0.51      0.46       222
weighted avg       0.73      0.79      0.70       222



In [None]:
NaiveBayes(indexes_train,y_train,indexes_val,y_val)

Validation Accuracy: 0.8513513513513513
Classification Report (Validation Set):
              precision    recall  f1-score   support

           0       0.90      0.91      0.91       175
           1       0.65      0.64      0.65        47

    accuracy                           0.85       222
   macro avg       0.78      0.77      0.78       222
weighted avg       0.85      0.85      0.85       222



## Naive Bayes on the validation set considering all the variables

In [None]:
NaiveBayes(X_train,y_train,X_val,y_val)

Validation Accuracy: 0.6396396396396397
Classification Report (Validation Set):
              precision    recall  f1-score   support

           0       0.88      0.63      0.73       175
           1       0.33      0.68      0.44        47

    accuracy                           0.64       222
   macro avg       0.60      0.65      0.59       222
weighted avg       0.76      0.64      0.67       222



## Naive Bayes on the test set considering all the variables

In [None]:
NaiveBayes(X_train,y_train,X_test,y_test)

Validation Accuracy: 0.6502242152466368
Classification Report (Validation Set):
              precision    recall  f1-score   support

           0       0.84      0.68      0.75       175
           1       0.32      0.54      0.40        48

    accuracy                           0.65       223
   macro avg       0.58      0.61      0.58       223
weighted avg       0.73      0.65      0.68       223



Used as a benchmark in our study, the Naive Bayes classifier demonstrates considerable utility when considering the entire dataset as opposed to segmented buckets, showing robustness across diverse data scenarios; its computational efficiency makes it a preferred choice for initial data analysis and provides a useful baseline for comparing more complex models.