# NASDAQ 100 Trend Classification: Machine Learning's Approach to Identifying Major Turning Points

In [18]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV



In [2]:
df_interpolated = pd.read_csv('df_interpolated.csv', index_col='Date', parse_dates=True)

In [3]:
print(df_interpolated.head())

            Class        GC=F         ^DJT   ^FVX   ^IRX         ^NDX  \
Date                                                                    
2007-01-03      2  627.099976  4650.660156  4.657  4.915  1759.369995   
2007-01-04      2  623.900024  4673.069824  4.605  4.900  1792.910034   
2007-01-05      2  604.900024  4612.350098  4.644  4.910  1785.300049   
2007-01-08      2  607.500000  4624.180176  4.658  4.910  1787.140015   
2007-01-09      2  613.099976  4632.660156  4.655  4.945  1795.630005   

                  ^RUT   ^TNX   ^VIX        ^W5000  ...  Core Inflation  \
Date                                                ...                   
2007-01-03  787.419983  4.664  12.04  14246.709961  ...           208.6   
2007-01-04  789.950012  4.618  11.51  14269.900391  ...           208.6   
2007-01-05  775.869995  4.646  12.14  14164.799805  ...           208.6   
2007-01-08  776.989990  4.660  12.00  14197.150391  ...           208.6   
2007-01-09  778.330017  4.656  11.91  

### Splitting the Data

Given that the data is time series, we'll split it into training, validation, and test sets chronologically:

70% of the data for training
15% for validation
15% for testing

In [6]:
# Calculate the splitting indices
train_size = int(0.7 * len(df_interpolated))
val_size = int(0.15 * len(df_interpolated))

# Split the data
train = df_interpolated.iloc[:train_size]
val = df_interpolated.iloc[train_size:train_size + val_size]
test = df_interpolated.iloc[train_size + val_size:]

train.shape, val.shape, test.shape


((2925, 26), (626, 26), (628, 26))

### Standardizing the Data:

We´ll standardize based on the mean and standard deviation of the training set only, and then apply these statistics to the validation and test sets. This ensures that our validation and test sets do not leak any information into the training process.

In [8]:
# Separate the target and features
X_train = train.drop("Class", axis=1)
y_train = train["Class"]

X_val = val.drop("Class", axis=1)
y_val = val["Class"]

X_test = test.drop("Class", axis=1)
y_test = test["Class"]

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

X_train_scaled[:5, :] 


array([[-2.10853346e+00, -8.10127021e-01,  2.71329137e+00,
         3.32633398e+00, -9.54600519e-01, -5.65129307e-01,
         2.18202966e+00, -7.96966302e-01, -5.63428044e-01,
         3.11776120e-01,  1.08543507e+00, -4.66958184e-01,
        -3.96099807e-01, -1.30615745e+00, -2.08589594e+00,
        -1.67748531e+00,  3.04246495e+00,  1.32262896e+00,
        -7.33670221e-01, -1.02192918e+00,  1.67994485e+00,
        -2.21598399e+00, -5.00880904e-02, -2.06406999e+00,
        -2.57692879e+00],
       [-2.11995076e+00, -7.99732719e-01,  2.66094339e+00,
         3.31441758e+00, -9.33383778e-01, -5.56960314e-01,
         2.12900359e+00, -8.52644824e-01, -5.59164790e-01,
         1.43370542e-01,  2.77605842e-01, -8.37904844e-01,
        -3.96099807e-01, -1.30615745e+00, -2.08589594e+00,
        -1.67748531e+00,  3.04246495e+00,  1.32262896e+00,
        -8.47017078e-01, -1.02192918e+00,  1.67994485e+00,
        -2.21598399e+00, -4.23350318e-02, -2.04863472e+00,
        -2.61175014e+00],
    

### Building Gradient Boosted Trees Model:

Next, we'll build a Gradient Boosted Trees (GBT) model using the training data and validate its performance using the validation set.

In [16]:
# Initialize the GradientBoostingClassifier
gbt = GradientBoostingClassifier(n_estimators=500, learning_rate=0.05, max_depth=4, random_state=42)

# Train the model
gbt.fit(X_train_scaled, y_train)

# Predict on validation set
y_val_pred = gbt.predict(X_val_scaled)

# Evaluate the model on validation set
accuracy = accuracy_score(y_val, y_val_pred)
classification_rep = classification_report(y_val, y_val_pred)

accuracy, classification_rep

(0.39456869009584666,
 '              precision    recall  f1-score   support\n\n           1       0.10      0.57      0.16        56\n           2       0.81      0.40      0.54       514\n           3       0.21      0.12      0.16        56\n\n    accuracy                           0.39       626\n   macro avg       0.37      0.37      0.29       626\nweighted avg       0.69      0.39      0.47       626\n')

The GBT model achieved an accuracy of approximately 39.5% on the validation set. Here's a detailed breakdown of the classification report:

Class 1:
Precision: 10%
Recall: 57%
F1-score: 17%

Class 2:
Precision: 82%
Recall: 40%
F1-score: 54%

Class 3:
Precision: 19%
Recall: 16%
F1-score: 17%


The accuracy is relatively low, suggesting that the model may not be capturing the underlying patterns of the data well or that the classes are challenging to distinguish based on the provided features.

When dealing with an unbalanced dataset, as it seems to be in your case, the accuracy metric alone can be misleading. This is because models can achieve high accuracy by simply predicting the majority class, which isn't helpful.

To improve the model's performance we will use oversampling -> Increase the number of minority classes samples by duplicating them or generating synthetic samples using SMOTE (Synthetic Minority Over-sampling Technique).


### Addressing Imbalance with SMOTE

First, we're using the SMOTE technique to oversample the minority class(es) in the training data. The purpose of this step is to balance the number of instances across classes, which can potentially help improve classifier performance on imbalanced datasets.

Now, X_train_resampled and y_train_resampled contain the augmented training data where all classes have an equal number of instances.

In [19]:
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)


### Hyperparameter Tuning with GridSearchCV

After handling the class imbalance, we want to find the best hyperparameters for our classifier. GridSearchCV performs a systematic exploration of multiple combinations of hyperparameters, and it identifies the combination that gives the best performance (based on a chosen scoring metric, in this case, f1_macro).

In [20]:

param_grid = {
    'n_estimators': [100, 200, 500],
    'learning_rate': [0.001, 0.01, 0.05, 0.1],
    'max_depth': [3, 4, 5, 6]
}

gbt_grid = GridSearchCV(GradientBoostingClassifier(random_state=42), param_grid, cv=5, scoring='f1_macro', verbose=3)
gbt_grid.fit(X_train_resampled, y_train_resampled)

# Getting the best parameters and estimator
best_params = gbt_grid.best_params_
best_gbt = gbt_grid.best_estimator_


Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV 1/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.412 total time=  15.3s
[CV 2/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.441 total time=  14.6s
[CV 3/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.475 total time=  15.4s
[CV 4/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.376 total time=  16.8s
[CV 5/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.419 total time=  15.8s
[CV 1/5] END learning_rate=0.001, max_depth=3, n_estimators=200;, score=0.419 total time=  32.0s
[CV 2/5] END learning_rate=0.001, max_depth=3, n_estimators=200;, score=0.459 total time=  30.5s
[CV 3/5] END learning_rate=0.001, max_depth=3, n_estimators=200;, score=0.475 total time=  34.4s
[CV 4/5] END learning_rate=0.001, max_depth=3, n_estimators=200;, score=0.397 total time=  30.9s
[CV 5/5] END learning_rate=0.001, max_depth=3, n_estimators=200;,

[CV 1/5] END learning_rate=0.01, max_depth=4, n_estimators=500;, score=0.541 total time= 2.0min
[CV 2/5] END learning_rate=0.01, max_depth=4, n_estimators=500;, score=0.656 total time= 2.0min
[CV 3/5] END learning_rate=0.01, max_depth=4, n_estimators=500;, score=0.593 total time= 2.1min
[CV 4/5] END learning_rate=0.01, max_depth=4, n_estimators=500;, score=0.613 total time= 2.1min
[CV 5/5] END learning_rate=0.01, max_depth=4, n_estimators=500;, score=0.531 total time= 2.0min
[CV 1/5] END learning_rate=0.01, max_depth=5, n_estimators=100;, score=0.529 total time=  29.7s
[CV 2/5] END learning_rate=0.01, max_depth=5, n_estimators=100;, score=0.591 total time=  30.1s
[CV 3/5] END learning_rate=0.01, max_depth=5, n_estimators=100;, score=0.598 total time=  28.9s
[CV 4/5] END learning_rate=0.01, max_depth=5, n_estimators=100;, score=0.577 total time=  29.2s
[CV 5/5] END learning_rate=0.01, max_depth=5, n_estimators=100;, score=0.514 total time=  28.4s
[CV 1/5] END learning_rate=0.01, max_dep

[CV 2/5] END learning_rate=0.05, max_depth=6, n_estimators=200;, score=0.619 total time= 1.1min
[CV 3/5] END learning_rate=0.05, max_depth=6, n_estimators=200;, score=0.618 total time= 1.0min
[CV 4/5] END learning_rate=0.05, max_depth=6, n_estimators=200;, score=0.646 total time= 1.0min
[CV 5/5] END learning_rate=0.05, max_depth=6, n_estimators=200;, score=0.536 total time= 1.0min
[CV 1/5] END learning_rate=0.05, max_depth=6, n_estimators=500;, score=0.554 total time= 2.6min
[CV 2/5] END learning_rate=0.05, max_depth=6, n_estimators=500;, score=0.624 total time= 2.6min
[CV 3/5] END learning_rate=0.05, max_depth=6, n_estimators=500;, score=0.624 total time= 2.5min
[CV 4/5] END learning_rate=0.05, max_depth=6, n_estimators=500;, score=0.666 total time= 2.6min
[CV 5/5] END learning_rate=0.05, max_depth=6, n_estimators=500;, score=0.539 total time= 3.3min
[CV 1/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.538 total time=  27.9s
[CV 2/5] END learning_rate=0.1, max_depth

### Evaluating the Best Model on the Validation Set

Now that we have the best model (best_gbt), we can evaluate its performance on the validation set to see how well it generalizes:

In [24]:
y_val_pred = best_gbt.predict(X_val_scaled)
accuracy = accuracy_score(y_val, y_val_pred)
classification_rep = classification_report(y_val, y_val_pred, zero_division=0)

print(accuracy)
print(classification_rep)


0.8210862619808307
              precision    recall  f1-score   support

           1       0.00      0.00      0.00        56
           2       0.82      1.00      0.90       514
           3       0.00      0.00      0.00        56

    accuracy                           0.82       626
   macro avg       0.27      0.33      0.30       626
weighted avg       0.67      0.82      0.74       626



### Predict on the Test Set

We'll use the best model from the grid search to predict the outcomes on the test set.

In [22]:
y_test_pred = best_gbt.predict(X_test_scaled)


### Evaluate

We'll measure the accuracy and generate a classification report to assess precision, recall, f1-score, and support for each class.

In [25]:
# Calculate accuracy
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

# Generate and print the classification report
test_classification_rep = classification_report(y_test, y_test_pred, zero_division=0)
print("\nClassification Report for Test Set:\n", test_classification_rep)


Test Accuracy: 0.7261146496815286

Classification Report for Test Set:
               precision    recall  f1-score   support

           1       0.00      0.00      0.00        88
           2       0.73      1.00      0.84       456
           3       0.00      0.00      0.00        84

    accuracy                           0.73       628
   macro avg       0.24      0.33      0.28       628
weighted avg       0.53      0.73      0.61       628



### Result Analysis:

Initial Model Performance:

Accuracy: ~39.45%
The model seems to be doing decently for class 2 but struggles significantly with classes 1 and 3. This might indicate a data imbalance problem.


Model Performance after Hyperparameter Tuning on Validation Set:

Accuracy: ~82.11%

There's a significant improvement in accuracy, but a closer look reveals that the model is now predicting almost everything as class 2. It's failing to recognize classes 1 and 3 entirely, which is evident from the precision, recall, and F1-score of 0 for those classes.


Model Performance on Test Set:

Accuracy: ~72.61%

Similar to the validation results, the model primarily predicts class 2, ignoring classes 1 and 3.

Despite the high accuracy, the model is not effectively distinguishing between all the classes. Relying on accuracy alone can be misleading, especially with imbalanced datasets.

The resampling with SMOTE appears to have had unintended consequences. Even though it was meant to improve the model's ability to recognize underrepresented classes, the model's reliance on predicting the majority class (class 2) increased.

Hyperparameter tuning did optimize the model's accuracy, but it didn't resolve its inability to recognize all classes.