### A few interesting things that came out of the fitting exercise...

#### Why do the XGB with/without z-scaling give identical results?

* Tree-based models like XGBoost are invariant to monotonic transformations of the features, including z-scoring.
* StandardScaler just subtracts the mean and divides by std — it doesn’t change splitting behavior, because splits are based on thresholds (e.g., "is feature > 6.5?").
* Unless you use a model that’s sensitive to feature scale (like logistic regression, SVM, or neural networks), or combine tree-based methods with regularization strategies that interact with scale (rare in XGBoost), you’ll see nearly identical performance with or without scaling.

#### What is the best possible performance the model could hope to achieve, given the noise added?

In [6]:
import pandas as pd
import numpy as np

def load_data(path='../output/emotion_data.csv'):
    return pd.read_csv(path)

def calculate_stochastic_bayes_accuracy(df, n_trials=10000, seed=42):
    np.random.seed(seed)
    prob_cols = ['happy_prob', 'energetic_prob', 'engaged_prob']
    class_labels = ['Happy', 'Energetic', 'Engaged']

    probs = df[prob_cols].values
    true_labels = df['predicted_emotion'].values

    correct = 0
    for _ in range(n_trials):
        i = np.random.randint(len(df))
        p = probs[i]
        sampled_label = np.random.choice(class_labels, p=p)
        if sampled_label == true_labels[i]:
            correct += 1

    return correct / n_trials

df = load_data()
print(calculate_stochastic_bayes_accuracy(df))

0.6818


```
Classification report for xgb_nozscore:

              precision    recall  f1-score   support

           0       0.67      0.72      0.69       660
           1       0.63      0.62      0.62       624
           2       0.62      0.58      0.60       716

    accuracy                           0.64      2000
   macro avg       0.64      0.64      0.64      2000
weighted avg       0.64      0.64      0.64      2000
```

The accuracy value of 0.64 above is the direct comparator to the Bayes-optimal simulated accuracy of 0.68


Understanding how the Bayes-optimal simulation works:

---

Let’s take a sample row:

happy_prob	energetic_prob	engaged_prob	predicted_emotion
0.45	0.40	0.15	Happy

We run this trial:

Pick this row: index i = 5

Extract p = [0.45, 0.40, 0.15]

Run sampled_label = np.random.choice(class_labels, p=p)

With 45% chance, it gives 'Happy'

With 40% chance, it gives 'Energetic'

With 15% chance, it gives 'Engaged'

Compare sampled_label to true_labels[5] (which is 'Happy')

If they're the same, count it as correct.

You repeat this process 10,000 times across randomly sampled rows from the dataset.

---

If the label 'Happy' was always overwhelmingly likely (like p = [0.95, 0.03, 0.02]), it would be correct most of the time — and so would your model.

If the label was more ambiguous (e.g., [0.36, 0.34, 0.30]), then even the best model might guess wrong.

The final accuracy (e.g. 68%) reflects the level of noise or uncertainty in your data-generating process.