<a href="https://colab.research.google.com/github/fframee/lab/blob/main/Lab04_Naive_Bayes_Grid_and_Random_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #4

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from scipy.stats import uniform

In [2]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

X: 500
y: 500


### Naive Bayes [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

### Random Search Cross-Validation [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

### Uniform distribution in `Scipy` [(Documentation)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html)

#### Exercise

1. For the Naive Bayes model, use grid search 5-fold cross-validation across different values of `alpha` to find the best model.

2. For the best value of `alpha`, compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* What is the model's `f1_macro` score?

3. Repeat Exercise 1 and 2 for **random search** 5-fold cross validation across different values of `alpha`. Compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* Did you get a better `f1_macro` score compared to grid search in Exercise 2?

In [3]:
# Exercise 1 & 2: Grid Search Cross-Validation
print("Performing Grid Search...")
pipeline_gs = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

param_grid = {'nb__alpha': [0.001, 0.01, 0.1, 1, 10]}

grid_search = GridSearchCV(pipeline_gs, param_grid, cv=5, scoring='f1_macro')
grid_search.fit(Xtrain, ytrain)

print("Best alpha (Grid Search):", grid_search.best_params_['nb__alpha'])
print("Best f1_macro score (Grid Search on training data):", grid_search.best_score_)

# Evaluate on test set
y_pred_gs = grid_search.predict(Xtest)
report_gs = classification_report(ytest, y_pred_gs, output_dict=True)
f1_macro_gs = report_gs['macro avg']['f1-score']
print("f1_macro score on test set (Grid Search):", f1_macro_gs)

print("\n" + "="*50 + "\n")

# Exercise 3: Random Search Cross-Validation
print("Performing Random Search...")
pipeline_rs = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

param_dist = {'nb__alpha': uniform(loc=0, scale=1)}

random_search = RandomizedSearchCV(pipeline_rs, param_distributions=param_dist, n_iter=10, cv=5, scoring='f1_macro', random_state=42)
random_search.fit(Xtrain, ytrain)

print("Best alpha (Random Search):", random_search.best_params_['nb__alpha'])
print("Best f1_macro score (Random Search on training data):", random_search.best_score_)

# Evaluate on test set
y_pred_rs = random_search.predict(Xtest)
report_rs = classification_report(ytest, y_pred_rs, output_dict=True)
f1_macro_rs = report_rs['macro avg']['f1-score']
print("f1_macro score on test set (Random Search):", f1_macro_rs)

# Compare results
print("\n" + "="*50 + "\n")
print("Comparison:")
print(f"Grid Search f1_macro: {f1_macro_gs:.4f}")
print(f"Random Search f1_macro: {f1_macro_rs:.4f}")

if f1_macro_rs > f1_macro_gs:
    print("Random Search obtained a better f1_macro score on the test set.")
elif f1_macro_rs < f1_macro_gs:
    print("Grid Search obtained a better f1_macro score on the test set.")
else:
    print("Both Grid Search and Random Search obtained the same f1_macro score on the test set.")

Performing Grid Search...
Best alpha (Grid Search): 0.001
Best f1_macro score (Grid Search on training data): 0.8316349455351428
f1_macro score on test set (Grid Search): 0.7442025675112527


Performing Random Search...
Best alpha (Random Search): 0.05808361216819946
Best f1_macro score (Random Search on training data): 0.8082867260714419
f1_macro score on test set (Random Search): 0.7309479579937921


Comparison:
Grid Search f1_macro: 0.7442
Random Search f1_macro: 0.7309
Grid Search obtained a better f1_macro score on the test set.
