### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #4

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from scipy.stats import uniform

In [2]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

X: 500
y: 500


### Naive Bayes [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [3]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB()),
])

param_grid = {
    'nb__alpha': [0.01, 0.1, 0.5, 1.0, 10.0]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, verbose=1, n_jobs=-1)
grid_search.fit(Xtrain, ytrain)

print("Best parameters (Grid Search):", grid_search.best_params_)
print("Best cross-validation score (Grid Search):", grid_search.best_score_)


Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best parameters (Grid Search): {'nb__alpha': 0.01}
Best cross-validation score (Grid Search): 0.8366666666666666


### Random Search Cross-Validation [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

### Uniform distribution in `Scipy` [(Documentation)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html)

In [7]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB()),
])
param_distributions = {
    'nb__alpha': uniform(loc=0.01, scale=10.0)
}


#### Exercise

1. For the Naive Bayes model, use grid search 5-fold cross-validation across different values of `alpha` to find the best model.

2. For the best value of `alpha`, compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* What is the model's `f1_macro` score?

3. Repeat Exercise 1 and 2 for **random search** 5-fold cross validation across different values of `alpha`. Compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* Did you get a better `f1_macro` score compared to grid search in Exercise 2?

In [4]:
best_grid_model = grid_search.best_estimator_
ypred_grid = best_grid_model.predict(Xtest)
report_grid = classification_report(ytest, ypred_grid, output_dict=True)
f1_macro_grid = report_grid['macro avg']['f1-score']

print("\n--- Results for Grid Search ---")
print("Best alpha (Grid Search):", grid_search.best_params_['nb__alpha'])
print("f1_macro score (Grid Search) on test set:", f1_macro_grid)

param_distributions = {
    'nb__alpha': uniform(loc=0.01, scale=10.0)
}

random_search = RandomizedSearchCV(pipeline, param_distributions, n_iter=100, cv=5, verbose=1, n_jobs=-1, random_state=42)
random_search.fit(Xtrain, ytrain)

print("\nBest parameters (Random Search):", random_search.best_params_)
print("Best cross-validation score (Random Search):", random_search.best_score_)

best_random_model = random_search.best_estimator_
ypred_random = best_random_model.predict(Xtest)
report_random = classification_report(ytest, ypred_random, output_dict=True)
f1_macro_random = report_random['macro avg']['f1-score']

print("\n--- Results for Random Search ---")
print("Best alpha (Random Search):", random_search.best_params_['nb__alpha'])
print("f1_macro score (Random Search) on test set:", f1_macro_random)

print("\n--- Comparison ---")
if f1_macro_random > f1_macro_grid:
    print("Random Search yielded a better f1_macro score.")
elif f1_macro_random < f1_macro_grid:
    print("Grid Search yielded a better f1_macro score.")
else:
    print("Both Grid Search and Random Search yielded the same f1_macro score.")



--- Results for Grid Search ---
Best alpha (Grid Search): 0.01
f1_macro score (Grid Search) on test set: 0.7725871193182322
Fitting 5 folds for each of 100 candidates, totalling 500 fits

Best parameters (Random Search): {'nb__alpha': np.float64(0.06522117123602399)}
Best cross-validation score (Random Search): 0.8223333333333335

--- Results for Random Search ---
Best alpha (Random Search): 0.06522117123602399
f1_macro score (Random Search) on test set: 0.7316444156628901

--- Comparison ---
Grid Search yielded a better f1_macro score.
