**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [1]:
# imports for the project
# imports for the project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [4]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

  from .autonotebook import tqdm as notebook_tqdm


(120000, 2) (7600, 2)


In [None]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
del train
del test

train_df.shape, test_df.shape

((1200, 2), (760, 2))

Splitting the data

In [6]:
(
    
    X_train,
    X_test,
    y_train,
    y_test

) = train_test_split(train_df["text"], train_df["label"], test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(960,) (240,) (960,) (240,)


fitting the models and finetuning hyperparameters

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV


# Initialize the CountVectorizer (Bag of Words model)
vectorizer = CountVectorizer(stop_words='english', max_features=10000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Logistic Regression Model
model_lr = LogisticRegression(max_iter=200)
model_lr.fit(X_train_vec, y_train)
y_pred_lr = model_lr.predict(X_test_vec)

# Naive Bayes Model
model_nb = MultinomialNB()
model_nb.fit(X_train_vec, y_train)
y_pred_nb = model_nb.predict(X_test_vec)

# Hyperparameter Tuning for Logistic Regression
param_grid_lr = {
    'C': [0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs']
    , 'penalty': ['l1', 'l2']
}
grid_search_lr = GridSearchCV(LogisticRegression(max_iter=200), param_grid_lr, cv=3, verbose=2)
grid_search_lr.fit(X_train_vec, y_train)

# Hyperparameter Tuning for Naive Bayes
param_grid_nb = {
    'alpha': [0.01, 0.1, 1, 2, 3, 10],
    'fit_prior': [True, False]
    , 'force_alpha': [True, False]
}
grid_search_nb = GridSearchCV(MultinomialNB(), param_grid_nb, cv=3, verbose=2)
grid_search_nb.fit(X_train_vec, y_train)


# Print the best estimators for each model
print("Best Logistic Regression Estimator:", grid_search_lr.best_estimator_)
print("Best Naive Bayes Estimator:", grid_search_nb.best_estimator_)

# Evaluate models before GridSearch
print("Classification Report for Logistic Regression (Before GridSearch):")
print(classification_report(y_test, y_pred_lr))

print("Classification Report for Naive Bayes (Before GridSearch):")
print(classification_report(y_test, y_pred_nb))

# Evaluate models after GridSearch
best_model_lr = grid_search_lr.best_estimator_
best_model_nb = grid_search_nb.best_estimator_

# Predict using the best models
y_pred_lr_best = best_model_lr.predict(X_test_vec)
y_pred_nb_best = best_model_nb.predict(X_test_vec)

print("Classification Report for Logistic Regression (After GridSearch):")
print(classification_report(y_test, y_pred_lr_best))

print("Classification Report for Naive Bayes (After GridSearch):")
print(classification_report(y_test, y_pred_nb_best))

### reflections
For output we get:
Best Logistic Regression Estimator: LogisticRegression(C=0.1, max_iter=200)
Best Naive Bayes Estimator: MultinomialNB(alpha=1, fit_prior=False)

Classification Report for Logistic Regression (Before GridSearch):
              precision    recall  f1-score   support

    Business       0.78      0.68      0.72        62
    Sci/Tech       0.77      0.67      0.71        60
      Sports       0.80      0.88      0.84        60
       World       0.74      0.86      0.79        58

    accuracy                           0.77       240
   macro avg       0.77      0.77      0.77       240
weighted avg       0.77      0.77      0.77       240


Classification Report for Logistic Regression (After GridSearch):
              precision    recall  f1-score   support

    Business       0.78      0.65      0.71        62
    Sci/Tech       0.72      0.65      0.68        60
      Sports       0.76      0.87      0.81        60
       World       0.76      0.88      0.82        58

    accuracy                           0.76       240
   macro avg       0.76      0.76      0.76       240
weighted avg       0.76      0.76      0.75       240

Classification Report for Naive Bayes (Before GridSearch):
              precision    recall  f1-score   support

    Business       0.79      0.84      0.81        62
    Sci/Tech       0.88      0.63      0.74        60
      Sports       0.87      0.88      0.88        60
       World       0.74      0.90      0.81        58

    accuracy                           0.81       240
   macro avg       0.82      0.81      0.81       240
weighted avg       0.82      0.81      0.81       240

Classification Report for Naive Bayes (After GridSearch):
              precision    recall  f1-score   support

    Business       0.79      0.84      0.81        62
    Sci/Tech       0.88      0.63      0.74        60
      Sports       0.87      0.88      0.88        60
       World       0.74      0.90      0.81        58

    accuracy                           0.81       240
   macro avg       0.82      0.81      0.81       240
weighted avg       0.82      0.81      0.81       240

The performance of the system indicates that the Naive Bayes classifier outperforms Logistic Regression in terms of accuracy and f1-score, both before and after hyperparameter tuning. The hyperparameter tuning for Logistic Regression did not significantly improve its performance, suggesting that the default parameters were already close to optimal for this dataset. For Naive Bayes, the best parameters (alpha=1, fit_prior=False) did not change the performance of the model at all. from the data it seems The Bag-of-Words model, captures enough information to achieve reasonable classification performance, but more sophisticated models like an LLM or word embeddings could potentially yield better results which will be explored in the next tasks
