**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [1]:
# imports for the project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [2]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [3]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
del train
del test

train_df.shape, test_df.shape

((1200, 2), (760, 2))

### Split the Data

In [4]:
(
    
    X_train,
    X_val,
    y_train,
    y_val

) = train_test_split(train_df["text"], train_df["label"], test_size=0.2, random_state=42)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(960,) (240,) (960,) (240,)


### Build the Model

In [5]:
# Tfidfvectorizer
tv = TfidfVectorizer()
# fit_transform on the training data
X_train_vectorized = tv.fit_transform(X_train)
# get dense matrix
X_train_vectorized.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

### Create Classifier

In [6]:
# Insert parameters (via GridSearch) and reflect on them
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10], # Regularization strength: lower values = stronger regularization to prevent overfitting
    'penalty': ['l1', 'l2'], # Test both L1 (sparse features) and L2 (smooth weights) regularization
    'solver': ['liblinear'], # 'liblinear' supports both 'l1' and 'l2' and works well for small to medium-sized datasets
    'class_weight': [None, 'balanced'] # Try both: 'balanced' adjusts for class imbalance; None leaves classes unweighted
}

grid = GridSearchCV(
    LogisticRegression(max_iter=1000), # Increase iterations to ensure convergence, especially with small C values
    param_grid,
    cv=5, # 5-fold cross-validation for more stable and reliable performance estimates
    scoring='f1_macro', # Use macro-averaged F1 to treat all classes equally, regardless of support
    n_jobs=-1 # Use all available CPU cores for faster grid search
)

grid.fit(X_train_vectorized, y_train)

print("Best parameters:", grid.best_params_)
print("Best score:", grid.best_score_)

Best parameters: {'C': 10, 'class_weight': None, 'penalty': 'l2', 'solver': 'liblinear'}
Best score: 0.823820572389945


### Get predicitions

In [7]:
best_model = grid.best_estimator_

X_val_vectorized = tv.transform(X_val)  # transform, not fit_transform (for the validation data)
y_pred = best_model.predict(X_val_vectorized)

### Evaluate Trained Model

In [8]:
print("Performance on the training set:")
print(classification_report(y_train, best_model.predict(X_train_vectorized), target_names=label_map.values()))

print("Performance on the validation set:")
print(classification_report(y_val, y_pred, target_names=label_map.values()))

Performance on the training set:
              precision    recall  f1-score   support

       World       1.00      1.00      1.00       238
      Sports       1.00      1.00      1.00       240
    Business       1.00      1.00      1.00       240
    Sci/Tech       1.00      1.00      1.00       242

    accuracy                           1.00       960
   macro avg       1.00      1.00      1.00       960
weighted avg       1.00      1.00      1.00       960

Performance on the validation set:
              precision    recall  f1-score   support

       World       0.82      0.73      0.77        62
      Sports       0.77      0.67      0.71        60
    Business       0.81      0.93      0.87        60
    Sci/Tech       0.80      0.88      0.84        58

    accuracy                           0.80       240
   macro avg       0.80      0.80      0.80       240
weighted avg       0.80      0.80      0.80       240



In [9]:
test_df_vectorized = tv.transform(test_df["text"])

print("Performance on the test set:")
print(classification_report(test_df["label"], best_model.predict(test_df_vectorized), target_names=label_map.values()))

Performance on the test set:
              precision    recall  f1-score   support

       World       0.74      0.79      0.77       190
      Sports       0.83      0.76      0.79       190
    Business       0.88      0.91      0.89       190
    Sci/Tech       0.86      0.84      0.85       190

    accuracy                           0.82       760
   macro avg       0.83      0.83      0.82       760
weighted avg       0.83      0.82      0.82       760



### Reflection on Hyperparameters and Performance

The final model demonstrates solid performance on both the validation and test sets. After performing hyperparameter tuning to find the best hyperparameters with `GridSearchCV` and switching from `CountVectorizer` to `TfidfVectorizer`, the results improved.

#### Performance summary
- **Training accuracy:** 100%
- **Validation accuracy:** 80%
- **Test accuracy:** 82%
- **Best macro F1 (cross-validation):** 0.82

On the test set:
- **Business** and **Sci/Tech** classes perform quite well (F1: 0.89 and 0.85 respectively).
- **World** and **Sports** also achieve relatively good scores (F1: 0.77 and 0.79).

The grid search revealed the following configuration for `LogisticRegression`:
- `C=10` — relatively low regularization, which allows the model to learn more nuanced weights.
- `penalty='l2'` — standard L2 regularization helped maintain generalization without over-sparsifying.
- `solver='liblinear'` — suitable for small datasets and compatible with L1/L2 penalties.
- `class_weight=None` — the model performed best without adjusting class weights, which suggests that the class imbalance was not a major issue.


#### Final thoughts
The model achieved perfect scores on the training set — 100% precision, recall, and F1-score across all categories. While this might initially seem good, it’s typically a **strong indicator of overfitting**.
Even though the training performance was perfect, the **validation and test performance remained strong**, which shows that the model **generalized well despite overfitting on training data**. Through the utilization of the `TfidfVectorizer` and the `GridSearchCV`, the performance could be improved even more. 

