**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [3]:
# imports for the project

import pandas as pd

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [6]:
# Define file paths for the training and test datasets.
splits = {
    'train': 'data/train-00000-of-00001.parquet',
    'test': 'data/test-00000-of-00001.parquet'
}

# Load the AG News dataset using pandas.read_parquet.
# This downloads the data from the Hugging Face Hub.
train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print("Original shapes:", train.shape, test.shape)

Original shapes: (120000, 2) (7600, 2)


In [7]:
# Define a mapping from numerical labels to their string categories.
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac: float = 1e-2, label_map: dict[int, str] = label_map, seed: int = 42) -> pd.DataFrame:
    """
    Preprocess the dataset by:
    - Mapping numeric labels to category names.
    - Filtering to only include rows with valid labels.
    - Sampling a fraction of the data for quick experimentation (stratified by label).
    """
    return (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)
    )

# Apply preprocessing to both training and test sets.
train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# Free memory by deleting the original dataframes.
del train
del test

print("Preprocessed shapes:", train_df.shape, test_df.shape)

Preprocessed shapes: (1200, 2) (760, 2)


In [8]:
from sklearn.model_selection import train_test_split

# Split the training data into training and validation sets.
X_train, X_val, y_train, y_val = train_test_split(
    train_df["text"], train_df["label"],
    test_size=0.2, random_state=42
)

print("Training set shape:", X_train.shape)
print("Validation set shape:", X_val.shape)


Training set shape: (960,)
Validation set shape: (240,)


In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer to transform text data into a Bag-of-Words representation.
cv = CountVectorizer()

# Fit the vectorizer on the training text and transform it.
X_train_vectorized = cv.fit_transform(X_train)
print("BoW matrix shape (training):", X_train_vectorized.shape)

# Transform the validation and test text using the fitted vectorizer.
X_val_vectorized = cv.transform(X_val)
X_test_vectorized = cv.transform(test_df["text"])


BoW matrix shape (training): (960, 7634)


In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train a Logistic Regression classifier on the BoW representation.
# Increase max_iter to ensure convergence.
lr_clf = LogisticRegression(max_iter=1000)
lr_clf.fit(X_train_vectorized, y_train)

# Predict on training and validation sets.
y_train_pred = lr_clf.predict(X_train_vectorized)
y_val_pred = lr_clf.predict(X_val_vectorized)

# Evaluate classifier performance.
print("Performance on the Training Set:")
print(classification_report(y_train, y_train_pred, target_names=label_map.values()))

print("Performance on the Validation Set:")
print(classification_report(y_val, y_val_pred, target_names=label_map.values()))

# Evaluate performance on the test set.
y_test_pred = lr_clf.predict(X_test_vectorized)
print("Performance on the Test Set:")
print(classification_report(test_df["label"], y_test_pred, target_names=label_map.values()))


Performance on the Training Set:
              precision    recall  f1-score   support

       World       1.00      1.00      1.00       238
      Sports       1.00      1.00      1.00       240
    Business       1.00      1.00      1.00       240
    Sci/Tech       1.00      1.00      1.00       242

    accuracy                           1.00       960
   macro avg       1.00      1.00      1.00       960
weighted avg       1.00      1.00      1.00       960

Performance on the Validation Set:
              precision    recall  f1-score   support

       World       0.76      0.68      0.72        62
      Sports       0.69      0.60      0.64        60
    Business       0.79      0.87      0.83        60
    Sci/Tech       0.78      0.90      0.83        58

    accuracy                           0.76       240
   macro avg       0.75      0.76      0.75       240
weighted avg       0.75      0.76      0.75       240

Performance on the Test Set:
              precision    recall

In [11]:
# Experiment with a different hyperparameter: regularization strength (C).
# A lower C value means stronger regularization.
lr_clf_exp = LogisticRegression(max_iter=1000, C=0.5)
lr_clf_exp.fit(X_train_vectorized, y_train)

# Predict on the validation set with the new model.
y_val_pred_exp = lr_clf_exp.predict(X_val_vectorized)

print("Validation Performance with Logistic Regression (C=0.5):")
print(classification_report(y_val, y_val_pred_exp, target_names=label_map.values()))


Validation Performance with Logistic Regression (C=0.5):
              precision    recall  f1-score   support

       World       0.76      0.68      0.72        62
      Sports       0.69      0.60      0.64        60
    Business       0.79      0.87      0.83        60
    Sci/Tech       0.78      0.90      0.83        58

    accuracy                           0.76       240
   macro avg       0.75      0.76      0.75       240
weighted avg       0.75      0.76      0.75       240



# Observations and Analysis

- Dataset Size and Preprocessing:
  - Original Data: The initial training set had 120,000 rows and the test set 7,600 rows.
  - After Preprocessing: Sampling reduced the datasets to 1,200 rows for training and 760 rows for testing. Further splitting the training set yielded 960 samples for training and 240 for validation.

- Feature Representation:
  - The BoW vectorization produced a document-term matrix of shape (960, 7634), meaning that 7,634 unique tokens were extracted from the training data.

- Training Performance:
  - The classifier achieved perfect scores (100% precision, recall, and F1-score) on the training set. This suggests that the model is capable of memorizing the training data—possibly due to the small dataset size—but also indicates a risk of overfitting.

- Validation and Test Performance:
  - On the validation set, overall accuracy dropped to 76%, with performance varying by category:
    - World: F1-score of 0.72
    - Sports: F1-score of 0.64
    - Business: F1-score of 0.83
    - Sci/Tech: F1-score of 0.83
  - The test set results are consistent with these findings, achieving an overall accuracy of 78%.

- Hyperparameter Experimentation:
  - Adjusting the regularization strength (using Logistic Regression with C=0.5) yielded the same performance on the validation set as the baseline model. This indicates that merely changing the regularization parameter, at least with the current setup and sampled data, did not have a significant impact on model generalization.

COnsidering the results of both iterations, while the model fits the training data with a score of 1, the drop in performance on the validation and test sets suggests that further tuning—perhaps through additional hyperparameter adjustments, alternative classifiers, or enhanced feature engineering—is necessary to improve generalization.
