**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [88]:
# imports for the project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [89]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [90]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

train_df = preprocess(train, frac=0.02)
test_df = preprocess(test, frac=0.2)

# clear up some memory by deleting the original dataframes
del train
del test

train_df.shape, test_df.shape

((2400, 2), (1520, 2))

In [91]:
(
    
    X_train,
    X_val,
    y_train,
    y_val

) = train_test_split(train_df["text"], train_df["label"], test_size=0.2, random_state=42)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(1920,) (480,) (1920,) (480,)


In [92]:
# Enhanced CountVectorizer with better parameters
cv = CountVectorizer(
    min_df=3,           # Ignore terms that appear in less than 2 documents
    max_df=0.9,         # Ignore terms that appear in more than 90% of documents
    ngram_range=(1, 2)  # Include both unigrams and bigrams
)
X_train_vectorized = cv.fit_transform(X_train)

In [93]:
X_train_vectorized.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [94]:
# Define the parameter grid
param_grid = {
    'C': [0.05, 0.1, 0.5, 1.0],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga'],
    'class_weight': [None, 'balanced']
}

# Create a grid search object
grid_search = GridSearchCV(
    LogisticRegression(random_state=42, max_iter=1000),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

# Fit the grid search to the data
grid_search.fit(X_train_vectorized, y_train)

# Get the best parameters
print("Best parameters:", grid_search.best_params_)

# Use the best model
best_lr_clf = grid_search.best_estimator_

Fitting 5 folds for each of 32 candidates, totalling 160 fits
Best parameters: {'C': 0.1, 'class_weight': None, 'penalty': 'l2', 'solver': 'saga'}


lr_clf = LogisticRegression() # Note that we can set hyperparameters here

lr_clf.fit(X_train_vectorized, y_train)

In [95]:
X_val_vectorized = cv.transform(X_val) 

y_pred = best_lr_clf.predict(X_val_vectorized)

In [96]:

print("Performance on the training set:")
print(classification_report(y_train, best_lr_clf.predict(X_train_vectorized), target_names=label_map.values()))

print("Performance on the validation set:")
print(classification_report(y_val, y_pred, target_names=label_map.values()))



Performance on the training set:
              precision    recall  f1-score   support

       World       0.99      0.99      0.99       474
      Sports       0.99      0.99      0.99       478
    Business       0.99      1.00      0.99       485
    Sci/Tech       1.00      0.99      0.99       483

    accuracy                           0.99      1920
   macro avg       0.99      0.99      0.99      1920
weighted avg       0.99      0.99      0.99      1920

Performance on the validation set:
              precision    recall  f1-score   support

       World       0.78      0.71      0.75       126
      Sports       0.76      0.75      0.75       122
    Business       0.78      0.92      0.84       115
    Sci/Tech       0.84      0.79      0.81       117

    accuracy                           0.79       480
   macro avg       0.79      0.79      0.79       480
weighted avg       0.79      0.79      0.79       480



In [97]:
test_df_vectorized = cv.transform(test_df["text"])

print("Performance on the test set:")
print(classification_report(test_df["label"], best_lr_clf.predict(test_df_vectorized), target_names=label_map.values()))

Performance on the test set:
              precision    recall  f1-score   support

       World       0.78      0.80      0.79       380
      Sports       0.81      0.79      0.80       380
    Business       0.87      0.91      0.89       380
    Sci/Tech       0.86      0.81      0.83       380

    accuracy                           0.83      1520
   macro avg       0.83      0.83      0.83      1520
weighted avg       0.83      0.83      0.83      1520



## Reflections on Results and Hyperparameter Choices


My implementation of the Bag-of-Words model with optimized hyperparameters achieved 79% accuracy on the validation set and 83% on the test set, a significant improvement over the lg model in the guide notebook (76% validation, 78% test).

### Key Hyperparameter Optimizations

1. **CountVectorizer Enhancements:**
   - Setting min_df=3 removed rare terms that could introduce noise
   - Using max_df=0.9 filtered out extremely common words that provide little discriminative value
   - Including bigrams (ngram_range=(1,2)) captured important word combinations and context

2. **Logistic Regression Tuning:**
   - Grid search identified optimal hyperparameters: C=0.1, penalty='l2', solver='saga', class_weight=None
   - The lower C value (0.1) increased regularization, reducing overfitting compared to the default (C=1.0)

### Performance Analysis

The optimized model shows:
- Reduced overfitting (training accuracy 99% vs. validation 79%)
- Overall balanced precision and recall across all categories

### Conclusion

The hyperparameter tuning process demonstrated that feature representation (through CountVectorizer settings) and regularization strength (C value) were the most impactful factors for model performance. The inclusion of bigrams and careful filtering of vocabulary significantly improved the model's ability to capture relevant patterns in the text data.

This exercise highlights the importance of systematic hyperparameter optimization in NLP tasks, even with relatively simple models like Bag-of-Words with logistic regression.
