**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [19]:
# imports for the project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [5]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [6]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
del train
del test

train_df.shape, test_df.shape

((1200, 2), (760, 2))

### 1.2 split the data
We will split the data into training and test

In [7]:
(
    
    X_train,
    X_val,
    y_train,
    y_val

) = train_test_split(train_df["text"], train_df["label"], test_size=0.2, random_state=42)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(960,) (240,) (960,) (240,)


### 2. Create BOW model

In [22]:
cv = TfidfVectorizer()
X_train_vectorized = cv.fit_transform(X_train)

In [23]:
lr_clf = LogisticRegression() 

lr_clf.fit(X_train_vectorized, y_train)

In [24]:
X_val_vectorized = cv.transform(X_val)

y_pred = lr_clf.predict(X_val_vectorized)

In [25]:

print("Performance on the training set:")
print(classification_report(y_train, lr_clf.predict(X_train_vectorized), target_names=label_map.values()))

print("Performance on the validation set:")
print(classification_report(y_val, y_pred, target_names=label_map.values()))


Performance on the training set:
              precision    recall  f1-score   support

       World       0.97      0.97      0.97       238
      Sports       0.98      0.98      0.98       240
    Business       0.99      1.00      1.00       240
    Sci/Tech       0.99      0.98      0.99       242

    accuracy                           0.98       960
   macro avg       0.98      0.98      0.98       960
weighted avg       0.98      0.98      0.98       960

Performance on the validation set:
              precision    recall  f1-score   support

       World       0.87      0.73      0.79        62
      Sports       0.79      0.73      0.76        60
    Business       0.78      0.90      0.84        60
    Sci/Tech       0.79      0.86      0.83        58

    accuracy                           0.80       240
   macro avg       0.81      0.81      0.80       240
weighted avg       0.81      0.80      0.80       240



### Tuning

In [45]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(C=0.1,max_iter=4000))
])

param_grid = {
    'tfidf__max_features': [None, 5000, 10000],
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': [True, False],
    'tfidf__norm': ['l1', 'l2'],
    'clf__C': [0.1, 1.0, 10.0],
    'clf__penalty': ['l2', 'elasticnet'],
    'clf__solver': ['liblinear','saga']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

720 fits failed out of a total of 1440.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
360 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/envs/aiml25-ma2/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/envs/aiml25-ma2/lib/python3.11/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/aiml25-ma2/lib/python3.11/site-packages/sklearn/pipeline.py", line 662, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "/opt/ana

### Optimized BOW and classifier
Was hoping it was going to be better, but the result is honestly kind of the same as without specific hyperparameters

In [70]:
# We tried to use TfidfVectorizer instead of CountVectorizer to improve the preprocessing. 
# This should in theory work but not sure why this didn't improve the results in the end.
cv = TfidfVectorizer(max_features=10000)
X_train_vectorized = cv.fit_transform(X_train)

In [71]:
lr_clf = LogisticRegression(C=10.0, max_iter=4000, solver='saga') 

lr_clf.fit(X_train_vectorized, y_train)

In [72]:
X_val_vectorized = cv.transform(X_val)

y_pred = lr_clf.predict(X_val_vectorized)

In [73]:

print("Performance on the training set:")
print(classification_report(y_train, lr_clf.predict(X_train_vectorized), target_names=label_map.values()))

print("Performance on the validation set:")
print(classification_report(y_val, y_pred, target_names=label_map.values()))

Performance on the training set:
              precision    recall  f1-score   support

       World       1.00      1.00      1.00       238
      Sports       1.00      1.00      1.00       240
    Business       1.00      1.00      1.00       240
    Sci/Tech       1.00      1.00      1.00       242

    accuracy                           1.00       960
   macro avg       1.00      1.00      1.00       960
weighted avg       1.00      1.00      1.00       960

Performance on the validation set:
              precision    recall  f1-score   support

       World       0.82      0.73      0.77        62
      Sports       0.77      0.68      0.73        60
    Business       0.82      0.93      0.88        60
    Sci/Tech       0.81      0.90      0.85        58

    accuracy                           0.81       240
   macro avg       0.81      0.81      0.81       240
weighted avg       0.81      0.81      0.80       240



In [None]:
# All in all we successfully implemented a BoW model to classify news articles.
# W trained a Logistic Regression-classifier on the BoW representation 
# and experimented a bit with different hyperparameters.
# We tried to make use of TfidfVectorizer instead of CountVectorizer to improve the preprocessing,
# but for some reason this did not improve the result in the end. 
# TfidfVectorizer should be better for this kind of data since it adjusts for common words 
# and gives more weight to distinctive words.
# We faced some issues in the tuning. 
# The "optimized" model resulted in being overfitted more than actually improved.