# INTRODUCTION

## TASK
Text Classification

## DATASET SOURCING 
*Ref*
1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).
2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

Download the Data from https://www.kaggle.com/datasets/rmisra/news-category-dataset

For our **text classifcation** exercise, we need a dataset with a wide variety categories. So, the dataset used contains 209527 news headlines and abstracts from 2012 to 2022 from HuffPost, classified in one of **42 categories**, so it can serve as a benchmark for our exercise (explained below).

The dataset consists of the following columns:
* **category**: category in which the article was published.
* **headline**: the headline of the news article.
* **authors**: list of authors who contributed to the article.
* **link**: link to the original news article.
* **short_description**: Abstract of the news article.
* **date**: publication date of the article.

## PROBLEM DEFINITION

How many times have we faced a problem of not having 'enough ammount of data' when we have to solve a text classfification task? In fact, sometimes we have enough data for some categories but not for others.

Some teams choose complex model without considering whether simpler methods might perform better, other teams abandond the problem...

In this experiment, I make a deep exploration of that issue and try to answer the above question. To do that, I compare two approaches, the **BERT model**, that is one of the best classifying texts; and a simpler method but very usefull in many scenarios (indeed, is the core component of most RAGs systems), the **cosine similarity**.

The main idea of the present work is to extract sub-samples of the dataset in order of increasing size, then training both models on them and compare the results on a common validation dataset: Is better using BERT regardless of training size set? If not, why? Does cosine similarity works better in any scenario?

## APPROACH

To address the problem I used the following tools:

* **For DATA PROCESSING**:
    - `pandas` and `numpy` libraries used to manipulate the data.
    - `sklearn` library used to split the data, encode the categorical target, and compute the cosine similarity.
    - `evaluate` library from Huggingface used performance evaluation.

* **For BERT MODEL**:
    - `transformers` library from Huggingface used for tokenization and fine-tuning.
    - `datasets` library from Huggingface used to format data correctly.
    - `torch` library used to ensure the model is training with gpu.
    - `GPU RTX 4090` (in runpod.io) to train (fine tune) the model.

* **For COSINE SIMIALRITY**:
    - `openai` and `langchain` where used to call the api for the embedding model (text-embedding-3-small).

* **For VISUALIZATIONS**:
    - `plotly`

## STEPS

To complete the task, I followed these steps:

* **LOAD DATA**: Load the data and make analisys to understand the basics statistics. Here, after inspect the text lengths, I decided to join `headline` and  `short_description`, into a column named `news`, which became our input feature.
* **DATA PROCESSING**: Since we are using the BERT model and the OpenAI embedding model, we do not have to do a deep cleaning of the data. BERT model uses a tokenizer function, which makes the neccesary pre-processing to the text; and OpenAI embeddings were trained to capture the context of any real world text. If we would clean the text, we could remove valuable context.

I encoded the `target` and build 11 subsets of the data following the steps:

- First the `validation` set: 30% stratified sample of the full dataset, preserving class distribution.

- Then, the `training samples`: Those are the stratified samples which will serve to train the models on training sets of increasing sizes to check the performance evolution of the models. To build them, I went through a loop where, in every iteration we make the subsample of `p` proportion where `p` is in [0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1]. Finally, we have to compute another partition over this `train dataset`, one to fine tune BERT and the other to test the performance and avoid overfitting.

* **`BERT` PIPELINE**:
    1. I had to implement 3 functions:

        * `fine_tune`, which returns the trained model and the tokenizer.
        * `tokenize_function`, which returns the tokenized data.
        * `performance`, which compute the performance of the model on the `validation` dataset.

    2. The loop:

        * Iterate through proportions, `p` = [0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1].
        * Using the 3 previous functions, for each `p`, fine tune of the model over the corresponding training dataset, and evaluate the performance on the `validation` dataset and save it into a csv.

* **`Cosine Similarity` PIPELINE**:
    1. I had to implement 1 function:

        * `similarity_performance` which has 2 main functionalities:
            + Computes the `embeddings` of the news of the training dataset and with that, computes similarity matrix. To prevent memory overflow I had to iterate on the `embeddings` matrix by batches of rows.
            + Compute the `performance` of the `Cosine Similarity` over the `validation` dataset and save it into a csv.

    2. The loop:

        * For each, `p` = [0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1], to load the corresponding traing dataset, perform prediction, and save the performance.

* **PERFORMANCE ANALISYS**:
    I choose the **f1-score** to measure the model performance. Since the target has many categories, f1-score is a good choice as it balances precision and recall and handles class imbalance well.

    To make the comparison fair, both models were evaluated on the same `validation` set.

    The following, are the main things that we have to know about this analisys:

    1. I had to implement 3 functions:

        * `general`, function to plot performance and training set size for each `p`.
        * `top_categories`, same as above but focused on the 3 most frequent and 3 least frequent classes.
        * `f1_data`, function to build a convenient dataset for the visualization, with `Category`, `F1-Score`, `prop` and `n_rows_label`.

    2. Sections **Performance similarity process** and **Performance BERT process**:

        In both sections I show: 

        * The evolution of the performance (on the `validation` dataset) of the models as the training size increases.
        * The same as above but focused on the top and bottom 3 categories by frequency.

    3. Section **BERT VS Similarity based on training data size**:

        In this section I show:

        * Comparison of performance trends of both models, overall and for 'extremes' categories.
        * Analysis of how many categories improve as more training data is available.

--------------------------------------------------------------------------------------------------------------------

In [None]:
import os
import pandas as pd
import numpy as np
import time
import gc
from typing import Literal

# PROCESS DATA
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder
import evaluate
from sklearn.metrics.pairwise import cosine_similarity

# BERT
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
import torch

# SIMILARITY
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ["OPENAI_API_KEY"]
from langchain_openai import OpenAIEmbeddings

# CHARTS
import plotly.express as px
import plotly.graph_objects as go


# LOAD DATA

**Set directories**

In [2]:
path = ""
workdata_path = "../working_data/"

**Import data**

In [3]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rmisra/news-category-dataset")

print("Path to dataset files:", path)

Path to dataset files: /home/y41000/.cache/kagglehub/datasets/rmisra/news-category-dataset/versions/3


In [5]:
data = pd.read_json(f'{path}/News_Category_Dataset_v3.json', lines = True)

In [6]:
data.shape

(209527, 6)

**General information of the features**

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209527 entries, 0 to 209526
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   link               209527 non-null  object        
 1   headline           209527 non-null  object        
 2   category           209527 non-null  object        
 3   short_description  209527 non-null  object        
 4   authors            209527 non-null  object        
 5   date               209527 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.6+ MB


**Distributions of the classes**

In [8]:
pd.DataFrame(data['category'].value_counts()).join(
    pd.DataFrame(data['category'].value_counts(normalize=True))
)

Unnamed: 0_level_0,count,proportion
category,Unnamed: 1_level_1,Unnamed: 2_level_1
POLITICS,35602,0.169916
WELLNESS,17945,0.085645
ENTERTAINMENT,17362,0.082863
TRAVEL,9900,0.047249
STYLE & BEAUTY,9814,0.046839
PARENTING,8791,0.041956
HEALTHY LIVING,6694,0.031948
QUEER VOICES,6347,0.030292
FOOD & DRINK,6340,0.030259
BUSINESS,5992,0.028598


42 umbalanced categories.

**Distribution of the classes by author**

In [9]:
pd.crosstab(data[data['authors'].isin(list(data['authors'].value_counts()[:20].index))]['category'], data[data['authors'].isin(list(data['authors'].value_counts()[:20].index))]['authors'])

authors,Unnamed: 1_level_0,Andy McDonald,Bill Bradley,Carly Ledbetter,Caroline Bologna,Cole Delbyck,Curtis M. Wong,Dana Oliver,David Moye,Dominique Mosbergen,Ed Mazza,Igor Bobic,Julia Brucculieri,Lee Moran,Mary Papenfuss,Michelle Manetti,Nina Golgowski,"Reuters, Reuters",Ron Dicker,Sam Levine
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ARTS,19,0,0,0,0,0,7,0,0,4,0,0,0,0,0,0,0,0,1,0
ARTS & CULTURE,32,4,0,1,1,3,6,0,6,3,6,0,5,41,6,0,7,0,3,0
BLACK VOICES,1008,14,0,1,0,0,4,2,53,6,12,0,0,31,7,0,35,23,40,1
BUSINESS,991,0,0,0,1,0,0,0,2,21,21,1,0,2,25,0,13,294,13,2
COLLEGE,111,0,0,0,0,0,0,0,0,3,2,0,0,0,0,0,2,1,1,0
COMEDY,729,857,167,4,0,9,0,0,59,25,343,0,15,841,129,0,8,0,414,0
CRIME,977,0,0,0,0,0,0,0,97,31,58,1,0,92,76,0,251,45,45,0
CULTURE & ARTS,370,0,0,0,0,0,1,0,0,1,0,0,0,2,0,0,0,1,0,0
DIVORCE,1731,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,11,0,0
EDUCATION,9,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,0


All authors wrote on more than 2 topics.

**Length text distribution**

In [10]:

perc = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99]
data['news'] = data['headline'] +' -- '+ data['short_description']
pd.concat([
    pd.DataFrame(data['headline'].str.len().describe(percentiles=perc)),
    pd.DataFrame(data['short_description'].str.len().describe(percentiles=perc)),
    pd.DataFrame(data['news'].str.len().describe(percentiles=perc))
], axis=1)

Unnamed: 0,headline,short_description,news
count,209527.0,209527.0,209527.0
mean,58.415355,114.20867,176.624025
std,18.808506,80.840575,78.55297
min,0.0,0.0,4.0
10%,33.0,9.0,79.0
20%,42.0,46.0,112.0
30%,49.0,70.0,137.8
40%,55.0,94.0,158.0
50%,60.0,120.0,174.0
60%,64.0,122.0,187.0


We have shorts texts, 99% has less than 400 characteres.

# DATA PROCESSING

**Encode the target**

We have to work with numerical target, not string.

In [11]:
label_encoder = LabelEncoder()
data['label'] = label_encoder.fit_transform(data['category'])

In [15]:
data_work = data[['news', 'label']]


**Build validation data set**

In [16]:
rest, validation = train_test_split(data_work,test_size=0.3 ,random_state=2025, stratify =data['label'])

In [17]:
validation.to_excel(f'{workdata_path}validation.xlsx', index=False)
validation.shape


(62859, 2)

In [None]:
# proportions = [0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]

**Build training data sets with increasing sizes**

In [None]:
proportions = [0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
for p in proportions:
    # split to create training data set
    _, v = train_test_split(rest,test_size=p,random_state=2025 ,stratify =rest['label'])
    # Training and test partition column
    if p>0.001:
        v_tr, v_ts = train_test_split(v,test_size=0.2,random_state=2025 ,stratify =v['label'])
    else:
        v_tr, v_ts = train_test_split(v,test_size=0.2,random_state=2025 )

    v_tr['part']= 'train'
    v_ts['part']= 'test'
    v = pd.concat([v_tr,v_ts]).sort_index()
    v.to_excel(f'{workdata_path}prop_{p}.xlsx', index=False)
    del(v)
    del(v_ts)
    del(v_tr)
rest_tr, rest_ts = train_test_split(rest,test_size=0.2,random_state=2025 ,stratify =rest['label'])
rest_tr['part']= 'train'
rest_ts['part']= 'test'
rest = pd.concat([rest_tr,rest_ts]).sort_index()
rest.to_excel(f'{workdata_path}prop_1.xlsx', index=False)

# BERT PIPELINE
Executed in Runpod with a RTX 4090

### AUXILIAR FUNCTIONS

In [33]:
torch.cuda.is_available()

True

In [None]:
def tokenize_function(df,tokenizer,padding,truncation):
    """
    Function to tokenize data and give it the structure needed
    """
    tokens = tokenizer(df["news"], padding=padding, truncation=truncation)
    tokens["labels"] = df["label"]
    return tokens

def performance(trainer,p,tokenizer):
    """
    Function to compute the performace of the model on validation data
    """
    padding="max_length"
    truncation=True
    metric = evaluate.load("f1")

    # Tokenize validation data
    val_dataset = Dataset.from_pandas(validation)
    val_tok = val_dataset.map(lambda batch: tokenize_function(batch, tokenizer, padding, truncation), batched=True)

    # Predictions
    predictions = trainer.predict(val_tok)
    pred_labels = np.argmax(predictions.predictions, axis=1)

    # Performance
    report = metric.compute(references = predictions.label_ids, predictions = pred_labels, average=None)

    df_report = pd.DataFrame({
        "Category": label_encoder.classes_,
        "F1-Score": report["f1"]
    })    
    # report = classification_report(predictions.label_ids, pred_labels, target_names=label_encoder.classes_, output_dict=True)
    # df_report = pd.DataFrame(report).transpose()

    # Saving into a csv file
    df_report.to_csv(f"../performance/bert_{p}.csv")

### TRAINING FUNCTION

In [None]:
def fine_tune(p):
    """
    Function to fine tune BERT.
    Input: Proportion p, to load the corresponding training data.
    Outuput: Duplca consist on the trained model and the tokenizer. 
    """
    mode_checkpoint="bert-base-cased"
    padding="max_length"
    truncation=True
    save_checkpoints = "../checkpoints"
    data_tr = pd.read_excel(f'{workdata_path}prop_{p}.xlsx')
    num_labels = len(data_tr["label"].unique())

    # Define the tokenizer and the model
    tokenizer = AutoTokenizer.from_pretrained(mode_checkpoint)
    model = AutoModelForSequenceClassification.from_pretrained(mode_checkpoint, num_labels=num_labels)

    # # Force the model top use GPU
    # device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    # model.to(device)

    # Train and validation set
    train_df = data_tr[data_tr['part']=='train'][['news','label']]
    test_df = data_tr[data_tr['part']=='test'][['news','label']]


    # Tokenize the data
    train_dataset = Dataset.from_pandas(train_df)
    test_dataset = Dataset.from_pandas(test_df)
    train_tok = train_dataset.map(lambda batch: tokenize_function(batch, tokenizer, padding, truncation), batched=True)
    test_tok = test_dataset.map(lambda batch: tokenize_function(batch, tokenizer, padding, truncation), batched=True)

    # Set the training arguments
    training_args = TrainingArguments(
        output_dir=save_checkpoints,
        eval_strategy="epoch",
        per_device_train_batch_size=8,
        fp16=True if torch.cuda.is_available() else False,
        logging_steps=10,
        save_strategy="no",
        report_to="none",
    )

    # Instanciate the training object
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_tok,
        eval_dataset=test_tok
    )

    # Train the model
    trainer.train()
    return trainer, tokenizer

In [17]:
proportions.append(1)

### BERT LOOP

In [18]:
proportions

[0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1]

In [None]:

# Train on each train set and evaluate validation set 

for prop in proportions:
    t_inicial = time.time()
    print(f"Proceso proporción: {prop}")
    trained, tokenizer = fine_tune(prop)
    print(f"    Finaliza entrenamiento en {round((time.time()-t_inicial)/60)} minutos")
    performance(trained,prop,tokenizer)
    print(f"    Finaliza evaluación en {round((time.time()-t_inicial)/60)} minutos")

    del(trained)
    gc.collect()
    torch.cuda.empty_cache()


Proceso proporción: 0.001


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/117 [00:00<?, ? examples/s]

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,3.5918,3.636848
2,3.3512,3.588998
3,3.097,3.566536


    Finaliza entrenamiento en 0 minutos


Map:   0%|          | 0/62859 [00:00<?, ? examples/s]

    Finaliza evaluación en 3 minutos
Proceso proporción: 0.005


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/587 [00:00<?, ? examples/s]

Map:   0%|          | 0/147 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,3.2671,3.144505
2,2.7034,2.892455
3,2.3844,2.683795


    Finaliza entrenamiento en 0 minutos


Map:   0%|          | 0/62859 [00:00<?, ? examples/s]

    Finaliza evaluación en 3 minutos
Proceso proporción: 0.01


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/1173 [00:00<?, ? examples/s]

Map:   0%|          | 0/294 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,2.8512,2.64715
2,1.969,2.19909
3,1.7772,2.073778


    Finaliza entrenamiento en 1 minutos


Map:   0%|          | 0/62859 [00:00<?, ? examples/s]

    Finaliza evaluación en 4 minutos
Proceso proporción: 0.05


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/5867 [00:00<?, ? examples/s]

Map:   0%|          | 0/1467 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,1.917,1.850401
2,1.55,1.539533
3,0.7913,1.513788


    Finaliza entrenamiento en 3 minutos


Map:   0%|          | 0/62859 [00:00<?, ? examples/s]

    Finaliza evaluación en 6 minutos
Proceso proporción: 0.1


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/11733 [00:00<?, ? examples/s]

Map:   0%|          | 0/2934 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,1.5398,1.642799
2,1.0121,1.511623
3,0.7274,1.537976


    Finaliza entrenamiento en 6 minutos


Map:   0%|          | 0/62859 [00:00<?, ? examples/s]

    Finaliza evaluación en 9 minutos
Proceso proporción: 0.2


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/23467 [00:00<?, ? examples/s]

Map:   0%|          | 0/5867 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,1.5482,1.490093
2,1.1102,1.361755
3,0.7063,1.376636


    Finaliza entrenamiento en 12 minutos


Map:   0%|          | 0/62859 [00:00<?, ? examples/s]

    Finaliza evaluación en 15 minutos
Proceso proporción: 0.3


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/35200 [00:00<?, ? examples/s]

Map:   0%|          | 0/8801 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,1.2914,1.366652
2,1.1846,1.278015
3,0.5633,1.377542


    Finaliza entrenamiento en 17 minutos


Map:   0%|          | 0/62859 [00:00<?, ? examples/s]

    Finaliza evaluación en 20 minutos
Proceso proporción: 0.4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/46934 [00:00<?, ? examples/s]

Map:   0%|          | 0/11734 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,1.2456,1.281389
2,1.0261,1.185213
3,0.74,1.313957


    Finaliza entrenamiento en 23 minutos


Map:   0%|          | 0/62859 [00:00<?, ? examples/s]

    Finaliza evaluación en 26 minutos
Proceso proporción: 0.5


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/58667 [00:00<?, ? examples/s]

Map:   0%|          | 0/14667 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,1.4351,1.234507
2,0.9982,1.165917
3,0.448,1.28523


    Finaliza entrenamiento en 29 minutos


Map:   0%|          | 0/62859 [00:00<?, ? examples/s]

    Finaliza evaluación en 32 minutos
Proceso proporción: 1


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


KeyError: 'part'

In [39]:

for prop in proportions[-1:]:
    t_inicial = time.time()
    print(f"Proceso proporción: {prop}")
    trained, tokenizer = fine_tune(prop)
    print(f"    Finaliza entrenamiento en {round((time.time()-t_inicial)/60)} minutos")
    performance(trained,prop,tokenizer)
    print(f"    Finaliza evaluación en {round((time.time()-t_inicial)/60)} minutos")

    del(trained)
    gc.collect()
    torch.cuda.empty_cache()

Proceso proporción: 1


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/117334 [00:00<?, ? examples/s]

Map:   0%|          | 0/29334 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,1.0636,1.153072
2,0.8484,1.072012
3,0.3905,1.191224


    Finaliza entrenamiento en 59 minutos


Map:   0%|          | 0/62859 [00:00<?, ? examples/s]

    Finaliza evaluación en 62 minutos


# COSINE SIMILARITY PIPELINE

We are not splitting the documents because they are shorts (length of 99% of texts < 400 chars)

### AUXILIAR FUNCTION

In [None]:
def similarity_performance(p,embeddings,validation, validation_emb,batch_size=1000):
    """
    Function to tokenize data and give it the structure needed
    """
    data_tr = pd.read_excel(f'{workdata_path}prop_{p}.xlsx')
    data_tr = data_tr[data_tr['part']=='train']

    # Training embeddings
    data_tr_emb = embeddings.embed_documents(data_tr["news"])
    pred_labels=[]
    # To avoid broke the process by memory overflow we need to go through the validation data by batches of rows.
    for i in range(0, len(validation_emb), batch_size):
        print(f"   Procesando batch {i} - {i+batch_size}...")
        val_batch = validation_emb[i:i+batch_size]
        sim = cosine_similarity(val_batch, data_tr_emb)
        top_idx = np.argmax(sim, axis=1)
        batch_preds = [data_tr['label'].values[j] for j in top_idx]
        pred_labels.extend(batch_preds)
        del sim, val_batch , top_idx, batch_preds
        

    # similarities = cosine_similarity(validation_emb, data_tr_emb)
    # indexs = np.argmax(similarities, axis=1)
    # pred_labels = [data_tr['label'].values[i] for i in indexs]

    # Performance
    metric = evaluate.load("f1")
    report = metric.compute(references = np.array(validation['label']), predictions = np.array(pred_labels), average=None)

    df_report = pd.DataFrame({
        "Category": label_encoder.classes_,
        "F1-Score": report["f1"]
    })
    df_report.to_csv(f"../performance/sim_{p}.csv")
    del data_tr_emb, pred_labels, df_report
    gc.collect()

### EMBEDDINGS

In [20]:
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)


In [21]:
validation=pd.read_excel(f'{workdata_path}validation.xlsx')
validation_emb = embeddings.embed_documents(validation["news"]) 

### SIMILARITY LOOP

In [None]:


for prop in proportions:
    t_inicial = time.time()
    print(f"Proceso proporción: {prop}")
    similarity_performance(prop,embeddings,validation, validation_emb,batch_size=1000)
    print(f"Proceso finalizado en {round((time.time()-t_inicial)/60)} minutos\n----------------------------------------")
 

Proceso proporción: 0.001
   Procesando batch 0 - 1000...
   Procesando batch 1000 - 2000...
   Procesando batch 2000 - 3000...
   Procesando batch 3000 - 4000...
   Procesando batch 4000 - 5000...
   Procesando batch 5000 - 6000...
   Procesando batch 6000 - 7000...
   Procesando batch 7000 - 8000...
   Procesando batch 8000 - 9000...
   Procesando batch 9000 - 10000...
   Procesando batch 10000 - 11000...
   Procesando batch 11000 - 12000...
   Procesando batch 12000 - 13000...
   Procesando batch 13000 - 14000...
   Procesando batch 14000 - 15000...
   Procesando batch 15000 - 16000...
   Procesando batch 16000 - 17000...
   Procesando batch 17000 - 18000...
   Procesando batch 18000 - 19000...
   Procesando batch 19000 - 20000...
   Procesando batch 20000 - 21000...
   Procesando batch 21000 - 22000...
   Procesando batch 22000 - 23000...
   Procesando batch 23000 - 24000...
   Procesando batch 24000 - 25000...
   Procesando batch 25000 - 26000...
   Procesando batch 26000 - 27000.

# PERFORMANCE ANALISYS

We are going to study 3 main things related to the impact on the performance, of the training size:

1. Evaluation of the performance of the similarity process.

2. Evaluation of the performance of the BERT fine tune process.

3. Compare the performance between both processes based on the number of samples. 

### 0. AUXILIAR FUNCTIONS

In [None]:
def general(data):

    """
    Function to plot performance and training set size for each `p`.
    """

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=data["prop"], y=data["f1_score_avg"],name="f1-score", yaxis="y1"))
    fig.add_trace(go.Scatter(x=data["prop"], y=data["n_rows"],name="Nº rows", yaxis="y2"))


    fig.update_layout(
        title="Average f1-score vs size of training data",
        xaxis=dict(title="Proportion of data for training"),
        yaxis=dict(title="Performance (f1-score avg)", side="left"),
        yaxis2=dict(
            title="Nº rows of training data",
            overlaying="y",
            side="right"
        )
    )
    fig.show()

def top_categories(data):

    """
    Function to plot performance and training set size for each `p` of 3 most frequent and 3 least frequent classes.
    """ 

    # Crear figura
    fig = go.Figure()
    colors = ['red','blue','green','violet','turquoise','yellowgreen']
    # turquoise, violet, wheat, white, whitesmoke,
    #             yellow, yellowgreen
    # Añadir trazas de F1-Score por categoría
    j=0
    for category in data['Category'].unique():
        subset = data[data['Category'] == category]
        fig.add_trace(go.Scatter(
            x=subset['prop'],
            y=subset['F1-Score'],
            name=f'F1 - {category}',
            mode='lines+markers',
            yaxis='y1',
            marker=dict(symbol='circle', size=5, color=colors[j]),
        ))
        j +=1
    # Añadir trazas de F1-Score por categoría
    j=0
    for category in data['Category'].unique():
        subset = data[data['Category'] == category]
        fig.add_trace(go.Scatter(
            x=subset['prop'],
            y=subset['n_rows_label'],
            name=f'nrows - {category}',
            mode='lines+markers',
            yaxis='y2',
            marker=dict(symbol='diamond', size=5, color=colors[j]),
            line=dict(dash='dot')

        ))
        j +=1

    # Configurar layout con ejes secundarios
    fig.update_layout(
        title="F1-Score / size training datan of extreme categories ",
        xaxis=dict(title="Proportion of data for training"),
        yaxis=dict(title="Performance (f1-score)"),
        yaxis2=dict(
            title="Nº rows training",
            overlaying='y',
            side='right',
            showgrid=False
        ),
        legend=dict(x=1.1, y=1, bordercolor="Black", borderwidth=1)
    )

    fig.show()

def f1_data(tipo: Literal['sim', 'bert']):
    """
    function to build a convenient dataset for the visualization.
    """
    data = pd.DataFrame()
    work_l = pd.DataFrame()

    for d in proportions:
        data = pd.concat([data,pd.read_csv(f"../performance/{tipo}_{d}.csv",usecols=['Category','F1-Score']).assign(prop=d)])
        work_aux = pd.read_excel(f'{workdata_path}prop_{d}.xlsx').assign(prop=d)
        work_aux = work_aux[work_aux['part']=='train']
        work_l = pd.concat([work_l,work_aux.groupby(['label','prop']).agg(n_rows_label = pd.NamedAgg(column= 'news',aggfunc='count')).reset_index()])
        del work_aux
    work_l['Category'] = label_encoder.inverse_transform(work_l['label'])
    work_l.drop(columns='label',inplace=True)
    work_l.reset_index(drop=True,inplace=True)
    data.reset_index(drop=True,inplace=True)
    print(data.shape)
    data = data.merge(work_l,how='left',on=['Category','prop'])
    print(data.shape)
    return data

In [None]:
# proportions = [0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1]

### 1. PERFORMANCE SIMILARITY PROCESS

In [51]:
sim = f1_data('sim')

(420, 3)
(420, 4)


#### GENERAL

In [None]:
f1_size = sim.groupby(['prop']).agg(
    f1_score_avg = pd.NamedAgg(column='F1-Score', aggfunc='mean'),
    n_rows = pd.NamedAgg(column='n_rows_label', aggfunc='sum')
).reset_index()

In [53]:
general(f1_size)

This chart shows how the average F1-score (blue line) evolves as the training data size increases. The red line represents the corresponding number of training examples used at each proportion.

The model achieves the largest performance gain with the initial training data. After a certain point (when the train size is about 20-30k rows), adding more data contributes less to improving F1-score, the benefit of adding more data decreases progressively

#### MOST AND LEAST POPULATED CATEGORIES

In [26]:
less_3 = list(sim.groupby(['Category'])['n_rows_label'].mean().sort_values().index[:3])
top_3 = list(sim.groupby(['Category'])['n_rows_label'].mean().sort_values().index[-3:])
top_less_3 =less_3 +top_3

In [29]:
sim_tops = sim[sim['Category'].isin(top_less_3)]
sim_tops

Unnamed: 0,Category,F1-Score,prop,n_rows_label
4,COLLEGE,0.112948,0.001,1.0
9,EDUCATION,0.18617,0.001,1.0
10,ENTERTAINMENT,0.369369,0.001,8.0
19,LATINO VOICES,0.021739,0.001,1.0
24,POLITICS,0.544674,0.001,21.0
38,WELLNESS,0.422986,0.001,11.0
46,COLLEGE,0.208333,0.005,3.0
51,EDUCATION,0.301493,0.005,3.0
52,ENTERTAINMENT,0.465661,0.005,49.0
61,LATINO VOICES,0.045889,0.005,3.0


In [47]:
top_categories(sim_tops)

This chart shows how the f1-score (solid lines) evolves for those categories with the most and the least training samples, as the proportion of training data increases. The dotted lines represent the number of training samples for each category at each data proportion.

Categories with higher frequency tend to achieve higher F1-scores, reaching between 0.4 and 0.7 even at the smallest training sizes.

For low-frequency categories, the F1-score curves also grow quickly, but with much greater instability.

### 2. PERFORMANCE BERT PROCESS

In [54]:
bertdat = f1_data('bert')

(420, 3)
(420, 4)


#### GENERAL

In [56]:
f1_size_b = bertdat.groupby(['prop']).agg(
    f1_score_avg = pd.NamedAgg(column='F1-Score', aggfunc='mean'),
    n_rows = pd.NamedAgg(column='n_rows_label', aggfunc='sum')
).reset_index()



In [57]:
general(f1_size_b)

Just like in the cosine case, it is increasingly expensive to increase the f1-score. At first glance, a difference with the cosine is tha here the f1-score with the smallest train set is near to 0.

#### MOST AND LEAST POPULATED CATEGORIES

In [58]:

less_3_b = list(bertdat.groupby(['Category'])['n_rows_label'].mean().sort_values().index[:3])
top_3_b = list(bertdat.groupby(['Category'])['n_rows_label'].mean().sort_values().index[-3:])
top_less_3_b =less_3_b +top_3_b

sim_tops_b = bertdat[bertdat['Category'].isin(top_less_3_b)]
sim_tops_b


Unnamed: 0,Category,F1-Score,prop,n_rows_label
4,COLLEGE,0.0,0.001,1.0
9,EDUCATION,0.0,0.001,1.0
10,ENTERTAINMENT,0.004198,0.001,8.0
19,LATINO VOICES,0.0,0.001,1.0
24,POLITICS,0.296516,0.001,21.0
38,WELLNESS,0.0,0.001,11.0
46,COLLEGE,0.0,0.005,3.0
51,EDUCATION,0.0,0.005,3.0
52,ENTERTAINMENT,0.541329,0.005,49.0
61,LATINO VOICES,0.0,0.005,3.0


In [60]:

top_categories(sim_tops_b)

We observe the same pattern as in the cosine similarity case.

### 3. BERT VS SIMILARITY BASED ON TRAINING DATA SIZE

#### GENERAL

In [63]:
f1_bert_sim =f1_size.rename(columns={'f1_score_avg':'f1_avg_sim'}).merge(
    f1_size_b[['prop','f1_score_avg']].rename(columns={'f1_score_avg':'f1_avg_bert'}),
    how='left',
    on = 'prop'
)
f1_bert_sim

Unnamed: 0,prop,f1_avg_sim,n_rows,f1_avg_bert
0,0.001,0.17122,117.0,0.010252
1,0.005,0.273838,587.0,0.11209
2,0.01,0.302651,1173.0,0.216159
3,0.05,0.361606,5867.0,0.417953
4,0.1,0.388165,11733.0,0.486586
5,0.2,0.409168,23467.0,0.517698
6,0.3,0.424269,35200.0,0.554554
7,0.4,0.435602,46934.0,0.566658
8,0.5,0.439252,58667.0,0.577528
9,1.0,0.467711,117334.0,0.611635


In [64]:

fig = go.Figure()
fig.add_trace(go.Scatter(x=f1_bert_sim["prop"], y=f1_bert_sim["f1_avg_sim"],name="f1-similarity", yaxis="y1"))
fig.add_trace(go.Scatter(x=f1_bert_sim["prop"], y=f1_bert_sim["f1_avg_bert"],name="f1-BERT", yaxis="y1"))
fig.add_trace(go.Scatter(x=f1_bert_sim["prop"], y=f1_bert_sim["n_rows"],name="Nº rows", yaxis="y2"))


fig.update_layout(
    title="Average f1-score similarity vs BERT",
    xaxis=dict(title="Proportion of data for training"),
    yaxis=dict(title="Performance (f1-score avg)", side="left"),
    yaxis2=dict(
        title="Nº rows of training data",
        overlaying="y",
        side="right"
    )
)
fig.show()


This chart compares the performance of the two models (Cosine Similarity (blue) and fine-tuned BERT (red)) as the training data proportion increases. The green line shows the number of training samples.

At very small training sizes, Cosine Similarity offers better performance and could be a viable option when we have low resources. However, as training data grows, BERT consistently outperforms it, making it the better choice when resources allow for fine tuning on larger datasets. The performance gap between the models increases with more training data, clearly showing BERT's capacity to learn from larger datasets.

#### MOST AND LEAST POPULATED CATEGORIES

In [75]:
top_sim =sim[sim['Category'].isin(top_3_b)]
less_sim =sim[sim['Category'].isin(less_3_b)]
top_bert =bertdat[bertdat['Category'].isin(top_3_b)]
less_bert =bertdat[bertdat['Category'].isin(less_3_b)]
# sim_tops_b = bertdat[bertdat['Category'].isin(top_less_3_b)]
# sim_tops_b
tops = top_sim.rename(columns={'F1-Score':'F1-sim'}).merge(
    top_bert[['Category','prop','F1-Score']].rename(columns={'F1-Score':'F1-BERT'}),
    how='left',
    on=['Category','prop']
)
less = less_sim.rename(columns={'F1-Score':'F1-sim'}).merge(
    less_bert[['Category','prop','F1-Score']].rename(columns={'F1-Score':'F1-BERT'}),
    how='left',
    on=['Category','prop']
)

In [76]:
tops

Unnamed: 0,Category,F1-sim,prop,n_rows_label,F1-BERT
0,ENTERTAINMENT,0.369369,0.001,8.0,0.004198
1,POLITICS,0.544674,0.001,21.0,0.296516
2,WELLNESS,0.422986,0.001,11.0,0.0
3,ENTERTAINMENT,0.465661,0.005,49.0,0.541329
4,POLITICS,0.656941,0.005,100.0,0.666942
5,WELLNESS,0.45093,0.005,50.0,0.461931
6,ENTERTAINMENT,0.49736,0.01,98.0,0.621011
7,POLITICS,0.669723,0.01,199.0,0.718699
8,WELLNESS,0.498126,0.01,101.0,0.622637
9,ENTERTAINMENT,0.560983,0.05,486.0,0.681268


In [80]:
def tops_bert_sim(data):
    # Crear figura
    fig = go.Figure()
    colors = ['red','blue','green']
    # turquoise, violet, wheat, white, whitesmoke,
    #             yellow, yellowgreen
    # Añadir trazas de F1-Score por categoría
    j=0
    for category in data['Category'].unique():
        subset = data[data['Category'] == category]
        fig.add_trace(go.Scatter(
            x=subset['prop'],
            y=subset['F1-sim'],
            name=f'F1-sim - {category}',
            mode='lines+markers',
            yaxis='y1',
            marker=dict(symbol='circle', size=5, color=colors[j]),
            line=dict(dash='dot')
        ))
        j +=1
    # Añadir trazas de F1-Score por categoría
    j=0
    for category in data['Category'].unique():
        subset = data[data['Category'] == category]
        fig.add_trace(go.Scatter(
            x=subset['prop'],
            y=subset['F1-BERT'],
            name=f'F1-BERT - {category}',
            mode='lines+markers',
            yaxis='y1',
            marker=dict(symbol='diamond', size=5, color=colors[j])

        ))
        j +=1

    # Configurar layout con ejes secundarios
    fig.update_layout(
        title="F1-Score BERT vs Similarity of TOP populated categories",
        xaxis=dict(title="Proportion of data for training"),
        yaxis=dict(title="Performance (f1-score)"),
        legend=dict(x=1.1, y=1, bordercolor="Black", borderwidth=1)
    )

    fig.show()


In [81]:
tops_bert_sim(tops)

This chart compares the performance of BERT and Cosine Similarity across the three most populated categories in the dataset: ENTERTAINMENT, POLITICS, and WELLNESS. Solid lines represent BERT; dotted lines represent Cosine Similarity.

For the top populated categories, BERT works better even for the smallests training datasets (not for the smallest one).

In [82]:
tops_bert_sim(less)

This chart compares the performance of BERT (solid lines) and Cosine Similarity (dotted lines) for the least populated categories in the dataset: COLLEGE, EDUCATION, and LATINO VOICES, across increasing proportions of training data.

Cosine Similarity is better in early stages but lacks the capacity to close the performance gap as data increases.

#### CATEGORIES IMPROVE

We are going to check if the performance of each category increase with respect of previous training set

In [91]:

sim.sort_values(['Category','prop'], inplace=True)
sim['F1-up'] = (sim['F1-Score']-sim['F1-Score'].shift(1))>0

bertdat.sort_values(['Category','prop'], inplace=True)
bertdat['F1-up'] = (bertdat['F1-Score']-bertdat['F1-Score'].shift(1))>0

In [96]:
n_cat_increases_sim = sim.groupby('prop').agg(n_cat_increases_sim = pd.NamedAgg(column='F1-up', aggfunc='sum')).reset_index()
n_cat_increases_bert = bertdat.groupby('prop').agg(n_cat_increases_bert = pd.NamedAgg(column='F1-up', aggfunc='sum')).reset_index()

In [98]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=n_cat_increases_sim["prop"], y=n_cat_increases_sim["n_cat_increases_sim"],name="Similarity", yaxis="y1"))
fig.add_trace(go.Scatter(x=n_cat_increases_bert["prop"], y=n_cat_increases_bert["n_cat_increases_bert"],name="BERT", yaxis="y1"))


fig.update_layout(
    title="Nº categories with increasing performance",
    xaxis=dict(title="Proportion of data for training"),
    yaxis=dict(title="Nº of categories", side="left"),
   
)
fig.show()

This chart shows how many categories improve their f1-score performance at each step as the training data proportion increases, for both Cosine Similarity (blue) and BERT (red).

An intuitive and consistent result with previous observations can be seen: as the training set grows, it becomes increasingly difficult to achieve further improvements in f1-score, so it is natural to expect that fewer categories will continue to show performance gains.

# TAKEAWAYS

As expected, BERT clearly outperforms cosine similarity in most scenarios. However, when the training data is very limited, BERT may not achieve good results, and a simpler method like cosine similarity can actually be more effective — as shown in this experiment.

Why does cosine similarity perform better with small training sets but worse as the training size increases?

In my view, two main factors explain this behavior:

* Redundancy in added data: As more training data is introduced, it increasingly overlaps with the contexts already seen. Therefore, the marginal gain in similarity-based retrieval becomes smaller, since new examples add little novel information to improve the similarity match.

* BERT’s learning capacity: BERT is capable of capturing far more complex relationships in the data, going beyond simple linear similarity. However, it requires a larger number of training iterations (and examples) to properly adjust its weights and reach a meaningful minimum of the loss function.

# RECOMENDATIONS

* Explore additional baseline models.

* Use other embedding models.

* Build a more parametric pipeline that allows us to run this experiment in a more automated way.

* Use cross validation for smaller data sets.

* Optimize hyperparamters.
