In [None]:
import logging

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud, STOPWORDS

from src.data_loader import LABEL_MAP
from src.data_loader import load_datasets
from src.model_pipeline import load_and_prep_datasets

# Sentiment Analysis Mini-Challenge

> Authors: Dominik Filliger, Nils Fahrni, Noah Leuenberger (2024)

## 1. Introduction
Sentiment analysis is a crucial task in Natural Language Processing (NLP) that involves determining the sentiment or tone of a given text. It has numerous applications, such as understanding customer feedback, monitoring social media sentiment, and analyzing product reviews. However, manually labeling large datasets for sentiment analysis can be time-consuming and costly. Semi-supervised learning techniques, such as weak supervision, can help alleviate this challenge by leveraging a small amount of labeled data along with a larger set of unlabeled data to improve model performance.

 ## 2. Dataset Selection and Exploratory Data Analysis
The dataset used for this mini-challenge is the Amazon Polarity dataset, which consists of product reviews from Amazon labeled as either positive or negative. The dataset is loaded using the Hugging Face Datasets library. Exploratory data analysis is performed to gain insights into the distribution of labels, length of reviews, and other relevant characteristics.

As the dataset contains 4 million reviews we cut this down into a subset of 6666 reviews for the purpose of this mini-challenge. 666 of the reviews are used for validation, the remaining 6000 are split into 1000 labeled samples and 5000 artificially unlabeled samples.

Each subset has a 50/50 split of positive and negative reviews. 

In [None]:
train_df, unlabeled, validation = load_datasets("../data/partitions")

print(f"Labeled Dataset Length: {len(train_df)}")
print(f"Unlabeled Dataset Length: {len(unlabeled)}")
print(f"Validation Dataset Length: {len(validation)}")

To get a better idea of the whole dataset, we will merge the data again and perform some exploratory data analysis. For this purpose, we will merge the training, unlabeled, and validation datasets into one dataframe. We will also rename the 'ground_truth' column to 'label' in the unlabeled dataset to maintain consistency across the datasets.

In [None]:
unlabeled.rename(columns={'ground_truth': 'label'}, inplace=True)  # Rename column for consistency
eda_df = pd.concat([train_df, unlabeled, validation])
eda_df['label'] = eda_df['label'].map(LABEL_MAP)  # Map labels to 0: Negative, 1: Positive

print(f"Merged Dataset Length: {len(eda_df)}")
eda_df.head()

In [None]:
eda_df['review_length'] = eda_df['content'].apply(len)

plt.figure(figsize=(10, 6))
sns.histplot(eda_df['review_length'], bins=50, kde=True)
plt.title('Distribution of Review Lengths in Training Data')
plt.xlabel('Review Length')
plt.ylabel('Frequency')
plt.show()

The distribution of review lengths in the dataset is visualized using a histogram. The majority of reviews have a length between 0 and 1000 characters, with a peak around 500 characters. This information can be useful for preprocessing and feature engineering steps in the sentiment analysis task.

### Most Common Words

In [None]:
def plot_most_common_words(df, top_n=20):
    df = df.copy()
    df['label'] = df['label'].map({v: k for k, v in LABEL_MAP.items()})
    pos_reviews = df[df['label'] == 1]['content']
    neg_reviews = df[df['label'] == 0]['content']

    vectorizer_pos = CountVectorizer(stop_words='english')
    vectorizer_neg = CountVectorizer(stop_words='english')

    pos_word_count = vectorizer_pos.fit_transform(pos_reviews)
    neg_word_count = vectorizer_neg.fit_transform(neg_reviews)

    pos_sum_words = pos_word_count.sum(axis=0)
    neg_sum_words = neg_word_count.sum(axis=0)

    pos_words_freq = [(word, pos_sum_words[0, idx]) for word, idx in
                      zip(vectorizer_pos.get_feature_names_out(), range(pos_sum_words.shape[1]))]
    neg_words_freq = [(word, neg_sum_words[0, idx]) for word, idx in
                      zip(vectorizer_neg.get_feature_names_out(), range(neg_sum_words.shape[1]))]

    pos_words_freq = sorted(pos_words_freq, key=lambda x: x[1], reverse=True)
    neg_words_freq = sorted(neg_words_freq, key=lambda x: x[1], reverse=True)

    words, freq = zip(*pos_words_freq[:top_n])
    plt.figure(figsize=(10, 5))
    plt.bar(words, freq)
    plt.title('Most common words in positive reviews')
    plt.xticks(rotation=90)
    plt.show()

    words, freq = zip(*neg_words_freq[:top_n])
    plt.figure(figsize=(10, 5))
    plt.bar(words, freq)
    plt.title('Most common words in negative reviews')
    plt.xticks(rotation=90)
    plt.show()


plot_most_common_words(eda_df, top_n=20)

The most common words in positive and negative reviews are visualized using bar plots. The top 20 most frequent words in each category are displayed, providing insights into the language used in positive and negative reviews.

### Word Clouds

In [None]:
def generate_word_cloud(text, title):
    wordcloud = WordCloud(width=800, height=800,
                          background_color='white',
                          stopwords=set(STOPWORDS),
                          min_font_size=10).generate(text)

    plt.figure(figsize=(8, 8), facecolor=None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.title(title)
    plt.show()

In [None]:
pos_reviews_text = " ".join(eda_df[eda_df['label'] == "positive"]['content'].values)
generate_word_cloud(pos_reviews_text, "Word Cloud for Positive Reviews")

In [None]:
neg_reviews_text = " ".join(eda_df[eda_df['label'] == "negative"]['content'].values)
generate_word_cloud(neg_reviews_text, "Word Cloud for Negative Reviews")

## 3. Data Splitting Strategy
The dataset is split into development, validation, labeled, and unlabeled sets using a nested split approach. The development set is a fraction of the full dataset, the validation set is a fraction of the test dataset, and the labeled set is a fraction of the development set. The remaining samples in the development set are considered unlabeled. The nested split always adds in 25% increments (25, 50, 75), and a 1/6 split between labelled and unlabelled data is used, resulting in 1000 labeled and 5000 weakly labeled samples in total.

All the pre-split datasets are stored in the `data/partitions` directory as `.parquet` files.

Given the focus of the MC on the impact of weak labelling and its impact, we introduce a nested split which further divides our training data into splits. Here is a brief overview of the nested split algorithm we use:

1. **Validate the Fractions**: We start by ensuring that the proportions we want to use for our subsets are reasonable—each should be a fraction of the whole dataset.
2. **Shuffle the Data**: To make sure our subsets are representative and unbiased, we randomly shuffle the entire dataset. This ensures that each subset is a good mix of the data.
3. **Forming the Subsets**: For each proportion we decided on, we calculate how much of the dataset it represents. Starting with the smallest subset, we keep adding more data until we reach the desired size for each proportion. Each new subset contains all the data from the previous subsets, plus some more.
4. **Collect the Subsets**: In the end, we have a series of nested subsets, each larger than the last.

The implementation of the nested split is used in the `load_and_prep_datasets` function in the `src/model_pipeline.py` module and separately implemented in the `src/prep_datasets.py` script.

As the goal is to identify the optimal amount of additional data that can be used to improve model performance without the need for manual annotation. For this purpose: When training with weak labels we only apply the nested split on the weak labels and not the labeled data and then concat every given nested split with all the labeled data.

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(10, 5), sharey=True)


def add_percentages(ax, data):
    total = len(data)
    for p in ax.patches:
        height = p.get_height()
        percentage = f'{(height / total) * 100:.1f}%'
        ax.text(p.get_x() + p.get_width() / 2., height, percentage, ha='center', va='bottom')


sns.countplot(data=train_df, x='label', ax=ax[0])
add_percentages(ax[0], train_df)
ax[0].set_title('Labeled Dataset')
ax[0].set_xlabel('Label')
ax[0].set_ylabel('Count')

sns.countplot(data=unlabeled, x='label', ax=ax[1])
add_percentages(ax[1], unlabeled)
ax[1].set_title('Unlabeled Dataset')
ax[1].set_xlabel('Ground Truth')

sns.countplot(data=validation, x='label', ax=ax[2])
add_percentages(ax[2], validation)
ax[2].set_title('Validation Dataset')
ax[2].set_xlabel('Label')

plt.tight_layout()
plt.show()

The distribution of labels in the labeled, unlabeled, and validation datasets is visualized using count plots. The percentage of positive and negative reviews in each dataset is displayed, providing insights into the class distribution of the data. As we can see, the datasets are balanced with a 50/50 split between positive and negative reviews making imbalanced classes no issue for our models.

In [None]:
preped_datasets = load_and_prep_datasets("../data", nested_splits=True)
logging.disable(
    logging.CRITICAL)  # Disable logging for this cell as logging is initialized in the load_and_prep_datasets function

fig, ax = plt.subplots(1, 1, figsize=(10, 5))
sns.barplot(x=list(preped_datasets['nested_splits'].keys()),
            y=[len(value) for value in preped_datasets['nested_splits'].values()], ax=ax)
ax.set_title('Nested Split Sizes')
ax.set_xlabel('Nested Split Fraction')
ax.set_ylabel('Count')

total = len(preped_datasets['train'])
for p in ax.patches:
    height = p.get_height()
    percentage = f'{(height / total) * 100:.1f}%'
    ax.text(p.get_x() + p.get_width() / 2., height, percentage, ha='center', va='bottom')

plt.show()

The sizes of the nested splits generated from the training data are visualized using a bar plot. The count of samples in each nested split is displayed, along with the percentage of the total training data that each split represents. The nested splits are created in 25% increments, starting from 25% of the training data and increasing to 100% of the training data. In this visualization the nested split was only applied to the labeled data but the picture would be the same if it was applied to the weak labels.

In [None]:
fig, ax = plt.subplots(1, 4, figsize=(10, 5), sharey=True)

for i, (key, value) in enumerate(preped_datasets['nested_splits'].items()):
    value = value.to_pandas()
    sns.countplot(data=value, x='label', ax=ax[i])
    add_percentages(ax[i], value)
    ax[i].set_title(f'Nested Split {key}')
    ax[i].set_xlabel('Label')
    ax[i].set_ylabel('Count')

The distribution of labels in each nested split is visualized using count plots. The percentage of positive and negative reviews in each nested split is displayed, providing insights into the class distribution of the data. The nested splits maintain a balanced distribution of positive and negative reviews, as the set the data is sampled from is also balanced.

## Helper Functions
In order to make the notebook more readable and to avoid code duplication, we will define some helper functions that will be used throughout the notebook. These functions will help us to perform common tasks such as running experiments, evaluating pipelines, and visualizing the results.


In [None]:
from sklearn.metrics import roc_curve
import json
import matplotlib.pyplot as plt

import os
from dotenv import load_dotenv
load_dotenv()

MODEL_DIR = os.getenv("MODELS_DIR")

def plot_model_performance(results_data, model_names, baseline_data=None, metrics=None, baseline_name='Baseline'):
    if metrics is None:
        metrics = ['eval_accuracy', 'eval_f1_macro', 'eval_f1_weighted']

    results = results_data

    fractions = sorted(results[0].keys(), key=float)
    num_fractions = len(fractions)

    fig, ax = plt.subplots(figsize=(10, 6))
    x = range(num_fractions)

    if baseline_data:
        baseline_accuracy = baseline_data['eval_accuracy']
        ax.axhline(y=baseline_accuracy, color='r', linestyle='--', label=baseline_name)

    for i, model_results in enumerate(results):
        values = [model_results[fraction]['eval_accuracy'] for fraction in fractions]
        ax.plot(x, values, marker='o', label=model_names[i])

    ax.set_xticks(x)
    ax.set_xticklabels(fractions)
    ax.set_xlabel('Fraction of Labeled Samples')
    ax.set_ylabel('Accuracy')
    ax.set_title('Model Accuracy Comparison')
    ax.set_ylim(0, 1)
    ax.legend(loc='lower right')
    ax.grid(True)

    plt.tight_layout()
    plt.show()

def plot_model_auroc(results_data, model_names, baseline_data=None, baseline_name='Baseline'):
    results = results_data

    fractions = sorted(results[0].keys(), key=float)
    num_fractions = len(fractions)

    fig, ax = plt.subplots(figsize=(10, 6))
    x = range(num_fractions)

    if baseline_data:
        baseline_auroc = baseline_data['eval_auroc']
        ax.axhline(y=baseline_auroc, color='r', linestyle='--', label=baseline_name)

    for i, model_results in enumerate(results):
        values = [model_results[fraction]['eval_auroc'] for fraction in fractions]
        ax.plot(x, values, marker='o', label=model_names[i])

    ax.set_xticks(x)
    ax.set_xticklabels(fractions)
    ax.set_xlabel('Fraction of Labeled Samples')
    ax.set_ylabel('AUROC')
    ax.set_title('Model AUROC Comparison')
    ax.set_ylim(0, 1)
    ax.legend(loc='lower right')
    ax.grid(True)

    plt.tight_layout()
    plt.show()


## 4. Baseline Model Performance
The selected model is a pretrained language model, specifically `sentence-transformers/all-MiniLM-L6-v2`, which is used as the baseline model for sentiment classification without any training. 
   
Before we train any models, we will evaluate the performance of the baseline model on the validation set. The model will be used to predict the sentiment of the reviews in the validation set, and the results will be evaluated using accuracy, precision, recall, and F1-score metrics.

**DISCLAIMER**: In this notebook no training or evaluating is done it. It only loads the results from the training and evaluation done via the `src/model_pipeline.py` module. For more info check out the README file.

In [None]:
import json

relevant_metrics = ['eval_accuracy', 'eval_f1_macro', 'eval_f1_weighted']

with open(f'{MODEL_DIR}/eval/eval_results.json') as file:
    baseline_data = json.load(file)

metrics = [metric for metric in relevant_metrics if metric in baseline_data]
values = [baseline_data[metric] for metric in metrics]

fig, ax = plt.subplots(1, 1, figsize=(8, 5))
ax.bar(metrics, values)
ax.set_title('Baseline Model Performance')
ax.set_xlabel('Metric')
ax.set_ylabel('Value')
ax.set_ylim(0, 1)
ax.grid(True)

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

The results speak for themselves. The model without any training achieves very poor results. With around 50% accuracy, it is only slightly better than random guessing.

This sets the perfect foundation for our experiments. 


## 5. Supervised Learning Performance

Before we dive deeper into the chosen weak labelling technique and its impact on the model performance, we will first decide whether we will train our model via transfer learning or fine-tuning.

For this we will train the model using the nested splits on both techniques and compare the results. The results are stored in the `data/eval` directory as `.json` files.


### 5.1 Transfer Learning

In [None]:
with open(f'{MODEL_DIR}/supervised/transfer_nested/eval_results.json') as file:
    transfer_nested_data = json.load(file)

plot_model_performance([transfer_nested_data], ['Transfer Learning'], baseline_data, metrics=relevant_metrics)



The results show that the transfer learning model barely outperforms the baseline model. This indicates that the pretrained model's knowledge is not sufficient to achieve high performance on the sentiment analysis task. 

### 5.2 Fine-tuning
To identify if fine-tuning the model can improve the performance, we will train the model using the nested splits and compare the results. 

For the fine-tuning we are using the recommended hyperparameters from the Hugging Face documentation. 

In [None]:
with open(f'{MODEL_DIR}/supervised/finetune_nested/eval_results.json') as file:
    finetune_nested_data = json.load(file)

plot_model_performance([finetune_nested_data], ['Fine-tuning'], baseline_data, metrics=relevant_metrics)


The results show that fine-tuning outperforms both the baseline model and the transfer learning model. We can also see that after using 75% of the labelled data (750 labelled samples) the model stagnates in its performance. This indicates that the model has reached its capacity to learn from the data and adding more data does not substantially improve the performance.


## 6. Semi-Supervised Learning Performance
After we established that fine-tuning is the best approach for training the model, we will now evaluate the performance of the semi-supervised learning techniques. We will compare the performance of the fine-tuned model with weak labels generated using different weak labelling strategies. 

The nested split logic above is used, with the small difference that each split contains the fully labeled data. This means that the nested split is applied to the weak labels and then concatenated with the fully labeled data. 

### 6.1 Logistic Regression (LogReg) Weak Labelling

In [None]:
with open(f'{MODEL_DIR}/semi-supervised/finetune_nested/eval_results.json') as file:
    logreg_nested_data = json.load(file)
    
plot_model_performance([logreg_nested_data], ['LogReg Weak-Labelling'], finetune_nested_data["1.0"], metrics=relevant_metrics, baseline_name='Fine-tuning 100% (Fully Labeled)')


Adding weak labels to the dataset has a significant impact on the model performance. With only an addition 25% 

## 7. Learning Curve Analysis

In [None]:
# Plot all results
plot_model_performance([transfer_nested_data, finetune_nested_data, logreg_nested_data], ['Transfer Learning', 'Fine-tuning', 'LogReg Weak-Labelling'], baseline_data, metrics=relevant_metrics)

Using the 

## 8. Model Comparison and Analysis
A thorough analysis of the results is conducted, comparing the baseline model, supervised learning techniques, and semi-supervised learning techniques. The impact of different weak labeling strategies and training data sizes on model performance is evaluated. The best approach for the chosen dataset is determined, emphasizing the models that achieve acceptable performance with few manually annotated samples.

In [None]:
# Plot all results
plot_model_auroc([transfer_nested_data, finetune_nested_data, logreg_nested_data], ['Transfer Learning', 'Fine-tuning', 'LogReg Weak-Labelling'], baseline_data)

## 9. Time Savings Factor and Implications
The time savings factor, quantifying the reduction in manually labeled data required to achieve acceptable performance levels using weak labeling approaches, is calculated. The implications of the findings are discussed.

## 10. Conclusion and Future Directions
The key findings, insights, and potential implications of the sentiment analysis mini-challenge are summarized. The effectiveness of weak supervision techniques in reducing the need for manual annotation while maintaining acceptable model performance is discussed. Future directions for research and improvements are outlined.