In [None]:
import json
import logging
import os
import sys

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from dotenv import load_dotenv
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_curve, precision_recall_curve, auc
from wordcloud import WordCloud, STOPWORDS

load_dotenv()

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)

sys.path.append(parent_dir)

from src.data_loader import LABEL_MAP
from src.data_loader import load_datasets
from src.model_pipeline import load_and_prep_datasets

logging.disable(
    logging.CRITICAL)  # Disable logging for this cell as logging is initialized in the load_and_prep_datasets function

plt.style.use('ggplot')

# Sentiment Analysis Mini-Challenge

> Authors: Dominik Filliger, Nils Fahrni, Noah Leuenberger (2024)

## Introduction
Sentiment analysis is a crucial task in Natural Language Processing that involves determining the sentiment or tone of a given text. It has numerous applications, such as understanding customer feedback, monitoring social media sentiment, and analyzing product reviews. However, manually labeling large datasets for sentiment analysis can be time-consuming and costly. Semi-supervised learning techniques, such as weak supervision, can help alleviate this challenge by leveraging a small amount of labeled data along with a larger set of unlabeled data to improve model performance.

 ## Dataset Selection & Exploratory Data Analysis
The dataset used for this mini-challenge is the Amazon Polarity dataset, which consists of product reviews from Amazon labeled as either positive or negative. The dataset is loaded using the Hugging Face Datasets library. Exploratory data analysis is performed to gain insights into the distribution of labels, length of reviews, and other relevant characteristics.

As the dataset contains 4 million reviews we cut this down into a subset of 2777 reviews for the purpose of this mini-challenge. 277 of the reviews are used for validation, the remaining 2500 are split into 250 labeled samples and 2250 artificially unlabeled samples. Each subset has a 50/50 split of positive and negative reviews. More on the splitting strategy can be found in the `Data Partitioning` section. 

In [None]:
train_df, unlabeled, validation = load_datasets("../data/partitions")

print(f"Labeled Dataset Length: {len(train_df)}")
print(f"Unlabeled Dataset Length: {len(unlabeled)}")
print(f"Validation Dataset Length: {len(validation)}")

To get a better idea of the whole dataset, we will merge the data again and perform some exploratory data analysis. For this purpose, we will merge the training, unlabeled, and validation datasets into one dataframe. We will also rename the 'ground_truth' column to 'label' in the unlabeled dataset to maintain consistency across the datasets.

In [None]:
unlabeled.rename(columns={'ground_truth': 'label'}, inplace=True)  # Rename column for consistency with other datasets
eda_df = pd.concat([train_df, unlabeled, validation])
eda_df['label'] = eda_df['label'].map(LABEL_MAP)  # Map labels to 0: Negative, 1: Positive

print(f"Merged Dataset Length: {len(eda_df)}")
eda_df.head()

In [None]:
eda_df['review_length'] = eda_df['content'].apply(len)

plt.figure(figsize=(10, 6))
sns.histplot(eda_df['review_length'], bins=50, kde=True)
plt.title('Distribution of Review Lengths in Training Data')
plt.xlabel('Review Length')
plt.ylabel('Frequency')
plt.show()

The distribution of review lengths in the dataset is visualized using a histogram. The majority of reviews have a length between 0 and 1000 characters, with a peak around 200 characters. 

In [None]:
eda_df.describe()

Looking at the descriptive statistics of the dataset we can see that the average review length is around 400 characters. The shortest review is 54 character long and the longest review is 998 characters long. This indicates that the reviews are relatively short, which is common for product reviews. 

The chosen model for this mini-challenge is the `sentence-transformers/all-MiniLM-L6-v2` model, which is a pretrained language model (more on that in the `Baseline Model Performance` section). [The Hugging Face model hub](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) provides the following information about the model:

> By default, input text longer than 256 word pieces is truncated.

This means that the model will truncate reviews longer than 256 word pieces (approximately 1250 characters according to [this source](https://charactercalculator.com/words-to-characters/)). Therefore we consider that the lengths of the reviews in the dataset are suitable for the chosen model.


### Most Common Words

To gain insights into the most common words used in positive and negative reviews, we will visualize the top 20 most frequent words in each category. This will help us understand the language used in reviews and identify common themes or topics that are mentioned frequently. We will use the CountVectorizer to count the occurrences of each word taking into account the stop words in the English language.

In [None]:
def plot_most_common_words(df, top_n=20):
    df = df.copy()
    df['label'] = df['label'].map({v: k for k, v in LABEL_MAP.items()})
    pos_reviews = df[df['label'] == 1]['content']
    neg_reviews = df[df['label'] == 0]['content']

    vectorizer_pos = CountVectorizer(stop_words='english')
    vectorizer_neg = CountVectorizer(stop_words='english')

    pos_word_count = vectorizer_pos.fit_transform(pos_reviews)
    neg_word_count = vectorizer_neg.fit_transform(neg_reviews)

    pos_sum_words = pos_word_count.sum(axis=0)
    neg_sum_words = neg_word_count.sum(axis=0)

    pos_words_freq = [(word, pos_sum_words[0, idx]) for word, idx in
                      zip(vectorizer_pos.get_feature_names_out(), range(pos_sum_words.shape[1]))]
    neg_words_freq = [(word, neg_sum_words[0, idx]) for word, idx in
                      zip(vectorizer_neg.get_feature_names_out(), range(neg_sum_words.shape[1]))]

    pos_words_freq = sorted(pos_words_freq, key=lambda x: x[1], reverse=True)
    neg_words_freq = sorted(neg_words_freq, key=lambda x: x[1], reverse=True)

    words, freq = zip(*pos_words_freq[:top_n])
    plt.figure(figsize=(10, 5))
    plt.bar(words, freq)
    plt.title('Most common words in positive reviews')
    plt.xticks(rotation=90)
    plt.show()

    words, freq = zip(*neg_words_freq[:top_n])
    plt.figure(figsize=(10, 5))
    plt.bar(words, freq)
    plt.title('Most common words in negative reviews')
    plt.xticks(rotation=90)
    plt.show()


plot_most_common_words(eda_df, top_n=20)

The most common words in positive and negative reviews are visualized using bar plots. The top 20 most frequent words in each category are displayed, providing insights into the language used in positive and negative reviews. We can see that both negative and positive reviews appear to have a lot of overlap, highlighting that even though the sentiment is different, the language used is similar. We also identify that book and movie are common words in both positive and negative reviews, showcasing that there are more reviews about book, both positive and negative, in our subset.

### Word Clouds

Word clouds are a popular visualization technique that displays the most frequent words in a text corpus. We will generate word clouds for positive and negative reviews to visualize the most common words in each category. This will provide a more visual representation of the language used in reviews and help identify the key themes or topics mentioned frequently.

The word clouds are generated using the WordCloud library, which creates word clouds based on the frequency of words in the text. We will display the word clouds for positive and negative reviews separately.

In [None]:
def generate_word_cloud(text, title):
    wordcloud = WordCloud(width=800, height=800,
                          background_color='white',
                          stopwords=set(STOPWORDS),
                          min_font_size=10).generate(text)

    plt.figure(figsize=(8, 8), facecolor=None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.title(title)
    plt.show()

In [None]:
pos_reviews_text = " ".join(eda_df[eda_df['label'] == "positive"]['content'].values)
generate_word_cloud(pos_reviews_text, "Word Cloud for Positive Reviews")

We can see the common theme of topic related words in the positive reviews. The word cloud shows that words like "great", "good", "bought" and "love" are frequently used in positive reviews, indicating that customers are satisfied with the products or services they have received. This aligns with the general understanding that positive reviews are more likely to contain positive language and sentiments.

In [None]:
neg_reviews_text = " ".join(eda_df[eda_df['label'] == "negative"]['content'].values)
generate_word_cloud(neg_reviews_text, "Word Cloud for Negative Reviews")

Similarly to the positive reviews, the world cloud for negative reviews also show topic related words. Interestingly enough the word "good" is also present in the negative reviews. This could be due to the fact that the word "good" is used in a negative context, e.g. "not good". Other words like "time", "even" and "worst" are also present in the negative reviews, indicating that customers are dissatisfied with the products or services they have received. For example the usage of "even" could be used in the context of "even worse than expected" or something similar. This aligns with the general understanding that negative reviews are more likely to contain negative language and sentiments as well as complaints.

## Data Splitting Strategy
The dataset is split into development, validation, labeled, and unlabeled sets using a nested split approach. The development set is a fraction of the full dataset, the validation set is a fraction of the test dataset, and the labeled set is a fraction of the development set. The remaining samples in the development set are considered unlabeled. The nested split always adds in 25% increments (25, 50, 75), and a 1/10 split between labeled and unlabeled data is used, resulting in 250 labeled and 2250 weakly labeled samples in total.

All the pre-split datasets are stored in the `data/partitions` directory as `.parquet` files.

Given the focus of the MC on the impact of weak labeling and its impact, we introduce a nested split which further divides our training data into splits. Here is a brief overview of the nested split algorithm we use:

1. **Validate the Fractions**: We start by ensuring that the proportions we want to use for our subsets are reasonable—each should be a fraction of the whole dataset.
2. **Shuffle the Data**: To make sure our subsets are representative and unbiased, we randomly shuffle the entire dataset. This ensures that each subset is a good mix of the data.
3. **Forming the Subsets**: For each proportion we decided on, we calculate how much of the dataset it represents. Starting with the smallest subset, we keep adding more data until we reach the desired size for each proportion. Each new subset contains all the data from the previous subsets, plus some more.
4. **Collect the Subsets**: In the end, we have a series of nested subsets, each larger than the last.

The implementation of the nested split is used in the `load_and_prep_datasets` function in the `src/model_pipeline.py` module and separately implemented in the `src/prep_datasets.py` script.

As the goal is to identify the optimal amount of additional data that can be used to improve model performance without the need for manual annotation. For this purpose: When training with weak labels we only apply the nested split on the weak labels and not the labeled data and then concat every given nested split with all the labeled data.

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(10, 5), sharey=True)


def add_percentages(ax, data, percentages=True):
    total = len(data)
    for p in ax.patches:
        height = int(p.get_height())
        if percentages:
            label = f'{height} ({(height / total) * 100:.1f}%)'
        else:
            label = f'{height}'
        ax.text(p.get_x() + p.get_width() / 2., height, label, ha='center', va='bottom')


sns.countplot(data=train_df, x='label', ax=ax[0])
add_percentages(ax[0], train_df)
ax[0].set_title('Labeled Dataset')
ax[0].set_xlabel('Label')
ax[0].set_ylabel('Count')

sns.countplot(data=unlabeled, x='label', ax=ax[1])
add_percentages(ax[1], unlabeled)
ax[1].set_title('Unlabeled Dataset')
ax[1].set_xlabel('Ground Truth')

sns.countplot(data=validation, x='label', ax=ax[2])
add_percentages(ax[2], validation)
ax[2].set_title('Validation Dataset')
ax[2].set_xlabel('Label')

plt.tight_layout()
plt.show()

The distribution of labels in the labeled, unlabeled, and validation datasets is visualized using count plots. The percentage of positive and negative reviews in each dataset is displayed, providing insights into the class distribution of the data. As we can see, the datasets are balanced with a 50/50 split between positive and negative reviews making imbalanced classes no issue for our models.

In [None]:
preped_datasets = load_and_prep_datasets("../data", nested_splits=True)

fig, ax = plt.subplots(1, 1, figsize=(10, 5))
sns.barplot(x=list(preped_datasets['nested_splits'].keys()),
            y=[len(value) for value in preped_datasets['nested_splits'].values()], ax=ax)
ax.set_title('Nested Split Sizes')
ax.set_xlabel('Nested Split Fraction')
ax.set_ylabel('Count')

total = len(preped_datasets['train'])

add_percentages(ax, preped_datasets['train'])

plt.show()

The sizes of the nested splits generated from the training data are visualized using a bar plot. The count of samples in each nested split is displayed, along with the percentage of the total training data that each split represents. The nested splits are created in 25% increments, starting from 25% of the training data and increasing to 100% of the training data. In this visualization the nested split was only applied to the labeled data but the picture would be the same if it was applied to the weak labels proportions wise.

In [None]:
_, ax = plt.subplots(1, 4, figsize=(10, 5), sharey=True)

for i, (key, value) in enumerate(preped_datasets['nested_splits'].items()):
    value = value.to_pandas()
    sns.countplot(data=value, x='label', ax=ax[i])
    add_percentages(ax[i], value, percentages=False)
    ax[i].set_title(f'Nested Split {float(key) * 100}%')
    ax[i].set_xlabel('Label')
    ax[i].set_ylabel('Count')

The distribution of labels in each nested split is visualized using count plots. The percentage of positive and negative reviews in each nested split is displayed, providing insights into the class distribution of the data. The nested splits maintain a approximately balanced distribution of positive and negative reviews, as the set the data is sampled from is also balanced. This ensures that the models are trained on a representative subset of the data and are not biased towards one class.

## Helper Functions
In order to make the notebook more readable and to avoid code duplication, we will define some helper functions that will be used throughout the notebook. These functions will help us to perform common tasks such as running experiments, evaluating pipelines, and visualizing the results.


In [None]:
MODEL_DIR = os.getenv("MODELS_DIR")


def plot_additional_metrics(results_data, model_names):
    fig = plt.figure(figsize=(16, 13))
    gs = fig.add_gridspec(nrows=3, ncols=2, height_ratios=[3, 1.5, 1.5], wspace=0.3, hspace=0.6)

    # Precision-Recall curve
    ax_pr = fig.add_subplot(gs[0, 0])
    for i, model_results in enumerate(results_data):
        precision, recall, _ = precision_recall_curve(model_results['eval_true_labels'],
                                                      model_results['eval_pred_probs'])
        auprc = auc(recall, precision)
        ax_pr.plot(recall, precision, label=f'{model_names[i]} (AUPRC = {auprc:.2f})')
    ax_pr.set_title('Precision-Recall Curve', fontsize=14, pad=15)
    ax_pr.set_xlabel('Recall', fontsize=12, labelpad=10)
    ax_pr.set_ylabel('Precision', fontsize=12, labelpad=10)
    ax_pr.legend(loc='lower left', fontsize=11, bbox_to_anchor=(0.0, 0.0), borderaxespad=0.5)
    ax_pr.grid(True)

    # ROC curve and AUC
    ax_roc = fig.add_subplot(gs[0, 1])
    for i, model_results in enumerate(results_data):
        fpr, tpr, _ = roc_curve(model_results['eval_true_labels'], model_results['eval_pred_probs'])
        roc_auc = auc(fpr, tpr)
        ax_roc.plot(fpr, tpr, label=f'{model_names[i]} (AUC = {roc_auc:.2f})')
    ax_roc.plot([0, 1], [0, 1], 'k--')
    ax_roc.set_title('ROC Curve', fontsize=14, pad=15)
    ax_roc.set_xlabel('False Positive Rate', fontsize=12, labelpad=10)
    ax_roc.set_ylabel('True Positive Rate', fontsize=12, labelpad=10)
    ax_roc.legend(loc='lower right', fontsize=11, bbox_to_anchor=(1.0, 0.0), borderaxespad=0.5)
    ax_roc.grid(True)

    # Confusion matrices
    for i, model_results in enumerate(results_data):
        ax_cm = fig.add_subplot(gs[1 + i // 2, i % 2])
        cm = model_results['eval_confusion_matrix']
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax_cm, xticklabels=['Negative', 'Positive'],
                    yticklabels=['Negative', 'Positive'], cbar=False, annot_kws={"size": 14})
        ax_cm.set_title(f'Confusion Matrix - {model_names[i]}', fontsize=13, pad=15)
        ax_cm.set_xlabel('Predicted', fontsize=11, labelpad=10)
        ax_cm.set_ylabel('True', fontsize=11, labelpad=10)
        ax_cm.tick_params(axis='both', labelsize=11)

    plt.show()


def plot_model_performance(results_data, model_names, baseline_data=None, baseline_name='Baseline'):
    results = results_data

    fractions = sorted(results[0].keys(), key=float)
    num_fractions = len(fractions)

    fig, ax = plt.subplots(figsize=(10, 6))
    x = range(num_fractions)

    if baseline_data:
        baseline_accuracy = baseline_data['eval_accuracy']
        baseline_f1 = baseline_data['eval_f1_weighted']
        ax.axhline(y=baseline_accuracy, color='r', linestyle='--', label=baseline_name)
        ax.axhline(y=baseline_f1, color='g', linestyle='--', label=f'{baseline_name} (F1-Score Weighted)')

    for i, model_results in enumerate(results):
        values = [model_results[fraction]['eval_accuracy'] for fraction in fractions]
        values_f1 = [model_results[fraction]['eval_f1_weighted'] for fraction in fractions]
        ax.plot(x, values, marker='o', label=model_names[i])
        ax.plot(x, values_f1, marker='x', label=f'{model_names[i]} (F1-Score Weighted)')

    ax.set_xticks(x)
    ax.set_xticklabels(fractions)
    ax.set_xlabel('Fraction of Labeled Samples')
    ax.set_ylabel('Accuracy')
    ax.set_title('Model Accuracy Comparison')
    ax.set_ylim(0, 1)
    ax.legend(loc='lower right')
    ax.grid(True)

    plt.tight_layout()
    plt.show()

## Fine-Tuning & Transfer Learning Methods

This section defines the methods used to train the models. All the model training code is stored in the `src/model_pipeline.py` module. The module mainly utilizes the [`transformers` library](https://huggingface.co/docs/transformers/en/index) for training the models.

For both transfer learning and fine-tuning, the `all-MiniLM-L6-v2` model is used. The model is pretrained on a large corpus of text data. This excludes the Amazon Polarity dataset, which would otherwise cause data leakage.

The models are loaded in a sequence classification configuration, which includes a classification head on top of the pretrained model. The following difference between transfer learning and fine-tuning is the training process:

- **Transfer Learning**: The model is used as a feature extractor, and only the classification head is trained on the labeled data. We therefore freeze the pretrained model's weights and only update the weights of the classification head during training.
- **Fine-Tuning**: The entire model is trained on the labeled data, including the pretrained model and the classification head. We will not freeze any weights and update all the weights during training.

The training process is the same for both methods, with the only difference being the weights that are updated during training.

For the tokenization of the text data, the appropriate tokenizer based on the functionality provided by the `transformers` library is used. The tokenizer is then used to convert the text data into input features that can be fed into the model.

We are using the hyperparameters [from the Hugging Face documentation](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) for all training processes:

> We trained our model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core). We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with a 2e-5 learning rate [...].


## Baseline Model Performance
The selected model is a pretrained language model, specifically `sentence-transformers/all-MiniLM-L6-v2`, which is used as the baseline model for sentiment classification without any training. 
   
Before we train any models, we will evaluate the performance of the baseline model on the validation set. This will establish a benchmark for comparison with the models trained using transfer learning, fine-tuning, and weak labeling techniques. 

The results will be evaluated using accuracy, precision, recall, and F1-score metrics.

**DISCLAIMER**: In this notebook no training or evaluating is done it. It only loads the results from the training and evaluation done via the `src/model_pipeline.py` module. For more info check out the README file.

In [None]:
relevant_metrics = ['eval_accuracy', 'eval_precision', 'eval_recall', 'eval_f1_weighted']

with open(f'{MODEL_DIR}/eval/eval_results.json') as file:
    baseline_data = json.load(file)

metrics = [metric for metric in relevant_metrics if metric in baseline_data]
values = [baseline_data[metric] for metric in metrics]

fig, ax = plt.subplots(1, 1, figsize=(8, 5))
sns.barplot(x=metrics, y=values, ax=ax)

for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2., height, f'{height * 100:.2f}%', ha='center', va='bottom')

ax.set_title('Baseline Model Performance')
ax.set_xlabel('Metric')
ax.set_ylabel('Value')
ax.set_ylim(0, 1)

ax.grid(True, zorder=-1)
plt.show()

The results speak for themselves. The model without any training achieves very poor results. With around 50% accuracy, it is as good as random guessing.

This sets the perfect foundation for our experiments. 


## Supervised Learning Performance

Before we dive deeper into the impact of weak labeling on the model performance, we will first decide whether we will train our model via transfer learning or fine-tuning.

For this we will train the model using the nested splits on both techniques and compare the results. The results are stored in the `data/eval` directory as `.json` files and are an aggregation of the results from the `model_pipeline.py` output.


### Transfer Learning

In [None]:
with open(f'{MODEL_DIR}/supervised/transfer_nested/eval_results.json') as file:
    transfer_nested_data = json.load(file)

plot_model_performance([transfer_nested_data], ['Transfer Learning'], baseline_data)

The results show that the transfer learning model barely outperforms the baseline model. This puts the model's performance into the same random guessing category as the baseline model.

In [None]:
plot_additional_metrics([results for k, results in transfer_nested_data.items()],
                        [f'Transfer Learning {percentage}' for percentage in transfer_nested_data.keys()])

Looking at the individual splits we can see a distinct difference between 0.25 split and the other ones. The 0.25 split has a very high true positive and false positive count, indicating that the model with little training data appears to show some kind of bias towards classifying reviews as positive.

As soon as we increase training data size this bias appears to flip. For 0.5, 0.75 and 1.0 we see tendency towards classifying reviews as negative. This may indicate that the decision threshold of the model with little training data is not optimal and the model is not able to generalize well.

This indicates that the pretrained model's knowledge is not sufficient to achieve high performance on the sentiment analysis task. It could also be that further tweaking of the model's classification head is necessary to increase model flexibility and thus possibly help to achieve better results. For the scope of this mini-challenge, we will not further investigate any tweaks to the model's classification head.


### Fine-tuning

In [None]:
with open(f'{MODEL_DIR}/supervised/finetune_nested/eval_results.json') as file:
    finetune_nested_data = json.load(file)

plot_model_performance([finetune_nested_data], ['Fine-tuning'], baseline_data)

The results show that fine-tuning outperforms both the baseline model and the transfer learning model. We can also see that after using 75% of the labeled data (750 labeled samples) the model stagnates in its performance. This indicates that the model has reached its capacity to learn from the data and adding more data does not substantially improve the performance.


In [None]:
plot_additional_metrics([results for k, results in finetune_nested_data.items()],
                        [f'Fine-Tuning {percentage}' for percentage in finetune_nested_data.keys()])

With the difference of the model performing better than the transfer learning model, we can see that the exact same pattern as with the transfer learning model. The 0.25 split has a very high true positive and false positive count, indicating that the model with little training data appears to show some kind of bias towards classifying reviews as positive.

Only with using the 1.0 split we see the first signs of the model being able to pick up the differences between the classes. This is also reflected in the confusion matrix where the true positive and true negative counts are much higher than the false positive and false negative counts. This is an improvement compared to the transfer learning model where the model was performing poorly on all splits.

## Semi-Supervised Learning Performance
After we established that fine-tuning is the best approach for training the model, we will now evaluate the performance of the semi-supervised learning techniques. We will compare the performance of the fine-tuned model with weak labels to the only labeled data model. 

The same nested split logic as above is used, with the small difference that each split contains the fully labeled data. This means that the nested split is applied to the weak labels and then concatenated with the fully labeled data enabling us to speak of the proportion of weak labels in the whole dataset when talking about the nested splits.

### Logistic Regression (LogReg)

The weak labeling technique used is logistic regression. The logistic regression model is trained on the labeled data and then used to predict the labels for the unlabeled data. The predicted labels are then used as weak labels for the training data. More on the weak labeling technique and approach can be found in the `notebooks/weak_labeling.ipynb` notebook.

In [None]:
with open(f'{MODEL_DIR}/semi-supervised/finetune_nested/eval_results.json') as file:
    logreg_nested_data = json.load(file)

plot_model_performance([logreg_nested_data], ['Weak-Labeling Fine-Tuning '], finetune_nested_data["1.0"],
                       baseline_name='Fine-Tuning 100% (Fully Labeled)')


Adding weak labels to the dataset has a significant impact on the model performance. With only an additional 25% of weakly labeled (total 812, 250 labeled and 562 weakly labeled) data, the model achieves an accuracy (and weighted f1 score) of around 88% . This is a substantial improvement compared to the fine-tuning model, which only used the fully labeled data. 

However, we can also see an interesting pattern, the difference between 25% and 100% additional weak labels has little to no impact on both weighted f1 and accuracy. This indicates that the model has reached its capacity to learn from the data and adding more data does not substantially improve the performance. We even spot a slight decrease in performance when at 75% weak labels.


In [None]:
plot_additional_metrics([results for k, results in logreg_nested_data.items()],
                        [f'Weak-Labeling Fine-Tuning {percentage}' for percentage in logreg_nested_data.keys()])

Looking at the individual splits we can see that the splits all perform similarly as already shown in the nested performance plot. Looking at the confusion matrix we see that all splits have about the same amount of wrong predictions. However, interesting to is that the 0.25 split has more false positive than false negative, where all other splits have more false negatives than false positives. The difference here is minor but it is still a small pattern that can be observed. 

## Model Comparison & Analysis
Now we will look how already a small amount of weakly labeled data can improve the model performance when compared to the fully labeled data.

For this we will specifically look at the 0.25 weak labeling split as it has shown promising results with only a fraction of the data. This puts us at around 812 samples in total, 250 labeled and 562 weakly labeled.

In [None]:
with open(f'{MODEL_DIR}/semi-supervised/finetune_nested/eval_results.json') as file:
    logreg_nested_data = json.load(file)

best_transfer = transfer_nested_data["1.0"]
best_finetune = finetune_nested_data["1.0"]
best_weak_labeling = logreg_nested_data["0.25"]

plot_additional_metrics([best_transfer, best_finetune, best_weak_labeling],
                        ['Transfer Learning', 'Fine-Tuning', 'Weak-Labeling Fine-Tuning 25% split'])

The model trained with weak labels outperforms both the transfer learning and fine-tuning models by a large margin. This is especially interesting given that we are using only a fraction of the weak labeled data. 

### Critical Reflection
Even though it was clear before, here we can see how much improvement can be made with weak labeling. 
 
The weak labeling model outperforms both the transfer learning and fine-tuning models by a large margin.
 
This is especially interesting given that we are using only a fraction of the weak labeled data. 
 
However, something that needs to be highlighted is that even the 25% nested split in weak labels introduces an additional 562 weakly labeled samples. 
This is twice as many samples as the 250 fully labeled samples used for training. Even though the results are promising, it sparks the question if using even fewer weakly labeled samples could still achieve similar results. In a future iteration of this experiment, it would be interesting to see how much more performance can be squeezed out of the model with even fewer weakly labeled samples.

However, this could also be an indicator that the model has reached its capacity to learn from the data and adding more data does not substantially improve the performance. Therefore trying out different pretrained models or tweaking the classification head could be a next step to further improve the model performance and perhaps reach higher performance with more weakly labeled samples. For reference [other approaches](https://arxiv.org/abs/1904.12848v6) achieved up to 97% accuracy on the Amazon Polarity dataset. But this has to be taken with a grain of salt as the approach is not fully comparable to ours (Augmentation, more data, etc.).
