# Task
Train a Roberta-large model for binary sentence classification using the data in "PLUE/PLUE-main/privacyqa/policy_train_data.csv" and "PLUE/PLUE-main/privacyqa/policy_test_data.csv". Hold out 10% of the training data for validation.

## Load data

### Subtask:
Load the training and test data from the specified CSV files.


In [None]:
from datasets import load_dataset

train_dataset = load_dataset('csv', data_files='PLUE/PLUE-main/data/privacyqa/policy_train_data.csv', delimiter='\t')
test_dataset = load_dataset('csv', data_files='PLUE/PLUE-main/data/privacyqa/policy_test_data.csv', delimiter='\t')

# Split the training data into training and validation sets
train_dataset = train_dataset["train"].train_test_split(test_size=0.1)
validation_dataset = train_dataset["test"]
train_dataset = train_dataset["train"]

print("Training dataset:", train_dataset)
print("Validation dataset:", validation_dataset)
print("Test dataset:", test_dataset)

Training dataset: Dataset({
    features: ['Folder', 'DocID', 'QueryID', 'SentID', 'Split', 'Query', 'Segment', 'Label'],
    num_rows: 166680
})
Validation dataset: Dataset({
    features: ['Folder', 'DocID', 'QueryID', 'SentID', 'Split', 'Query', 'Segment', 'Label'],
    num_rows: 18520
})
Test dataset: DatasetDict({
    train: Dataset({
        features: ['Folder', 'DocID', 'QueryID', 'SentID', 'Split', 'Query', 'Segment', 'Any_Relevant', 'Ann1', 'Ann2', 'Ann3', 'Ann4', 'Ann5', 'Ann6'],
        num_rows: 62150
    })
})


## Preprocess data

### Subtask:
Tokenize the text data using the RoBERTa tokenizer.

In [None]:
from transformers import RobertaTokenizerFast
from datasets import Value

tokenizer = RobertaTokenizerFast.from_pretrained("roberta-large")

def tokenize_function(examples):
    return tokenizer(examples["Segment"], padding="max_length", truncation=True)

def convert_labels_to_int(examples):
    label_map = {"Irrelevant": 0, "Relevant": 1}
    return {"labels": [label_map[label] for label in examples["Label"]]}


tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_validation_dataset = validation_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset["train"].map(tokenize_function, batched=True, remove_columns=['Folder', 'DocID', 'QueryID', 'SentID', 'Split', 'Query',  'Ann1', 'Ann2', 'Ann3', 'Ann4', 'Ann5', 'Ann6'])

tokenized_train_dataset = tokenized_train_dataset.map(convert_labels_to_int, batched=True)
tokenized_validation_dataset = tokenized_validation_dataset.map(convert_labels_to_int, batched=True)


print("Tokenized training dataset:", tokenized_train_dataset)
print("Tokenized validation dataset:", tokenized_validation_dataset)
print("Tokenized test dataset:", tokenized_test_dataset)

Map:   0%|          | 0/166680 [00:00<?, ? examples/s]

Map:   0%|          | 0/18520 [00:00<?, ? examples/s]

Map:   0%|          | 0/166680 [00:00<?, ? examples/s]

Map:   0%|          | 0/18520 [00:00<?, ? examples/s]

Tokenized training dataset: Dataset({
    features: ['Folder', 'DocID', 'QueryID', 'SentID', 'Split', 'Query', 'Segment', 'Label', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 166680
})
Tokenized validation dataset: Dataset({
    features: ['Folder', 'DocID', 'QueryID', 'SentID', 'Split', 'Query', 'Segment', 'Label', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 18520
})
Tokenized test dataset: Dataset({
    features: ['Segment', 'Any_Relevant', 'input_ids', 'attention_mask'],
    num_rows: 62150
})


## Define the model

### Subtask:
Load the pre-trained RoBERTa-large model for sequence classification.

In [None]:
from transformers import RobertaForSequenceClassification

model = RobertaForSequenceClassification.from_pretrained("roberta-large", num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train the model

### Subtask:
Train the RoBERTa-large model for binary sentence classification.

In [None]:
print("Tokenized training dataset features:", tokenized_train_dataset.features)
print("Tokenized validation dataset features:", tokenized_validation_dataset.features)
print("First element of tokenized training dataset:", tokenized_train_dataset[0])
print("First element of tokenized validation dataset:", tokenized_validation_dataset[0])

Tokenized training dataset features: {'Folder': Value('string'), 'DocID': Value('string'), 'QueryID': Value('string'), 'SentID': Value('string'), 'Split': Value('string'), 'Query': Value('string'), 'Segment': Value('string'), 'Label': Value('string'), 'input_ids': List(Value('int32')), 'attention_mask': List(Value('int8')), 'labels': Value('int64')}
Tokenized validation dataset features: {'Folder': Value('string'), 'DocID': Value('string'), 'QueryID': Value('string'), 'SentID': Value('string'), 'Split': Value('string'), 'Query': Value('string'), 'Segment': Value('string'), 'Label': Value('string'), 'input_ids': List(Value('int32')), 'attention_mask': List(Value('int8')), 'labels': Value('int64')}
First element of tokenized training dataset: {'Folder': '../../Dataset/Train/com.intuit.quickbooks', 'DocID': 'QuickBooks Accounting: Invoicing & Expenses _9', 'QueryID': 'QuickBooks Accounting: Invoicing & Expenses _9_46', 'SentID': 'QuickBooks Accounting: Invoicing & Expenses _9_46_151', 'Sp

# Task
Analyze the class distribution of the "Relevant" and "Irrelevant" labels in the training dataset, explore and implement techniques to handle class imbalance, update the training code to use the balanced dataset or incorporate class weights, retrain and evaluate the model on the validation and test sets, and summarize the findings.

In [None]:
label_counts = train_dataset.to_pandas()['Label'].value_counts()
print("Class distribution in the training dataset:")
print(label_counts)

Class distribution in the training dataset:
Label
Irrelevant    160280
Relevant        6400
Name: count, dtype: int64


In [None]:
import numpy as np
from datasets import Dataset, concatenate_datasets

# Separate the dataset by label
irrelevant_dataset = train_dataset.filter(lambda example: example['Label'] == 'Irrelevant')
relevant_dataset = train_dataset.filter(lambda example: example['Label'] == 'Relevant')

# Get the number of samples in the minority class
n_relevant = len(relevant_dataset)

# Randomly select a subset of the majority class
irrelevant_indices = np.random.choice(len(irrelevant_dataset), n_relevant, replace=False)
downsampled_irrelevant_dataset = irrelevant_dataset.select(irrelevant_indices)

# Combine the downsampled majority class with the minority class
balanced_train_dataset = concatenate_datasets([downsampled_irrelevant_dataset, relevant_dataset])

# Shuffle the balanced dataset
balanced_train_dataset = balanced_train_dataset.shuffle(seed=42)

# Print the class distribution of the new balanced dataset
balanced_label_counts = balanced_train_dataset.to_pandas()['Label'].value_counts()
print("Class distribution in the balanced training dataset:")
print(balanced_label_counts)


Filter:   0%|          | 0/166680 [00:00<?, ? examples/s]

Filter:   0%|          | 0/166680 [00:00<?, ? examples/s]

Class distribution in the balanced training dataset:
Label
Irrelevant    6452
Relevant      6452
Name: count, dtype: int64


In [None]:
from transformers import Trainer, TrainingArguments
import numpy as np
import evaluate
from datasets import Features, Value, Sequence

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = evaluate.load("accuracy").compute(predictions=predictions, references=labels)
    precision = evaluate.load("precision").compute(predictions=predictions, references=labels, average="weighted")
    recall = evaluate.load("recall").compute(predictions=predictions, references=labels, average="weighted")
    f1 = evaluate.load("f1").compute(predictions=predictions, references=labels, average="weighted")
    return {
        "accuracy": accuracy["accuracy"],
        "precision": precision["precision"],
        "recall": recall["recall"],
        "f1": f1["f1"],
    }

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    remove_unused_columns=False,
)

# Re-apply tokenization and label conversion to the balanced training dataset
tokenized_balanced_train_dataset = balanced_train_dataset.map(tokenize_function, batched=True)
tokenized_balanced_train_dataset = tokenized_balanced_train_dataset.map(convert_labels_to_int, batched=True)

# Re-apply tokenization and label conversion to the validation dataset
tokenized_validation_dataset = validation_dataset.map(tokenize_function, batched=True)
tokenized_validation_dataset = tokenized_validation_dataset.map(convert_labels_to_int, batched=True)

# Define the expected features for the trainer
trainer_features = Features({
    'input_ids': Sequence(Value('int32')),
    'attention_mask': Sequence(Value('int8')),
    'labels': Value('int64')
})

# Select and cast columns for the trainer datasets
train_dataset_for_trainer = tokenized_balanced_train_dataset.select_columns(['input_ids', 'attention_mask', 'labels']).cast(trainer_features)
eval_dataset_for_trainer = tokenized_validation_dataset.select_columns(['input_ids', 'attention_mask', 'labels']).cast(trainer_features)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_for_trainer,
    eval_dataset=eval_dataset_for_trainer,
    compute_metrics=compute_metrics,
)

trainer.train()

Map:   0%|          | 0/12800 [00:00<?, ? examples/s]

Map:   0%|          | 0/12800 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/12800 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/18520 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.6927,0.703246,0.039795,0.001584,0.039795,0.003046


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

TrainOutput(global_step=1600, training_loss=0.7052149820327759, metrics={'train_runtime': 2607.3412, 'train_samples_per_second': 4.909, 'train_steps_per_second': 0.614, 'total_flos': 1.19287214505984e+16, 'train_loss': 0.7052149820327759, 'epoch': 1.0})

## Evaluate on validation set (dev set)

### Subtask:
Evaluate the performance of the retrained model on the validation set.

**Reasoning**:
Use the `trainer.evaluate` method to compute evaluation metrics on the tokenized validation dataset.

In [None]:
evaluation_results = trainer.evaluate(eval_dataset_for_trainer)
print("Validation (dev set) results:", evaluation_results)

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Validation (dev set) results: {'eval_loss': 0.7032455205917358, 'eval_accuracy': 0.03979481641468682, 'eval_precision': 0.0015836274134786279, 'eval_recall': 0.03979481641468682, 'eval_f1': 0.003046038292322188, 'eval_runtime': 482.4324, 'eval_samples_per_second': 38.389, 'eval_steps_per_second': 4.799, 'epoch': 1.0}


## Evaluate on test set

### Subtask:
Attempt to evaluate the performance of the retrained model on the test set.

**Reasoning**:
Use the `trainer.predict` method to get predictions on the tokenized test dataset. Acknowledge the potential lack of ground truth labels in the test set and the implications for evaluation metrics.

In [None]:
# Using trainer.predict to get predictions on the tokenized test dataset.
# Note: As observed before, the test dataset does not contain a 'labels' column
# suitable for direct evaluation with standard metrics (accuracy, precision, etc.).
# This step will primarily produce the model's output (logits).
predictions = trainer.predict(tokenized_test_dataset)
print("Predictions on the test set (logits and labels):", predictions)

# You can access the raw logits and potentially the predicted class IDs:
# logits = predictions.predictions
# print("\nRaw logits (first 10):", logits[:10])

# To get the predicted class IDs:
# predicted_class_ids = np.argmax(logits, axis=1)
# print("Predicted class IDs (first 10):", predicted_class_ids[:10])

# If you had a way to obtain ground truth labels for the test set separately,
# you could then use compute_metrics((logits, ground_truth_labels)) to get full evaluation metrics.
# Without ground truth labels in the test dataset provided, a complete evaluation is not possible.

# If 'Any_Relevant' is intended as the ground truth label, we can attempt to use it,
# assuming it can be cast to the correct type (int64).
# We attempted this before, and it might fail if the data is not in the expected format.

print("\nAttempting to use 'Any_Relevant' for evaluation if possible:")
try:
    # Ensure 'Any_Relevant' is in the correct format (int64) if possible
    # Note: This might raise an error if 'Any_Relevant' values are not convertible to int64
    test_labels_for_eval = tokenized_test_dataset['Any_Relevant']
    if test_labels_for_eval.features.dtype != Value('int64').dtype:
         # Attempt to cast if not int64
         test_labels_for_eval = tokenized_test_dataset.cast_column('Any_Relevant', Value('int64'))['Any_Relevant']


    # Compute metrics using predictions and the 'Any_Relevant' column
    test_metrics = compute_metrics((predictions.predictions, test_labels_for_eval))
    print("\nTest Metrics (using 'Any_Relevant' as labels):", test_metrics)

except Exception as e:
    print(f"\nCould not use 'Any_Relevant' as labels for metric computation. Error: {e}")
    print("A full evaluation on the test set is not possible without suitable ground truth labels.")

Predictions on the test set (logits and labels): PredictionOutput(predictions=array([[-0.04330444, -0.02172852],
       [-0.04333496, -0.021698  ],
       [-0.04336548, -0.02175903],
       ...,
       [-0.04333496, -0.02177429],
       [-0.04336548, -0.02178955],
       [-0.04333496, -0.02172852]], dtype=float32), label_ids=None, metrics={'test_runtime': 1411.4771, 'test_samples_per_second': 44.032, 'test_steps_per_second': 5.504})

Attempting to use 'Any_Relevant' for evaluation if possible:


Casting the dataset:   0%|          | 0/62150 [00:00<?, ? examples/s]


Could not use 'Any_Relevant' as labels for metric computation. Error: Failed to parse string: 'Irrelevant' as a scalar of type int64
A full evaluation on the test set is not possible without suitable ground truth labels.


In [None]:
# Convert 'Any_Relevant' column to numerical labels (0 and 1)
# We need to add a 'labels' column to the test dataset for evaluation.
# The convert_labels_to_int function expects a 'Label' column, so we'll
# temporarily rename 'Any_Relevant' to 'Label' for the mapping.

# First, let's inspect the unique values in 'Any_Relevant' to be sure they are consistent with the label_map
unique_any_relevant_values = set(tokenized_test_dataset['Any_Relevant'])
print("Unique values in 'Any_Relevant':", unique_any_relevant_values)

if unique_any_relevant_values.issubset({"Irrelevant", "Relevant"}):
    # Temporarily rename 'Any_Relevant' to 'Label' for the mapping function
    test_dataset_with_label = tokenized_test_dataset.rename_column("Any_Relevant", "Label")

    # Apply the conversion function
    tokenized_test_dataset_with_labels = test_dataset_with_label.map(convert_labels_to_int, batched=True)

    # Remove the temporary 'Label' column and the original 'Any_Relevant' if it still exists
    # Keep only 'input_ids', 'attention_mask', and the new 'labels' column for evaluation
    eval_test_dataset_for_trainer = tokenized_test_dataset_with_labels.select_columns(['input_ids', 'attention_mask', 'labels']).cast(trainer_features)


    print("\nEvaluating on the test set with converted labels:")
    try:
        test_results = trainer.evaluate(eval_test_dataset_for_trainer)
        print("Test set evaluation results:", test_results)
    except Exception as e:
        print(f"Error during test set evaluation: {e}")
        print("Could not complete evaluation on the test set even after label conversion.")

else:
    print("\n'Any_Relevant' column contains unexpected values. Cannot convert to numerical labels for evaluation.")
    print("A full evaluation on the test set is not possible without suitable ground truth labels.")

Unique values in 'Any_Relevant': {'Irrelevant', 'Relevant'}


Map:   0%|          | 0/62150 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/62150 [00:00<?, ? examples/s]


Evaluating on the test set with converted labels:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Test set evaluation results: {'eval_loss': 0.7019492387771606, 'eval_accuracy': 0.09998390989541432, 'eval_precision': 0.009996782237974329, 'eval_recall': 0.09998390989541432, 'eval_f1': 0.0181762335758617, 'eval_runtime': 1695.5512, 'eval_samples_per_second': 36.655, 'eval_steps_per_second': 4.582, 'epoch': 1.0}


In [None]:
from transformers import Trainer, TrainingArguments
import numpy as np
import evaluate
from datasets import Features, Value, Sequence

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    # Load evaluation metrics
    accuracy_metric = evaluate.load("accuracy")
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric = evaluate.load("f1")

    # Compute overall metrics
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    precision_weighted = precision_metric.compute(predictions=predictions, references=labels, average="weighted")
    recall_weighted = recall_metric.compute(predictions=predictions, references=labels, average="weighted")
    f1_weighted = f1_metric.compute(predictions=predictions, references=labels, average="weighted")

    # Compute per-class metrics
    # We need to handle cases where a class might not be present in predictions or labels
    # The labels are 0 for Irrelevant and 1 for Relevant
    try:
        precision_per_class = precision_metric.compute(predictions=predictions, references=labels, average=None, zero_division=0)
        recall_per_class = recall_metric.compute(predictions=predictions, references=labels, average=None, zero_division=0)
        f1_per_class = f1_metric.compute(predictions=predictions, references=labels, average=None)

        # Assuming labels are 0 for Irrelevant and 1 for Relevant based on convert_labels_to_int
        irrelevant_precision = precision_per_class['precision'][0] if len(precision_per_class['precision']) > 0 else 0
        relevant_precision = precision_per_class['precision'][1] if len(precision_per_class['precision']) > 1 else 0
        irrelevant_recall = recall_per_class['recall'][0] if len(recall_per_class['recall']) > 0 else 0
        relevant_recall = recall_per_class['recall'][1] if len(recall_per_class['recall']) > 1 else 0
        irrelevant_f1 = f1_per_class['f1'][0] if len(f1_per_class['f1']) > 0 else 0
        relevant_f1 = f1_per_class['f1'][1] if len(f1_per_class['f1']) > 1 else 0

    except ValueError:
        # Handle cases where one of the classes might not be in the labels or predictions
        irrelevant_precision = 0
        relevant_precision = 0
        irrelevant_recall = 0
        relevant_recall = 0
        irrelevant_f1 = 0
        relevant_f1 = 0
        print("Warning: Could not compute per-class metrics for all classes. This might happen if a class is missing in the predictions or references.")


    return {
        "accuracy": accuracy["accuracy"],
        "precision_weighted": precision_weighted["precision"],
        "recall_weighted": recall_weighted["recall"],
        "f1_weighted": f1_weighted["f1"],
        "precision_Irrelevant": irrelevant_precision,
        "precision_Relevant": relevant_precision,
        "recall_Irrelevant": irrelevant_recall,
        "recall_Relevant": relevant_recall,
        "f1_Irrelevant": irrelevant_f1,
        "f1_Relevant": relevant_f1,
    }

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    remove_unused_columns=False,
)

# Re-apply tokenization and label conversion to the balanced training dataset
tokenized_balanced_train_dataset = balanced_train_dataset.map(tokenize_function, batched=True)
tokenized_balanced_train_dataset = tokenized_balanced_train_dataset.map(convert_labels_to_int, batched=True)

# Re-apply tokenization and label conversion to the validation dataset
tokenized_validation_dataset = validation_dataset.map(tokenize_function, batched=True)
tokenized_validation_dataset = tokenized_validation_dataset.map(convert_labels_to_int, batched=True)

# Define the expected features for the trainer
trainer_features = Features({
    'input_ids': Sequence(Value('int32')),
    'attention_mask': Sequence(Value('int8')),
    'labels': Value('int64')
})

# Select and cast columns for the trainer datasets
train_dataset_for_trainer = tokenized_balanced_train_dataset.select_columns(['input_ids', 'attention_mask', 'labels']).cast(trainer_features)
eval_dataset_for_trainer = tokenized_validation_dataset.select_columns(['input_ids', 'attention_mask', 'labels']).cast(trainer_features)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_for_trainer,
    eval_dataset=eval_dataset_for_trainer,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision Weighted,Recall Weighted,F1 Weighted,Precision Irrelevant,Precision Relevant,Recall Irrelevant,Recall Relevant,F1 Irrelevant,F1 Relevant
1,0.7008,0.688533,0.963013,0.927394,0.963013,0.944868,0.963013,0.0,1.0,0.0,0.981158,0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


TrainOutput(global_step=1613, training_loss=0.6971375955533183, metrics={'train_runtime': 805.6759, 'train_samples_per_second': 16.016, 'train_steps_per_second': 2.002, 'total_flos': 1.2025642312384512e+16, 'train_loss': 0.6971375955533183, 'epoch': 1.0})

In [None]:
# Convert 'Any_Relevant' column to numerical labels (0 and 1)
# We need to add a 'labels' column to the test dataset for evaluation.
# The convert_labels_to_int function expects a 'Label' column, so we'll
# temporarily rename 'Any_Relevant' to 'Label' for the mapping.

# First, let's inspect the unique values in 'Any_Relevant' to be sure they are consistent with the label_map
unique_any_relevant_values = set(tokenized_test_dataset['Any_Relevant'])
print("Unique values in 'Any_Relevant':", unique_any_relevant_values)

if unique_any_relevant_values.issubset({"Irrelevant", "Relevant"}):
    # Temporarily rename 'Any_Relevant' to 'Label' for the mapping function
    test_dataset_with_label = tokenized_test_dataset.rename_column("Any_Relevant", "Label")

    # Apply the conversion function
    tokenized_test_dataset_with_labels = test_dataset_with_label.map(convert_labels_to_int, batched=True)

    # Remove the temporary 'Label' column and the original 'Any_Relevant' if it still exists
    # Keep only 'input_ids', 'attention_mask', and the new 'labels' column for evaluation
    eval_test_dataset_for_trainer = tokenized_test_dataset_with_labels.select_columns(['input_ids', 'attention_mask', 'labels']).cast(trainer_features)


    print("\nEvaluating on the test set with converted labels:")
    try:
        test_results = trainer.evaluate(eval_test_dataset_for_trainer)
        print("Test set evaluation results:", test_results)
    except Exception as e:
        print(f"Error during test set evaluation: {e}")
        print("Could not complete evaluation on the test set even after label conversion.")

else:
    print("\n'Any_Relevant' column contains unexpected values. Cannot convert to numerical labels for evaluation.")
    print("A full evaluation on the test set is not possible without suitable ground truth labels.")

Unique values in 'Any_Relevant': {'Relevant', 'Irrelevant'}

Evaluating on the test set with converted labels:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Test set evaluation results: {'eval_loss': 0.6891624927520752, 'eval_accuracy': 0.9000160901045857, 'eval_precision_weighted': 0.8100289624471457, 'eval_recall_weighted': 0.9000160901045857, 'eval_f1_weighted': 0.8526548450466627, 'eval_precision_Irrelevant': 0.9000160901045857, 'eval_precision_Relevant': 0.0, 'eval_recall_Irrelevant': 1.0, 'eval_recall_Relevant': 0.0, 'eval_f1_Irrelevant': 0.9473773351625087, 'eval_f1_Relevant': 0.0, 'eval_runtime': 1088.2042, 'eval_samples_per_second': 57.112, 'eval_steps_per_second': 7.139, 'epoch': 1.0}


## Test Set Class-wise Metrics

The evaluation on the test set with the 'Any_Relevant' column converted to numerical labels yielded the following results, including class-wise metrics:

*   **eval_loss:** 0.6892
*   **eval_accuracy:** 0.9000
*   **eval_precision_weighted:** 0.8100
*   **eval_recall_weighted:** 0.9000
*   **eval_f1_weighted:** 0.8527

**Class-wise Metrics:**

*   **Precision (Irrelevant):** 0.9000
*   **Precision (Relevant):** 0.0
*   **Recall (Irrelevant):** 1.0
*   **Recall (Relevant):** 0.0
*   **F1 (Irrelevant):** 0.9474
*   **F1 (Relevant):** 0.0

**Analysis of Class-wise Metrics:**

These class-wise metrics provide a more detailed view of the model's performance.

*   **Irrelevant Class:** The model shows high precision, recall, and F1-score for the "Irrelevant" class. This indicates that when the model predicts a sentence as "Irrelevant", it is very likely correct, and it is able to identify most of the truly "Irrelevant" sentences.
*   **Relevant Class:** The model has a precision, recall, and F1-score of 0.0 for the "Relevant" class. This is a critical finding and indicates that the model failed to correctly classify any of the "Relevant" sentences in the test set. The `UndefinedMetricWarning` observed earlier was likely due to the model not predicting any instances as "Relevant".

**Summary:**

While the overall weighted metrics might appear somewhat higher due to the large number of "Irrelevant" samples, the class-wise metrics clearly demonstrate that the model, after downsampling the majority class for training, is still unable to effectively identify the minority "Relevant" class in the imbalanced test set. This suggests that downsampling alone was not sufficient to train a model that generalizes well to the minority class in the presence of significant class imbalance in the evaluation data.

Further steps should focus on exploring alternative or complementary techniques for handling class imbalance, such as using class weights during training or employing different model architectures, to improve the model's ability to detect the "Relevant" sentences.