## Importing Required libraries:

In [1]:
import pandas as pd
from transformers import RobertaTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, balanced_accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from datasets import Dataset
import torch

  from .autonotebook import tqdm as notebook_tqdm


As our dataset is already preprocessed and splitted into training, test and development test by the split that was given to us already in the initial dataset after loading the datasets again we need to prepare them for the RoBERTA Model in a format that is acceptable for this model which can be found in the corresponding huggingFace page.

Link to RoBERTA model documentation is provided below:

https://huggingface.co/docs/transformers/en/model_doc/roberta

Because RoBERta and other similar pre-trained models are trained on general datasets and are not specialized for specific task like in our case classification of text to sexist and no-sexist we need to fine-tune the model based on our data and labels although it can be used already but the results and accuracy of the model might not be as good as it should be. we are going to compare the results of both original and fine-tuned models.

But to describe why fine-tuning is needed in more details.
Fine-tuning involves training the last few layers (and optionally all layers) of the model on your labeled data. The goal is to optimize the pre-trained weights for your task while retaining the knowledge learned during pre-training.

1. It will allow the model to learn task specific patterns and adapt to our specific domain.
2. The pre-trained model doesn't know about our labels in this condition fine-tuning will align the model's output to our specific purposes.
3. Fine-tuning can imporve the performance of the model.

- Steps of fine-tuning are as follows:

1. Pre-processing the data (which we have already done)
2. Adapting the data to our model for binary classification
3. Adding classification head on top of RoBERTa for binary prediction
4. Training the model using DF_train
5. Use the fine-tuned model for predciting on test set(DF_test)

------------------------------------------------

Steps for Fine-tuning in Your Task
Pre-process the data: You've already preprocessed and loaded the dataset. Tokenize and prepare it for the RoBERTa model.
Adapt the model for binary classification:
Add a classification head (a linear layer) on top of RoBERTa for outputting binary predictions.
Train the model:
Use your training data (DF_train) for model training.
Use your dev data (DF_dev) to monitor performance during training and prevent overfitting.
Evaluate:
Use metrics like accuracy, precision, recall, and F1 score to evaluate the fine-tuned model on the validation set (DF_dev).
Predict:
Use the fine-tuned model to predict labels for your test data (DF_test).

In [None]:
# Loading the dataset
DF_train=pd.read_csv('../data/preprocessed/DF_train.csv')
DF_dev=pd.read_csv('../data/preprocessed/DF_dev.csv')
DF_test=pd.read_csv('../data/preprocessed/DF_test.csv')
Actual_labels=pd.read_csv('../data/preprocessed/Actual_labels.csv')

In [34]:
# Combining datasets into format acceptable by HuggingFace:
train_dataset = Dataset.from_pandas(DF_train[['text', 'label_sexist']])
dev_dataset = Dataset.from_pandas(DF_dev[['text', 'label_sexist']])
test_dataset = Dataset.from_pandas(DF_test[['text']])  # Test doesn't need labels for now

After loading and making the datasets ready for the model, we need to tokenize the data which will be done using tokenizer already included in the transformer library:

### Loading the tokenizer and tokenizing the data:

In [26]:
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

def tokenize_data(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

train_dataset = train_dataset.map(tokenize_data, batched=True)
dev_dataset = dev_dataset.map(tokenize_data, batched=True)
test_dataset = test_dataset.map(tokenize_data, batched=True)

Map: 100%|██████████| 14000/14000 [00:04<00:00, 3276.16 examples/s]
Map: 100%|██████████| 2000/2000 [00:00<00:00, 2267.73 examples/s]
Map: 100%|██████████| 4000/4000 [00:01<00:00, 2237.69 examples/s]


### Formating data for Training:

In [27]:
train_dataset = train_dataset.rename_column("label_sexist", "labels")
dev_dataset = dev_dataset.rename_column("label_sexist", "labels")

train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
dev_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

### Loading the pre-trained Roberta model:

In [28]:
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Setting up training parameters and arguments:

We are going to use exactly the same parameters and arguments that were used for training RoBERTA model as they should already be the optimized ones. Parameters are copied from huggingFace documentation page which was mentioned before.

In [30]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    save_total_limit=1
)

### Defining Metrics for fine-tuning the model:

In [31]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="binary")
    acc = accuracy_score(labels, predictions)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

### Fine-Tuning and training the model:

In [11]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3068,0.314841,0.8585,0.685206,0.745763,0.633745
2,0.2621,0.342734,0.8725,0.728435,0.754967,0.703704
3,0.1835,0.400996,0.8785,0.736728,0.778032,0.699588


TrainOutput(global_step=2625, training_loss=0.2861032879466102, metrics={'train_runtime': 38222.8101, 'train_samples_per_second': 1.099, 'train_steps_per_second': 0.069, 'total_flos': 2762666081280000.0, 'train_loss': 0.2861032879466102, 'epoch': 3.0})

### Saving the fine-tuned model for future usage:

In [12]:
output_dir = "../fine_tuned_roberta"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model and tokenizer saved to {output_dir}")

Model and tokenizer saved to ./fine_tuned_roberta


### Evaluating the fine-tuned model:

In [13]:
results = trainer.evaluate()
print(results)

{'eval_loss': 0.40099582076072693, 'eval_accuracy': 0.8785, 'eval_f1': 0.7367280606717227, 'eval_precision': 0.7780320366132724, 'eval_recall': 0.6995884773662552, 'eval_runtime': 528.0448, 'eval_samples_per_second': 3.788, 'eval_steps_per_second': 0.237, 'epoch': 3.0}


### Testing the model on our test dataset:

In [14]:
predictions = trainer.predict(test_dataset)
test_dataset = DF_test.copy()
test_dataset["predictions"] = torch.argmax(torch.tensor(predictions.predictions), dim=-1).numpy()
print(test_dataset.head())

                  rewire_id  \
0    sexism2022_english-845   
1   sexism2022_english-6629   
2  sexism2022_english-17573   
3  sexism2022_english-10268   
4  sexism2022_english-10735   

                                                text  predictions  
0  fuck the niggers and the jews both have a hist...            0  
1  well then good because someone has to knock he...            1  
2  usa texas islam muslims islamization sharialaw...            0  
3  yes normal women want to be dominated social s...            1  
4  she didnt have to be a bitch about it she lite...            1  


In [15]:
test_dataset

Unnamed: 0,rewire_id,text,predictions
0,sexism2022_english-845,fuck the niggers and the jews both have a hist...,0
1,sexism2022_english-6629,well then good because someone has to knock he...,1
2,sexism2022_english-17573,usa texas islam muslims islamization sharialaw...,0
3,sexism2022_english-10268,yes normal women want to be dominated social s...,1
4,sexism2022_english-10735,she didnt have to be a bitch about it she lite...,1
...,...,...,...
3995,sexism2022_english-2356,define blatant if youre with a girl then be wi...,0
3996,sexism2022_english-17641,take a look at mgtow even chads know women are...,1
3997,sexism2022_english-6358,when youre known as the guy who argues that wo...,0
3998,sexism2022_english-8770,you shouldve asked if you could be her side piece,0


### Combining predictions and actual Labels for final Evaluation

In [41]:
# Adding actual labels to the test dataset
evaluation_data = DF_test.copy()
evaluation_data["predictions"] = torch.argmax(torch.tensor(predictions.predictions), dim=-1).numpy()
evaluation_data["actual_labels"] = Actual_labels["label_sexist"]

# Mapping numeric labels to text labels
label_map = {0: "not sexist", 1: "sexist"}
evaluation_data["predictions_text"] = evaluation_data["predictions"].map(label_map)
evaluation_data["actual_labels_text"] = evaluation_data["actual_labels"].map(label_map)

# Calculating evaluation metrics
accuracy = accuracy_score(evaluation_data["actual_labels"], evaluation_data["predictions"])
precision = precision_score(evaluation_data["actual_labels"], evaluation_data["predictions"], average="binary")
recall = recall_score(evaluation_data["actual_labels"], evaluation_data["predictions"], average="binary")
f1 = f1_score(evaluation_data["actual_labels"], evaluation_data["predictions"], average="binary")

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

# Saving the final result to and Excel file:
output_path = "../results/predictions_with_labels_dl.xlsx"
columns_to_save = ["rewire_id", "text", "predictions_text", "actual_labels_text"]
evaluation_data[columns_to_save].to_excel(output_path, index=False)
print(f"Predictions saved to {output_path}")

# Classification report for test set
print("Test Set Classification Report:")
print(classification_report(evaluation_data["actual_labels"], evaluation_data["predictions"]))

# Calculate misclassification rate
accuracy=classification_report(evaluation_data["actual_labels"], evaluation_data["predictions"],output_dict=True)['accuracy']
misclassification_rate=1-accuracy
# Calculate balanced accuracy
balanced_accuracy=balanced_accuracy_score(evaluation_data["actual_labels"], evaluation_data["predictions"])

print(f"Misclassification Rate: {misclassification_rate:.4f}")
print(f"Balanced Accuracy: {balanced_accuracy:.4f}")

Accuracy: 0.8738
Precision: 0.7476
Recall: 0.7237
F1 Score: 0.7355
Predictions saved to ../results/predictions_with_labels_dl.xlsx
Test Set Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.92      0.92      3030
           1       0.75      0.72      0.74       970

    accuracy                           0.87      4000
   macro avg       0.83      0.82      0.83      4000
weighted avg       0.87      0.87      0.87      4000

Misclassification Rate: 0.1262
Balanced Accuracy: 0.8227


The misclassification rate tells you how often the model gets predictions wrong overall. And the balanced accuracy provides a fairer measure of the model's ability to classify when the dataset may have unequal class distributions.
misclassification rate of 0.1262 indicates that 12.62% of predictions made by the model are incorrect.
A balanced accuracy of 0.8227 indicates that, on average, the model correctly identifies 82.27% of the samples across all classes, adjusting for any class imbalance.

In [22]:
evaluation_data

Unnamed: 0,rewire_id,text,predictions,actual_labels,predictions_text,actual_labels_text
0,sexism2022_english-845,fuck the niggers and the jews both have a hist...,0,0,not sexist,not sexist
1,sexism2022_english-6629,well then good because someone has to knock he...,1,1,sexist,sexist
2,sexism2022_english-17573,usa texas islam muslims islamization sharialaw...,0,0,not sexist,not sexist
3,sexism2022_english-10268,yes normal women want to be dominated social s...,1,1,sexist,sexist
4,sexism2022_english-10735,she didnt have to be a bitch about it she lite...,1,1,sexist,sexist
...,...,...,...,...,...,...
3995,sexism2022_english-2356,define blatant if youre with a girl then be wi...,0,0,not sexist,not sexist
3996,sexism2022_english-17641,take a look at mgtow even chads know women are...,1,1,sexist,sexist
3997,sexism2022_english-6358,when youre known as the guy who argues that wo...,0,0,not sexist,not sexist
3998,sexism2022_english-8770,you shouldve asked if you could be her side piece,0,0,not sexist,not sexist


### Analyzing the Results:

In [27]:
# Filtering rows where predictions and actual_labels are not equal
mismatched_rows = evaluation_data[evaluation_data['predictions'] != evaluation_data['actual_labels']]
mismatched_rows

Unnamed: 0,rewire_id,text,predictions,actual_labels,predictions_text,actual_labels_text
11,sexism2022_english-9087,i would pay to see her ass cracker nsfw,0,1,not sexist,sexist
23,sexism2022_english-295,most men know instinctually the game but their...,1,0,sexist,not sexist
29,sexism2022_english-13224,with interior decorating skills like yours gir...,0,1,not sexist,sexist
41,sexism2022_english-8690,isnt it convenient that she doesnt have the vi...,0,1,not sexist,sexist
48,sexism2022_english-14596,its not the same goys groping women is my come...,0,1,not sexist,sexist
...,...,...,...,...,...,...
3964,sexism2022_english-12476,awalt the guy had social status so she attache...,0,1,not sexist,sexist
3968,sexism2022_english-9520,so leading women on and using them for materia...,1,0,sexist,not sexist
3971,sexism2022_english-5262,a girl who comes up with terms like substantia...,1,0,sexist,not sexist
3976,sexism2022_english-9172,it has always seemed to me that feminists goal...,1,0,sexist,not sexist


In [28]:
output_path_mis = "../results/mismatch_dl.xlsx"
columns = ["rewire_id", "text", "predictions_text", "actual_labels_text"]
mismatched_rows[columns].to_excel(output_path_mis, index=False)

Misspellings in the text, a lack of context, and the mislabeling of data are key factors contributing to this mismatch. For instance, the sentence *"with interior decorating skills like yours girls will be falling all over you im falling for you a little myself"* is classified by the model as not sexist, whereas the actual label is sexist. The absence of context or knowledge of the tone makes it challenging to determine the correct classification, and it may also highlight potential biases in the actual labels.
Another example is the sentence *"this bith should be stoped shes the rapist"* which contains misspellings and lacks context. Again, the sentence is classified as not sexist by the model, despite the actual label being sexist.

## Fine tuning the Fine-tuned model (fine_tuned_roberta)

In [2]:
model_fine_tuned_roberta=RobertaForSequenceClassification.from_pretrained("fine_tuned_roberta")
tokenizer_fine_tuned_roberta=RobertaTokenizer.from_pretrained("fine_tuned_roberta")

def tokenize_data(examples):
    return tokenizer_fine_tuned_roberta(examples["text"],truncation=True,padding="max_length",max_length=128)

In [43]:
# Loading the dataset
DF_test=pd.read_csv('../data/preprocessed/DF_test.csv')
DF_test=Dataset.from_pandas(DF_test[['text']])
DF_test=DF_test.map(tokenize_data,batched=True)
Actual_labels=pd.read_csv('../data/preprocessed/Actual_labels.csv')
GPT_Generated_test=pd.read_csv('../data/relabeled/generatedDataset.csv')

# LabelEncoding
label_mapping={" Sexist":1," Not sexist":0}
# Encode using the mapping
GPT_Actual_labels=GPT_Generated_test['label']
GPT_Actual_labels=[label_mapping[label] for label in GPT_Actual_labels]
GPT_Generated_test.drop('label',axis=1,inplace=True)
GPT_Generated_test.drop('id',axis=1,inplace=True)
GPT_Actual_labels=pd.DataFrame(GPT_Actual_labels)
GPT_Generated_test=Dataset.from_pandas(GPT_Generated_test[['text']])
GPT_Generated_test=GPT_Generated_test.map(tokenize_data,batched=True)

DF_train_subset=pd.read_csv('../data/relabeled/DF_train_subset_Juliane.csv')
Train,Dev=train_test_split(DF_train_subset,test_size=0.2,random_state=3)

Train=Dataset.from_pandas(Train[['text', 'label_sexist']])
Train=Train.map(tokenize_data,batched=True)
Train=Train.rename_column("label_sexist", "labels")
Train.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Dev=Dataset.from_pandas(Dev[['text', 'label_sexist']])
Dev=Dev.map(tokenize_data,batched=True)
Dev=Dev.rename_column("label_sexist", "labels")
Dev.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Map: 100%|██████████| 4000/4000 [00:01<00:00, 2519.89 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 1825.85 examples/s]
Map: 100%|██████████| 80/80 [00:00<00:00, 1407.30 examples/s]
Map: 100%|██████████| 20/20 [00:00<00:00, 870.72 examples/s]


In [5]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    logits = torch.tensor(logits)  # Convert logits to a PyTorch tensor
    predictions = torch.argmax(logits, dim=-1)  # Get predictions
    predictions = predictions.numpy()  # Convert predictions to a NumPy array
    # Labels are already NumPy arrays; no need for `.cpu()`
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="binary")
    acc = accuracy_score(labels, predictions)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    save_total_limit=1)

trainer_fine_tuned=Trainer(
    model=model_fine_tuned_roberta,
    args=training_args,
    train_dataset=Train,
    eval_dataset=Dev,
    tokenizer=tokenizer_fine_tuned_roberta,
    compute_metrics=compute_metrics)

### Fine tuning

In [6]:
trainer_fine_tuned.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.408514,0.85,0.8,0.666667,1.0
2,0.931400,0.358112,0.85,0.8,0.666667,1.0
3,0.931400,0.315461,0.85,0.8,0.666667,1.0


TrainOutput(global_step=15, training_loss=0.82242325146993, metrics={'train_runtime': 117.6545, 'train_samples_per_second': 2.04, 'train_steps_per_second': 0.127, 'total_flos': 15786663321600.0, 'train_loss': 0.82242325146993, 'epoch': 3.0})

### Save the new model

In [9]:
output_dir = "../fine_tuned_roberta_v2"
trainer_fine_tuned.save_model(output_dir)
tokenizer_fine_tuned_roberta.save_pretrained(output_dir)
print(f"Model and tokenizer saved to {output_dir}")

Model and tokenizer saved to ../fine_tuned_roberta_v2


### evaluate & Classification Report

In [10]:
results=trainer_fine_tuned.evaluate()
print(results)

{'eval_loss': 0.4085139334201813, 'eval_accuracy': 0.85, 'eval_f1': 0.8, 'eval_precision': 0.6666666666666666, 'eval_recall': 1.0, 'eval_runtime': 2.3939, 'eval_samples_per_second': 8.355, 'eval_steps_per_second': 0.835, 'epoch': 3.0}


In [44]:
predictions_DF_test = trainer_fine_tuned.predict(DF_test)
predictions_GPT_Generated_test = trainer_fine_tuned.predict(GPT_Generated_test)

In [47]:
predictions_test=torch.argmax(torch.tensor(predictions_DF_test.predictions), dim=-1).numpy()
predictions_GPT_test=torch.argmax(torch.tensor(predictions_GPT_Generated_test.predictions), dim=-1).numpy()

In [50]:
# Calculating evaluation metrics
accuracy = accuracy_score(Actual_labels,predictions_test)
precision = precision_score(Actual_labels,predictions_test,average="binary")
recall = recall_score(Actual_labels,predictions_test,average="binary")
f1 = f1_score(Actual_labels,predictions_test,average="binary")

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

# Classification report for test set
print("Test Set Classification Report:")
print(classification_report(Actual_labels,predictions_test))

# Calculate misclassification rate
accuracy=classification_report(Actual_labels,predictions_test,output_dict=True)['accuracy']
misclassification_rate=1-accuracy
# Calculate balanced accuracy
balanced_accuracy=balanced_accuracy_score(Actual_labels,predictions_test)

print(f"Misclassification Rate: {misclassification_rate:.4f}")
print(f"Balanced Accuracy: {balanced_accuracy:.4f}")

Accuracy: 0.8405
Precision: 0.6343
Recall: 0.8082
F1 Score: 0.7108
Test Set Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.85      0.89      3030
           1       0.63      0.81      0.71       970

    accuracy                           0.84      4000
   macro avg       0.78      0.83      0.80      4000
weighted avg       0.86      0.84      0.85      4000

Misclassification Rate: 0.1595
Balanced Accuracy: 0.8295


These results indicate that the model has an overall accuracy of 84.05% on the original test set. It performs very well on classifying the majority class (0) with high precision and recall, but is less effective on the minority class (1), as shown by its lower precision (63.43%) and a higher recall (80.82%). The balanced accuracy of 82.95% suggests the model has a fairly good performance across both classes, though the misclassification rate is 15.95%. The F1 score of 71.08% highlights that there's room for improvement, especially in balancing precision and recall for the minority class.

The major changes between the results are the overall performance and balance between classes. Before the second fine-tuning, the model had higher accuracy (87.38%) and better precision (74.76%) and recall (72.37%) for the minority class (1), but after the second fine-tuning, these metrics dropped slightly, resulting in an accuracy of 84.05%, precision of 63.43%, and recall of 80.82% for the minority class. The fine_tuned_roberta_v2 model also showed improved performance balance across classes, reflected in a higher F1 score (71.08% vs. 73.55%) for the minority class. Essentially, second fine-tuning with new relabled data subset enhanced the balance between classes at the cost of a slight overall accuracy decrease.

In [51]:
# Calculating evaluation metrics
accuracy = accuracy_score(GPT_Actual_labels,predictions_GPT_test)
precision = precision_score(GPT_Actual_labels,predictions_GPT_test,average="binary")
recall = recall_score(GPT_Actual_labels,predictions_GPT_test,average="binary")
f1 = f1_score(GPT_Actual_labels,predictions_GPT_test,average="binary")

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

# Classification report for test set
print("Test Set Classification Report:")
print(classification_report(GPT_Actual_labels,predictions_GPT_test))

# Calculate misclassification rate
accuracy=classification_report(GPT_Actual_labels,predictions_GPT_test,output_dict=True)['accuracy']
misclassification_rate=1-accuracy
# Calculate balanced accuracy
balanced_accuracy=balanced_accuracy_score(GPT_Actual_labels,predictions_GPT_test)

print(f"Misclassification Rate: {misclassification_rate:.4f}")
print(f"Balanced Accuracy: {balanced_accuracy:.4f}")

Accuracy: 0.5500
Precision: 1.0000
Recall: 0.1000
F1 Score: 0.1818
Test Set Classification Report:
              precision    recall  f1-score   support

           0       0.53      1.00      0.69        50
           1       1.00      0.10      0.18        50

    accuracy                           0.55       100
   macro avg       0.76      0.55      0.44       100
weighted avg       0.76      0.55      0.44       100

Misclassification Rate: 0.4500
Balanced Accuracy: 0.5500


The results on the test set generated by ChatGPT indicate a drastic drop in overall model performance. The accuracy is significantly lower at 55.00%, with a perfect precision of 100.00% for class 1, but an extremely low recall of 10.00%, leading to a very low F1 score of 18.18%. This suggests the model is overfitting class 1 to the point of only predicting it when absolutely certain, missing a lot of actual positives. For class 0, the model has a good recall but mediocre precision. The misclassification rate is high at 45.00%, and the balanced accuracy is at only 55.00%, indicating poor performance across both classes. This shows that the fine-tuned model does not generalize well to the ChatGPT generated test set.