We compare our classification results against the ground_truth file, whose labels were manually annotated by us. We assume this file to represent the correct ground truth.

In [1]:
import pandas as pd
from sklearn.metrics import classification_report

We load the model test_file and the ground_truth file

In [2]:
test_file = pd.read_csv('pipeline_output.csv')
ground_truth = pd.read_csv('sample_gt.csv')

In [3]:
bool_cols = ["is_text_ad", "is_image_ad", "is_image_irrelevant", "is_text_irrelevant", "is_text_rant", "is_review_ad", "is_review_irrelevant", "sensibility"]
for col in bool_cols:
    test_file[col] = test_file[col].astype(bool)
test_file["is_review_ad"] = test_file["is_text_ad"] | test_file["is_image_ad"]
test_file['is_review_irrelevant'] = test_file["is_image_irrelevant"] | test_file["is_text_irrelevant"]
# print(test_file.columns)
# test_file.head()

In [4]:
bool_cols_gt = ["is_text_rant", "is_review_ad", "is_review_irrelevant", "sensibility"]
for col in bool_cols_gt:
    ground_truth[col] = ground_truth[col].astype(bool)
# print(ground_truth.columns)
# ground_truth.head()

We evalute precision, recall, and F1-score for each class (True/False) and the overall weighted/macro scores.

In [5]:
targets = ["is_review_irrelevant", "is_review_ad", "is_text_rant", "sensibility"]

pred_cols = ["review_id"] + [f"{col}" for col in targets] + ["helpfulness"]
gt_cols = ["review_id"] + [f"{col}" for col in targets] + ["helpfulness"]

test_subset = test_file[pred_cols]
gt_subset = ground_truth[gt_cols]

df = test_subset.merge(gt_subset, on="review_id", suffixes=("_pred", "_true"))

for col in targets:
    y_true = df[f"{col}_true"]
    y_pred = df[f"{col}_pred"]
    print(f"=== {col} ===")
    print(classification_report(y_true, y_pred, digits=3, zero_division=0))

=== is_review_irrelevant ===
              precision    recall  f1-score   support

       False      0.961     0.813     0.881        91
        True      0.261     0.667     0.375         9

    accuracy                          0.800       100
   macro avg      0.611     0.740     0.628       100
weighted avg      0.898     0.800     0.835       100

=== is_review_ad ===
              precision    recall  f1-score   support

       False      1.000     0.990     0.995       100
        True      0.000     0.000     0.000         0

    accuracy                          0.990       100
   macro avg      0.500     0.495     0.497       100
weighted avg      1.000     0.990     0.995       100

=== is_text_rant ===
              precision    recall  f1-score   support

       False      0.984     0.733     0.840        86
        True      0.361     0.929     0.520        14

    accuracy                          0.760       100
   macro avg      0.673     0.831     0.680       100
wei

We subsequently iterated the process with alternative prompts. The prompts listed below were tested but yielded inferior performance.

In [16]:
df[20:30]

Unnamed: 0,review_id,is_review_irrelevant_pred,is_review_ad_pred,is_text_rant_pred,sensibility_pred,helpfulness_pred,is_review_irrelevant_true,is_review_ad_true,is_text_rant_true,sensibility_true,helpfulness_true
20,102324,True,False,False,True,not_helpful,True,False,False,True,not_helpful
21,203732,True,False,True,True,not_helpful,True,False,False,True,not_helpful
22,203732,True,False,True,True,not_helpful,False,False,False,True,not_helpful
23,203732,True,False,True,True,not_helpful,True,False,False,True,not_helpful
24,203732,True,False,True,True,not_helpful,False,False,False,True,not_helpful
25,70746,True,False,True,True,not_helpful,False,False,False,False,not_helpful
26,71137,True,False,True,True,not_helpful,False,False,False,False,not_helpful
27,309898,False,False,True,True,helpful,False,False,False,False,not_helpful
28,65500,False,False,False,True,helpful,False,False,False,True,helpful
29,147276,False,False,False,True,helpful,False,False,False,True,very_helpful
