# Policy Evaluation of User Reviews

This notebook demonstrates the workflow for evaluating user-submitted reviews against a set of trustworthiness policies, including spam detection, relevance, and credibility. We start by loading annotated test data, then process reviews in batches using a policy evaluation model. The predictions are saved, analyzed, and compared against ground truth labels. Finally, we compute evaluation metrics such as F1 scores and display the results in a structured table for easy inspection.

This pipeline provides a transparent and reproducible approach to assessing review quality and identifying policy violations.

In [None]:
import sys
sys.path.append('../src')

from policies.evaluators import PolicyEvaluator
from policies.review_selector import select_violated_reviews
from objects import Review, Business, OutputData
from llm import Model
import json, csv
from utils import convert_dict_to_review
import pandas as pd

### Loading Annotated Review Data

The first step in our pipeline is to load the annotated test dataset, which contains metadata and ground truth labels for each review. We read the CSV file and store all rows in a list for further processing.

In [None]:
import csv
INPUT_FILE = "../data/processed/test_set_annotated.csv"
OUTPUT_FILE = "../data/output/test_set_results.jsonl"

with open(INPUT_FILE, newline="") as f:
    reader = csv.DictReader(f)
    rows = list(reader) 


### Batch Evaluation of Reviews

After loading the annotated dataset, we process the reviews in batches to efficiently evaluate them with our PolicyEvaluator model. Each batch is converted into Review objects before being passed to the evaluator. The model returns predictions for each review, which are then paired with their ground truth labels and saved to a JSONL file for downstream analysis.

In [None]:
evaluator= PolicyEvaluator()
batch_size = 50
with open(OUTPUT_FILE, "w") as outfile: 
    for batch_start in range(0, len(rows), batch_size):
    
        batch_end = min(batch_start + batch_size, len(rows))
        batch_rows = rows[batch_start:batch_end]
        
        # Convert batch to Review objects
        review_objects = [convert_dict_to_review(batch_start + i, review) 
                         for i, review in enumerate(batch_rows)]
        
        # Process entire batch
        results = evaluator.evaluate_batch(review_objects)
        print(results)

        # Write batch results
        for result in results:
            output = {
                "id": getattr(result, "id", result.id),
                "prediction": result.evaluation,
                "truth": result.truth
            }
            print(output)  # (Optional) print for inspection
            outfile.write(json.dumps(output) + "\n")
        
        print(f"Processed batch {batch_start//batch_size + 1} "
              f"(reviews {batch_start+1}-{batch_end})")

### Evaluating Model Performance

To assess the effectiveness of our policy evaluation model, we compute the F1 score and generate a full classification report for each key parameter: credible, relevance, and spam. The predictions from the JSONL output file are compared against the corresponding ground truth labels, with boolean and string values normalized for consistent evaluation.

In [None]:
import json
from sklearn.metrics import f1_score, classification_report

parameters = ["credible", "relevance", "spam"]

with open(OUTPUT_FILE) as infile:
    lines = infile.readlines()

for param in parameters:
    y_pred = []
    y_true = []
    for line in lines:
        record = json.loads(line)
        pred = record["prediction"][param]
        true = record["truth"][param]
        # Convert to int (True/False or "TRUE"/"FALSE")
        y_pred.append(int(pred) if isinstance(pred, bool) else int(str(pred).upper() == "TRUE"))
        y_true.append(int(true) if isinstance(true, bool) else int(str(true).upper() == "TRUE"))
    print(f"\n=== {param.upper()} ===")
    print(f"F1 score ({param}):", f1_score(y_true, y_pred))
    print(classification_report(y_true, y_pred))

### Visualizing Results

After processing and evaluating the reviews, we display the results in a structured table using a DataFrame. This allows for easy inspection of predictions, ground truth labels, and any detected policy violations directly within the notebook.

In [None]:
import pandas as pd

violated_reviews = select_violated_reviews(OUTPUT_FILE, INPUT_FILE)
df = pd.DataFrame(violated_reviews)
display(df)  # This will show a nicely formatted table in the notebook

# Optionally, save to CSV as before
df.to_csv("../data/output/violated_reviews.csv", index=False)
