# Dataset Preparation

This notebook demonstrates the workflow for processing and evaluating user-submitted reviews against a set of trustworthiness policies, including spam detection, relevance, and credibility.

The dataset used in this project is sourced from the Google Local Reviews dataset provided by the [McAuley Lab at UCSD](https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/?utm_source=chatgpt.com). Since the dataset contains millions of reviews, we randomly sample a manageable subset for processing.


In [1]:
import sys
sys.path.append('../src')
import json
import csv

from objects import InputData
from data_processing.input_parser import parse_csv, parse_json_into_reviews, save_to_json

DATA_DIR= "../data"
PROCESSED_DIR = f"{DATA_DIR}/processed"
RAW_DIR = f"{DATA_DIR}/raw"


### Parsing and Standardizing Review Data

We begin by loading a publicly available dataset of Vermont reviews and business metadata. Since the dataset contains millions of reviews, we randomly select a subset of 200 reviews for processing. Using a custom   `parse_json_into_reviews `function, each selected review is transformed into a standardized format suitable for our policy evaluation pipeline. This ensures consistency in how review text, ratings, and business information are represented.

The processed reviews are then saved to a JSON file for downstream use, and a sample review is printed to verify the transformation.

In [None]:
review_json_path=f"{RAW_DIR}/vermont/reviews-Vermont.json"
business_info_path=f"{RAW_DIR}/vermont/meta-Vermont.json"

output_path = f"{PROCESSED_DIR}/vermont_test_set.json"

selected_data = parse_json_into_reviews(path=review_json_path,
                                        rows= 200, # test set size
                                        business_info_path=business_info_path)
save_to_json(selected_data, output_path)

print(selected_data[0])

### Converting JSON to CSV for Annotation

After parsing and standardizing the review data, we convert the JSON output into a CSV file to facilitate manual annotation. Additional columns are created for policy labels (spam, relevance, credible) and review attributes (_sentiment, _informative) if they do not already exist.

This CSV format allows annotators to easily inspect, edit, and label the reviews, ensuring the dataset is ready for model evaluation and training.

In [None]:
csv_name = output_path.replace('.json', '_annotated.csv')

with open(output_path, 'r') as f:
    data = json.load(f)

annotation_fields = ['spam', 'relevance', 'credible', '_sentiment', '_informative']
with open(csv_name, 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=list(data[0].keys())+ annotation_fields)
   
    for item in data:
        for field in annotation_fields:
            if field not in item:
                item[field] = ''
        writer.writerow(item)
