# Dataset Verification and Overview

**Project:** Context-Aware Trust Scoring and Review-Based Product Recommendation  
**Dataset:** Amazon Reviews 2018 (Fashion Category)

**Objective:**  
To verify that the dataset is usable, well-structured, and suitable for trust scoring and recommendation tasks.


Import Required Libraries

In [12]:
import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', 300)
pd.set_option('display.max_columns', None)


Load Dataset

In [13]:
import os
import gzip
import json
import shutil
import requests
from pathlib import Path

# Define paths relative to the project root
PROJECT_ROOT = Path("..").resolve()
DATA_DIR = PROJECT_ROOT / "data" / "raw"
DATA_PATH = DATA_DIR / "AMAZON_FASHION.json"
DATA_URL = "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/AMAZON_FASHION.json.gz"

# Ensure directory exists
DATA_DIR.mkdir(parents=True, exist_ok=True)

# Download and unzip if file doesn't exist
if not DATA_PATH.exists():
    print(f"File not found at {DATA_PATH}. Downloading from {DATA_URL}...")
    try:
        response = requests.get(DATA_URL, stream=True)
        response.raise_for_status()
        
        # Download compressed file
        compressed_path = DATA_PATH.with_suffix(".json.gz")
        with open(compressed_path, "wb") as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        
        print("Download complete. Extracting...")
        
        # Extract json.gz to json
        with gzip.open(compressed_path, 'rb') as f_in:
            with open(DATA_PATH, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
        
        # Clean up compressed file
        compressed_path.unlink()
        print("Extraction complete.")
        
    except Exception as e:
        print(f"Error downloading/extracting dataset: {e}")
        raise

print(f"Loading dataset from: {DATA_PATH}")
df = pd.read_json(DATA_PATH, lines=True)
print("Dataset loaded successfully!")


Loading dataset from: D:\Context_Aware_Trust_Scoring_Recommendation\data\raw\AMAZON_FASHION.json
Dataset loaded successfully!


In [14]:
print(f"Total number of reviews: {df.shape[0]}")


Total number of reviews: 883636


In [15]:
print("Column names:\n")
print(df.columns.tolist())


Column names:

['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'vote', 'style', 'image']


In [16]:
df.dtypes


overall             int64
verified             bool
reviewTime         object
reviewerID         object
asin               object
reviewerName       object
reviewText         object
summary            object
unixReviewTime      int64
vote              float64
style              object
image              object
dtype: object

In [17]:
df.sample(5)[
    ["reviewerID", "asin", "overall", "reviewText", "summary", "verified"]
]


Unnamed: 0,reviewerID,asin,overall,reviewText,summary,verified
453424,A2G81ATHIVXUXC,B00CMD45N6,5,Good sneakers arrived on time and fit as expected.,Early,True
203470,A38XDLJOGTQUWF,B00LQO1XOG,5,Antique looking meaning it has silver and black in it. Looks nice and equals the cost.,Looks nice and equals the cost,True
195390,A1BC9QDY6Y55SM,B00KXQMKS6,1,Terrible material and horrible fit. I would not recommend buying this product.,One Star,True
410706,A34EICRIXUO8L,B00389SZ1Q,1,"What I received were cheaper, flimsier, off color, and not like the picture. Dissatisfied.",and not like the picture,True
584804,A3VGJLLYEV7TV5,B00QIR3XHG,5,"Super fast shipping I love it!! Got it the next day I ordered. Hat came in brand new, fits perfectly and looks nice as expected on the picture. Thank you very much for your awesome service and knit hat :)",Overall is awesome!,True


In [18]:
missing_percentage = (df.isnull().sum() / len(df)) * 100
missing_percentage.sort_values(ascending=False)


image             96.739947
vote              90.957815
style             65.532301
reviewText         0.139537
summary            0.060319
reviewerName       0.010412
verified           0.000000
overall            0.000000
reviewerID         0.000000
reviewTime         0.000000
asin               0.000000
unixReviewTime     0.000000
dtype: float64

## Dataset Summary

- The Amazon Fashion dataset contains user reviews with textual feedback and ratings.
- Each review is associated with a unique user (`reviewerID`) and product (`asin`).
- Ratings are numeric, and reviews include rich free-text content.
- Additional metadata such as verification status and timestamps are available.
- The dataset is suitable for:
  - Review-level trust scoring
  - Product-level recommendation aggregation
