# Amazon electronics dataset exploration

## 2018 Amazon Review Data

A subset of the Amazon Review Data (2018), the electronics category data is roughly 20M engagements from Amazon users.  

*Source*: Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019, https://nijianmo.github.io/amazon/index.html

In [None]:
!ls ../data/2018

In [None]:
import pandas as pd
df = pd.read_csv('../data/2018/Electronics.csv', nrows=10000, names=["item", "user", "rating", "timestamp"])

Ratings only: These datasets include no metadata or reviews, but only (item,user,rating,timestamp) tuples. Thus they are suitable for use with mymedialite (or similar) packages.

In [None]:
df.head()

In [None]:
len(df)

Hmm... are the four columns sufficient for our system? Can we infer a purchase based on the presence of a rating? Do we assume a user with no rating for a product failed to purchase? Yeesh... that doesn't seem supportable. I guess the prediction here is not whether they bought it but whether they were motivated to source a review. Here the review becomes the reward, not the sale ... go off and read the paper: https://cseweb.ucsd.edu/~jmcauley/pdfs/emnlp19a.pdf

In [None]:
df.describe()

In [None]:
df.item.value_counts()

## 2023 Amazon Reviews Data

### Preprocessing

In [None]:
!ls -lh ../data/2023

2023 publication, see https://amazon-reviews-2023.github.io/

In [None]:
import json 
import pandas as pd

In [None]:
reviews = pd.read_json('../data/2023/Electronics.jsonl', lines=True, nrows=100, )

In [None]:
reviews.head()

In [None]:
# We need to reduce the size of this dataset or risk blowing our memory budget, filter down to essentials for our prediction task 
!cd ../data/2023 && jq -c '{rating, parent_asin, user_id, timestamp}' Electronics.jsonl > Electronics_min.jsonl

In [None]:
reviews = pd.read_json('../data/2023/Electronics_min.jsonl', lines=True)

In [None]:
reviews.to_parquet("../data/2023/Electronics_min.parquet")

In [None]:
reviews.iloc[0]

In [None]:
len(reviews.user_id.unique())

In [None]:
reviews.hist()

In [None]:
items = pd.read_json("../data/2023/meta_Electronics.jsonl", lines=True, nrows=100) 

In [None]:
items.head()

In [None]:
# Filter down to essential fields
!cd ../data/2023 && jq -c '{title, average_rating, description, price, images, rating_number, parent_asin}' meta_Electronics.jsonl > meta_Electronics_min.jsonl

In [None]:
import pandas as pd

In [None]:
# Note this for whatever reason burns about 30G of RAM during the load, even though the json is only 2.8G uncompressed, we should get this into a parquet file stat
items = pd.read_json("../data/2023/meta_Electronics.jsonl", lines=True)

In [None]:
items.drop(['main_category', 'features', 'videos', 'store', 'categories', 'details', 'bought_together', 'subtitle', 'author'], axis=1, inplace=True)

In [None]:
items.head()

In [None]:
items.price = items.price.astype(str)

In [None]:
items.to_parquet("../data/2023/meta_Electronics.parquet")

In [None]:
len(items)

In [None]:
items.hist()

In [None]:
items.iloc[0]

In [None]:
# Per the dataset documentation: Note: Products with different colors, styles, sizes usually belong to the same parent ID. 
# The “asin” in previous Amazon datasets is actually parent ID. Please use parent ID to find product meta.
item = reviews.iloc[5].parent_asin
items[items.parent_asin == item]

In [None]:
reviews[reviews.parent_asin == items.iloc[1].parent_asin]

I can't load the entirety of the reviews in one shot... but I can fit every item in memory. So every review will be grounded to an item, but many reviews will be hidden. I don't think this matters for this project. If I want to fit more reviews, I can simply preprocess the data to rejct unneeded fields and (notably text fields) and dramatically reduce memory requirements. I can alternatively load only the critical columns, yes? 