# Generation of synthetic reviews dataset for the AWS Blog "Testing data quality at scale with PyDeequ"

This notebook outlines the steps to generate synthetic data for the AWS blog post ["Testing data quality at scale with PyDeequ"](https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/) for the update of the blog publushed in June 2024. The new synthetic dataset replaces the original Amazon reviews dataset, however, it retains characteristics necessary to demonstrate features of PyDeequ library. 

This synthetic dataset resides in the public S3 bucket: `s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Electronics/` as parquet files.

The notebook was executed in Jupyter Lab of Amazon SageMaker Classic, ml.m5.4xlarge instance, DataScience 3.0/Python 3 kernel.

In [None]:
import pandas as pd

import numpy as np
rng = np.random.default_rng(seed = 42)

# awswrangler version: 3.7.2
import awswrangler as wr
import matplotlib.pyplot as plt
# essential_generators version: 1.0
from essential_generators import DocumentGenerator

import random
random.seed(a = 42, version=2)
import string

Supporting functions:

In [None]:
import review_generation_helpers as rgh

## Engineer data columns

### Generate marketplace, review_headline and review_body

To generate review titles/headlines and bodies we use [essential_generators](https://pypi.org/project/essential-generators/) module. This will create nonsensical sentences (for the titles) and paragraphs (for the review bodies).

In [None]:
gen = DocumentGenerator()

template = {'marketplace': ['US', 'UK', 'DE', 'JP', 'FR', None, ''], 
            'review_headline':'sentence', 
            'review_body': 'paragraph'}
gen.set_template(template)
documents = gen.documents(3010972)

In [None]:
len(documents)

Review a few generated examples:

In [None]:
documents[0]

In [None]:
documents[100]

In [None]:
documents[10000]

Convert the json object to pandas DataFrame:

In [None]:
dat = pd.DataFrame(documents)
dat.head()

In [None]:
dat["marketplace"].unique()

In [None]:
dat.shape

### Generate review years

We assume that each year brings more reviews for a successful retailer, therefore the values in year column will be drawn with weights according to the exponential distribution.

Total number of reviews (and the rows in the dataset):

In [None]:
n = dat.shape[0]

In [None]:
# array of years
years_range = np.arange(1996, 2017, 1)

# generate the weights:
exp_weights = rng.exponential(1, size = len(years_range))
exp_weights.sort()

# select the year according to the weights:
years = rng.choice(years_range, size = n, p = exp_weights/exp_weights.sum())

In [None]:
np.unique(years, return_counts = True)

The blog focuces on the data quality. Introduce out-of-range years to be detected by PyDeequ checks.

In [None]:
k = np.where(years == 2002)

In [None]:
years[k[0][0]] = 2202

In [None]:
years[years == 2202]

In [None]:
k = np.where(years == 1996)

In [None]:
years[[k[0][0], k[0][1]]] = 1696

In [None]:
years[years == 1696]

In [None]:
k = np.where(years == 2001)

In [None]:
years[[k[0][0], k[0][30]]] = 2101

In [None]:
years[years == 2101]

In [None]:
years.shape[0]

In [None]:
np.unique(years, return_counts = True)

### Generate review dates

In [None]:
review_year_date = rgh.generate_dates(rng, years)

In [None]:
review_year_date

### Generate user ratings (star_rating)

The original dataset used in the blog post had ratings between 1 and 5 stars, with an average rating of 4 stars, and  74.9% of reviews had the star rating 4 or higher (the last condition was defined in the blog ["Test data quality at scale with Deequ"](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/).

To satisfy the requirements for the star rating statistics we found that the following distribution of the number of stars works: `array([1, 1, 3, 7, 8])` for 1, 2, 3, 4, and 5 stars accordingly. This distribution was optimized on an array of length 20. We will use numpy.repeat() method to generate an array of final length.

In [None]:
repeats = np.round(n/20)
repeats

In [None]:
arr_sample = np.array([1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5])
ratings_all = np.repeat(arr_sample, repeats = repeats, axis = 0)
ratings_all.shape

Checking that it satisfies the data statistics of the original dataset:

In [None]:
ratings_all.mean()

In [None]:
ratings_all[ratings_all >= 4].shape

In [None]:
2258235*100/3010980

Remove 8 elements from the back of the array (a few 5 star ratings) to get the final count needed for the dataset:

In [None]:
ratings = ratings_all[0:(len(ratings_all)-8)]
ratings.shape[0] == n

Check that the new array still satisfies the requirements:

In [None]:
ratings.mean()

In [None]:
ratings[ratings >= 4].shape

In [None]:
2258227*100/n

Shuffle the ratings:

In [None]:
rng.shuffle(ratings)
ratings[1:40]

### Generate helpful and total votes

For these 2 columns we have experimented with the mean and the variance to get the target correlation (0.99365) based on the formula for correlation. We ensured that there are no negative votes.  

In [None]:
cov = [[15, 8.59668777], [8.59668777, 5]]
mean = [20, 15]
total_votes, helpful_votes = rng.multivariate_normal(mean, cov, n).T
total_votes.shape[0] == n

In [None]:
helpful_votes.shape[0] == n

Visualize total vs helpful votes:

In [None]:
plt.scatter(total_votes, helpful_votes, alpha = 0.5)
plt.xlabel('total votes')
plt.ylabel('helpful votes')
plt.show()

Check the correlation numerically:

In [None]:
np.corrcoef(total_votes, helpful_votes)

This is a sufficiently close result.

Total votes and star rating shouldn't be correlated:

In [None]:
np.corrcoef(total_votes, ratings)

### Generate product ids and product titles

Create product titles and add additional variability by adding prefixes (product descriptions) and suffixes (additional product features):

In [None]:
# pool of product names, prefixes and suffixes from which we will generate the product titles
product_pool= ["fax machine", "banknote counter", "electronic alarm clock","electric pencil sharpener", 
               "blu-ray", "floor lamp", "hair dryer", "paper copier", "electric drill", "video camera", 
               "radio", "air purifier", "floor heater", "cd player", "iron", "kettle",  "mp3 player",
               "video player", "electric stove", "electric razor", "dvd", "curling iron",  
               "office printer", "wireless speaker", "kitchen scale", "theater receiver", "electronic cigarettes", 
               "computer",  "television", "smartphone", "surge protector", "remote control", "headset", 
               "game controller", "cellular phone", "bluetooth speaker"]
product_prefix_pool = ["large", "red", "small", "orange", "green", "black", "silver", "yellow", "compact", 
                       "energy-efficient", "vintage", "pink", "portable", "white", "metal", "stainless-steel"]
product_suffix_pool = ["newest model", "refurbished", "renewed", "1996 model", "with the storage case", 
                       "charger included", "charger not included", "batteries not included", "waterproof", 
                       "with adapter", "with wooden inlays", "with silver details", "with black handle", 
                       "EU compatible", "Japan compatible", "US compatible"]

Generate products using combinations of product names, prefixes and suffixes. Each product will then get a unique product id.

In [None]:
products = rgh.generate_products(rng, [product_prefix_pool, product_pool, product_suffix_pool],  n)

In [None]:
products

### Generate customer ids and insight

Create a column 'insight' to indicate influential reviewers.

Generate the following distribution of the number of reviews per customer:

- 10% of the reviews come from a single customer (1 review : 1 customer)
- 15% - 2:1 (each customer created 2 reviews)
- 10% - 3:1 (each customer created 3 reviews)
- 10% - 4:1 (each customer created 4 reviews)
- 20% - 7:1 (each customer created 7 reviews)
- 35% - 15:1 (each customer created 15 reviews)

Create more "insightful" reviews for customers who wrote more reviews. Insight is a Y/N field.

#### Generate customer_id

In [None]:
temp = np.round(np.array([0.1, 0.15, 0.2])*n).astype(int)
temp

Experiment with different numbers of customers to get obtain the distribution as outlined above. 

In [None]:
# 10% of the reviews come from a single customer (1 review from 1 customer)
r1 = 301097

15% of reviews are expected to come from customers who wrote 2 reviews each:

In [None]:
451646/2

In [None]:
r2 = 451646

10% of reviews came from customers who have written 3 reviews each: use 301101.

In [None]:
(r1+4)/3

In [None]:
r3 = r1 + 4
r3

10% reviews came from the customers who have written 4 reviews each, use 301100.

In [None]:
(r1 + 3)/4

In [None]:
r4 = r1 +3 
r4

20% of reviews came from the customers who have written 7 reviews each. Use 602196.

In [None]:
(602194+2)/7

In [None]:
r7 = 602194+2
r7

The rest of the reviews:

In [None]:
n - r1 - r2 - r3 - r4 - r7

26 reviews per customer gives us a round number of customers:

In [None]:
1053832/26

In [None]:
# these people have written 26 reviews
rmax = 1053832

Verify that the sum is still correct:

In [None]:
(r1 + r2 + r3 + r4 + r7 + rmax) == n

How many customers do we actually need?

In [None]:
c1, c2, c3, c4, c7, c26 = r1, int(r2/2), int(r3/3), int(r4/4), int(r7/7), int(rmax/26)
total_customers_needed = c1 + c2 + c3 + c4 + c7 + c26
total_customers_needed

Generate random customer ids and shuffle.

In [None]:
customer_ids = np.arange(100000, 100000 + total_customers_needed)
rng.shuffle(customer_ids)

Next, construct indices to split this array into:

In [None]:
split_indices = [c1, 
                 c1 + c2, 
                 c1 + c2 + c3,
                 c1 + c2 + c3 + c4, 
                 c1 + c2 + c3 + c4 + c7]
split_indices

In [None]:
customer_cohorts = np.split(customer_ids, split_indices)

In [None]:
len(customer_cohorts)

In [None]:
[len(x) for x in customer_cohorts]

The code below completes generation of customer ids relative to the ratio of reviews as defined in the beginning of the section. 

In [None]:
customers = np.hstack([customer_cohorts[0], 
                       np.repeat(customer_cohorts[1], 2),
                       np.repeat(customer_cohorts[2], 3),
                       np.repeat(customer_cohorts[3], 4),
                       np.repeat(customer_cohorts[4], 7),
                       np.repeat(customer_cohorts[5], 26)
                      ])
customers.shape[0] == n

Next, we need to distribute 'insight' accordingly and shuffle.

#### Create vine = insight

Customers with more reviews should have more Y than N in the vine = insight column.

In [None]:
insight = np.hstack([rng.choice(['N'], r1), 
                 rng.choice(['N'], r2),
                 rng.choice(['N'], r3),
                 rng.choice(['Y', 'N'], r4, p = [0.2, 0.8]),
                 rng.choice(['Y', 'N'], r7, p = [0.5, 0.5]),
                 rng.choice(['Y', 'N'], rmax, p = [0.9, 0.1])])

In [None]:
insight.shape[0] == n

### Associate customer ids with the indicator for insight and shuffle

Combine into a single array:

In [None]:
cust_insight = np.vstack([customers, insight])
cust_insight.shape

In [None]:
ind = np.arange(cust_insight.shape[1])
ind[0:100]

In [None]:
rng.shuffle(ind)
ind[0:100]

In [None]:
ind.max()

In [None]:
ind.min()

In [None]:
cust_insight_shuffled = cust_insight[:, ind]

In [None]:
cust_insight[:, 0:10]

In [None]:
cust_insight_shuffled[:, 0:10]

### Generate review ids

The review ids need to be mostly unique: 0.9926566948782706 of unique values. Each review is 14 characters long, starts with the letter 'R', and followed by a mix of uppercase letters and digits.  

In [None]:
reviews_unique = ['R' + ''.join(random.choices(string.ascii_uppercase + string.digits, k=14)) for x in range(n)]

len(reviews_unique)

In [None]:
reviews_unique[0:100]

Verify uniquiness:

In [None]:
temp = set(reviews_unique)
len(temp)

Introduce duplicated ids:

In [None]:
count_dup = (n - np.round(n*0.9926566948782706)).astype(int)
count_dup

In [None]:
reviews_unique[count_dup:count_dup*2] = reviews_unique[0:count_dup]

In [None]:
len(set(reviews_unique))

In [None]:
2988862/n

In [None]:
reviews_non_unique = np.array(reviews_unique)

In [None]:
rng.shuffle(reviews_non_unique)

## Assemble the dataset

In [None]:
dat.head()

In [None]:
dat["customer_id"] = cust_insight_shuffled[0, :]

dat["review_id"] = reviews_non_unique

dat["product_title"] = products[0, :]
dat["product_id"] = products[1, :]

dat["star_rating"] = ratings

dat["helpful_votes"] = helpful_votes.astype("int")
dat["total_votes"] = total_votes.astype("int")
dat["insight"] = cust_insight_shuffled[1, :]
dat[["review_year", "review_date"]] = review_year_date
dat["review_year"] = dat["review_year"].astype("int")
dat["review_date"] = pd.to_datetime(dat["review_date"])
dat["product_category"] = "Electronics"

dat.head()

In [None]:
s3_bucket_name = <s3://BUCKET-NAME>

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_name,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols=['product_category']
)