# Generation of synthetic reviews dataset for the AWS Blog "Monitor data quality in your data lake using PyDeequ and AWS Glue"


This notebook outlines the steps to generate synthetic data for the AWS blog post ["Monitor data quality in your data lake using PyDeequ and AWS Glue"](https://aws.amazon.com/blogs/big-data/monitor-data-quality-in-your-data-lake-using-pydeequ-and-aws-glue/) for the blog update publushed in August 2024. The new synthetic dataset replaces the original Amazon reviews dataset, however, it retains characteristics necessary to demonstrate features of PyDeequ library in AWS Glue. 

This synthetic dataset resides in the public S3 bucket: `s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Jewelry/` as parquet files.

Install [awswrangler](https://aws-sdk-pandas.readthedocs.io/en/stable/) and [essential_generators](https://pypi.org/project/essential-generators/):

In [None]:
!python3 -m pip install awswrangler==3.7.2
!python3 -m pip install essential_generators==1.0

In [None]:
import matplotlib.pyplot as plt
from matplotlib import ticker

import numpy as np
import pandas as pd
from scipy.linalg import cholesky
import awswrangler as wr

from essential_generators import DocumentGenerator

Import supporting functions

In [None]:
import review_generation_helpers as rgh

In [None]:
rng = np.random.default_rng()

In [None]:
float_formatter = "{:.4f}".format
np.set_printoptions(formatter={'float_kind':float_formatter})

## Engineer data columns

### Generate customer ids and insight

Create a column 'insight' to indicate influential reviewers.

In [None]:
cust_ins = rgh.create_customers_insight(rng)
cust_ins_ready = rgh.shuffle_customer_insight(rng, cust_ins)

In [None]:
cust_ins

In [None]:
cust_ins_ready

In [None]:
cust_ins_ready.shape

This dataset 

In [None]:
n = cust_ins_ready.shape[1]
n

### Generate total votes

The code for generating seasonal semi-correlated total votes is based on the [PyMC notebook](https://www.pymc.io/projects/examples/en/latest/time_series/MvGaussianRandomWalk_demo.html).

In the blog we investigate seasonal correlation in the montly count of total reviews between 3 years. Since the number of the product reviews is not divisible by 12 we take the closest number and then trim the result to match the size of the dataset.

In [None]:
n

In [None]:
12*268734

In [None]:
3224808/3

In [None]:
D = 3  # Dimension of random walks = time series for each year
N = 1074936  # Number of steps = data points
sections = 12  # Number of sections = months
period = N / sections  # Number steps in each section

Sigma_alpha = rng.standard_normal((D, D))
Sigma_alpha = Sigma_alpha.T.dot(Sigma_alpha)  # Construct covariance matrix for alpha
L_alpha = cholesky(Sigma_alpha, lower=True)  # Obtain its Cholesky decomposition

Sigma_beta = rng.standard_normal((D, D))
Sigma_beta = Sigma_beta.T.dot(Sigma_beta)  # Construct covariance matrix for beta
L_beta = cholesky(Sigma_beta, lower=True)  # Obtain its Cholesky decomposition

# Gaussian random walks:
alpha = np.cumsum(L_alpha.dot(rng.standard_normal((D, sections))), axis=1).T
beta = np.cumsum(L_beta.dot(rng.standard_normal((D, sections))), axis=1).T
t = np.arange(N)[:, None] / N
alpha = np.repeat(alpha, period, axis=0)
beta = np.repeat(beta, period, axis=0)

# Correlated series
sigma = 0.1

# This is number of points (N) by 3 years array:
y = alpha + beta * t + sigma * rng.standard_normal((N, 1))

Since `y` represents votes, therefore it can't be negative, we also prefer to increase the number to emulate the number of votes.

In [None]:
y.min()

In [None]:
y.max()

In [None]:
total_votes = np.abs(np.round(y*10))
total_votes.max()

In [None]:
total_votes.min()

In [None]:
total_votes.shape

Plot the series:

In [None]:
plt.figure(figsize=(12, 5))
plt.plot(t, total_votes, ".", markersize=2, label=("y_0 data", "y_1 data", "y_2 data"))
plt.title("Three Correlated Series")
plt.xlabel("Time")
plt.legend()
plt.show();

#### Plot the sum of total votes by month and year

We need a figure to demonstrate monthly and yearly variability and correlation in the sum of total votes. Calculate the sum and generate the figure.

Prepare 1D array of the total votes:

In [None]:
total_votes_1D = np.reshape(total_votes, (total_votes.shape[0]*3,) , order='F')

Split the array by month to calculate monthly sum and plot:

In [None]:
month_split = np.split(total_votes, list(range(0, N, int(N/12)))[1:], axis=0)

Calculate the sum for each month/year:

In [None]:
sum_month_year_list = [month_split[x].sum(axis = 0) for x in range(len(month_split))]
sum_month_year_list_array = np.array(sum_month_year_list)
sum_month_year_list_array.max()

In [None]:
sum_month_year_list_array.min()

In [None]:
fig, ax = plt.subplots(layout='constrained')
ax.plot(["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"], 
        sum_month_year_list_array, 
        marker = 's',
        markersize=5, label=("2013", "2014", "2015"))
ax.set_title("Total votes in jewelry review")
ax.set_xlabel("Month")
ax.set_ylabel("Total votes")
ax.grid(True, which='major', color = 'lightgrey', alpha = 0.5)
ax.yaxis.set_major_formatter(ticker.StrMethodFormatter("{x:,.0f}"))
ax.legend()
plt.savefig('total_votes_in_jewelry_review.jpg')
plt.show()

### Generate review years

In the blog the dataset has reviews from 3 years: 2013, 2014 and 2015:

In [None]:
years = np.repeat([[2013, 2014, 2015]], total_votes.shape[0], axis = 0)

In [None]:
years.shape

Reshape to a 1D array:

In [None]:
years_1D = np.reshape(years, (years.shape[0]*3,) , order='F')

### Generate review dates

Generate the review dates in accordance to the periods used in the generation of the total votes. 

In [None]:
review_dates = rgh.generate_dates_even_per_month(rng, [2013, 2014, 2015], days_per_month = int(N/12))

In [None]:
len(review_dates)

### Assemble years, review dates, total votes and trim to size

In [None]:
dat = pd.DataFrame({"review_year": years_1D, "review_date":  review_dates, "total_votes": total_votes_1D})
dat.shape

In [None]:
dat = dat.iloc[0:-2,]

In [None]:
dat.shape

In [None]:
dat.shape[0] == n

In [None]:
dat.head()

Add customer id and the insight to the DataFrame:

In [None]:
dat["customer_id"] = cust_ins_ready[0, :]
dat["insight"] = cust_ins_ready[1, :]

In [None]:
dat.head()

In [None]:
dat.dtypes

Correct the data types:

In [None]:
dat["review_year"] = dat["review_year"].astype("int")
dat["review_date"] = pd.to_datetime(dat["review_date"])
dat["total_votes"] = dat["total_votes"].astype("int")

In [None]:
dat.dtypes

### Generate product ids and product titles

Create product titles, add variability by adding prefixes (adjectives) and suffixes:

In [None]:
# pool of product names, prefixes and suffixes from which we will generate the product titles
product_pool= ["earrings", "crown", "headband","hairclip", 'armlet', 'bracelet', "cuff links", "ring", "pin", "brooch",
              "buckle", "toe ring", "anklet", "amulet", "beads", "jewelry", "necklace", "pendant", "tie clip"]
product_prefix_pool = ["wire wrapped", "charm", "Italian", "friendship", "silver", "gold", "ametist", "coral", "silver-plated", "gold-plated",
                      "vintage", "unique", "cute", "adorable", "elegant", "designer", "championship", "class", "engagement", 
                       "promise", "wedding", "art", "estate", "forever love heart", "jade", "pearl", "gold pearl", "silver pearl",
                      "imitation pearl", "black pearl", "white gold", "yellow gold", "white and yellow gold", "paw print",
                      "chunky", "gold dipped"]
product_suffix_pool = ["with natural stones", "with cubic zirconia", "with diamonds", "with princess cut-stones", "size adjustable",
                      "with custom engraving", "for women", "for men", "for men and women", "for couples", "unisex", "excellent gift",
                      "engagement gift", "best gift for friends", "best for Mother's day", "excellent gift for mother-in-law", 
                       "set of 3", "for every day of the week", "with interchangeable stones"]

In [None]:
products = rgh.generate_products(rng, [product_prefix_pool, product_pool, product_suffix_pool], n) 

In [None]:
products.shape

In [None]:
products

In [None]:
dat["product_title"] = products[0, :]
dat["product_id"] = products[1, :]
dat.head()

### Generate review titles and text

In [None]:
gen = DocumentGenerator()

template = {'review_headline':'sentence', 
            'review_body': 'paragraph'}
gen.set_template(template)
documents = gen.documents(n)

In [None]:
len(documents)

In [None]:
documents[0:3]

In [None]:
reviews = pd.DataFrame(documents)

In [None]:
dat["review_headline"] = reviews["review_headline"]
dat["review_body"] = reviews["review_body"]

### Generate marketplace codes

Majority of the reviews have code "US" and ~1000 have code "MX".

In [None]:
marketplace = np.repeat(["US"], n)
marketplace.shape

In [None]:
random_subs = rng.choice(np.arange(n), 1000)
marketplace[random_subs] = "MX"

In [None]:
np.unique(marketplace, return_counts = True)

In [None]:
dat["marketplace"] = marketplace

### Generate review ids

In [None]:
dat["review_id"] = rgh.generate_random_review_id(n)

### Generate star ratings

Use exponential distribution:

In [None]:
star_rating = np.array([1, 2, 3, 4, 5])

dat["star_rating"] = rgh.subset_array_exponential(rng, star_rating, n, scale = 0.9, sort = False)

### Generate helpful votes

Use exponential distribution:

In [None]:
votes = np.arange(0, 55, 5)

dat["helpful_votes"] = rgh.subset_array_exponential(rng, votes, n, scale = 0.9, sort = False)
dat["helpful_votes"] = dat["helpful_votes"].astype("int")
dat["star_rating"] = dat["star_rating"].astype("int")

In [None]:
dat["product_category"] = "Jewelry"

## Write the data to S3 in parquet format

In [None]:
s3_bucket_name = <s3://BUCKET-NAME>

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_name,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)