# Generation of the syntheric reviews data for other product categories

**DISCLAIMER:** This notebook is an extension of the synthetic product reviews generation for products in [Electronics](./01-synthetic-data-electronics.ipynb) and [Jewelry](./02-synthetic-data-jewelry.ipynb) to 18 other product categories. The data for these categories does not follow any temporary trends and there are no relationships between the variables. This data shouldn't be used for data science projects as it doesn't have any meaningful patterns (or these patterns are random or occidental). However, this dataset is sufficiently large to demonstrate 'big data' capabilities of services, databases and query engines.

We generate the synthetic data for the following product categories:

- Automotive
- Home_Kitchen
- Beauty_Personal_Care
- Apparel
- Video_Games
- Toys_Games
- Office_Products
- Pet_Supplies
- Sports_Outdoors
- Tools_Home_Improvement
- Garden_Outdoor
- Arts_Crafts_Sewing
- Health_Household
- Computers
- Books
- Music
- Movies_TV
- Grocery_Gourment_Food 

Product titles in the majority of cases were generated using prompts to Anthropic's Claude v3 Sonnet model using Amazon Bedrock. These prompts are recorded.

Review titles and bodies have been generated separately using a helper function `review_generation_helpers.generate_review_headline_body()`. These were saved in an Amazon S3 bucket and subsequently read back to augment the data for each product category. 

The final size of the synthetic data is approximately 130 million rows. 

The dataset resides in the public S3 bucket: `s3://aws-bigdata-blog/generated_synthetic_reviews/data/` as parquet files partitioned by product_category as listed above.

License: CC-BY-4.0

Install [awswrangler](https://aws-sdk-pandas.readthedocs.io/en/stable/) and [essential_generators](https://pypi.org/project/essential-generators/):

In [None]:
!python3 -m pip install awswrangler==3.7.2
!python3 -m pip install essential_generators==1.0

In [None]:
import pandas as pd
import numpy as np

# custom module: review_generation_helpers.py
import review_generation_helpers as rgh

import awswrangler as wr

In [None]:
# private backet with review titles and bodies
s3_bucket_text = <s3://PRIVATE-S3-BUCKET>
s3_bucket_output = <s3://BUCKET-NAME>

## Automotive

Prompt to Claude v3 Sonnet: "Generate 50 products related to automotive. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular."

In [None]:
product_pool = ['tire', 'battery', 'engine', 'headlight', 'wiper', 'brake', 'muffler', 'mirror', 'radiator', 
                       'alternator', 'spark', 'plug', 'filter', 'bearing', 'belt', 'fuse', 'sensor', 'gasket', 'hose', 
                       'pump', 'piston', 'caliper', 'rotor', 'suspension', 'strut', 'bushing', 'control', 'arm', 'axle', 
                       'driveshaft', 'differential', 'transmission', 'clutch', 'flywheel', 'starter', 'ignition', 'coil', 
                       'distributor', 'carburetor', 'injector', 'valve', 'cylinder', 'turbocharger', 'supercharger', 
                       'catalytic', 'converter', 'manifold', 'thermostat', 'compressor']

Prompt to Claude v3 Sonnet: "Generate 50 automotive product characteristics. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular."

In [None]:
product_prefix_pool = ['durable', 'efficient', 'powerful', 'sleek', 'sophisticated', 'innovative', 
                              'eco-friendly', 'luxurious', 'sporty', 'advanced', 'spacious', 'versatile', 'ergonomic', 
                              'aerodynamic', 'stylish', 'comfortable', 'responsive', 'intuitive', 'intelligent', 
                              'rugged', 'agile', 'dynamic', 'premium', 'reliable', 'sturdy', 'silent', 'smooth', 
                              'swift', 'elegant', 'secure', 'customizable', 'futuristic', 'energy-efficient', 
                              'performance-oriented', 'technologically-advanced', 'user-friendly', 
                              'environmentally-conscious', 'cutting-edge', 'safety-focused', 'connected', 
                              'automated', 'handy', 'compact', 'maneuverable', 'sustainable', 'adaptable', 'refined', 
                              'robust', 'smart']

In [None]:
# author generated
product_suffix_pool = ["foldable", "collapsible", "expandable", "with slogans", "new", "refurbished", "replacement", "12x",
                      "new in box", "with dividers", "large capacity", "rechargeable", "for the trunk", "extra-large", "compact", 
                      "4x", "for pets", "against pets", "for cars and homes", "for the roof", "for the interior", "with automatic sensor", 
                      "high performance", "with charger", "with mirror", "with aerodinamic features", "secure", "with app",
                      '2 years warranty']

In [None]:
dat = rgh.create_dataset(size_factor = 3, 
                   total_votes = np.arange(0, 65, 5),
                   helpful_votes = np.arange(0, 32), 
                   scale = 1, 
                   review_years = np.arange(1996, 2017), 
                   product_category = 'Automotive', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 0.8)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/10d7317906cf4aa6845a8d7d66b0c651_0.snappy.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols=['product_category']
)

## Home_Kitchen

Prompt: "Generate 50 products related to home and kitchen. Provide the output as a comma separated list, 
surround each word in single quotes, each word as lower case, each word is singular."

In [None]:
product_pool = ['blender', 'saucepan', 'oven', 'spatula', 'chopping board', 'kettle', 'toaster', 'microwave', 'refrigerator',
                       'dishwasher', 'coffee maker', 'food processor', 'mixer', 'frying pan', 'baking tray', 'pressure cooker', 
                       'slow cooker', 'knife set', 'cutting board', 'peeler', 'grater', 'strainer', 'whisk', 'ladle', 'tongs', 
                       'dish rack', 'oven mitt', 'kitchen towel', 'apron', 'trash can', 'storage container', 'jar', 'bottle', 
                       'bowl', 'plate', 'cup', 'glass', 'mug', 'fork', 'knife', 'spoon', 'napkin', 'coaster', 'placemat', 
                       'tablecloth', 'vase', 'candle holder', 'centerpiece']

Prompt: "Generate 50 home and kitchen product characteristics. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular."

In [None]:
product_prefix_pool = ['durable', 'efficient', 'sleek', 'versatile', 'ergonomic', 'eco-friendly', 'compact', 'multifunctional', 'innovative', 
                         'stylish', 'user-friendly', 'dishwasher-safe', 'energy-saving', 'non-stick', 'portable', 'stainless', 'cordless', 
                         'rust-resistant', 'powerful', 'quiet', 'programmable', 'decorative', 'adjustable', 'insulated', 'collapsible', 
                         'lightweight', 'lockable', 'spill-proof', 'automated', 'hygienic', 'microwave-safe', 'retractable', 'wireless', 
                         'leak-proof', 'robust', 'customizable', 'washable', 'scratch-resistant', 'biodegradable', 'tamper-proof', 
                         'heat-resistant', 'space-saving', 'child-proof', 'soundproof', 'breathable', 'waterproof', 'odor-free', 'shatterproof']

In [None]:
# author generated
product_suffix_pool = ["a perfect gift", "santa's little helper", "with cord storage", "with beautiful print",
                  "for a beautiful home", "for a well-appointed kitchen", "long-lasting", "with ergonimic handle", "latest model",
                  "quick ship", "set of 12", '2 years warranty', '1 year warranty']

In [None]:
dat = rgh.create_dataset(size_factor = 3, 
                   total_votes = np.arange(0, 100, 2),
                   helpful_votes = np.arange(0, 52), 
                   scale = 0.7, 
                   review_years = np.arange(1997, 2018), 
                   product_category = 'Home_Kitchen', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 1)

In [None]:
reviews = pd.read_parquet(s3_bucket_text+
                          '/review_body_headline/10d7317906cf4aa6845a8d7d66b0c651_1.snappy.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Beauty_Personal_Care

Prompt: "Generate 50 products related to beauty and personal care. Provide the output as a comma separated list, 
surround each word in single quotes, each word as lower case, each word is singular."

In [None]:
product_pool = ['lipstick', 'mascara', 'eyeliner', 'foundation', 'concealer', 'blush', 'bronzer', 'highlighter', 
                       'eyeshadow', 'lip gloss', 'lip balm', 'face powder', 'face serum', 'face cream', 'face mask', 
                       'face scrub', 'face toner', 'face cleanser', 'body lotion', 'body butter', 'body scrub', 
                       'body wash', 'shampoo', 'conditioner', 'hair oil', 'hair serum', 'hair spray', 'hair gel', 
                       'hair mousse', 'hair dye', 'nail polish', 'nail polish remover', 'nail file', 'nail clipper', 
                       'tweezers', 'razor', 'shaving cream', 'aftershave', 'deodorant', 'perfume', 'cologne', 'bath bomb', 
                       'bath salt', 'loofah', 'toothbrush', 'toothpaste', 'mouthwash', 'floss']

Prompt: "Generate 50 beauty and personal care product characteristics. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_prefix_pool = ['radiant', 'nourishing', 'rejuvenating', 'silky', 'luminous', 'hydrating', 'soothing', 
                         'revitalizing', 'purifying', 'luxurious', 'anti-aging', 'botanical', 'gentle', 'organic', 
                         'natural', 'enriching', 'firming', 'smoothing', 'antioxidant', 'protecting', 'repairing', 
                         'clarifying', 'refining', 'toning', 'balancing', 'reviving', 'energizing', 'brightening', 
                         'lifting', 'anti-wrinkle', 'moisturizing', 'detoxifying', 'invigorating', 'regenerating', 
                         'refreshing', 'softening', 'volumizing', 'conditioning', 'replenishing', 'soothing', 'illuminating', 
                         'mattifying', 'nurturing', 'therapeutic', 'pampering', 'calming', 'vitalizing', 'renewing', 'polishing']

In [None]:
# author generated
product_suffix_pool = ["in renewable packaging", "in recyclable packaging", "two-pack", "for the whole family", "good for sensitive skin",
                        "appropriate for all skin types", "16oz jar", "8oz jar", "200 ml", "appropriate for any skin tone", "appropriate for any hair type",
                        "gentle smell", "no aritificial components", "no artifical colors", "no artifical fragrances", "refreshing aroma", "refreshing smell", 
                        "long-lasting"]

In [None]:
dat = rgh.create_dataset(size_factor = 1, 
                   total_votes = np.arange(0, 150),
                   helpful_votes = np.arange(0, 60), 
                   scale = 0.5, 
                   review_years = np.arange(1998, 2015), 
                   product_category = 'Beauty_Personal_Care', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 0.5)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/10d7317906cf4aa6845a8d7d66b0c651_2.snappy.parquet')

In [None]:
row_marker = dat.shape[0]
row_marker

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Apparel

Prompt: "Generate 50 apparel items. Provide the output as a comma separated list, 
surround each word in single quotes, each word as lower case, each word is singular."

In [None]:
product_pool = ['shirt', 'blouse', 'dress', 'skirt', 'pants', 'jeans', 'shorts', 'jacket', 'coat', 'sweater', 
                       'cardigan', 'vest', 't-shirt', 'tank', 'camisole', 'hoodie', 'sweatshirt', 'blazer', 'suit', 
                       'robe', 'gown', 'jumpsuit', 'romper', 'leggings', 'tights', 'stockings', 'socks', 'shoes', 'boots', 
                       'sandals', 'slippers', 'flip-flops', 'sneakers', 'heels', 'flats', 'hat', 'cap', 'beanie', 'scarf', 
                       'gloves', 'mittens', 'belt', 'tie', 'bowtie', 'necklace', 'bracelet', 'ring', 'earrings', 'watch']

Prompt: "Generate 50 apparel characteristics. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular". The author then provided modifications.

In [None]:
product_prefix_pool = ['multi-colored', 'modern pattern', 'pleasing texture', 'made with natural materials', 
                           'modern fit for men and women', 'in the latest style', 
                           'floor length', 'short sleeve', 'long-sleeve', 'flat collar', 'no collar', 'round neckline', 
                           'with embellishmenst', 'slimming silhouette', 'in fun prints!', 'heavy weight fabric', 'long lasting material',
                           'drape', 'sheerness', 'stretch', 'opacity', 'high breathability',
                           'very washabile fabric', 'with wrinkle-resistantant properties', 'insulation for low temperatures', 
                           'moisture-wicking', 'odor-resistance in any circumstances', 'sun-protection', 'warm and cozy', 
                           'unbelievable softness', 'with high sheen', 'shimmering fabric', 'matte fabric', 'shiny fabric', 'rugged construction', 
                           'with delicate lace', 'casual and comfy style', 'business formal', 'vintage', 'contemporary', 
                           'classic style', 'trendy lines', 'feminine or masculine']

Prompt for the suffixes: "Generate 50 apparel characteristics. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular".

In [None]:
product_suffix_pool = ['slim','sleek','loose','baggy','tight','flowy','sheer','opaque','vibrant','muted','patterned','solid',
                         'striped','polka-dotted','floral','abstract','plain','faded','distressed','embroidered','sequined',
                         'beaded','fringed','pleated','ruffled','belted','layered','oversized','cropped','cinched','tapered',
                         'asymmetric','structured','draped','knitted','woven','lace','mesh','crocheted','denim','leather','suede',
                         'satin','silk','cotton','linen']

In [None]:
dat = rgh.create_dataset(size_factor = 2, 
                   total_votes = np.arange(0, 40),
                   helpful_votes = np.arange(0, 38), 
                   scale = 0.9, 
                   review_years = np.arange(2000, 2017), 
                   product_category = 'Apparel', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 1)

Use remaining rows from the last reviews file:

In [None]:
dat["review_headline"] = reviews.iloc[row_marker:(row_marker + dat.shape[0])]["review_headline"].array
dat["review_body"] = reviews.iloc[row_marker:(row_marker + dat.shape[0])]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Video_Games

Prompt: "Generate 50 products related to video games. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular". The author then provided modifications.

In [None]:
product_pool = ['video game', 'game controller', 'console', 'controller', 'joystick', 'headset', 'mouse', 'keyboard', 'monitor', 
                'speaker', 'microphone', 'webcam', 'chair', 'desk', 'laptop', 'graphics card', 'memory card', 'processor', 
                'motherboard', 'memory', 'storage', 'hard-drive', 'drive', 'power', 'supply', 'cooling fan', 'fan', 'case', 
                'accessory', 'cable', 'software', 'antivirus protection', 'antivirus software', 'firewall', 'peripheral', 'printer', 
                'scanner', 'platform', 'lighting', 'lamp', 'touch panel', 'headphones', 'docking station',
                'subscription service', 'merchandise', 'figurine', 'plushie', 'apparel', 'poster', 'book', 'guide', 'walkthrough sheet', 
                'strategy walkthrough', 'cheat-sheet', 'fidget spinner', 'sweatband']

In [None]:
product_prefix_pool = ['ergonomic', 'gaming', 'large', 'fastest', 'cooling', 'stress relief', '', 'wearable', 'physical', 'virtual',
                      'computer', 'single', 'streaming', 'board', 'play', 'flat' ]

In [None]:
# author generated
product_suffix_pool = ['alleviates carpal tunnel syndrome', 'with thumb support', 'with elbow support', 
                       'in all colors', 'with large capacity', 'newest version', 'set', 'mount', 'adapter', 
                       'can be rotated in any direction', 'peripheral set']

In [None]:
dat = rgh.create_dataset(size_factor = 1, 
                   total_votes = np.arange(0, 87, 2),
                   helpful_votes = np.arange(0, 54, 3), 
                   scale = 0.7, 
                   review_years = np.arange(1999, 2018), 
                   product_category = 'Video_Games', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 1)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/7b80671e92144a27bcc4ea59e656e543_0.snappy.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

In [None]:
row_marker = dat.shape[0]

## Toys_Games

Prompt: "Generate 50 toys and games products. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_pool = ['doll', 'puzzle', 'action figure', 'board game', 'stuffed animal', 'video game', 'lego set', 'card game', 
                'toy car', 'art supply', 'ball', 'playset', 'toy robot', 'kite', 'jump rope', 'musical instrument', 
                'building block', 'science kit', 'puzzle cube', 'toy truck', 'coloring book', 'playing card', 'frisbee', 
                'yo-yo', 'bubble wand', 'remote control car', 'chalk', 'play dough', 'jigsaw puzzle', 'plush toy', 
                'educational game', 'toy train', 'sticker book', 'craft kit', 'memory game', 'model kit', 'playground ball', 
                'spinning top', 'marionette', 'kaleidoscope', 'dress-up costume', 'puppet', 'play tent', 'magic set', 
                'toy laptop', 'pretend play set', 'juggling ball', 'sports equipment', 'musical toy']

Prompt: "Generate 50 words related to toys and games products. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_prefix_pool = ['educational','fun','interactive','colorful','wooden','plastic','electronic','durable','safe',
                       'entertaining','stimulating','creative','imaginative','developmental','engaging','challenging',
                       'collectible','lifelike','plush','cuddly','sturdy','eco-friendly','vibrant','musical','artistic',
                       'construction','coding','virtual','augmented','puzzling','competitive','strategic','collaborative',
                       'skill-building','themed','licensed','nostalgic','retro','innovative','compact','portable','versatile',
                       'customizable','programmable','robotic','battery-powered','motion-activated']

In [None]:
# author generated
product_suffix_pool = ['unisex', 'for boys and girls', 'for kids and adults', 'for ages 0-1', 'for ages 1 to 3', 
                      'for ages 3 to 7', 'for ages 8 to 13', 'for ages 13+', 'for adults', 'for stress relief', 
                      'battery operated', 'in box', 'with a storage box', 'perfect birthday gift', 'for little girls', '10 pieces',
                      'extra large', 'extra pieces included', 'for all ages', 'set of 2', 'pastel colors', 'full activity set']

In [None]:
dat = rgh.create_dataset(size_factor = 2, 
                   total_votes = np.arange(0, 20),
                   helpful_votes = np.arange(0, 18), 
                   scale = 0.6, 
                   review_years = np.arange(2000, 2015), 
                   product_category = 'Toys_Games', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 0.5)

Use remaining rows from the last reviews file:

In [None]:
dat["review_headline"] = reviews.iloc[row_marker:(row_marker + dat.shape[0])]["review_headline"].array
dat["review_body"] = reviews.iloc[row_marker:(row_marker + dat.shape[0])]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Office_Products

Prompt: "Generate 50 office products. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_pool = ['pen', 'pencil', 'eraser', 'ruler', 'stapler', 'scissors', 'paper', 'folder', 'clipboard', 'binder', 'tape', 
                'glue', 'notebook', 'marker', 'highlighter', 'envelope', 'label', 'stamp', 'calendar', 'calculator', 'desk', 
                'chair', 'lamp', 'whiteboard', 'projector', 'printer', 'scanner', 'shredder', 'telephone', 'headset', 'monitor', 
                'keyboard', 'mouse', 'laptop', 'tablet', 'dock', 'cable', 'adapter', 'charger', 'battery', 'fan', 'humidifier', 
                'dehumidifier', 'clock', 'planner', 'organizer', 'cup', 'mug']

Prompt: "Generate 50 words related to office products. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_prefix_pool = ['ergonomic','comfortable','multifunctional','compact','durable','affordable','versatile','efficient',
                       'eco-friendly','stylish','portable','adjustable','wireless','noise-canceling','premium','innovative',
                       'sturdy','sleek','secure','paperless','spacious','customizable','integrated','organized','lightweight',
                       'programmable','user-friendly','modern','modular','automated','energy-saving','hygienic','smart',
                       'flexible','intuitive','collaborative','productive','clutter-free','convenient','mobile','sophisticated',
                       'minimalist','elegant','digital','streamlined','reliable','intuitive']

In [None]:
# author generated
product_suffix_pool = ['makes a great gift for boss', 'best manager gift', 'excellent employee gift', 'can be modified with your logo',
                       '5 boxes', 'box of 3', 'box of 10', 'box of 5', 'set of 10', 'set of 2', 'set of 20', 'elegant gift', '50-count',
                       '100-count', '200-count', 
                       'with your custom message', 'recyclable', 'made in the USA', 'non-slip', 'large capacity', 'comfortable grip',
                       'soft grip', 'one touch', 'with remote', 'assorted colors', '96 color bulk pack', 'for adults and kids', 
                       'non-toxic', 'color-coordinated']
                       

In [None]:
dat = rgh.create_dataset(size_factor = 3, 
                   total_votes = np.arange(2, 44, 2),
                   helpful_votes = np.arange(0, 32), 
                   scale = 0.4, 
                   review_years = np.arange(1995, 2014), 
                   product_category = 'Office_Products', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 0.4)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/7b80671e92144a27bcc4ea59e656e543_1.snappy.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Pet_Supplies

Prompt: "Generate 50 pet supplies related products. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular". The author then provided modifications.

In [None]:
product_pool = ['leash', 'collar', 'bed', 'toy', 'bowl', 'treat', 'brush', 'crate', 'shampoo', 'conditioner', 'harness',
                'sweater', 'carrier', 'litter', 'box', 'scratching pad', 'scratching post', 'aquarium', 'filter', 'food', 'dispenser',
                'water', 'fountain', 'grooming kit', 'nail clipper', 'vitamin', 'supplement', 'training pad', 'pad', 
                'odor', 'remover', 'waste bag', 'waste bag dispenser', 'travel', 'bowl', 'pet', 'gate', 'ramp', 'stairs', 'playpen', 'feeder',
                'house', 'costume', 'bandana', 'tag', 'charm', 'bed', 'cover', 'heating pad', 'wheel', 'tree']

Prompt: "Generate 50 words related to pet supplies. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular" 

In [None]:
product_prefix_pool = ['absorbent', 'chewy','squeaky','durable','cozy','scratchy','feathery', 'collared',
                       'groomed','nibbled','digestible','cuddly',
                       'indestructible','chewable','nutritious','hypoallergenic','ergonomic','eco-friendly','orthopedic',
                       'automated','heated','cooling','antibacterial','deodorizing','waterproof','plush','removable',
                       'adjustable','machine-washable','non-toxic','refillable','training','corrective','enzymatic',
                       'tasty','flavored','interactive','robotic','intelligent','monitored','wearable','portable','hands-free',
                       'remote-controlled']

In [None]:
# author generated
product_suffix_pool = ['for pets', 'for cats', 'for small and large pets', 'for dogs', 'for your furry baby', 
                      'for your furrbaby', 'for small dogs', 'for large dogs', 'for birds', 'for hamsters', 'for indoor and outdoor',
                      'with antimicrobial protection', 'no smell', 'perfect for Halloween', 'collapsible for easy storage',
                      'for cars and home', 'for travel']

In [None]:
dat = rgh.create_dataset(size_factor = 3, 
                   total_votes = np.arange(10, 30),
                   helpful_votes = np.arange(1, 25), 
                   scale = 0.4, 
                   review_years = np.arange(2003, 2015), 
                   product_category = 'Pet_Supplies', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 0.5)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/7b80671e92144a27bcc4ea59e656e543_2.snappy.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Sports_Outdoors

Prompt: "Generate 50 product names related to sports and outdoor activities. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_pool = ['basketball', 'football', 'baseball', 'tennis', 'volleyball', 'frisbee', 'swimsuit', 'surfboard', 'kayak', 'canoe', 
                'tent', 'backpack', 'sleeping bag', 'compass', 'binoculars', 'hiking boots', 'fishing rod', 'fishing line', 'tackle box', 'life jacket', 
                'water bottle', 'sunscreen', 'hat', 'snorkel', 'diving mask', 'wetsuit', 'skateboard', 'rollerblades', 'bicycle', 
                'helmet', 'knee pad', 'elbow pad', 'climbing rope', 'carabiner', 'harness', 'golf club', 'golf ball', 'tee', 
                'racquet', 'shuttlecock', 'goalpost', 'whistle', 'stopwatch', 'scoreboard', 'trophy', 'medal', 'jersey', 'shorts',
               'meal replacement', 'napkin']

Prompt: "Generate 50 words that describe products for sports and outdoor activities. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_prefix_pool = ['lightweight', 'durable', 'breathable', 'waterproof', 'moisture-wicking', 'insulated', 'quick-drying', 
                       'abrasion-resistant', 'shock-absorbing', 'ergonomic', 'buoyant', 'reflective', 'aerodynamic', 'non-slip', 
                       'sweat-resistant', 'adjustable', 'flexible', 'padded', 'compressible', 'weatherproof', 'ventilated', 
                       'UV-resistant', 'high-visibility', 'anti-microbial', 'tear-resistant', 'grip-enhancing', 'cushioned', 
                       'thermal', 'odor-resistant', 'windproof', 'impact-resistant', 'breathable', 'water-repellent', 'anti-chafe', 
                       'snug-fitting', 'quick-release', 'versatile', 'rugged', 'lightweight', 'compact', 'sturdy', 'reinforced', 
                       'hypoallergenic', 'anti-static', 'anti-glare', 'heat-retaining', 'sweat-wicking', 'disposable']

In [None]:
# author generated
product_suffix_pool = ['with long lasting properties', 'set of 6', 'with heavy duty clips', 'D ring shape', 'mix in a mesh bag',
                       '36 pack', 'personalized with your name', 'personalized with your logo', 'for the whole team', 'with team logo',
                       'multi-colored', 'great for daily trips', 'organized', 'easy to store', 'with the instruction manual',
                       'for hunting', 'for fishing', 'for walking', 'for running and walking']

In [None]:
dat = rgh.create_dataset(size_factor = 1, 
                   total_votes = np.arange(10, 71),
                   helpful_votes = np.arange(3, 34, 3), 
                   scale = 0.3, 
                   review_years = np.arange(2001, 2013), 
                   product_category = 'Sports_Outdoors', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 1)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/a15fc40fcfa545bb93730f74a462d1a3_0.snappy.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

In [None]:
row_marker = dat.shape[0]
row_marker

## Tools_Home_Improvement

Prompt: "Generate 50 product names related to tools and home improvement. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_pool = ['hammer', 'screwdriver', 'wrench', 'pliers', 'drill', 'saw', 'chisel', 'level', 'tape measure', 
                'ladder', 'paintbrush', 'roller', 'sandpaper', 'putty knife', 'caulk gun', 'utility knife', 
                'stud finder', 'clamp', 'vise', 'hacksaw', 'crowbar', 'socket set', 'shovel', 'rake', 'wheelbarrow', 
                'trowel', 'sander', 'grinder', 'hedge trimmer', 'lawnmower', 'leaf blower', 'chainsaw', 'nail gun', 
                'heat gun', 'welding torch', 'soldering iron', 'multimeter', 'voltmeter', 'pipe wrench', 'wire stripper', 
                'bolt cutter', 'sledgehammer', 'pickaxe', 'pry bar', 'plumb bob', 'spirit level', 'chalk line']

Prompt: "Generate 50 words that describe products such as tools and that are used for home improvement. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_prefix_pool = ['sturdy', 'durable', 'ergonomic', 'versatile', 'efficient', 'powerful', 'precise', 'lightweight', 
                       'reliable', 'rust-resistant', 'corrosion-resistant', 'heavy-duty', 'high-performance', 'sleek', 
                       'compact', 'user-friendly', 'multipurpose', 'innovative', 'energy-saving', 'eco-friendly', 'robust', 
                       'long-lasting', 'portable', 'weatherproof', 'scratch-resistant', 'vibration-resistant', 'noise-reducing', 
                       'adjustable', 'flexible', 'lockable', 'swiveling', 'retractable', 'collapsible', 'foldable', 'seamless', 
                       'cordless', 'rechargeable', 'detachable', 'interchangeable', 'customizable', 'programmable', 'automated', 
                       'digital', 'intuitive', 'heated', 'insulated', 'magnetic', 'laser-guided']

In [None]:
# author generated
product_suffix_pool = ['with magnetics holders', 'rust resistant metal parts', 'double sided', 'with large easy to read letters', 
                       'with fiberglass handles', 'general purpose', 'all purpose', 'for smaller hands', 'with slip cushion grip', 
                       'large size', 'small size', 'red', 'black', 'yellow', 'quality steel', 'quality materials', 'long lasting materials',
                       'exceptional worksmanship', 'with 40V lithium-ion battery', 'orange', 'orange and black colors', '2 years warranty']

In [None]:
dat = rgh.create_dataset(size_factor = 2, 
                   total_votes = np.arange(10, 66, 2),
                   helpful_votes = np.arange(2, 48), 
                   scale = 1, 
                   review_years = np.arange(1996, 2010), 
                   product_category = 'Tools_Home_Improvement', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 0.6)

Use remaining rows from the last reviews file:

In [None]:
dat["review_headline"] = reviews.iloc[row_marker:(row_marker + dat.shape[0])]["review_headline"].array
dat["review_body"] = reviews.iloc[row_marker:(row_marker + dat.shape[0])]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Garden_Outdoor

Prompt: "Generate 50 product names related to garden and outdoor. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular". 

In [None]:
product_pool = ['rake', 'shovel', 'hose', 'sprinkler', 'lawnmower', 'wheelbarrow', 'pruner', 'trowel', 'watering_can', 
                'hoe', 'hedge trimmer', 'leaf blower', 'patio set', 'hammock', 'bird feeder', 'bird bath', 'garden gnome', 
                'sundial', 'wind chime', 'garden bench', 'potting soil', 'fertilizer', 'seed packet', 'bulb', 'plant pot', 
                'garden sculpture', 'outdoor lighting', 'patio heater', 'grill', 'smoker', 'fire pit', 'umbrella', 'gazebo', 
                'trellis', 'pergola', 'garden_hose', 'nozzle', 'garden fork', 'cultivator', 'edger', 'pruning saw', 
                'loppers', 'garden shears', 'gardening gloves', 'gardening hat', 'gardening apron', 'gardening knee pad', 
                'gardening tool set', 'garden cart', 'compost_bin']

Prompt: "Generate 50 words that describe products for garden and outdoors. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular". The author then provided modifications.

In [None]:
product_prefix_pool = ['durable','weather-resistant','eco-friendly','sustainable','ergonomic','portable','lightweight','compact',
                       'multi-functional','versatile','decorative','ornamental','rust-proof','water-resistant','low-maintenance',
                       'vibrant','colorful','elegant','rustic','vintage','stylish','contemporary','sleek','modern','innovative',
                       'space-saving','efficient','user-friendly','sturdy','robust','reliable','easy-to-clean','easy-to-assemble',
                       'foldable','adjustable','lockable','breathable','ventilated','insulated','energy-saving','rechargeable',
                       'cordless','solar-powered','automated','programmable', 'uv-resistant', 'uv-protective', 'spacious']                   

In [None]:
# author generated
product_suffix_pool = ['from heavy duty material', 'for patio, backyard and garden', 'DIY kit', 'quality structure', 
                       'one size fits all', 'washable canvas', 'with non-slip grip', 'excellent gift for a new homeowner', 
                       'with storage basket', "ideal gift for women, birthdays","great gift for a gardener", 
                       "great gift for a beginner", "for raised beds", "deer resistant", "for pots", '2 years warranty']

In [None]:
dat = rgh.create_dataset(size_factor = 3, 
                   total_votes = np.arange(1, 40, 3),
                   helpful_votes = np.arange(1, 30, 2), 
                   scale = 0.6, 
                   review_years = np.arange(2002, 2018), 
                   product_category = 'Garden_Outdoor', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 1)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/a15fc40fcfa545bb93730f74a462d1a3_1.snappy.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Arts_Crafts_Sewing

Prompt: "Generate 100 product names related to arts, crafts and sewing. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_pool = ['paintbrush', 'sketchbook', 'easel', 'canvas', 'palette', 'pastel', 'marker', 'charcoal', 'calligraphy', 'ink', 
                'watercolor', 'acrylic paints', 'oil brushes', 'oil for painting', 'pencil', 'eraser', 'sharpener', 'ruler', 'compass', 'protractor', 'scissor', 
                'glue', 'tape', 'stencil', 'stamp', 'embosser', 'embroidery', 'needle', 'thread', 'yarn', 'fabric', 'felt', 'ribbon', 
                'button', 'zipper', 'clasp', 'hook', 'bead', 'sequin', 'feather', 'leather', 'clay', 'polymer clay', 'resin', 'mold', 
                'air dry clay', 'carving tools', 'engraving tools', 'etching tools', 'printing supplies', 'calligraphy kit', 
                'lettering', 'font', 'typeface', 
                'quilling', 'origami', 'papercutting', 'decoupage', 'collage', 'mosaic', 'macrame', 'weaving', 'knitting', 
                'crocheting', 'felting', 'dyeing', 'painting', 'sketching', 'drawing', 'coloring', 'scrapbooking', 'journaling', 
                'cardmaking', 'stamping', 'stenciling', 'woodburning', 'woodcarving', 'metalsmithing', 'jewelry', 'beading', 
                'wirework', 'enameling', 'glassblowing', 'ceramics', 'pottery', 'sculpture', 'photography', 'framing', 'matting']

In [None]:
# author generated
product_prefix_pool = ['cotton', 'polyester', 'glass', 'wooden', 'full-color', 'multicolored', 'natural', 'premium', 'custom',
                      'rubber', 'personalized', 'durable', 'sturdy', 'easy to handle', 'beautiful', 'sleek']

In [None]:
# author generated
product_suffix_pool = ['for beginners', 'perfect for beginners', 'complete kit', '120 pieces kit', 'starter kit', 'for Christmas trees',
                      'pack of 1', 'pack of 3', 'pack of 5', 'pack of 10', 'for drawing', 'for illustration', 'for all media', 'for coloring',
                      'acid free', 'for wedding', 'for anniversary', 'for birthdays', 'for engagement', 'with floral pattern',
                      'mounted', 'for card making', 'large bundle', 'for crafting', 'for DIY projects', 'for toys, sporting goods and glass',
                      'DIY arts, crafts project', 'for beginners and professionals']

In [None]:
dat = rgh.create_dataset(size_factor = 3, 
                   total_votes = np.arange(10, 80),
                   helpful_votes = np.arange(9, 77), 
                   scale = 0.6, 
                   review_years = np.arange(2001, 2011), 
                   product_category = 'Arts_Crafts_Sewing', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 0.3)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/a15fc40fcfa545bb93730f74a462d1a3_2.snappy.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Health_Household

Prompt: "Generate 50 products related to health and household. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_pool = ['vitamin', 'supplement', 'medicine', 'bandage', 'thermometer', 'disinfectant', 'soap', 'shampoo', 'conditioner', 
                'toothbrush', 'toothpaste', 'floss', 'mouthwash', 'deodorant', 'lotion', 'sunscreen', 'cleanser', 'towel', 
                'tissue', 'detergent', 'bleach', 'sponge', 'brush', 'mop', 'broom', 'vacuum', 'duster', 'trash', 'can', 'bag', 
                'container', 'box', 'basket', 'organizer', 'shelf', 'rack', 'closet', 'cabinet', 'drawer', 'light', 'bulb', 
                'battery', 'charger', 'cable', 'adapter', 'extension', 'cord', 'tape', 
                'glue', 'scissors']

Prompt: "Generate 50 words that describe products for health and household. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_prefix_pool = ['natural','organic','eco-friendly','sustainable','biodegradable','non-toxic','hypoallergenic',
                       'cruelty-free','vegan','plant-based','gentle','fragrance-free','antibacterial','antimicrobial',
                       'multipurpose','versatile','durable','long-lasting','compact','lightweight','ergonomic',
                       'energy-efficient','water-saving','stain-resistant','scratch-proof','unscented','disinfecting',
                       'deodorizing','air-purifying','odor-eliminating','moisture-wicking','quick-drying','absorbent',
                       'leak-proof','spill-proof','dustproof','lint-free','pet-friendly','child-safe','dishwasher-safe',
                       'microwave-safe','oven-safe','freezer-safe','insulated','recyclable','compostable']

In [None]:
# author generated
product_suffix_pool = ['under sink storage', 'for office and living room', 'for baby room', 'with skin restorative complex',
                      'with hair restorative complex', 'navy blue', 'green', 'blue', 'yellow', 'white', 'grey', 'black',
                      'natural color', 'with woodgrain pattern', 'with hooks', 'smell preventing', 'smell trapping', 'for kitchen',
                      'for bathroom', 'for toilets', 'for closet', 'child-safe', 'kid-safe']

In [None]:
dat = rgh.create_dataset(size_factor = 2, 
                   total_votes = np.arange(7, 69),
                   helpful_votes = np.arange(4, 50), 
                   scale = 0.2, 
                   review_years = np.arange(1998, 2013), 
                   product_category = 'Health_Household', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 1)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/c74b5acddec04b65a16193140fd25fab_0.snappy.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
row_marker = dat.shape[0]
row_marker

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Computers

Prompt: "Generate 50 products related to computers. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_pool = ['laptop','desktop','keyboard','mouse','monitor','printer','speaker','webcam','headset','microphone','router',
                'modem','cable','adapter','charger','battery','case','ram','processor','motherboard','graphics card',
                'power supply','hard drive','solid state drive','optical drive','projector','scanner','camera','camcorder',
                'tablet','stylus','dock','hub','antivirus','firewall','vpn','network switch','ethernet cable','wireless adapter',
                'server','software','operating system','application','external_drive','usb flash drive','cooling fan']

Prompt: "Generate 50 words related to computer products. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_prefix_pool = ['portable','lightweight','compact','ergonomic','sleek','durable','intuitive','innovative','versatile',
                       'responsive','efficient','secure','user-friendly','high-performance','multifunctional','customizable',
                       'advanced','intelligent','eco-friendly','wireless','interactive','multimedia','cutting-edge','integrated',
                       'futuristic','sophisticated','robust','powerful','flexible','dynamic','compatible','intuitive',
                       'energy-efficient','reliable','stylish','elegant','rugged','practical','affordable','accessible',
                       'high-quality','fast','smart','premium','versatile','modern','adaptable','convenient','streamlined']

In [None]:
# author generated
product_suffix_pool = ['with 6 outlets', 'quiet', 'very quiet', 'small size', 'under the desk storage', 'modern standby support',
                      'black', 'fully modular', '1 year warranty', '2 years warranty', '10 years warranty', 'for gaming', 
                      'with cloud backup and virus protection', 'with cloud backup and black-web monitoring', 'upgraded version',
                      'newest version', 'latest model', 'latest version', 'with updated firmware', 'water and dust resistant',
                      '2nd generation', '1st generation', '3d generation', 'with a sleeve', 'unlocked', 'refurbished', 'renewed']

In [None]:
dat = rgh.create_dataset(size_factor = 1, 
                   total_votes = np.arange(20, 120),
                   helpful_votes = np.arange(10, 60), 
                   scale = 0.8, 
                   review_years = np.arange(1996, 2017), 
                   product_category = 'Computers', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 0.4)

Use remaining rows from the last reviews file:

In [None]:
dat["review_headline"] = reviews.iloc[row_marker:(row_marker + dat.shape[0])]["review_headline"].array
dat["review_body"] = reviews.iloc[row_marker:(row_marker + dat.shape[0])]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Books

Prompt: "Generate 50 products related to books. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_pool = ['book', 'novel', 'textbook', 'magazine', 'comic', 'journal', 'dictionary', 'thesaurus', 'encyclopedia', 
                'atlas', 'anthology', 'memoir', 'biography', 'autobiography', 'essay', 'poetry', 'play', 'script', 
                'manuscript', 'ebook', 'audiobook', 'paperback', 'hardcover', 'bookmark', 'bookcase', 'bookshelf', 
                'bookstand', 'bookholder', 'bookend', 'booklight', 'bookbag', 'bookcover', 'bookbinding', 'bookplate', 
                'bookmarker', 'booklet',  
                 'bookmark']

Prompt: "Generate 50 products related to books. Each word must be an adjective. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_prefix_pool = ['literary', 'educational', 'fictional', 'non-fictional', 'classic', 'modern', 'best-selling', 'popular', 
                       'acclaimed', 'award-winning', 'illustrated', 'hardcover', 'paperback', 'digital', 'audio', 'signed', 
                       'collectible', 'rare', 'vintage', 'antique', 'scholarly', 'academic', 'reference', 'informative', 
                       'instructional', 'inspirational', 'motivational', 'self-help', 'biographical', 'autobiographical', 
                       'historical', 'scientific', 'technical', 'childrens', 'young adult', 'adventure', 'romantic', 'mysterious', 
                       'suspenseful', 'thrilling', 'horror', 'fantasy', 'sci-fi', 'comic', 'graphic', 'coffee table', 'travel', 
                       'cookbook']

In [None]:
# author generated
product_suffix_pool = ['large print', 'electronic version', 'with authors commentary', 'with excerpts from the upcoming book',
                      'latest edition', 'comprehensive volume', 'full volume', 'unabridged', 'English translation', 'French translation',
                      'Spanish translation', 'Russian translation', 'Chinese translation', 'unabridged version', '2-volume set',
                      'with updates', 'student text', 'self-teaching guide', 'selected works', 'trilogy', 'boxed set', 'leather bound set',
                      'cardstock', '1 pc', 'cute']

In [None]:
dat = rgh.create_dataset(size_factor = 3, 
                   total_votes = np.arange(40, 150, 3),
                   helpful_votes = np.arange(10, 30), 
                   scale = 0.4, 
                   review_years = np.arange(1996, 2018), 
                   product_category = 'Books', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 1)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/c74b5acddec04b65a16193140fd25fab_1.snappy.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Music

Prompt: "Generate 100 music products. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_pool = ['guitar', 'piano', 'drum', 'violin', 'trumpet', 'saxophone', 'harmonica', 'flute', 'clarinet', 'trombone', 
                'cello', 'harp', 'ukulele', 'synthesizer', 'keyboard', 'microphone', 'amplifier', 'speaker', 'headphone', 
                'mixer', 'turntable', 'recorder', 'tuner', 'metronome', 'cajon', 'tambourine', 'maracas', 'cowbell', 
                'shaker', 'triangle', 'cymbal', 'gong', 'xylophone', 'vibraphone', 'melodica', 'kazoo', 'accordion', 
                'banjo', 'mandolin', 'sitar', 'kalimba', 'didgeridoo', 'djembe', 'bongo', 'congas', 'timbale', 'clave', 
                'guiro', 'rainstick', 'beatbox', 'sampler', 'vocoder', 'distortion', 'reverb', 'delay', 'compressor', 
                'equalizer', 'looper', 'sequencer', 'drumpad', 'groovebox', 'guitar', 
                'pedal board', 'cable manager', 'music stand', 'sustain', 'capo', 'slidebar', 'pick', 'drumstick', 'mallets', 
                'bagpipe', 'shehnai', 'erhu', 'koto', 'shamisen', 'saz', 'oud', 'qanun', 'daf', 'bodhrán', 'concertina', 
                'harmonica', 'melodion', 'steelpan', 'washboard', 'jawsharp', 'mridangam', 'ghatam', 'kanjira', 'veena', 'sarangi']

In [None]:
# author generated
product_prefix_pool = ['acoustic', 'electric', 'bass', 'mahogany', 'portable', 'full size', 'black', 'professional',
                      'semi-professional', 'smart', 'rosewood', 'miniature', 'wooden', 'all wood']

In [None]:
# author generated
product_suffix_pool = ['instrument kit', 'with tuner', 'with shoulder strap', 'with picks', 'with picks for beginners',
                      'with the guide book for beginners', 'with extra strings', 'table version, collectable', 'collectable',
                      'small size', 'table top set', 'with travel case', 'with power supply', 'for beginners', 'expandable',
                      'for electronic music making', 'black and silver', 'silver', 'black', 'handheld', 'with a switch', 
                      'concert grade', 'for kids', 'for adults', 'full size', 'natural wood gloss']

In [None]:
dat = rgh.create_dataset(size_factor = 3, 
                   total_votes = np.arange(11, 91),
                   helpful_votes = np.arange(3, 71), 
                   scale = 0.4, 
                   review_years = np.arange(1996, 2018), 
                   product_category = 'Music', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 1)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/c74b5acddec04b65a16193140fd25fab_2.snappy.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Movies_TV

Here the authors sent multiple requests to the model to generate movies and TV titles for different genres. Prompts:

- Generate 50 titles for action movies. Provide the output as a comma separated list, surround each title in single quotes.

- Generate 50 titles for romantic movies. Provide the output as a comma separated list, surround each title in single quotes.

- Generate 50 titles for documentary movies. Provide the output as a comma separated list, surround each title in single quotes.

- Generate 50 titles for historic movies. Provide the output as a comma separated list, surround each title in single quotes.

- Generate 50 titles for children's movies. Provide the output as a comma separated list, surround each title in single quotes.

- Generate 50 titles for comedic movies. Provide the output as a comma separated list, surround each title in single quotes.

In [None]:
product_pool = ['Rogue Assassin', 'Explosive Vengeance', 'Lethal Strike', 'Adrenaline Rush', 'Bullet Proof', 'Unstoppable Force',
                       'Maximum Impact', 'Relentless Pursuit', 'Crimson Requiem', 'Shattered Redemption', 'Fists of Fury', 
                       'Chaos Reigned', 'Collateral Damage', 'Eternal Reckoning', 'Unleashing the Beast', 'Scorched Earth', 
                       'Mercenary Uprising', 'Ultima Protocol', 'Deadly Infiltration', 'Expendable Retribution', 'Savage Fury', 
                       'Firestorm Rising', 'Armageddon Operative', 'Doomsday Directive', 'Renegade Enforcer', 'Detonation Point', 
                       'Apex Predator', 'Decimation Code', 'Crimson Retaliation', 'Havoc Unleashed', 'Cataclysmic Impact', 
                       'Hellfire Blitz', 'Obliteration Protocol', 'Annihilation Sequence', 'Infernal Supremacy', 'Archangels Wrath', 
                       'Epoch of Carnage', 'Extinction Event', 'Devastation Vortex', 'Apocalyptic Purge', 'Maelstrom Operative', 
                       'Cyber Insurrection', 'Terminal Velocity', 'Rampage Agenda', 'Vanguard of Chaos', 'Cataclysmic Vengeance', 
                       'Maelstrom Rising', 'Eternal Embrace', 'Loves Symphony', 'Whispers of the Heart', 'Destined Souls', 
                       'Midnight Serenade', 'Autumn Bliss', 'Unbreakable Bond', 'Crimson Desire', 'Serendipitous Encounter', 
                       'Forever and a Day', 'Twilight Rendezvous', 'Celestial Dance', 'Velvet Embrace', 'Moonlit Passion', 
                       'Everlasting Promise', 'Amorous Rhapsody', 'Rapturous Rhythm', 'Enchanted Affection', 'Soulful Symphony', 
                       'Ardent Devotion', 'Poetic Bliss', 'Amorous Overture', 'Intoxicating Melody', 'Cherished Rapture', 
                       'Infinite Adoration', 'Ethereal Caress', 'Spellbinding Aria', 'Transcendent Desire', 'Impassioned Embrace', 
                       'Entrancing Sonata', 'Tender Rhapsody', 'Fervent Allegro', 'Blissful Serenade', 'Enamored Concerto', 
                       'Exquisite Affinity', 'Mesmerizing Overture', 'Euphoric Cadence', 'Rapturous Reverie', 'Impassioned Crescendo', 
                       'Enthralling Opus', 'Amorous Elegy', 'Fervent Adagio', 'Soulful Interlude', 'Captivating Refrain', 
                       'Entrancing Idyll', 'Ethereal Rhapsody', 'Enraptured Fantasia', 'Unraveling the Universe', 
                       'The Secret Lives of Oceans', 'Resilience: Stories of Human Triumph', 'Endangered: A Cry for Conservation', 
                       'The Digital Revolution: Reshaping Our World', 'Culinary Explorers: A Gastronomic Journey', 
                       'Homeless Voices: Unheard Stories', 'Behind the Canvas: The Art of Masterpieces', 
                       'Untamed Wilderness: Exploring Earths Last Frontiers', 'The Rise of Artificial Intelligence', 
                       'Forgotten Histories: Untold Tales of the Past', 'The Mindfulness Movement: Finding Inner Peace', 
                       'Extreme Survival: Pushing Human Limits', 'Beneath the Surface: Uncovering Ancient Civilizations', 
                       'The Future of Energy: Sustainable Solutions', 'Monuments of Humankind: Architectural Wonders', 
                       'Unsung Heroes: Inspiring Acts of Courage', 'The Cosmic Odyssey: Exploring the Universe', 
                       'Music Across Cultures: A Melodic Journey', 'The Science of Happiness: Unlocking Lifes Secrets', 
                       'Pioneers of Progress: Visionaries Who Shaped Our World', 'The World of Sports: Passion and Perseverance', 
                       'Forgotten Wars: Untold Stories of Conflict', 'The Future of Medicine: Revolutionary Breakthroughs', 
                       'Extraordinary Minds: Exploring Genius', 'The Language Barrier: Bridging Cultural Divides', 
                       'Sustainable Living: A Path to a Greener Future', 'Unsung Innovators: Ideas That Changed the World', 
                       'The Power of Dreams: Inspiring Life Stories', 'Vanishing Traditions: Preserving Cultural Heritage', 
                       'The Invisible Struggle: Mental Health Unveiled', 'Frontiers of Exploration: Pushing the Boundaries', 
                       'The Science of Food: Unraveling Culinary Mysteries', 'The Art of Storytelling: Captivating Narratives', 
                       'The Human Condition: Exploring Our Existence', 'Saving Species: Conservation Efforts Worldwide', 
                       'The World of Dance: Movement and Expression', 'Underwater Wonders: Exploring Earths Oceans', 
                       'The Future of Transportation: Innovative Solutions', 'Citizen Scientists: Unlocking Knowledge for All', 
                       'The Hidden World of Insects: Tiny Marvels', 'Extraordinary Educators: Inspiring Young Minds', 
                       'The Science of Love: Understanding Human Connections', 'Unsung Heroes of History: Untold Stories of Bravery',
                       'The Music Revolution: Shaping Cultural Movements', 'Extreme Environments: Life on the Edge',
                       'The Conquerors Legacy', 'Echoes of Valor', 'Destinys Crucible', 'Whispers of Revolution', 
                       'The Monarchs Downfall', 'Forged in Fire', 'Clash of Empires', 'The Forgotten Warrior', 
                       'Shadows of the Past', 'A Nation Divided', 'The Price of Freedom', 'Legends of the Battleground', 
                       'Triumph and Tragedy', 'Relics of Antiquity', 'Echoes of Rebellion', 'The Sands of Time', 'Immortal Legacies', 
                       'The Rise of a Dynasty', 'Whispers of the Ancients', 'Eternal Conquest', 'The Unforgotten Heroes', 
                       'Remnants of Glory', 'Echoes of Rebellion', 'The Vanquished Empire', 'Remnants of an Era', 
                       'Relics of a Bygone Age', 'The Fallen Kingdoms', 'Whispers of the Ancestors', 'Echoes of Valor', 
                       'Immortal Legacies', 'The Rise of a Dynasty', 'Whispers of the Ancients', 'Eternal Conquest', 
                       'The Unforgotten Heroes', 'Remnants of Glory', 'Echoes of Rebellion', 'The Vanquished Empire', 
                       'Remnants of an Era', 'Relics of a Bygone Age', 'The Fallen Kingdoms', 'Whispers of the Ancestors', 
                       'Echoes of Valor', 'Immortal Legacies', 'The Rise of a Dynasty', 'Whispers of the Ancients', 
                       'Eternal Conquest', 'The Magical Treehouse Adventure', 'Dino Explorers', 'Starry Night Dreamers', 
                       'The Curious Robot', 'Mermaids Treasure Cove', 'Superhero Schoolyard', 'Friendship Forest Fables', 
                       'Circus of Wonders', 'Enchanted Woodlands', 'Intergalactic Explorers', 'Unicorn Meadows', 'Pirate Island Quest', 
                       'Fairytale Kingdom Chronicles', 'Wizards Apprentice', 'Dinosaur Rescue Rangers', 'Candy Land Shenanigans', 
                       'Mythical Creature Companions', 'Rainforest Rumble', 'Underwater Odyssey', 'Playtime Pandemonium', 
                       'Secret Clubhouse Capers', 'Puppet Theater Antics', 'Farmyard Frolic', 'Storybook Singalong', 
                       'Alien Adventurers', 'Magical Museum Mishaps', 'Superhero School', 'Enchanted Toy Emporium', 
                       'Outer Space Odyssey', 'Junkyard Robots', 'Mythical Creature Academy', 'Fairytale Forest Friends', 
                       'Circus of Dreams', 'Dinosaur Discoveries', 'Wonderland Wanders', 'Pirate Ship Escapades', 
                       'Magical Treehouse Chronicles', 'Superhero Sidekicks', 'Mermaid Melodies', 'Woodland Whimsies', 
                       'Alien Explorers', 'Candy Land Quests', 'Puppet Theater Plays', 'Farmyard Follies', 'Storybook Singalongs', 
                       'Rainforest Ramblings', 'The Laughing Llama', 'Giggles in the Grocery Aisle', 'Punchline Pandemonium', 
                       'Chuckle Chaos', 'Hysteria on the High Seas', 'Splitting Sides in Split', 'Knee-Slapping Knights', 
                       'Guffaw Galaxy', 'Chortles and Chaos', 'Snicker Snafus', 'Laughter Liftoff', 'Funny Farm Fiasco', 
                       'Giggle Galore', 'Mirth Mayhem', 'Hilarity Hijinks', 'Titter Town', 'Cackle Capers', 'Grin and Guffaw', 
                       'Rib-Tickling Romp', 'Laugh Riot Rampage', 'Yuks Unleashed', 'Snort-Worthy Shenanigans', 'Howl with Hilarity', 
                       'Chortle Circus', 'Guffaw Gala', 'Chuckle Champs', 'Funny Bone Bonanza', 'Laughter Lollapalooza', 
                       'Giggle Gang', 'Mirth Marathon', 'Grin Galore', 'Chuckle Chaos', 'Laugh-a-Palooza,' 'Comedy Caper',
                       'Guffaw Gauntlet', 'Snicker Showdown', 'Hilarity Hell', 'Titter Tornado', 'Giggle Gauntlet', 'Mirth Madness',
                       'Chuckle Champs', 'Laugh Riot Rodeo', 'Guffaw Galore', 'Snort-Worthy Showdown', 'Chortle Chaos', 
                       'Grin and Giggle', 'Mirth Mania']

In [None]:
dat = rgh.create_dataset(size_factor = 3, 
                   total_votes = np.arange(15, 2000, 5),
                   helpful_votes = np.arange(15, 1500), 
                   scale = 0.6, 
                   review_years = np.arange(2004, 2017), 
                   product_category = 'Movies_TV', 
                   product_components = [product_pool],
                   marketplace_factor = 0.5)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/instance1_dat1.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)

## Grocery_Gourmet_Food

Prompt: "Generate 100 products in groceries and gourmet foods. Provide the output as a comma separated list, surround each word in single quotes, each word as lower case, each word is singular"

In [None]:
product_pool = ['apple', 'orange', 'banana', 'grape', 'lemon', 'lime', 'pineapple', 'mango', 'strawberry', 'blueberry', 'raspberry', 'blackberry', 'kiwi', 'peach', 'plum', 'apricot', 'nectarine', 'avocado', 'tomato', 'onion', 'potato', 'carrot', 'broccoli', 'cauliflower', 'spinach', 'lettuce', 'cucumber', 'bell pepper', 'zucchini', 'eggplant', 'mushroom', 'garlic', 'ginger', 'bread', 'bagel', 'croissant', 'muffin', 'doughnut', 'cookie', 'cake', 'pie', 'tart', 'pastry', 'cheese', 'yogurt', 'milk', 'butter', 'egg', 'beef', 'chicken', 'turkey', 'pork', 'fish', 'shrimp', 
                'crab', 'lobster', 'oyster', 'pasta', 'rice', 'cereal', 'oatmeal', 'honey', 'jam', 'peanut butter', 'olive oil', 
                'vinegar', 'ketchup', 'mustard', 'mayonnaise', 'salsa', 'hummus', 'guacamole', 'salad dressing', 'chip', 'cracker', 
                'pretzel', 'popcorn', 'nut', 'seed', 'dried fruit', 'candy', 'chocolate', 'ice cream', 'sorbet', 'froyo', 'coffee', 
                'tea', 'juice', 'soda', 'water', 'beer', 'wine', 'liquor']

In [None]:
product_prefix_pool = ['fresh','organic','gourmet','savory','sweet','tangy','zesty','creamy','crunchy','juicy',
                       'succulent','tender','flavorful','aromatic','spicy','pungent','earthy','nutty','robust',
                       'velvety','buttery','crispy','cheesy','baked','sauteed','grilled','smoked','brined','pickled',
                       'fermented','dried','candied','glazed','frosted','toasted','infused','aged','artisanal',
                       'premium','imported','exotic','rare','authentic','decadent','indulgent','exquisite','wholesome',
                       'nutritious','delectable','mouthwatering']

In [None]:
# author generated
product_suffix_pool = ['box of 10', 'box of 5', 'box of 12', 'box of 24',
                       '6 cans', '1 can', '5 bars', '10 bars',
                       '1 bag', 'bag of 12', 'family pack', 'party pack', 'family sized bag',
                       '1 pound', '5 pounds', 
                       '1 bottle', '2 bottles', '3 bottles', '12 bottles',
                       'bundle of 3', '1 carton', '12 cartons',
                       'a gift basket', 'celebration basket', 
                       'equally portioned', 'low calories', 'low in sugar', 'low in fat', 'vegan', 'vegetarian',
                       'nuts and seeds free', 'low glycemic index', 'healthy food', 'great snack', 'great for kids',
                       'great for kids and adults', 'sugarless', 'only natural sugar', 'no artificial sweeteners',
                       'no high fructose syrup', 'no transfats', 'perfect for low calory diet']

In [None]:
dat = rgh.create_dataset(size_factor = 3, 
                   total_votes = np.arange(7, 50, 2),
                   helpful_votes = np.arange(2, 45, 3), 
                   scale = 0.3, 
                   review_years = np.arange(1997, 2007), 
                   product_category = 'Grocery_Gourmet_Food', 
                   product_components = [product_prefix_pool, product_pool, product_suffix_pool],
                   marketplace_factor = 0.9)

In [None]:
reviews = pd.read_parquet(s3_bucket_text + 
                          '/review_body_headline/instance1_dat2.parquet')

In [None]:
dat["review_headline"] = reviews.iloc[0:dat.shape[0]]["review_headline"].array
dat["review_body"] = reviews.iloc[0:dat.shape[0]]["review_body"].array

In [None]:
wr.s3.to_parquet(
    df = dat[["product_category", "marketplace", "customer_id", "review_id", "product_id", "product_title", "star_rating",
            "helpful_votes", "total_votes", "insight", "review_headline", "review_body", "review_date", "review_year"]],
    path = s3_bucket_output,
    dataset = True,
    max_rows_by_file = 3000000,
    partition_cols = ['product_category']
)