# Amazon Reviews Data Processing

This notebook is responsible for **loading, cleaning, and processing raw Amazon reviews** and product metadata from the "Cell Phones and Accessories" category. The dataset is sourced from **Amazon Reviews 2023** and is used to generate structured prompt-response pairs for fine-tuning an LLM.

Key Steps:  
- Load the reviews (`Cell_Phones_and_Accessories.jsonl`) and product metadata (`meta_Cell_Phones_and_Accessories.jsonl`).
- Filter and clean the data to remove noise and incomplete entries.
- Structure the data into **instruction-based format** for LLM training.
- Split and save the final dataset into `train.jsonl` and `test.jsonl`, which are later used for fine-tuning and inference.

This processed dataset serves as the foundation for synthetic data generation.


## Constants

In [None]:
TRAIN_COUNT = 100000
TEST_COUNT = 10000
# Buffer to account for rows with empty columns
BUFFER_COUNT = 50000

# Paths to the data files
REVIEWS_DATA_PATH = "raw-data/Cell_Phones_and_Accessories.jsonl"
PRODUCTS_DATA_PATH = "raw-data/meta_Cell_Phones_and_Accessories.jsonl"

TRAIN_DATA_PATH = "final-data/train.jsonl"
TEST_DATA_PATH = "final-data/test.jsonl"
VALIDATION_DATA_PATH = "final-data/validation.jsonl"


: 

## Reading the data

In [2]:
import pandas as pd
import json

In [3]:
# Select more rows than necessary to drop rows with empty columns
ROWS = TRAIN_COUNT + TEST_COUNT + BUFFER_COUNT
reviews = pd.read_json(REVIEWS_DATA_PATH, lines = True, nrows = ROWS)

# Read the entire product data because we need all the product data to get the product title and main category
products = pd.read_json(PRODUCTS_DATA_PATH, lines = True)

In [4]:
reviews.head(2)

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,4,No white background! It’s clear!,I bought this bc I thought it had the nice whi...,[{'small_image_url': 'https://images-na.ssl-im...,B08L6L3X1S,B08L6L3X1S,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2021-01-30 22:07:31.196,0,True
1,5,Awesome! Great price! Works well!,Perfect. How pissed am I that I recently paid ...,[],B079BPGF6C,B079BPGF6C,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2018-08-16 18:18:37.349,2,True


In [5]:
products.head(2)

Unnamed: 0,main_category,title,average_rating,rating_number,features,description,price,images,videos,store,categories,details,parent_asin,bought_together,subtitle,author
0,Cell Phones & Accessories,ARAREE Slim Diary Cell Phone Case for Samsung ...,3.8,5,"[Genuine Cow leather with 6 different colors, ...","[JUST LOOK, You can tell the difference. Make ...",,[{'thumb': 'https://m.media-amazon.com/images/...,[],araree,"[Cell Phones & Accessories, Cases, Holsters & ...",{'Product Dimensions': '3.35 x 0.59 x 6.18 inc...,B013SK1JTY,,,
1,Cell Phones & Accessories,Bastmei for OnePlus 7T Case Extremely Light Ul...,4.4,177,[Ultra-thin & Ultra-light: The ultra slim fit ...,[],11.98,[{'thumb': 'https://m.media-amazon.com/images/...,[],Bastmei,"[Cell Phones & Accessories, Cases, Holsters & ...",{'Package Dimensions': '7.6 x 4.29 x 0.75 inch...,B07ZPSG8P5,,,


In [6]:
# Rename columns in products DataFrame
reviews = reviews.rename(columns={
    'title': 'review_title',
    'text': 'review_text'
})

# Rename columns in products DataFrame
products = products.rename(columns={
    'title': 'product_title',
    'main_category': 'product_main_category'
})
print(reviews.columns)
print(products.columns)

Index(['rating', 'review_title', 'review_text', 'images', 'asin',
       'parent_asin', 'user_id', 'timestamp', 'helpful_vote',
       'verified_purchase'],
      dtype='object')
Index(['product_main_category', 'product_title', 'average_rating',
       'rating_number', 'features', 'description', 'price', 'images', 'videos',
       'store', 'categories', 'details', 'parent_asin', 'bought_together',
       'subtitle', 'author'],
      dtype='object')


In [7]:
# Drop rows where review_text is empty
print("REVIEWS DATA")
print(f"Before dropping empty rows: {len(reviews)}")
reviews = reviews[reviews['review_text'].str.strip() != ""]
print(f"After dropping empty rows: {len(reviews)}")



REVIEWS DATA
Before dropping empty rows: 160000
After dropping empty rows: 159949


In [8]:
# Merge reviews with products based on parent_asin to get main_category and title
reviews_with_product_info = pd.merge(
    reviews,
    products[['parent_asin', 'product_main_category', 'product_title']],
    on='parent_asin',
    how='left'
)

In [9]:
reviews_with_product_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159949 entries, 0 to 159948
Data columns (total 12 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   rating                 159949 non-null  int64         
 1   review_title           159949 non-null  object        
 2   review_text            159949 non-null  object        
 3   images                 159949 non-null  object        
 4   asin                   159949 non-null  object        
 5   parent_asin            159949 non-null  object        
 6   user_id                159949 non-null  object        
 7   timestamp              159949 non-null  datetime64[ns]
 8   helpful_vote           159949 non-null  int64         
 9   verified_purchase      159949 non-null  bool          
 10  product_main_category  157485 non-null  object        
 11  product_title          159949 non-null  object        
dtypes: bool(1), datetime64[ns](1), int64(2), obj

In [10]:
reviews_with_product_info = reviews_with_product_info.drop(columns=["images", "asin", "parent_asin", "user_id", "timestamp", "helpful_vote", "verified_purchase"])
# Drop any rows where the merge failed (no matching product)
reviews_with_product_info = reviews_with_product_info.dropna(subset=['product_main_category', 'product_title'])
len(reviews_with_product_info)

157485

In [11]:
# Assuming df is your DataFrame
invalid_values = (reviews_with_product_info.isna() | (reviews_with_product_info == "") | (reviews_with_product_info == 0))

# Check if any column contains invalid values
invalid_summary = invalid_values.any()
invalid_summary

rating                   False
review_title             False
review_text              False
product_main_category    False
product_title             True
dtype: bool

In [12]:
# !! Modify code to drop rows where column is empty (based on output of invalid_summary)

# # Debugging
# product_title_rows = reviews_with_product_info[reviews_with_product_info["product_title"].str.strip() == ""]
# product_title_rows.head(2)

# Drop rows where review_text and product_title is empty
print(f"Before dropping empty rows: {len(reviews_with_product_info)}")
reviews_with_product_info = reviews_with_product_info[reviews_with_product_info["review_text"].str.strip() != ""]
reviews_with_product_info = reviews_with_product_info[reviews_with_product_info["product_title"].str.strip() != ""]
print(f"After dropping empty rows: {len(reviews_with_product_info)}")


Before dropping empty rows: 157485
After dropping empty rows: 157478


Sanity check - check if any column contains invalid values. None of the columns should contain invalid values here.

In [13]:
# Assuming df is your DataFrame
invalid_values = (reviews_with_product_info.isna() | (reviews_with_product_info == "") | (reviews_with_product_info == 0))

# Check if any column contains invalid values
invalid_summary = invalid_values.any()
invalid_summary

rating                   False
review_title             False
review_text              False
product_main_category    False
product_title            False
dtype: bool

In [14]:
# Check if the index is continous. If not, we need to reset the index.
counter = 0
for index, row in reviews_with_product_info.iterrows():
    if index != counter:
        print(index)
        break
    counter += 1

146


In [15]:
# Reset the index - because we have dropped some rows and they need to be continous
# so that we can count the rows correctly for writing to the JSONL file
reviews_with_product_info.reset_index(drop=True, inplace=True)


In [16]:
counter = 0
for index, row in reviews_with_product_info.iterrows():
    if index != counter:
        print(index)
        break
    counter += 1

## Writing the JSONL file

In [18]:
PATH = "new-data/train.jsonl"
for index, row in reviews_with_product_info.iterrows():
    jsonl_data_format_input = {
        'System prompt': 'Given the Rating and Title, you are required to generate the review',
        'Rating': row['rating'],
        'Review Title': row['review_title'],
        'Review': row['review_text'],
        'Product Title': row['product_title'],
        'Product Categories': row['product_main_category'],
    }
    
    if index < TRAIN_COUNT:
        PATH = TRAIN_DATA_PATH 
    elif index < TRAIN_COUNT + TEST_COUNT:
        PATH = TEST_DATA_PATH
    else:
        PATH = VALIDATION_DATA_PATH
    
    # Open file in append mode and write the dictionary as a JSON line
    with open(PATH, "a") as f:
        json.dump(jsonl_data_format_input, f)
        f.write("\n")

print(f"Train rows: {TRAIN_COUNT}")
print(f"Test rows: {TEST_COUNT}")
print(f"Validation rows: {len(reviews_with_product_info) - TRAIN_COUNT - TEST_COUNT}")


Train rows: 100000
Test rows: 10000
Validation rows: 47478


## Check the created files


In [19]:

files = [TRAIN_DATA_PATH, TEST_DATA_PATH, VALIDATION_DATA_PATH]

for file in files:
    formatted_strings = []
    with open(file, "r") as f:
        j = 0
        for line in f:
            j += 1
    print(f"Length of {file} : {j}")

Length of final-data/train.jsonl : 100000
Length of final-data/test.jsonl : 10000
Length of final-data/validation.jsonl : 47478
