# Purpose

The following notebook is just a subset of the notebook provided by the evaluators to download the data and store it as a Parquet file for easier work

## Load data

In [15]:
from datasets import load_dataset
from pathlib import Path

import pandas as pd

In [1]:
dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_Baby_Products", trust_remote_code=True, split="full[50%:52%]")

  from .autonotebook import tqdm as notebook_tqdm
Generating full split: 6028884 examples [00:41, 145547.94 examples/s]


In [2]:
parent_asin = set([rewview_data['parent_asin'] for rewview_data in dataset])

In [3]:
meta_baby_products = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_meta_Baby_Products", split="full", trust_remote_code=True)

Generating full split: 217724 examples [00:22, 9535.23 examples/s] 


In [5]:
meta_baby_products_with_reviews = [product for product in meta_baby_products if product['parent_asin'] in parent_asin]

In [6]:
len(meta_baby_products_with_reviews)

36825

------

# Task and data

 # Purpose
 
We are offered the opportunity to choose from the following tasks:

1. **Categorization of Unlabeled Products**: There are 1,257 products that lack assigned categories, corresponding to the `categories` field in the raw meta dataset.  
   - While this option holds potential business value, the provided model is not "efficient" for this task, as it is a decoder-only transformer.

2. **Enhancement of Product Taxonomy**: We could utilize the LLM to generate alternative categorizations.  
   - From a business perspective, this option appears less appealing, even though the model is suitable for text generation tasks.

3. **Sentiment Analysis and Product Reporting**: Using all products with more than 50 reviews in the dataset, we can generate a report for each product that, beyond review scores, reflects sentiment by highlighting the following information: safety issues, most and least appreciated features, price sentiment, etc.  
   - I find this option particularly interesting for two main reasons:
     1. **Business Value**: Providing feedback to retailers could be extremely beneficial. If the service is effective, they would likely pay for it, similar to the "insights" services currently offered by Amazon for advertisements. Ultimately, retailers aim to increase sales by enhancing visibility on their pages and by improving their products to meet customer preferences.
     2. **Model Suitability**: For this specific task, we have access to the Mistral 7B model, a decoder-only transformer specialized for text generation and understanding. For the first task, we could consider using encoder-only transformers, such as DistilBERT for fast text classification or FLAVA, both of which we would need to fine-tune.

### Transform the datasets into Pandas dataframes

In [25]:
df_products = pd.DataFrame(meta_baby_products_with_reviews)
df_reviews = pd.DataFrame(dataset)

### Filter the relevant columns

I have selected those columns that may be interesting for the task at hand. Further research could mean we end up using more than this (time is limited)



In [28]:
df_reviews_filtered = df_reviews[
    [
        "parent_asin", # ID
        "title",
        "text",
        "verified_purchase"
    ]
]

df_products_filtered = df_products[
    [
        "parent_asin", # ID
        "main_category",
        "title",
        "average_rating",
        "rating_number", # te dice cuantos ratings ha tenido el producto
        "details", # caracteristicas especificas del producto
        "features", # lista de caracteristicas del producto a modo de taxonomia
    ]
]

### Products with > 50 reviews

In [37]:
# Step 1: Count the number of reviews for each parent_asin
review_counts = df_reviews_filtered['parent_asin'].value_counts()

# Step 2: Filter parent_asin with more than 50 reviews
valid_asins = review_counts[review_counts > 50].index

# Step 3: Filter df_products_filtered to keep only those products
df_products_filtered_50_reviews = df_products_filtered[df_products_filtered['parent_asin'].isin(valid_asins)]

In [38]:
print(df_products_filtered.shape)
print(df_products_filtered_50_reviews.shape)

(36825, 7)
(192, 7)


We go from 36k products to 192

### Get only reviews of the selected products

In [39]:
# Step 1: Get the parent_asin values from df_products_filtered_filtered
valid_asins = df_products_filtered_50_reviews['parent_asin'].unique()

# Step 2: Filter df_reviews_filtered to keep only reviews from those products
df_reviews_filtered_50_reviews = df_reviews_filtered[df_reviews_filtered['parent_asin'].isin(valid_asins)]

In [40]:
print(df_reviews_filtered.shape)
print(df_reviews_filtered_50_reviews.shape)

(120578, 4)
(20259, 4)


By selecting the reviews from only those products that have more than 50 reviews, we reduce the total number of reviews to be considered from 120k to 20k

### Save data

In [42]:
def save_dataframe_to_parquet(df, file_name, directory):
    parquet_file_path = directory / f"{file_name}.parquet"
    df.to_parquet(parquet_file_path)
    print(f"DataFrame '{file_name}' saved to {parquet_file_path}")
 
# Create the directory if it does not exist   
data_dir = Path('data')
data_dir.mkdir(parents=True, exist_ok=True)

# Save each DataFrame individually
save_dataframe_to_parquet(df_products_filtered, 'df_products_filtered', data_dir)
save_dataframe_to_parquet(df_products_filtered_50_reviews, 'df_products_filtered_50_reviews', data_dir)
save_dataframe_to_parquet(df_reviews_filtered, 'df_reviews_filtered', data_dir)
save_dataframe_to_parquet(df_reviews_filtered_50_reviews, 'df_reviews_filtered_50_reviews', data_dir)

DataFrame 'df_products_filtered' saved to data/df_products_filtered.parquet
DataFrame 'df_products_filtered_50_reviews' saved to data/df_products_filtered_50_reviews.parquet
DataFrame 'df_reviews_filtered' saved to data/df_reviews_filtered.parquet
DataFrame 'df_reviews_filtered_50_reviews' saved to data/df_reviews_filtered_50_reviews.parquet
