To choose the **right dataset** for your task, let’s **align the dataset options with your objectives**:

### Task Summary

You need to build a model that:

* **Predicts product category** given a product's **name and brand**.
* The dataset should ideally include:

  * `product_name` (or title)
  * `brand`
  * `category` (label)

We’re expected to focus on **data analysis, modeling, training & evaluation**, so having **clean and structured data with appropriate labels** is crucial.

## Dataset Analysis


Let's import HuggingFace, Kaggle and Pandas utilities:

In [None]:
! pip install datasets kagglehub pandas

In [1]:
from datasets import load_dataset
import kagglehub

import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm



### 1. [Amazon-Reviews-2023 (HuggingFace / McAuley Lab)](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023)

In [35]:
amazon_review_dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_meta_All_Beauty", split="full", trust_remote_code=True) # In this POC we analyze only "beauty" products. Every dataset contains the same features

In [16]:
amazon_review_df = amazon_review_dataset.to_pandas()
amazon_review_df.columns

Index(['main_category', 'title', 'average_rating', 'rating_number', 'features',
       'description', 'price', 'images', 'videos', 'store', 'categories',
       'details', 'parent_asin', 'bought_together', 'subtitle', 'author'],
      dtype='object')

In [17]:
amazon_review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112590 entries, 0 to 112589
Data columns (total 16 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   main_category    112590 non-null  object 
 1   title            112590 non-null  object 
 2   average_rating   112590 non-null  float64
 3   rating_number    112590 non-null  int64  
 4   features         112590 non-null  object 
 5   description      112590 non-null  object 
 6   price            112590 non-null  object 
 7   images           112590 non-null  object 
 8   videos           112590 non-null  object 
 9   store            101259 non-null  object 
 10  categories       112590 non-null  object 
 11  details          112590 non-null  object 
 12  parent_asin      112590 non-null  object 
 13  bought_together  0 non-null       object 
 14  subtitle         0 non-null       object 
 15  author           0 non-null       object 
dtypes: float64(1), int64(1), object(14)
me

The dataset contains 112,590 rows and 16 columns, providing a substantial amount of product data for analysis and modeling.

In [18]:
amazon_review_df['categories'].isna().mean().round(3)

np.float64(0.0)

What this tells us:
- Every single record has at least one category label. That’s ok for our supervision signal.

In [19]:
amazon_review_df['title'].str.len().describe().apply(lambda x: int(x))

count    112590
mean        113
std          53
min           0
25%          67
50%         114
75%         158
max        1455
Name: title, dtype: int64

Observations:

- Titles run from 0 chars up to 1,455 chars (some appear to be full descriptions).

- Median length ~114 chars is reasonable, but we’ll need a cleaning pipeline (e.g. truncation, deduping, stopword removal).

Problem: There is no dedicated brand column in the main schema.

Next step: let’s inspect the details field (often a dict) to see if brand is buried there:

If one of those keys is “Brand” or “Manufacturer”, we could extract it—but that’d require:

- Parsing dozens of different key-names across categories
- Normalizing brand strings (case, spelling variants, noise)
- Handling missing or malformed entries

In [20]:
sample_details = amazon_review_df['details'].dropna().iloc[0]
sample_details

'{"Package Dimensions": "7.1 x 5.5 x 3 inches; 2.38 Pounds", "UPC": "617390882781"}'

That confirms:
- No brand or manufacturer field is buried here—just logistics metadata.
- We’d have to fall back on brittle heuristics (e.g. regexes over title) to guess brands, which would be noisy and time-consuming.

Bottom line for “raw_meta_All_Beauty” (and its sister splits)

- Pros
    - 100 % of items have categories.
    - Product titles exist (albeit noisy/variable length).

- Cons
    - No explicit brand—the critical second input we need.
    - Extracting brands reliably would eat up most of our 4-day window and introduce noise.

Check null values:

In [37]:
amazon_review_df[['title','main_category']].isna().mean().round(3)

title            0.0
main_category    0.0
dtype: float64

Summary table:

| **Criterion**                  | **Status**            | **Comment**                                                                          |
| ------------------------------ | --------------------- | ------------------------------------------------------------------------------------ |
| Dataset Size                   | ✅🟢 Very Large        | \~140 M reviews across all splits — massive scale                                    |
| Key Fields Present             | ✅🟢 title, categories | 100 % of rows have `categories`; `title` always present                              |
| Explicit Brand Column          | ❌🔴 Absent            | No direct `brand` field in metadata — must extract from `details` or `title`         |
| Brand Extraction Effort        | ❌🔴 High              | `details` dict rarely contains brand; title parsing/NLP needed — very time-consuming |
| Category Depth                 | ✅🟢 Hierarchical      | Multi-level category paths — rich but requires flattening                            |
| Title Quality & Noise          | ⚠️ Variable           | Titles range 0–1 455 chars; includes long descriptions — cleaning required           |
| Additional Metadata            | ✅🟢 Present           | Ratings, price, images, etc. — but heavy review text payload                         |
| Suitability for Category Model | ✅🟢 Strong            | Clean category labels, large volume                                                  |
| Suitability for Brand Model    | ❌🔴 Very Weak         | Brand extraction from unstructured text is high-risk under a 4-day window            |



### 2. [Kaggle - Amazon Products Dataset](https://www.kaggle.com/datasets/lokeshparab/amazon-products-dataset/data?select=Amazon-Products.csv)

In [2]:
path_to_amazon_product = kagglehub.dataset_download("lokeshparab/amazon-products-dataset")
path_to_amazon_product

'C:\\Users\\franc\\.cache\\kagglehub\\datasets\\lokeshparab\\amazon-products-dataset\\versions\\2'

In [3]:
amazon_product_df = pd.read_csv(f"{path_to_amazon_product}/Amazon-Products.csv")

In [4]:
amazon_product_df.columns

Index(['Unnamed: 0', 'name', 'main_category', 'sub_category', 'image', 'link',
       'ratings', 'no_of_ratings', 'discount_price', 'actual_price'],
      dtype='object')

In [5]:
amazon_product_df[['name','main_category','sub_category']]

Unnamed: 0,name,main_category,sub_category
0,Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1...,appliances,Air Conditioners
1,LG 1.5 Ton 5 Star AI DUAL Inverter Split AC (C...,appliances,Air Conditioners
2,LG 1 Ton 4 Star Ai Dual Inverter Split Ac (Cop...,appliances,Air Conditioners
3,LG 1.5 Ton 3 Star AI DUAL Inverter Split AC (C...,appliances,Air Conditioners
4,Carrier 1.5 Ton 3 Star Inverter Split AC (Copp...,appliances,Air Conditioners
...,...,...,...
551580,Adidas Regular Fit Men's Track Tops,sports & fitness,Yoga
551581,Redwolf Noice Toit Smort - Hoodie (Black),sports & fitness,Yoga
551582,Redwolf Schrute Farms B&B - Hoodie (Navy Blue),sports & fitness,Yoga
551583,Puma Men Shorts,sports & fitness,Yoga


The column name contains structured titles like:
- "LG 1.5 Ton 5 Star AI DUAL Inverter Split AC ..."
- "Redwolf Schrute Farms B&B - Hoodie (Navy Blue)"
- "Mothercare Printed Cotton Elastane Girls Infant ..."

We can heuristically extract potential brands from the first word(s). Let’s test this:

In [26]:
amazon_product_df['brand_candidate'] = amazon_product_df['name'].str.extract(r'^([\w&-]+)')
amazon_product_df['brand_candidate'].value_counts().head(30)

brand_candidate
PC         6406
Puma       4971
Shopnet    4259
Men        4131
U          3981
Amazon     3659
Nike       3216
The        3169
Avsar      2907
Van        2877
NEUTRON    2852
Red        2502
Campus     2450
Jockey     2381
Pepe       2139
Adidas     2118
Arrow      2107
Peter      2092
Jack       1947
Women      1925
Levi       1858
AONES      1749
Clovia     1723
Max        1663
BATA       1570
Baggit     1507
Lee        1499
Yellow     1484
Spykar     1475
Zeya       1473
Name: count, dtype: int64

- Pros:
    This works well for many clean brands (LG, Sony, Puma).

- Cons:
    - Fails if product names start with adjectives or missing brand names (e.g., "Wireless Bluetooth..." or "32L Convection Oven...")
    - Might split brands with spaces ("Mothercare", "Philips Avent", etc.)
    - If brand prediction is central to your deliverable, this dataset lacks an explicit brand column, which could slow you down with manual or model-based extraction (NER, rules, or hybrid).

Check null values:

In [27]:
amazon_product_df[['name','main_category','sub_category']].isna().mean().round(3)

name             0.0
main_category    0.0
sub_category     0.0
dtype: float64

Calculate nunique value for main columns

In [28]:
{
    'n_names':           amazon_product_df['name'].nunique(),
    'n_main_categories': amazon_product_df['main_category'].nunique(),
    'n_sub_categories':  amazon_product_df['sub_category'].nunique()
}

{'n_names': 396210, 'n_main_categories': 20, 'n_sub_categories': 112}

We have a strong category hierarchy:
- main_category: 20 values (e.g., "appliances", "sports & fitness", "electronics")
- sub_category: 112 values

This is very useful for our goal.

Summary table:

| **Criterion**                  | **Status**                              | **Comment**                                                              |
| ------------------------------ | --------------------------------------- | ------------------------------------------------------------------------ |
| Dataset Size                   | ✅🟢 Large                               | \~551 k rows — plenty of examples for modeling                           |
| Key Fields Present             | ✅🟢 name, main\_category, sub\_category | No missing values in these columns                                       |
| Explicit Brand Column          | ❌🔴 Missing                             | Must infer from `name` (e.g. first token) — error-prone                  |
| Brand Extraction Effort        | ⚠️ Heuristic/NLP                        | Will require regex or NER; introduces noise and eats into 4-day timeline |
| Category Hierarchy             | ✅🟢 Rich                                | 20 main + 112 sub categories — good granularity                          |
| Title Quality & Diversity      | ✅🟢 High                                | \~396 k unique titles; fairly structured (“Brand + Specs”)               |
| Additional Metadata            | ✅🟢 Present                             | Ratings, prices, links — potentially useful for future features          |
| Suitability for Category Model | ✅🟢 Strong                              | Excellent for category classification                                    |
| Suitability for Brand Model    | ❌🔴 Weak                                | No native `brand` field—preprocessing required                           |


### 3. [OpenFoodFacts – Product Database (HuggingFace)](https://huggingface.co/datasets/openfoodfacts/product-database)

Load the OpenFoodFacts product database from HuggingFace.

In [None]:
open_food_dataset = load_dataset("openfoodfacts/product-database")

Convert the first 500,000 food products to a pandas DataFrame for analysis.

In [4]:
open_food_df = open_food_dataset['food'].select(range(500_000)).to_pandas()

List all available columns in the DataFrame.

In [36]:
list(open_food_df.columns)

['additives_n',
 'additives_tags',
 'allergens_tags',
 'brands_tags',
 'brands',
 'categories',
 'categories_tags',
 'checkers_tags',
 'ciqual_food_name_tags',
 'cities_tags',
 'code',
 'compared_to_category',
 'complete',
 'completeness',
 'correctors_tags',
 'countries_tags',
 'created_t',
 'creator',
 'data_quality_errors_tags',
 'data_quality_info_tags',
 'data_sources_tags',
 'ecoscore_data',
 'ecoscore_grade',
 'ecoscore_score',
 'ecoscore_tags',
 'editors',
 'emb_codes_tags',
 'emb_codes',
 'entry_dates_tags',
 'food_groups_tags',
 'generic_name',
 'images',
 'informers_tags',
 'ingredients_analysis_tags',
 'ingredients_from_palm_oil_n',
 'ingredients_n',
 'ingredients_original_tags',
 'ingredients_percent_analysis',
 'ingredients_tags',
 'ingredients_text',
 'ingredients_with_specified_percent_n',
 'ingredients_with_unspecified_percent_n',
 'ingredients_without_ciqual_codes_n',
 'ingredients_without_ciqual_codes',
 'ingredients',
 'known_ingredients_n',
 'labels_tags',
 'labels

Preview key columns: product name, brands, categories, and category tags.

In [None]:
open_food_df[['product_name','brands', 'categories', 'categories_tags']].head()

Unnamed: 0,product_name,brands,categories,categories_tags
0,"[{'lang': 'main', 'text': 'Véritable pâte à ta...",Bovetti,"Petit-déjeuners,Produits à tartiner,Produits à...","[en:breakfasts, en:spreads, en:sweet-spreads, ..."
1,"[{'lang': 'main', 'text': 'Chamomile Herbal Te...",Lagg's,,[en:null]
2,"[{'lang': 'main', 'text': 'Lagg's, herbal tea,...",Lagg's,"Plant-based foods and beverages, Beverages, Ho...","[en:plant-based-foods-and-beverages, en:bevera..."
3,"[{'lang': 'main', 'text': 'Linden Flowers Tea'...",Lagg's,,[en:null]
4,"[{'lang': 'main', 'text': 'Herbal Tea, Hibiscu...",Lagg's,,


Let's explore some new products in the OpenFoodFacts dataset to better understand the available fields and data quality.

In [60]:
open_food_df['has_nutella'] = open_food_df['product_name'].astype(str).str.contains('Nutella', case=False, na=False)

In [62]:
open_food_df[open_food_df['has_nutella'] == True]

Unnamed: 0,additives_n,additives_tags,allergens_tags,brands_tags,brands,categories,categories_tags,checkers_tags,ciqual_food_name_tags,cities_tags,...,unique_scans_n,unknown_ingredients_n,unknown_nutrients_tags,vitamins_tags,with_non_nutritive_sweeteners,with_sweeteners,product_name_flat,categories_norm,has_pate,has_nutella
942,0.0,[],"[en:gluten, en:milk, en:nuts, en:soybeans]",[ferrero-u-s-a-incorporated],Ferrero U.S.A. Incorporated,"Snacks, Sweet snacks, Biscuits and cakes, Bisc...","[en:snacks, en:sweet-snacks, en:biscuits-and-c...",[],[unknown],,...,1.0,2.0,[],[],,,,"Snacks, Sweet snacks, Biscuits and cakes, Bisc...",False,True
947,0.0,[],"[en:gluten, en:milk, en:nuts, en:soybeans]","[xx:ferrerro, xx:nutella]","Nutella,Ferrerro","Plant-based foods and beverages,Plant-based fo...","[en:plant-based-foods-and-beverages, en:plant-...",[],[unknown],[],...,1.0,3.0,[],[],,,,"Plant-based foods and beverages,Plant-based fo...",False,True
950,0.0,[],"[en:milk, en:nuts, en:soybeans]","[xx:ferrero, xx:ferrero-u-s-a-incorporated]","Ferrero,Ferrero U.S.A. Incorporated","Plant-based foods and beverages,Plant-based fo...","[en:plant-based-foods-and-beverages, en:plant-...",[],[unknown],[],...,4.0,1.0,[],[],,,,"Plant-based foods and beverages,Plant-based fo...",False,True
951,0.0,[],"[en:milk, en:nuts, en:soybeans]","[nutella, ferrero]","Nutella,Ferrero","Plant-based foods and beverages, Plant-based f...","[en:plant-based-foods-and-beverages, en:plant-...",[],[chocolate-spread-with-hazelnuts],[],...,1.0,1.0,[],[],,,,"Plant-based foods and beverages, Plant-based f...",False,True
952,0.0,[],"[en:milk, en:nuts, en:soybeans]","[ferrero, ferrero-u-s-a-incorporated]","Ferrero, Ferrero U.S.A. Incorporated","Plant-based foods and beverages, Plant-based f...","[en:plant-based-foods-and-beverages, en:plant-...",[],[unknown],,...,11.0,1.0,[],[],,,,"Plant-based foods and beverages, Plant-based f...",False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
437030,1.0,[en:e322],"[en:milk, en:nuts, en:soybeans, de:fruits-à-co...",[xx:ferrero],Ferrero,"de:haselnusscremes, de:brotaufstriche, en:fruh...","[en:breakfasts, en:spreads, en:sweet-spreads, ...",[beniben],[chocolate-spread-with-hazelnuts],[],...,326.0,5.0,[],[],,,,"de:haselnusscremes, de:brotaufstriche, en:fruh...",False,True
437031,1.0,[en:e322],"[en:milk, en:nuts, en:soybeans]","[xx:ferrero, xx:nutella]","Ferrero, Nutella","de:nougatcremes, de:brotaufstriche, en:fruhstu...","[en:breakfasts, en:spreads, en:sweet-spreads, ...",[],[unknown],[],...,176.0,0.0,[],[],,,,"de:nougatcremes, de:brotaufstriche, en:fruhstu...",False,True
461544,,,[],,,"Snacks, Snacks sucrés, Biscuits et gâteaux, Pâ...","[en:snacks, en:sweet-snacks, en:biscuits-and-c...",[],[unknown],,...,1.0,,[],[],,,,"Snacks, Snacks sucrés, Biscuits et gâteaux, Pâ...",False,True
468220,1.0,[en:e322],"[en:milk, en:nuts, en:soybeans]","[ferrero, nutella]","Ferrero,Nutella","Breakfasts, Spreads, Sweet spreads, fr:Pâtes à...","[en:breakfasts, en:spreads, en:sweet-spreads, ...",[],[chocolate-spread-with-hazelnuts],[],...,91.0,0.0,[],[],,,,"Breakfasts, Spreads, Sweet spreads, fr:Pâtes à...",False,True


The dataset contains nested and complex structures that may require special handling.

Considerantions:
- Brands live in a free-text brands column but often as comma-separated lists with inconsistent formatting (“Ferrero, Ferrero U.S.A. Incorporated”), requiring multi-step cleaning and normalization.
- product_name is a nested list of dicts (per‐language), stored as strings, which first must be parsed (e.g. with ast.literal_eval) and flattened.
- The raw table has 109 columns. Even after flattening names, we still have to sift through lots of nutrition, packaging, and language-tag fields we won’t use.


Summary table:

| **Criterion**                  | **Status**                            | **Comment**                                                                                                                                 |
| ------------------------------ | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| Dataset Size                   | ✅🟢 Large                             | \~500 k sample rows (3.9 M total) — enough scale for robust modeling                                                                        |
| Key Fields Present             | ✅🟢 product\_name, brands, categories | \~98 % non-missing for each after normalization                                                                                             |
| Explicit Brand Column          | ✅🟢 Present                           | Clean `brands` field — no inference needed                                                                                                  |
| Brand Extraction Effort        | ⚠️ Moderate                           | Requires parsing comma-separated `brands` lists and normalizing variants (e.g. “Nutella,Ferrero” vs “Ferrero, Ferrero U.S.A. Incorporated”) |
| Category Hierarchy             | ❌🔴 Flat                              | \~1 200 comma-delimited categories with no inherent structure — difficult to model hierarchy                                                |
| Title Quality & Diversity      | ✅🟢 High                              | \~485 k unique flattened product names — rich and varied                                                                                    |
| Additional Metadata            | ⚠️ Extensive                          | 109 fields (nutriments, allergens, packaging, etc.) — valuable but beyond core four-day scope                                               |
| Suitability for Category Model | ✅🟢 Strong                            | Clean categories, high coverage, ample variety                                                                                              |
| Suitability for Brand Model    | ✅🟢 Strong                            | Direct `brands` column supports branding tasks; moderate cleaning needed                                                                    |


Based on this analysis, the Kaggle dataset is the clear choice. While it requires minimal preprocessing (a simple regex or split operation to extract brands), it offers a significant advantage: a clean, two-column structure (name + categories) that streamlines our workflow. This allows us to dedicate all four days to model development, training, and API deployment—rather than wasting time on complex data engineering. The trade-off is well worth it: we sacrifice a perfectly clean brands column for a dataset that lets us focus on what truly matters—building and delivering a high-quality model.