In [2]:
import numpy as np
import pandas as pd

from pathlib import Path

# Dataset Exploration

Dataset Name: Comparing Cosmetics By Ingredients

Kaggle Link: https://www.kaggle.com/datasets/venessagreen/comparing-cosmetics-by-ingredients?resource=download

In [3]:
BASE_DIR = Path().resolve() 
DATASET_PATH = BASE_DIR / ".." / ".." / "MiguelLib" / "datasets" / "cosmetics_raw.csv"
DATASET_PATH = DATASET_PATH.resolve()


df = pd.read_csv(DATASET_PATH)
df.head()

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"Algae (Seaweed) Extract, Mineral Oil, Petrolat...",1,1,1,1,1
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"Galactomyces Ferment Filtrate (Pitera), Butyle...",1,1,1,1,1
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"Water, Dicaprylyl Carbonate, Glycerin, Ceteary...",1,1,1,1,0
3,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,"Algae (Seaweed) Extract, Cyclopentasiloxane, P...",1,1,1,1,1
4,Moisturizer,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,"Water, Snail Secretion Filtrate, Phenyl Trimet...",1,1,1,1,1


In [4]:
print("Dataset shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)
print("\nMissing values per column:")
print(df.isnull().sum())

Dataset shape: (1472, 11)

Columns: ['Label', 'Brand', 'Name', 'Price', 'Rank', 'Ingredients', 'Combination', 'Dry', 'Normal', 'Oily', 'Sensitive']

Data types:
Label           object
Brand           object
Name            object
Price            int64
Rank           float64
Ingredients     object
Combination      int64
Dry              int64
Normal           int64
Oily             int64
Sensitive        int64
dtype: object

Missing values per column:
Label          0
Brand          0
Name           0
Price          0
Rank           0
Ingredients    0
Combination    0
Dry            0
Normal         0
Oily           0
Sensitive      0
dtype: int64


### Columns Overview

In [5]:
print("\n--- Product Categories (Label) ---")
category_counts = df["Label"].value_counts()
print(category_counts)
print("\nNumber of unique categories:", df["Label"].nunique())


--- Product Categories (Label) ---
Label
Moisturizer    298
Cleanser       281
Face Mask      266
Treatment      248
Eye cream      209
Sun protect    170
Name: count, dtype: int64

Number of unique categories: 6


In [6]:
print("\n--- Product Brands ---")
category_counts = df["Brand"].value_counts()
print(category_counts)
print("\nNumber of unique brands:", df["Brand"].nunique())


--- Product Brands ---
Brand
CLINIQUE              79
SEPHORA COLLECTION    66
SHISEIDO              63
ORIGINS               54
MURAD                 47
                      ..
TOM FORD               1
KAPLAN MD              1
BLACK UP               1
URBAN DECAY            1
DERMAFLASH             1
Name: count, Length: 116, dtype: int64

Number of unique brands: 116


In [7]:
print("\n--- Price Overview ---")
print(df["Price"].describe())


--- Price Overview ---
count    1472.000000
mean       55.584239
std        45.014429
min         3.000000
25%        30.000000
50%        42.500000
75%        68.000000
max       370.000000
Name: Price, dtype: float64


In [8]:
print("\n--- Rank Overview ---")
print(df["Rank"].describe())


--- Rank Overview ---
count    1472.000000
mean        4.153261
std         0.633918
min         0.000000
25%         4.000000
50%         4.300000
75%         4.500000
max         5.000000
Name: Rank, dtype: float64


In [9]:
print("\n--- Ingredients Overview ---")
print(df["Ingredients"].head(5))
ingredient_lengths = df["Ingredients"].str.len()
print("\nIngredient text lengths stats:")
print(ingredient_lengths.describe())
print("\nRandom sample of ingredients:")
print(df["Ingredients"].sample(10, random_state=42))


--- Ingredients Overview ---
0    Algae (Seaweed) Extract, Mineral Oil, Petrolat...
1    Galactomyces Ferment Filtrate (Pitera), Butyle...
2    Water, Dicaprylyl Carbonate, Glycerin, Ceteary...
3    Algae (Seaweed) Extract, Cyclopentasiloxane, P...
4    Water, Snail Secretion Filtrate, Phenyl Trimet...
Name: Ingredients, dtype: object

Ingredient text lengths stats:
count    1472.000000
mean      700.720788
std       531.932518
min         6.000000
25%       374.500000
50%       648.000000
75%       933.250000
max      5491.000000
Name: Ingredients, dtype: float64

Random sample of ingredients:
852     Galactomyces Ferment Filtrate*, Water, Glyceri...
184     Adv Timezone Night Age Rvr Lw Crm Division: El...
1261    -Nutrex EGF -Epidermosil™ -Eye Regener™ -Diamo...
67      Water, Dimethicone, Butylene Glycol, Phenyl Tr...
220     Water , Glycerin , Cetyl Alcohol , Dimethicone...
494     Water, Hamamelis Virginiana (Witch Hazel) Wate...
430     Water , Isopropyl Palmitate , Pentylene G

In [10]:
skin_columns = ["Combination", "Dry", "Normal", "Oily", "Sensitive"]
print("\n--- Skin Type Flags Summary ---")
print(df[skin_columns].sum())
print("\nDistribution (percent) of skin flags:")
print((df[skin_columns].sum() / df.shape[0] * 100).round(2))


--- Skin Type Flags Summary ---
Combination    966
Dry            904
Normal         960
Oily           894
Sensitive      756
dtype: int64

Distribution (percent) of skin flags:
Combination    65.62
Dry            61.41
Normal         65.22
Oily           60.73
Sensitive      51.36
dtype: float64


In [11]:
zero_ingredients = (df["Ingredients"].str.strip() == "").sum()
print(f"\nNumber of products with empty ingredient list: {zero_ingredients}")


Number of products with empty ingredient list: 0


In [12]:
duplicates = df.duplicated(subset=["Brand", "Name"]).sum()
print(f"\nNumber of duplicate products by Brand+Name: {duplicates}")


Number of duplicate products by Brand+Name: 0


### Summary

Dataset Overview:

- Rows × Columns: 1472 × 11

- Columns: Label, Brand, Name, Price, Rank, Ingredients, Combination, Dry, Normal, Oily, Sensitive

- Data types: Mixed; numerical (Price, Rank, skin-type flags) and categorical/text (Label, Brand, Name, Ingredients)

- Missing values: None

Product Categories (Label):

- 6 unique categories: Moisturizer (298), Cleanser (281), Face Mask (266), Treatment (248), Eye cream (209), Sun protect (170)

Brands:

- 116 unique brands, with top brands: CLINIQUE (79), SEPHORA COLLECTION (66), SHISEIDO (63), ORIGINS (54), MURAD (47)

Price Overview:

- Currency is not specified (must be confirmed)

Rank Overview:

- Range: 0 to 5 

Ingredients:

- Text field with varied lengths. It includes ingredients separated by commas.

- No products have empty ingredient lists

- Some entries contain detailed ingredient lists, others include additional notes or boutique info (this should be cleaned)

Skin Type Flags:

- Multiple flags per product are possible

- Most common flags: Combination (65.6%), Normal (65.2%), Dry (61.4%), Oily (60.7%), Sensitive (51.4%)

Duplicates & Null Values:

- No duplicate products by Brand+Name

- All key fields are populated

# Why this dataset? 

This dataset is a strong fit for our project because it aligns directly with the structure of the system we want to build. It contains 1,472 products across 6 categories and 116 brands, which gives us enough diversity to avoid narrow or biased recommendations while still being manageable. All key fields are populated, and there are no duplicate products by Brand and Name, which reduces the need for heavy data correction and allows us to focus on building the recommendation logic.

The dataset also includes both structured and unstructured features that are essential for our design. Numerical variables such as Price and Rank support filtering by budget and product quality, while the skin-type flags (Combination, Dry, Normal, Oily, Sensitive) map directly to the user profile inputs in our app. The Ingredients field, although a bit unstructured, provides detailed product-level information that enables ingredient exclusion logic and irritation-based feedback adjustments.

Overall, the dataset is sufficiently complete, diverse, and structurally compatible with our feedback-driven recommendation approach. While minor cleaning is required (such as standardizing ingredient formatting and confirming the currency of the price variable), the data provides a solid foundation for implementing both profile-based and interaction-based recommendation mechanisms.