# t-SNE of Cosmetics Data

## Context

Choosing a new cosmetic item can be a daunting task, especially when faced with unfamiliar ingredient lists that are difficult to interpret without a background in chemistry. This project aims to address this challenge by creating a content-based recommendation system that leverages the chemical components of cosmetics. By processing the ingredient lists of 1,472 cosmetics from Sephora, we can visualize the similarity between products using a machine learning technique called t-SNE (t-distributed Stochastic Neighbor Embedding) and the interactive visualization library Bokeh.

## Source

This dataset is available on Kaggele in the following link:

> https://www.kaggle.com/datasets/kingabzpro/cosmetics-datasets/data

## Data Dictionary

- **Label**: This is the type of product. This is categorical.
- **Brand**: This is the brand of product. This is categorical.
- **Name**: This is the unique name of cosmetics. This is categorical.
- **Price**: This is the price of the product in USD. This is numerical.
- **Rank**: This is the rank of the product in 0 to 5 scale. This is numerical.
- **Ingredients**: The ingredients present in the cosmetic. This is categorical.
- **Combination** This indicates whether the cosmetic is useful for combination of Dry and oily skin type. This is binary.
- **Dry**: This indicates whether the cosmetic is useful for Dry skin type. This is binary.
- **Normal**: This indicates whether the cosmetic is useful for Normal skin type. This is binary.
- **Oily**: This indicates whether the cosmetic is useful for Oily skin type. This is binary.
- **Sensitive**: This indicates whether the cosmetic is useful for Sesitive skin type. This is binary.

## Problem Statement

1. **Exploratory Data Analysis(EDA)**: This objective of EDA is to analyze the data and find the patterns and relationship among the features.
2. **t-SNE**: The objective of tSNE is to reduce the dimentionality of the features and visualize it.

### Load Libraries

In [21]:
# General
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings

# Other
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE

# Bokeh Visualization
from bokeh.plotting import figure, output_file, show
from bokeh.models import HoverTool, ColumnDataSource

### Settings

In [2]:
# Warnings
warnings.filterwarnings("ignore")
# Plot Style
sns.set_style("darkgrid")
# Path
data_path = "../data"
vis_path = "../visualizations"
csv_path = os.path.join(data_path, "cosmetics.csv")

### Load Data

In [3]:
df = pd.read_csv(csv_path)

### General Information

In [4]:
# Get 1st 5 rows of dataset to get an idea what data are stored in each feature
df.head()

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"Algae (Seaweed) Extract, Mineral Oil, Petrolat...",1,1,1,1,1
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"Galactomyces Ferment Filtrate (Pitera), Butyle...",1,1,1,1,1
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"Water, Dicaprylyl Carbonate, Glycerin, Ceteary...",1,1,1,1,0
3,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,"Algae (Seaweed) Extract, Cyclopentasiloxane, P...",1,1,1,1,1
4,Moisturizer,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,"Water, Snail Secretion Filtrate, Phenyl Trimet...",1,1,1,1,1


In [5]:
# Data Description
print("=" * 60)
print("DATA DESCRIPTION")
print("=" * 60)
print(f"Number of observations: {df.shape[0]}")
print(f"Number of features: {df.shape[1]}")

DATA DESCRIPTION
Number of observations: 1472
Number of features: 11


In [6]:
# Feature Description
print("=" * 60)
print("FEATURE DESCRIPTION")
print("=" * 60)
print(df.dtypes)
print("-" * 60)
# Separate categorical, numeric and binary features
cat_cols = [ col for col in df.columns if df[col].dtype == "object"]
num_cols = [col for col in df.columns if df[col].dtype != "object" and df[col].nunique() > 2]
bin_cols = [col for col in df.columns if df[col].dtype != "object" and df[col].nunique() == 2]

print(f"Number of Categorical Features: {len(cat_cols)}")
print(cat_cols)
print("-" * 60)
print(f"Number of Numerical Features: {len(num_cols)}")
print(num_cols)
print("-" * 60)
print(f"Number of Binary Features: {len(bin_cols)}")
print(bin_cols)
print("-" * 60)

FEATURE DESCRIPTION
Label           object
Brand           object
Name            object
Price            int64
Rank           float64
Ingredients     object
Combination      int64
Dry              int64
Normal           int64
Oily             int64
Sensitive        int64
dtype: object
------------------------------------------------------------
Number of Categorical Features: 4
['Label', 'Brand', 'Name', 'Ingredients']
------------------------------------------------------------
Number of Numerical Features: 2
['Price', 'Rank']
------------------------------------------------------------
Number of Binary Features: 5
['Combination', 'Dry', 'Normal', 'Oily', 'Sensitive']
------------------------------------------------------------


In [7]:
# Missing Value Detection
print("=" * 60)
print("MISSING VALUE DETECTION")
print("=" * 60)
if df.isnull().sum().sum() > 0:
    print(df.isnull().sum())
else:
    print("No missing value is present in any feature.")

MISSING VALUE DETECTION
No missing value is present in any feature.


In [8]:
# Duplicate Value Detection
print("=" * 60)
print("DUPLICATE ROW DETECTION")
print("=" * 60)
print(f"Number of duplicate rows: {df.duplicated().sum()}")

DUPLICATE ROW DETECTION
Number of duplicate rows: 0


In [9]:
# Check Category(Label)
df["Label"].value_counts()

Label
Moisturizer    298
Cleanser       281
Face Mask      266
Treatment      248
Eye cream      209
Sun protect    170
Name: count, dtype: int64

### Key Findings

There are **6** types of products for **5** skin types.

### Data Preparation By Vectorize Ingredients

It is required because ingredients are the most importent feature to select a cosmetic.

In [10]:
# Use TfidfVectorizer to convert the ingredient list (a comma separated string) into numerical vectors
tfidfv = TfidfVectorizer(stop_words="english", max_features= 5000)
tfidf_matrix = tfidfv.fit_transform(df["Ingredients"])

# Convert the matrix into dataframe
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), index = df.index, columns = tfidfv.get_feature_names_out())

In [11]:
tfidf_df

Unnamed: 0,00,000,002,01,02,031,05,067,07,074,...,zerumbet,zeylanicum,zinc,zingiber,zingier,zizanoides,ziziphus,zizyphus,zolinone,zostera
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.077682,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.058367,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1468,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1469,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1470,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Dimensionality Reduction By t-SNE

Reduce the dimensionality of the TF-IDF vectors using t-SNE

In [16]:
# Apply t-SNE to tfidf vector
tsne = TSNE(n_components= 2, random_state= 42, perplexity= 40, n_iter= 3000)
tsne_result = tsne.fit_transform(tfidf_df)
tsne_result

array([[ 40.08445  ,  -5.321477 ],
       [ 14.559531 , -36.45409  ],
       [  6.021212 ,  32.489952 ],
       ...,
       [  5.1926517,  16.146881 ],
       [-27.001022 ,  -5.449068 ],
       [-17.671087 ,  54.42585  ]], dtype=float32)

In [17]:
# Add t-SNE result back to the dataframe
df["TSNE-2D-ONE"] = tsne_result[:, 0]
df["TSNE-2D-TWO"] = tsne_result[:, 1]

In [19]:
# Sanity check
df.head()

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,TSNE-2D-ONE,TSNE-2D-TWO
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"Algae (Seaweed) Extract, Mineral Oil, Petrolat...",1,1,1,1,1,40.08445,-5.321477
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"Galactomyces Ferment Filtrate (Pitera), Butyle...",1,1,1,1,1,14.559531,-36.45409
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"Water, Dicaprylyl Carbonate, Glycerin, Ceteary...",1,1,1,1,0,6.021212,32.489952
3,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,"Algae (Seaweed) Extract, Cyclopentasiloxane, P...",1,1,1,1,1,38.363201,-4.50486
4,Moisturizer,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,"Water, Snail Secretion Filtrate, Phenyl Trimet...",1,1,1,1,1,12.000286,-5.572656


### Visualization with Bokeh

Visalize similarity between products with Bokeh.

In [28]:
# Prepare data for bokeh

source = ColumnDataSource(data = dict(
    x = df["TSNE-2D-ONE"],
    y = df["TSNE-2D-TWO"],
    label = df["Label"],
    brand = df["Brand"],
    name = df["Name"],
    price = df["Price"],
    ingredients = df["Ingredients"]
))


In [29]:
# Create a figure
plot = figure(title = "t-SNE Plot of Cosmetic Ingredients", tools= "pan, wheel_zoom, box_zoom, reset, hover", plot_width= 700, plot_height= 700)
# Plot the source
plot.scatter("x", "y", source=source, fill_alpha=0.6, size= 8)
# Add Hover tool to disply brand and product name
hover = plot.select(dict(type=HoverTool))
hover.tooltips = [
    ("Brand", "@brand"),
    ("Label", "@label"),
    ("Product", "@name"),
    ("Price", "@price"),
    ("Ingredients", "@ingredients")
]

# Show the plot
tsne_plot_path = os.path.join(vis_path, "tsne_cosmetic.html")
output_file(tsne_plot_path)
show(plot)

### Visualization

Check the HTML file stored in the *visualizations* folder. This will view the scatter plot for the simmilar products according to ingredients together. Check the products interctively hovering it.