<a href="https://colab.research.google.com/github/VienneseWaltz/Sephora/blob/main/CosmeticsIngredients.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Importing libraries

import pandas as pd
import numpy as np
from sklearn.manifold import TSNE

# Loading the data
df = pd.read_csv("cosmetics.csv")

# Examine the first 5 rows
display(df.sample(5))

# Inspect the type of products
df.Label.value_counts()

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive
156,Moisturizer,BIOSSANCE,100% Squalane Oil,58,4.6,-100 Percent Sugarcane-Derived Squalane.,1,1,1,1,1
785,Treatment,ESTÉE LAUDER,Perfectionist Pro Rapid Firm + Lift Treatment,75,5.0,Perfectionist Pro Rpd Frm+Lift Trt Division: E...,1,1,1,1,0
587,Treatment,LA MER,The Concentrate,370,3.9,"Cyclopentasiloxane, Algae (Seaweed) Extract, G...",0,0,0,0,0
1459,Sun protect,MOROCCANOIL,After-Sun Milk Soothing Body Lotion,28,4.7,"Water, Caprylic/Caprlc Triglyceride, Glycerin,...",1,1,1,1,0
254,Moisturizer,J.ONE,Jelly Pack,42,4.3,"Water, Polysorbate 80, PEG-150 Disterate, Niac...",1,1,1,1,1


Label
Moisturizer    298
Cleanser       281
Face Mask      266
Treatment      248
Eye cream      209
Sun protect    170
Name: count, dtype: int64

**Focus on one product category and one skn type**
There are 6 categories of products in our dataset — **moisturizers, cleansers, face mask, treatment, eye cream,** and **sun protection** and 5 different skin types — **combination, dry, normal, oily** and **sensitive**.
Let's focus on moisturizers for those with dry skin by filtering the data accordingly. Different individuals have different skin types and different product needs, so let's set up a workflow so its outputs (a t-SNE model and a visualization of that model) can be customized.


In [None]:
# Filter for moisturizers
moisturizers = df[df["Label"]=="Moisturizer"]
moisturizers

# Filter for dry skin
moisturizers_dry = moisturizers[moisturizers["Dry"]==1]
moisturizers_dry

# Reset index
moisturizers_dry = moisturizers_dry.reset_index(drop=True)
moisturizers_dry


Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"Algae (Seaweed) Extract, Mineral Oil, Petrolat...",1,1,1,1,1
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"Galactomyces Ferment Filtrate (Pitera), Butyle...",1,1,1,1,1
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"Water, Dicaprylyl Carbonate, Glycerin, Ceteary...",1,1,1,1,0
3,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,"Algae (Seaweed) Extract, Cyclopentasiloxane, P...",1,1,1,1,1
4,Moisturizer,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,"Water, Snail Secretion Filtrate, Phenyl Trimet...",1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...
185,Moisturizer,KIEHL'S SINCE 1851,Ultra Facial Deep Moisture Balm,29,4.7,"Water, Glycerin, Shea Butter, Glyceryl Stearat...",0,1,1,0,0
186,Moisturizer,SHISEIDO,White Lucent All Day Brightener Broad Spectrum...,62,4.6,"Water, Sd Alcohol 40-B, Dimethicone, Dipropyle...",1,1,1,0,0
187,Moisturizer,SATURDAY SKIN,Featherweight Daily Moisturizing Cream,49,4.6,"Water, Butylene Glycol, Ethylhexyl Palmitate, ...",1,1,1,1,1
188,Moisturizer,KATE SOMERVILLE,Goat Milk Moisturizing Cream,65,4.1,"Water, Ethylhexyl Palmitate, Myristyl Myristat...",1,1,1,1,1


**3. Tokenizing the ingredients**


To compare the ingredients in each product, we first need to do some preprocessing tasks and bookkeeping of the actual words in each product's ingredients list. The first step will be tokenizing the list of ingredients in the **Ingredients** column. After splitting them into tokens, we will make a binary bag of words. Then we will create a dictionary with the tokens, ingredient_idx, which will have the following format: {"ingredient": index value,...}

In [None]:
# moisturizers_dry need to be accessible as a DataFrame
#if not isinstance(moisturizers_dry, pd.DataFrame):
#  raise TypeError("moisturizers_dry must be a pandas DataFrame")


# Check the result
# print("The index for mineral oil is", ingredient_idx[moisturizers_dry.loc[moisturizers_dry['Ingredients'].str.lower().str.contains('mineral oil'), 'Ingredients'].index[0]])


moisturizers_dry = pd.DataFrame(moisturizers_dry)

# Investigate the moisturizers_dry DataFrame
print(moisturizers_dry.head())

# Initialize the dictionary and list
ingredient_idx = {}
ingredients_list = {}
corpus = []

# Tokenize ingredients and build corpus
for ingredients_list in moisturizers_dry['Ingredients']:
    # Convert to lowercase and split ingredients
    ingredients_lower = ingredients_list.lower()
    tokens = ingredients_lower.split(',')
    # Append tokenized ingredients to corpus
    corpus.append(tokens)

    # Update ingredient_idx with unique ingredients
    for ingredient in tokens:
        if ingredient not in ingredient_idx:
            ingredient_idx[ingredient] = len(ingredient_idx)

'''
# Check if moisturizers_dry has an 'Ingredients' column
if 'Ingredients' not in moisturizers_dry.columns:
    raise KeyError("moisturizers_dry DataFrame does not have an 'Ingredients' column.")

# Check if there's a row with 'mineral oil' in the 'Ingredients' column
if not moisturizers_dry['Ingredients'].str.lower().str.contains('mineral oil').any():
    raise ValueError("There's no row with 'mineral oil' in the 'Ingredients' column.")

# Check if the ingredient_idx dictionary is populated
if not ingredient_idx:
    raise ValueError("The ingredient_idx dictionary is empty.")
'''
print("\n")
# Get the index of the first row in moisturizers_dry DataFrame containing the substring "cyclopentasiloxane"
cyclopentasiloxane_index = moisturizers_dry.loc[moisturizers_dry['Ingredients'].str.lower().str.contains('cyclopentasiloxane'), 'Ingredients'].index[0]
print(f"The index of the first row containing cyclopentasiloxane is {cyclopentasiloxane_index}")
print("\n")

# Split ingredients string into individual ingredients
print("\n")
cyclopentasiloxane_ingredients = moisturizers_dry.loc[cyclopentasiloxane_index, 'Ingredients'].lower().split(',')

# Retrieve the brand of the moisturizer
brand_of_moisturizer = moisturizers_dry.loc[cyclopentasiloxane_index, 'Brand']

print(f"The {brand_of_moisturizer} moisturizer has the following ingredients: {cyclopentasiloxane_ingredients}")
print("\n")
# Print the index for each ingredient in that brand of moisturizer that contains cyclopentasiloxane
for ingredient in cyclopentasiloxane_ingredients:
    print(f"The index for {ingredient.strip()} is", ingredient_idx.get(ingredient.strip(), "Ingredient not found in ingredient_idx"))

print("\n")


         Label           Brand                                           Name  \
0  Moisturizer          LA MER                                Crème de la Mer   
1  Moisturizer           SK-II                       Facial Treatment Essence   
2  Moisturizer  DRUNK ELEPHANT                     Protini™ Polypeptide Cream   
3  Moisturizer          LA MER                    The Moisturizing Soft Cream   
4  Moisturizer    IT COSMETICS  Your Skin But Better™ CC+™ Cream with SPF 50+   

   Price  Rank                                        Ingredients  \
0    175   4.1  Algae (Seaweed) Extract, Mineral Oil, Petrolat...   
1    179   4.1  Galactomyces Ferment Filtrate (Pitera), Butyle...   
2     68   4.4  Water, Dicaprylyl Carbonate, Glycerin, Ceteary...   
3    175   3.8  Algae (Seaweed) Extract, Cyclopentasiloxane, P...   
4     38   4.1  Water, Snail Secretion Filtrate, Phenyl Trimet...   

   Combination  Dry  Normal  Oily  Sensitive  
0            1    1       1     1          1  
1   

**4. Initializing a document-term matrix (DTM)**

In [None]:
# Get the number of items and tokens
M = len(moisturizers_dry)
N = len(ingredient_idx)
print(f"The number of moisturizers for dry skin = {M}")
print(f"The number of ingredients = {N}")

The number of moisturizers for dry skin = 190
The number of ingredients = 2257


In [None]:
# Initialize a matrix of zeros
A = np.zeros([M, N])

**5. Creating a counter function**

In [None]:
# Define the oh_encoder function
def oh_encoder(tokens):
  x = np.zeros(N)
  for ingredient in tokens:
    idx = ingredient_idx[ingredient]
    # Put a 1 at the corresponding indices
    x[idx] = 1
  return x

**6. The Cosmetic Ingredient Matrix**

Now we will apply the oh_encoder function to the tokens in the corpus and set the values at each row of this matrix. We will do one-hot encoding for each ingredient in the items. The Cosmetic-ingredient matrix will be filed with binary values.

In [None]:
# Make a document-term matrix
i = 0
for tokens in corpus:
  A[i, :] = oh_encoder(tokens)
  i += 1

**7. Dimensionality Reduction with t-SNE**
The dimensions of the existing matrix is (190, 2257) which means there are 2257 features in our data. For visulization, we should downsize this into two dimensions. We'll use t-SNE for reducing the dimension of the data here.

T-distributed Stochastic Neighborhood Embedding(t-SNE) [https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding] is a dimensionality-reduction technique particularly well-suited for embedding high-dimensional data for visualization in a low-dimensional space of 2 or 3 dimensions. This enables us to make a plot on the coordinate plane, which can be said as vectorizing. All of these cosmetic items in our data can be vectorized into two-dimensional coordinates, and the distances between the points could be compared as similarities between the items.

In [None]:
# Dimensional reduction with t-SNE
tsne = TSNE(n_components=2, perplexity=40, random_state=42)
tsne_features = tsne.fit_transform(A)

# Make X, Y columns
moisturizers_dry['X'] = tsne_features[:, 0]
moisturizers_dry['Y'] = tsne_features[:, 1]


**8. Let's map the items with Bokeh**

In [None]:
from bokeh.io import show, output_notebook, push_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool

# Set up bokeh for displaying plots in the notebook
output_notebook()

# Make a source and a scatter plot
source = ColumnDataSource(moisturizers_dry)
plot = figure(x_axis_label = "TSNE 1",
              y_axis_label = "TSNE 2",
              width = 500, height = 400)
plot.circle(x = "X",
            y = "Y",
            source = source,
            size = 13,
            color = '#008080',
            alpha = .8)

# Show the plot
show(plot)


**9. Adding a Hover Tool**

Adding a hover tool allows us to check the information of each item whenever the cursor is directly over a glyph. We'll add tool tips for each product's name, brand, price and rank (i.e. rating).

In [None]:
# Create a Hover Tool object
hover = HoverTool(tooltips =[('Item','@Name'),
                              ('Brand','@Brand'),
                              ('Price','$@Price'),
                              ('Rank','@Rank')])
plot.add_tools(hover)

# Display the plot
show(plot)

**10. Mapping the Cosmetic Items**
Finally it is now show time! Let's see how the map we've made looks like. Each point on the plot corresponds to the cosmetic items. The axes of a t-SNE plot aren't easily interpretable in terms of the original data. Like mentioned above, t-SNE is a visualizing technique to plot high-dimensional data in a low-dimensional space. Therefore it's not desirable to interpret a t-SNE plot quantitatively.

Instead, what we can get from this map is the distance between the points (which items are close and which items are far apart). The closer the distance between two items, the more similar their composition. Therefore this enables us to compare two items without having a chemistry background or PhD.  

**11. Comparing Two Products**
There is a vast volume of cosmetics and their ingredients are staggering in numbers. The plot doesn't exhibit obvious patterns that simpler t-SNE plots
offer. Our plots require some digging, but that's okay.

Say we enjoyed Color Control Cushion Compact Broad Spectrum SPF 50+ by Amorepacific (at \$60). We find this spot on the extreme left. In fact, another item overlaps this item,  Laniege's BB Cushion Hydra Radiance SPF50 (at \$38). By looking at the ingredients, we can visually confirm the compositions of the products are similar (a little difficult, but that's why we initiated this analysis in the first place),plus it is $22 cheaper.

In real life, having such an ingredient-based recoomendation engine heps us make educated cosmetic purchase choices.

In [None]:
# Print the ingredients of two similar cosmetics
cosmetic_1 = moisturizers_dry[moisturizers_dry['Name'] == 'Color Control Cushion Compact Broad Spectrum SPF 50+']
cosmetic_2 = moisturizers_dry[moisturizers_dry['Name'] == 'BB Cushion Hydra Radiance SPF 50']

# Display each item's data and ingredients
display(cosmetic_1)
print("\n")
print(cosmetic_1.Ingredients.values)

display(cosmetic_2)
print("\n")
print(cosmetic_2.Ingredients.values)


Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,X,Y
45,Moisturizer,AMOREPACIFIC,Color Control Cushion Compact Broad Spectrum S...,60,4.0,"Phyllostachis Bambusoides Juice, Cyclopentasil...",1,1,1,1,1,-2.715333,1.412771




['Phyllostachis Bambusoides Juice, Cyclopentasiloxane, Cyclohexasiloxane, Peg-10 Dimethicone, Phenyl Trimethicone, Butylene Glycol, Butylene Glycol Dicaprylate/Dicaprate, Alcohol, Arbutin, Lauryl Peg-9 Polydimethylsiloxyethyl Dimethicone, Acrylates/Ethylhexyl Acrylate/Dimethicone Methacrylate Copolymer, Polyhydroxystearic Acid, Sodium Chloride, Polymethyl Methacrylate, Aluminium Hydroxide, Stearic Acid, Disteardimonium Hectorite, Triethoxycaprylylsilane, Ethylhexyl Palmitate, Lecithin, Isostearic Acid, Isopropyl Palmitate, Phenoxyethanol, Polyglyceryl-3 Polyricinoleate, Acrylates/Stearyl Acrylate/Dimethicone Methacrylate Copolymer, Dimethicone, Disodium Edta, Trimethylsiloxysilicate, Ethylhexyglycerin, Dimethicone/Vinyl Dimethicone Crosspolymer, Water, Silica, Camellia Japonica Seed Oil, Camillia Sinensis Leaf Extract, Caprylyl Glycol, 1,2-Hexanediol, Fragrance, Titanium Dioxide, Iron Oxides (Ci 77492, Ci 77491, Ci77499).']


Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,X,Y
55,Moisturizer,LANEIGE,BB Cushion Hydra Radiance SPF 50,38,4.3,"Water, Cyclopentasiloxane, Zinc Oxide (CI 7794...",1,1,1,1,1,-2.78389,1.451848




['Water, Cyclopentasiloxane, Zinc Oxide (CI 77947), Ethylhexyl Methoxycinnamate, PEG-10 Dimethicone, Cyclohexasiloxane, Phenyl Trimethicone, Iron Oxides (CI 77492), Butylene Glycol Dicaprylate/Dicaprate, Niacinamide, Lauryl PEG-9 Polydimethylsiloxyethyl Dimethicone, Acrylates/Ethylhexyl Acrylate/Dimethicone Methacrylate Copolymer, Titanium Dioxide (CI 77891 , Iron Oxides (CI 77491), Butylene Glycol, Sodium Chloride, Iron Oxides (CI 77499), Aluminum Hydroxide, HDI/Trimethylol Hexyllactone Crosspolymer, Stearic Acid, Methyl Methacrylate Crosspolymer, Triethoxycaprylylsilane, Phenoxyethanol, Fragrance, Disteardimonium Hectorite, Caprylyl Glycol, Yeast Extract, Acrylates/Stearyl Acrylate/Dimethicone Methacrylate Copolymer, Dimethicone, Trimethylsiloxysilicate, Polysorbate 80, Disodium EDTA, Hydrogenated Lecithin, Dimethicone/Vinyl Dimethicone Crosspolymer, Mica (CI 77019), Silica, 1,2-Hexanediol, Polypropylsilsesquioxane, Chenopodium Quinoa Seed Extract, Magnesium Sulfate, Calcium Chlori