<h3>Word2vec model</h3>
Word2Vec is one of the most popular and more advanced techniques for learning word embeddings. This technique is based on the assumption that words in the same contexts usually have similar meanings. Word embedding is a vector representation of a word and its input is the text corpus, and its output is a set of vectors. Word embeddings using word2vec can make natural language readable for the computer, and then one can apply further implementation of mathematical operations on words to detect their similarity. A well-trained set of word vectors will place similar words close together in this space.

In [7]:
import requests
import json
import pandas as pd
import numpy as np


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
sns.set_style("darkgrid")

import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import gensim
from gensim.utils import simple_preprocess
from gensim.models import phrases, word2vec, Word2Vec
from gensim.models.phrases import Phrases, Phraser

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [9]:
data = requests.get("http://makeup-api.herokuapp.com/api/v1/products.json/").json()
products = pd.DataFrame(data)

products.head()

Unnamed: 0,id,brand,name,price,price_sign,currency,image_link,product_link,website_link,description,rating,category,product_type,tag_list,created_at,updated_at,product_api_url,api_featured_image,product_colors
0,1048,colourpop,Lippie Pencil,5.0,$,CAD,https://cdn.shopify.com/s/files/1/1338/0845/co...,https://colourpop.com/collections/lippie-pencil,https://colourpop.com,Lippie Pencil A long-wearing and high-intensit...,,pencil,lip_liner,"[cruelty free, Vegan]",2018-07-08T23:45:08.056Z,2018-07-09T00:53:23.301Z,http://makeup-api.herokuapp.com/api/v1/product...,//s3.amazonaws.com/donovanbailey/products/api_...,"[{'hex_value': '#B28378', 'colour_name': 'BFF ..."
1,1047,colourpop,Blotted Lip,5.5,$,CAD,https://cdn.shopify.com/s/files/1/1338/0845/pr...,https://colourpop.com/collections/lippie-stix?...,https://colourpop.com,Blotted Lip Sheer matte lipstick that creates ...,,lipstick,lipstick,"[cruelty free, Vegan]",2018-07-08T22:01:20.178Z,2018-07-09T00:53:23.287Z,http://makeup-api.herokuapp.com/api/v1/product...,//s3.amazonaws.com/donovanbailey/products/api_...,"[{'hex_value': '#b72227', 'colour_name': 'Bee'..."
2,1046,colourpop,Lippie Stix,5.5,$,CAD,https://cdn.shopify.com/s/files/1/1338/0845/co...,https://colourpop.com/collections/lippie-stix,https://colourpop.com,"Lippie Stix Formula contains Vitamin E, Mango,...",,lipstick,lipstick,"[cruelty free, Vegan]",2018-07-08T21:47:49.858Z,2018-07-09T00:53:23.274Z,http://makeup-api.herokuapp.com/api/v1/product...,//s3.amazonaws.com/donovanbailey/products/api_...,"[{'hex_value': '#F2DEC3', 'colour_name': 'Fair..."
3,1045,colourpop,No Filter Foundation,12.0,$,CAD,https://cdn.shopify.com/s/files/1/1338/0845/pr...,https://colourpop.com/products/no-filter-matte...,https://colourpop.com/products/no-filter-matte...,"Developed for the Selfie Age, our buildable fu...",,liquid,foundation,"[cruelty free, Vegan]",2018-07-08T18:22:25.273Z,2018-07-09T00:53:23.313Z,http://makeup-api.herokuapp.com/api/v1/product...,//s3.amazonaws.com/donovanbailey/products/api_...,"[{'hex_value': '#F2DEC3', 'colour_name': 'Fair..."
4,1044,boosh,Lipstick,26.0,$,CAD,https://cdn.shopify.com/s/files/1/1016/3243/pr...,https://www.boosh.ca/collections/all,https://www.boosh.ca/,All of our products are free from lead and hea...,,lipstick,lipstick,"[Chemical Free, Organic]",2018-07-08T17:32:28.088Z,2018-09-02T22:52:06.669Z,http://makeup-api.herokuapp.com/api/v1/product...,//s3.amazonaws.com/donovanbailey/products/api_...,"[{'hex_value': '#CB4975', 'colour_name': 'Babs..."


In [10]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 931 entries, 0 to 930
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  931 non-null    int64  
 1   brand               919 non-null    object 
 2   name                931 non-null    object 
 3   price               917 non-null    object 
 4   price_sign          368 non-null    object 
 5   currency            368 non-null    object 
 6   image_link          931 non-null    object 
 7   product_link        931 non-null    object 
 8   website_link        931 non-null    object 
 9   description         930 non-null    object 
 10  rating              340 non-null    float64
 11  category            517 non-null    object 
 12  product_type        931 non-null    object 
 13  tag_list            931 non-null    object 
 14  created_at          931 non-null    object 
 15  updated_at          931 non-null    object 
 16  product_

In [13]:
products.drop(['api_featured_image', 'created_at', 'image_link', 'product_api_url', 'product_colors', 'product_link',
              'updated_at', 'website_link', 'price_sign','currency'], axis = 1, inplace = True, errors = 'ignore')

In [14]:
products.head()

Unnamed: 0,id,brand,name,price,description,rating,category,product_type,tag_list
0,1048,colourpop,Lippie Pencil,5.0,Lippie Pencil A long-wearing and high-intensit...,,pencil,lip_liner,"[cruelty free, Vegan]"
1,1047,colourpop,Blotted Lip,5.5,Blotted Lip Sheer matte lipstick that creates ...,,lipstick,lipstick,"[cruelty free, Vegan]"
2,1046,colourpop,Lippie Stix,5.5,"Lippie Stix Formula contains Vitamin E, Mango,...",,lipstick,lipstick,"[cruelty free, Vegan]"
3,1045,colourpop,No Filter Foundation,12.0,"Developed for the Selfie Age, our buildable fu...",,liquid,foundation,"[cruelty free, Vegan]"
4,1044,boosh,Lipstick,26.0,All of our products are free from lead and hea...,,lipstick,lipstick,"[Chemical Free, Organic]"


In [15]:
products.to_csv('data/makeup-products.csv')

In [24]:
df = pd.read_csv('data/makeup-products.csv', header=0, usecols=['name','description','tag_list'])
df.head()

Unnamed: 0,name,description,tag_list
0,Lippie Pencil,Lippie Pencil A long-wearing and high-intensit...,"['cruelty free', 'Vegan']"
1,Blotted Lip,Blotted Lip Sheer matte lipstick that creates ...,"['cruelty free', 'Vegan']"
2,Lippie Stix,"Lippie Stix Formula contains Vitamin E, Mango,...","['cruelty free', 'Vegan']"
3,No Filter Foundation,"Developed for the Selfie Age, our buildable fu...","['cruelty free', 'Vegan']"
4,Lipstick,All of our products are free from lead and hea...,"['Chemical Free', 'Organic']"


In [20]:
df.shape

(931, 3)

In [25]:
import texthero as hero
from texthero import preprocessing as ppe

In [31]:
custom_pipeline = [ ppe.fillna, ppe.lowercase, ppe.remove_punctuation, ppe.remove_stopwords]

df['name'] = hero.clean(df['name'], custom_pipeline)

df['description'] = hero.clean(df['description'], custom_pipeline)
df['tag_list'] = hero.clean(df['tag_list'])


In [33]:
lem = WordNetLemmatizer()

def word_lem(text):
    lem_text = [lem.lemmatize(word) for word in text.split()]
    return " ".join(lem_text)

In [34]:
df['description']  = df['description'].apply(word_lem)
df.head()

Unnamed: 0,name,description,tag_list
0,lippie pencil,lippie pencil long wearing high intensity lip ...,cruelty free vegan
1,blotted lip,blotted lip sheer matte lipstick creates perfe...,cruelty free vegan
2,lippie stix,lippie stix formula contains vitamin e mango a...,cruelty free vegan
3,filter foundation,developed selfie age buildable full coverage n...,cruelty free vegan
4,lipstick,product free lead heavy metal parabens phthala...,chemical free organic


<h3>Model creation</h3>

In [35]:
corpus = hero.tokenize(df['name'])
corpus[0:5]

0        [lippie, pencil]
1          [blotted, lip]
2          [lippie, stix]
3    [filter, foundation]
4              [lipstick]
Name: name, dtype: object

In [37]:
name_model = Word2Vec(corpus, size=100, window=3, min_count=1)

In [38]:
name_model.wv.most_similar('lipstick')

[('edition', 0.4124469757080078),
 ('matte', 0.33455488085746765),
 ('eye', 0.32677584886550903),
 ('kit', 0.3261893391609192),
 ('cosmetics', 0.30601441860198975),
 ('uplighting', 0.30400019884109497),
 ('dr', 0.3032594919204712),
 ('striking', 0.28636133670806885),
 ('wild', 0.28074413537979126),
 ('gel', 0.2798846364021301)]

In [39]:
name_model.wv.most_similar('liquid')

[('broad', 0.35196954011917114),
 ('bronzer', 0.3466640114784241),
 ('brow', 0.34234821796417236),
 ('gel', 0.32695043087005615),
 ('lash', 0.31893813610076904),
 ('concealer', 0.31460824608802795),
 ('l', 0.31291234493255615),
 ('fail', 0.30139318108558655),
 ('paris', 0.29558536410331726),
 ('cream', 0.29341015219688416)]

<h3>Word2vec model for description</h3>

In [40]:
# We create the list of list format of the custom corpus for gensim modeling:
sentences = [row.split() for row in df['description']]

In [41]:
sentences[0:1]

[['lippie',
  'pencil',
  'long',
  'wearing',
  'high',
  'intensity',
  'lip',
  'pencil',
  'glide',
  'easily',
  'prevents',
  'feathering',
  'many',
  'lippie',
  'stix',
  'coordinating',
  'lippie',
  'pencil',
  'designed',
  'compliment',
  'perfectly',
  'feel',
  'free',
  'mix',
  'match']]

<b>Creating the model and setting values for the various parameters:</b></br>
Here are few important of the hyperparameters of this model:<br>

size: this is the number of dimensions of the embeddings. Typical numbers is in the range of 50 to 300, we use 100 beacause we don't have that much text. <br>
window: it is the maximum distance between a target word and words around the target word.</br>
min_count: minimum count of words to consider when training the model. Words with occurrence less than this count will be ignored. We used for min_count 1.</br>

In [42]:
model = Word2Vec(sentences, min_count=1, size=100, window=3)

In [43]:
#This will print the most similar words present in the model: 
model.wv.most_similar('lipstick')

[('texture', 0.999869704246521),
 ('pencil', 0.9998623132705688),
 ('stay', 0.9998431205749512),
 ('hour', 0.9998080134391785),
 ('dry', 0.9997955560684204),
 ('like', 0.9997859001159668),
 ('result', 0.999779224395752),
 ('creates', 0.999763548374176),
 ('moisturizing', 0.9997388124465942),
 ('beautiful', 0.9997288584709167)]

In [44]:
model.wv.most_similar('powder')

[('cheekbone', 0.9996285438537598),
 ('perfectly', 0.9995981454849243),
 ('work', 0.9995733499526978),
 ('effect', 0.9995520710945129),
 ('add', 0.9995331764221191),
 ('naturally', 0.99953293800354),
 ('moisture', 0.9995229840278625),
 ('lid', 0.9995077252388),
 ('shimmer', 0.9995046257972717),
 ('contour', 0.9994987845420837)]

In [45]:

model.wv.similarity('bronzer', 'blush')

0.9981829