# Similarity in product names

Given a list of products names, find a way to compare how similar are them. 

In this notebook, a simple approach of Bag-Of-Words is applied, by looking at all the occurrences for the unique and relevant unigrams and bigrams, and then computing the cosine similarity between each pair of binary feature vectors.

In [1]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Installing collected packages: nltk
Successfully installed nltk-3.8.1


In [11]:
import pandas as pd
import os
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from typing import List
pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jupyter/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [81]:
pd.set_option('display.max_rows', None)

In [12]:
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.auto import tqdm

In [13]:
input_filename = os.getcwd().split('/examples/')[0]+'/data/initial_data/items_titles.csv'

df_titles = pd.read_csv(input_filename)

In [14]:
df_titles.tail()

Unnamed: 0,ITE_ITEM_TITLE
29995,Tênis Vans Old Skool I Love My Vans - Usado - Feminino
29996,Tênis Feminino Preto Moleca 5296155
29997,Tenis Botinha Com Pelo Via Marte Original Lançamento
29998,Tênis Slip On Feminino Masculino Original Sapato Xadrez Mule
29999,Bicicleta Nathor Rosa Infantil Sem Pedal Balance Aro 12


### 1. Remove stop-words

In [15]:
list_stopwords = stopwords.words('portuguese')


In [16]:
df_titles['ITE_ITEM_TITLE'] = df_titles['ITE_ITEM_TITLE'].str.lower()

In [17]:
def remove_stopwords(xs: str, list_stopwords: List) -> str:
    xs = re.sub(r'[^\w\s]', '', xs)
    ys = ''
    for a_word in xs.split(' '):
        if a_word not in list_stopwords:
            ys = ys + ' ' + a_word
    return ys

In [18]:
df_titles['ITE_ITEM_TITLE_PREPROC'] = df_titles['ITE_ITEM_TITLE']\
                                            .apply(lambda xs: remove_stopwords(xs, list_stopwords))

In [19]:
df_titles.head()

Unnamed: 0,ITE_ITEM_TITLE,ITE_ITEM_TITLE_PREPROC
0,tênis ascension posh masculino - preto e vermelho,tênis ascension posh masculino preto vermelho
1,tenis para caminhada super levinho spider corrida,tenis caminhada super levinho spider corrida
2,tênis feminino le parc hocks black/ice original envio já,tênis feminino le parc hocks blackice original envio
3,tênis olympikus esportivo academia nova tendência triunfo,tênis olympikus esportivo academia nova tendência triunfo
4,inteligente led bicicleta tauda luz usb bicicleta carregáve,inteligente led bicicleta tauda luz usb bicicleta carregáve


### 2. Tokenize and get a BoW

The length of the output feature vectors is controlled by the min_df parameter that allows less frequent words when it is smaller.

In [22]:
vectorizer = CountVectorizer(ngram_range=(1, 2), min_df=0.01)
X_features = vectorizer.fit_transform(df_titles['ITE_ITEM_TITLE_PREPROC'])

In [23]:
# maximum vocabulary: 85828
len(vectorizer.vocabulary_)

110

In [24]:
vec_i = X_features[0,:].toarray()
vec_j = X_features[0+1:,:].toarray()

In [25]:
vec_j.shape

(29999, 110)

In [26]:
import numpy as np
from tqdm.auto import tqdm

In [43]:
similarity_index.reshape(-1,1).shape

(29998, 1)

In [44]:
list_j.shape

(29997, 1)

In [45]:
list_i.shape

(29997, 1)

In [46]:
vec_j.shape

(29998, 110)

In [47]:
vec_i.shape

(1, 110)

### 3. Compute cosine similarity

For each pair of vectors compute the similarity, save checkpoints constantly since it is a slow and heavy computation.

In [None]:
similarity_dict = {}
all_similarities = np.ones((1,3))
k = 0
for i in tqdm(range(X_features.shape[0])):
    vec_i = X_features[i,:].toarray()
    vec_j = X_features[i+1:,:].toarray()
    similarity_index = cosine_similarity(vec_i, vec_j).reshape(-1,1)
    #similarity_dict[i] = cosine_similarity(vec_i, vec_j)
    
    list_j = np.arange(i+1, X_features.shape[0]).reshape(-1,1)
    list_i = np.array([i]*len(list_j)).reshape(-1,1)
    similarity_ij = np.concatenate([list_i, list_j, similarity_index], axis=1)
    all_similarities = np.concatenate([all_similarities, similarity_ij], axis=0)
    
    if i%1000==0:
        np.savetxt(f"all_similarities_naive_{k}.csv", all_similarities, delimiter=",")
        all_similarities = np.ones((1,3))
        k = k + 1

  0%|          | 0/30000 [00:00<?, ?it/s]

In [51]:
import pickle

with open('vectorizer.pkl', 'wb') as handle:
    pickle.dump(vectorizer, handle)

In [64]:
vectorizer.get_feature_names_out()

array(['20', '26', '29', 'academia', 'adidas', 'all', 'alto', 'aro',
       'aro 26', 'aro 29', 'asics', 'azul', 'barato', 'bicicleta',
       'bicicleta aro', 'bike', 'black', 'botinha', 'branco', 'brinde',
       'cadarço', 'calce', 'caminhada', 'cano', 'cano alto', 'casual',
       'casual feminino', 'chunky', 'cinza', 'conforto', 'confortável',
       'corrida', 'couro', 'envio', 'esportivo', 'feminina', 'feminino',
       'feminino casual', 'fila', 'flatform', 'freio', 'frete', 'fácil',
       'infantil', 'infantil feminino', 'infantil masculino', 'kit',
       'kit pares', 'kolosh', 'lançamento', 'led', 'leve', 'macio',
       'marinho', 'marte', 'masculino', 'meia', 'menina', 'menino',
       'mizuno', 'moleca', 'molekinha', 'mtb', 'new', 'nike', 'oferta',
       'olympikus', 'on', 'original', 'pares', 'plataforma', 'preto',
       'promoção', 'rio', 'rosa', 'sapatenis', 'sapatilha', 'sapato',
       'sapatênis', 'sapatênis masculino', 'shimano', 'shoes', 'skate',
       'slip',

In [82]:
df_similarity = pd.DataFrame(all_similarities[1:,:],
                             columns=['title_id_1', 'title_id_2', 'similarity_score']
                            )
df_similarity['title_id_1'] = df_similarity['title_id_1'].astype('int')
df_similarity['title_id_2'] = df_similarity['title_id_2'].astype('int')

df_similarity = df_similarity.merge(df_titles.reset_index()[['index', 'ITE_ITEM_TITLE']],
                                    how='left',
                                    left_on='title_id_1', right_on='index')\
                             .rename(columns={'ITE_ITEM_TITLE':'ITE_ITEM_TITLE_1'})\
                             .merge(df_titles.reset_index()[['index', 'ITE_ITEM_TITLE']],
                                    how='left',
                                    left_on='title_id_2', right_on='index')\
                             .rename(columns={'ITE_ITEM_TITLE':'ITE_ITEM_TITLE_2'})\
                            [['title_id_1', 'title_id_2', 'ITE_ITEM_TITLE_1',
                              'ITE_ITEM_TITLE_2', 'similarity_score']]\
                             .sort_values('similarity_score', ascending=False)

In [87]:
(df_similarity['similarity_score']>=1.0).sum()

1347

In [89]:
pd.set_option("display.precision", 5)

In [98]:
df_similarity[df_similarity['similarity_score']<0.9999][:40]\
                .style.bar(subset=['similarity_score'], color='yellow',
                          vmin=0.9, vmax=1)

Unnamed: 0,title_id_1,title_id_2,ITE_ITEM_TITLE_1,ITE_ITEM_TITLE_2,similarity_score
5007,29006,29034,tênis feminino via marte flatform slip on corrente,tênis feminino slip on via marte oncinha 213108,0.94868
48488,29050,29813,tenis feminino casual caminhada lançamento promoção original,tenis feminino casual caminhada plataform promoção original,0.94281
122160,29131,29937,bicicleta de passeio ultra bikes bike aro 26 18 marchas aro 26 18v freios v-brakes cor amarelo,bicicleta nautec alumínio aro 26,0.93541
121635,29131,29412,bicicleta de passeio ultra bikes bike aro 26 18 marchas aro 26 18v freios v-brakes cor amarelo,bicicleta vintage ultra bikes wave aro 26 com cestinha,0.93541
122043,29131,29820,bicicleta de passeio ultra bikes bike aro 26 18 marchas aro 26 18v freios v-brakes cor amarelo,bicicleta caloi xrt aro 26 full suspension,0.93541
122041,29131,29818,bicicleta de passeio ultra bikes bike aro 26 18 marchas aro 26 18v freios v-brakes cor amarelo,bicicleta com garupa reforçada ultra bikes stronger aro 26,0.93541
229499,29266,29276,bicicleta aro 29 colli bike 21 marchas freio a disco shimano,bicicleta aro 29 alumínio gta 24v freio hid shimano +brindes,0.93541
139486,29152,29266,bicicleta aro 29 alumínio 21v freio a disco câmbios shimano,bicicleta aro 29 colli bike 21 marchas freio a disco shimano,0.93541
169322,29188,29276,bicicleta aro 29 alum 24v câmbios shimano freio disco oferta,bicicleta aro 29 alumínio gta 24v freio hid shimano +brindes,0.93541
2649,29003,29658,bicicleta aro 29 21v câmbios shimano azul preto quadro 17,bicicleta aro 29 quadro 15 câmbio shimano 21v preto amarelo,0.93541


**It shows interesting results, however it still misses a lot of the context, this could be improved by increasing the number of words in the vocabulary (here 110 were used) to keep the computation cost low**

In [51]:
df_bow = pd.DataFrame(X_features.toarray(), columns=vectorizer.get_feature_names_out())

In [59]:
df_bow.sum(axis=1).value_counts().reset_index().sort_values('index')

Unnamed: 0,index,0
19,0,26
16,1,180
14,2,369
12,3,686
10,4,1118
8,5,1658
7,6,1940
5,7,2665
4,8,2785
1,9,3240


In [52]:
df_bow.head()

Unnamed: 0,0007,001,002,003,01,01 preto,01ac,02,02 pares,02 tênis,...,öus imigrante,öus naccarato,öus phibo,öus skate,últimas,últimas peças,últimas unidades,último,única,único
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
