<img src="https://s2.glbimg.com/Bu6upvmSg6SRv0za635uXphThKo=/620x430/e.glbimg.com/og/ed/f/original/2020/03/28/mercado-livre.jpg" width=30%/>

# Data Science Challenge - Data & Analytics Team
## 2. Similitud entre productos

### Descripción

Un desafío constante en MELI es el de poder agrupar productos similares utilizando algunos atributos de estos como pueden ser el título, la descripción o su imagen.

Para este desafío tenemos un dataset “items_titles.csv” que tiene títulos de 30 mil productos de 3 categorías diferentes de Mercado Libre Brasil.

El objetivo del desafío es poder generar una Jupyter notebook que determine cuán similares son dos títulos del dataset “item_titles_test.csv” generando como output un listado de la forma...

Donde ordenando por score de similitud podamos encontrar los pares de productos más similares en nuestro dataset de test.

In [1]:
pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd
import numpy as np

In [4]:
# Armazenamento de csvs em variáveis
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/MELI/items_titles.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/MELI/items_titles_test.csv')

In [5]:
print(train.shape)
print(test.shape)

(30000, 1)
(10000, 1)


In [6]:
train.head()

Unnamed: 0,ITE_ITEM_TITLE
0,Tênis Ascension Posh Masculino - Preto E Verme...
1,Tenis Para Caminhada Super Levinho Spider Corr...
2,Tênis Feminino Le Parc Hocks Black/ice Origina...
3,Tênis Olympikus Esportivo Academia Nova Tendên...
4,Inteligente Led Bicicleta Tauda Luz Usb Bicicl...


In [7]:
test.head()

Unnamed: 0,ITE_ITEM_TITLE
0,Tênis Olympikus Esporte Valente - Masculino Kids
1,Bicicleta Barra Forte Samy C/ 6 Marchas Cubo C...
2,Tênis Usthemp Slip-on Temático - Labrador 2
3,Tênis Casual Feminino Moleca Tecido Tie Dye
4,Tênis Star Baby Sapatinho Conforto + Brinde


In [8]:
# Importando SentenceTransformer
from sentence_transformers import SentenceTransformer

# Instanciando o modelo pré-treinado all-MiniLM-L6-v2 do Hugging Face (https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [9]:
# Importando Cosine Similarity do Scikit-learn
from sklearn.metrics.pairwise import cosine_similarity

In [10]:
# Encoding de valores do dataset train usando o model (all-MiniLM-L6-v2)
train_vecs = model.encode(train['ITE_ITEM_TITLE'])

In [11]:
train_vecs.shape

(30000, 384)

Temos um array com 3.000 linhas e 384 de tamanho no hidden state size do modelo Bert.

In [12]:
train_vecs

array([[-1.0410467e-01,  8.3035901e-02, -1.2613893e-04, ...,
         9.0357788e-02, -2.2680234e-02, -5.2821346e-02],
       [-4.4692412e-02,  4.1617818e-02, -1.4046744e-01, ...,
         5.1465411e-02, -3.9132010e-02,  8.1661744e-03],
       [-8.9592695e-02,  6.8836719e-02, -3.5719515e-03, ...,
         7.8193679e-02, -3.7878539e-02,  4.1248425e-04],
       ...,
       [-6.9650076e-02,  1.8741334e-02,  7.2399718e-03, ...,
         7.2441429e-02, -8.5498290e-03, -3.1157778e-04],
       [-9.9700317e-02,  1.5545686e-02,  4.1094311e-02, ...,
         2.7461603e-02,  6.2089324e-02, -9.3999721e-02],
       [-7.9272017e-02,  1.8710706e-02, -7.3789090e-02, ...,
         1.8070578e-03, -3.4365531e-03, -7.8087091e-02]], dtype=float32)

In [13]:
# Encoding de valores do dataset test usando o model (all-MiniLM-L6-v2)
test_vecs = model.encode(test['ITE_ITEM_TITLE'])

In [14]:
test_vecs.shape

(10000, 384)

Temos um array com 1000 linhas e 384 de tamanho no hidden state size do modelo Bert.

In [15]:
test_vecs

array([[-0.01816418,  0.12168916, -0.00326645, ...,  0.08843251,
         0.08487038, -0.03037256],
       [-0.04601841,  0.07405472, -0.1056786 , ..., -0.04466264,
        -0.00369643, -0.01232435],
       [-0.08496193, -0.06072852,  0.05279116, ..., -0.00343313,
        -0.00308601, -0.06607358],
       ...,
       [-0.03341956,  0.0230679 , -0.02346489, ..., -0.06607258,
         0.1327506 , -0.01943892],
       [-0.00553372,  0.07026482, -0.01061806, ...,  0.08500759,
         0.01786495, -0.05440037],
       [-0.02435026,  0.03572162,  0.0352535 , ..., -0.06202542,
        -0.06639925, -0.04696418]], dtype=float32)

In [16]:
train_vecs

array([[-1.0410467e-01,  8.3035901e-02, -1.2613893e-04, ...,
         9.0357788e-02, -2.2680234e-02, -5.2821346e-02],
       [-4.4692412e-02,  4.1617818e-02, -1.4046744e-01, ...,
         5.1465411e-02, -3.9132010e-02,  8.1661744e-03],
       [-8.9592695e-02,  6.8836719e-02, -3.5719515e-03, ...,
         7.8193679e-02, -3.7878539e-02,  4.1248425e-04],
       ...,
       [-6.9650076e-02,  1.8741334e-02,  7.2399718e-03, ...,
         7.2441429e-02, -8.5498290e-03, -3.1157778e-04],
       [-9.9700317e-02,  1.5545686e-02,  4.1094311e-02, ...,
         2.7461603e-02,  6.2089324e-02, -9.3999721e-02],
       [-7.9272017e-02,  1.8710706e-02, -7.3789090e-02, ...,
         1.8070578e-03, -3.4365531e-03, -7.8087091e-02]], dtype=float32)

In [20]:
pip install levenshtein

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting levenshtein
  Downloading Levenshtein-0.20.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (175 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.5/175.5 KB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<3.0.0,>=2.3.0
  Downloading rapidfuzz-2.13.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, levenshtein
Successfully installed levenshtein-0.20.9 rapidfuzz-2.13.7


In [35]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from Levenshtein import distance as levenshtein_distance

# create empty arrays to store variables in the loop
word_train = np.empty((len(train)//10, 300), dtype=np.float32)
word_test = np.empty((len(test)//10, 300), dtype=np.float32)
word_similarity = np.empty((len(train)//10, len(test)//10), dtype=np.float32)
original_train = np.empty(len(train)//10, dtype=object)
original_test = np.empty(len(test)//10, dtype=object)

# loop through a sample of the datasets and calculate Levenshtein similarity
for i in tqdm(range(0, len(train), 10)):
    for j in range(0, len(test), 10):
        # calculate Levenshtein similarity
        similarity = 1 - levenshtein_distance(train['ITE_ITEM_TITLE'][i], test['ITE_ITEM_TITLE'][j]) / max(len(train['ITE_ITEM_TITLE'][i]), len(test['ITE_ITEM_TITLE'][j]))
        
        # resize source arrays to match target array shape
        train_vecs_resized = train_vecs[i][:300]
        test_vecs_resized = test_vecs[j][:300]
        
        # save items in arrays
        word_train[i//10] = train_vecs_resized
        word_test[j//10] = test_vecs_resized
        word_similarity[i//10][j//10] = similarity
        original_train[i//10] = train['ITE_ITEM_TITLE'][i]
        original_test[j//10] = test['ITE_ITEM_TITLE'][j]

# create dataframe
df = pd.DataFrame({
    'ITE_ITEM_TITLE': original_train.repeat(len(test)//10),
    'ITE_ITEM_TITLE2': np.tile(original_test, len(train)//10),
    'Score Similitud (0,1)': word_similarity.flatten()
})


100%|██████████| 3000/3000 [02:29<00:00, 20.06it/s]


In [37]:
# Visualização do dataframe ordenado de maneira decrescente pela similaridade
df = df.sort_values(by='Score Similitud (0,1)', ascending=False)
display(df)

Unnamed: 0,ITE_ITEM_TITLE,ITE_ITEM_TITLE2,"Score Similitud (0,1)"
676456,Tênis Feminino Esportivo Wave Prophecy 9 Pro 9...,Tênis Feminino Esportivo Wave Prophecy 9 Pro 9...,1.0
2923745,Tênis Air Jordan Mid,Tênis Air Jordan Mid,1.0
2849678,Sapatenis Polo Bra Sapatilha Polo Leve 2 Pague...,Sapatenis Polo Bra Sapatilha Polo Leve 2 Pague...,1.0
2279985,Bicicleta Tsw Evo Quest Gx 12v 29x19 Az Metalico,Bicicleta Tsw Evo Quest Gx 12v 29x19 Az Metalico,1.0
626878,Tenis Feminino Tie Dye Estilo Babuche Slip Con...,Tenis Feminino Tie Dye Estilo Babuche Slip Con...,1.0
...,...,...,...
2830374,Gtsm1,J1 Seafoam,0.0
1567211,Tv 24 Polegadas,Bike Bmw Aro 26,0.0
2307483,Tênis Promoção Branco Número 41,Burberry - Tênis Xadrez Vintage,0.0
617397,Tv 42p LG 42ld460,Piso,0.0


Dataframe criado e ordenado por similaridade usando Cosine Similarity.

In [38]:
# Salvando resultado como csv
df.to_csv('meli2_results.csv')

## Obrigado.

In [43]:
df[(df['Score Similitud (0,1)'] > 0.2) & (df['Score Similitud (0,1)'] < 1.0)]

Unnamed: 0,ITE_ITEM_TITLE,ITE_ITEM_TITLE2,"Score Similitud (0,1)"
1217912,Tênis Feminino Lynd Promoção 578,Tênis Feminino Lynd Promoção 599,0.937500
383357,Tenis Feminino Casual Fletform,Tênis Feminino Casual Flatform,0.933333
1588520,Sapatenis Masculino Couro Legitimo Deneb Gshoe...,Sapatenis Masculino Couro Legitimo Deneb Gshoe...,0.932203
2059865,Tênis adidas Original,Tênis adidas Original!,0.913043
2211162,Tênis Cano Alto Infantil Plis Calçs A410 22 A...,Tênis Cano Alto Infantil Plis Calçados A442 2...,0.905660
...,...,...,...
2139529,Tênis Kolosh Feminino Conforto Macio Lançament...,Mountain Bike Spaceline Vega 2021 Aro 29 17 2...,0.200000
1334882,Tênis Feminino De Meia Supple,Tenis adidas Extaball W B35351,0.200000
1345753,Tênis Feminino Kolosh Casual Anabela C0531 - P...,Bicicleta Canondale Bad Boy,0.200000
181304,Sapatênis Pipper Slip On Casual Couro Masculin...,Tênis Nike Court Vision Low Feminino Pronta En...,0.200000
