Esse notebook visa criar a estrutura de dados necessária para a avaliação das recomendações, sendo os embeddings gerados a partir dos textos em 'reviews' utilizando o BERT.

A abordagem consiste em pegar as 5 melhores avaliações de cada business_id, as 5 melhores avaliações de cada usuário e comparar a similaridade entre a média deles. Assim, esperamos que os lugares mais semelhantes ao gosto do usuário tenham maior similaridade com suas avaliações particulares.

# Carregando dados de eval

In [1]:
import pandas as pd
import numpy as np

# loading dataset
eval_set = pd.read_csv('../data/evaluation/eval_users.csv')

In [2]:
eval_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      1000 non-null   object
 1   user_perfil  1000 non-null   object
 2   gt_reclist   1000 non-null   object
 3   reclist      1000 non-null   object
dtypes: object(4)
memory usage: 31.4+ KB


# Carregando os dados de embeddings

In [3]:
embs_business1 = pd.read_parquet('../data/embeddingsBusiness1.parquet')
embs_business2 = pd.read_parquet('../data/embeddingsBusiness2.parquet')
embs_users = pd.read_parquet('../data/embeddingsUsers.parquet')

In [4]:
embs_users.shape  # são todos os users_id que aparecem em eval Dataset

(1000, 2)

In [5]:
embs_business1.shape  # são todos os business (unique) que aparecem em eval Dataset

(7862, 2)

In [6]:
embs_business2.shape  # são todos os business (unique) que aparecem em eval Dataset

(8882, 2)

In [7]:
# concatenando os dois embs_business
embs_business = pd.concat([embs_business1, embs_business2])

In [8]:
embs_business.shape

(16744, 2)

# Carregando dataframe que relaciona business_id com embeddings + metadados

In [9]:
# criando dataframe que relaciona os business_id com os embeddings:
df_final = pd.read_parquet('../data/yelp_dataset/yelp_academic_dataset_business.parquet')

In [10]:
df_final = df_final[['business_id', 'name', 'categories']]

In [11]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150346 entries, 0 to 150345
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   business_id  150346 non-null  object
 1   name         150346 non-null  object
 2   categories   150243 non-null  object
dtypes: object(3)
memory usage: 3.4+ MB


### Filtrando esse dataset para conter apenas os business_id que criamos embeddings + os business_id de preferência (perfil) do usuário

In [12]:
# selecionando os business que possuem embeddings
business_ = embs_business.reset_index().business_id

In [13]:
# selecionando os perfis dos users
users_ = eval_set.user_perfil

In [14]:
filtro = pd.concat([business_, users_])

In [15]:
df_final = df_final[df_final['business_id'].isin(filtro)]

In [16]:
df_final.shape

(16744, 3)

> Já tínhamos levado em consideração no dataset de business os business de preferência do usuário O.o
> Nesse caso, basta apenas substituirmos os embeddings dos business de perfil pelo dos usuários!

# Colocando os embeddings

In [17]:
# selecionando apenas os embs_best:
embs_business = embs_business['embs_best']

In [18]:
# unindo o dataframe final com o de embeddings
df_final = df_final.join(embs_business, on='business_id', how='left')

In [19]:
df_final.to_parquet('../data/Dataframes_finais/EmbsBusinessMeta.parquet')

In [20]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16744 entries, 9 to 150326
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   business_id  16744 non-null  object
 1   name         16744 non-null  object
 2   categories   16735 non-null  object
 3   embs_best    16744 non-null  object
dtypes: object(4)
memory usage: 654.1+ KB


# Primeira tentativa: levando em consideração somente os embeddings formados pelos reviews dos business (excluindo o perfil calculado com foco nos reviews individuais dos users)

In [21]:
# gerando os embeddings no formato desejado
import os
def export_dataset(df: pd.DataFrame, emb_column: str, output_file: str):
    """
    Export the embeddings to a csv file.
    """
    if not os.path.exists(output_file):
        os.makedirs(output_file)

    np.savetxt(output_file+'/embeddings.txt', np.stack(df[emb_column]), delimiter='\t')
    df.drop(emb_column, axis=1).to_csv(output_file+"/metadados.csv", sep="\t", index=False)

In [22]:
export_dataset(df_final, 'embs_best', '../data/Embeddings/FirstAttempt')

## Calculando resultados

In [26]:
!python ../evaluation/evaluation.py ../data/Embeddings/FirstAttempt/embeddings.txt ../data/Embeddings/FirstAttempt/metadados.csv

              business_id  ...                                         categories
0  bBDDEgkFA1Otx9Lfe7BZUQ  ...  Ice Cream & Frozen Yogurt, Fast Food, Burgers,...
1  MUTTqe8uqyMdBl186RmNeA  ...                  Sushi Bars, Restaurants, Japanese
2  8wGISYjYkE2tSqn3cDMu8A  ...  Automotive, Car Rental, Hotels & Travel, Truck...
3  ROeacJQwBeh05Rqg7F6TCg  ...                                Korean, Restaurants
4  qhDdDeI3K4jy2KyzwFN53w  ...   Shopping, Books, Mags, Music & Video, Bookstores

[5 rows x 3 columns]


Avaliação de Embeddings
Embeddings:  ../data/Embeddings/FirstAttempt/embeddings.txt
Total Users:  1000
NDCG@5:  0.5517928975002069
NDCG@10:  0.5958024267360613


Your CPU supports instructions that this binary was not compiled to use: AVX AVX2
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
*
0it [00:00, ?it/s]
76it [00:00, 748.65it/s]
153it [00:00, 755.82it/s]
234it [00:00, 778.57it/s]
316it [00:00, 790.13it/s]
398it [00:00, 799.32it/s]
479it [00:00, 801.53it/s]
561it [00:00, 803.53it/s]
643it [00:00, 805.99it/s]
726it [00:00, 812.14it/s]
809it [00:01, 813.83it/s]
891it [00:01, 792.79it/s]
973it [00:01, 799.64it/s]
1000it [00:01, 797.79it/s]


# Segunda tentativa: levando em consideração os embeddings formados pelos reviews dos business + os embeddings formados pelos reviews dos users (perfil)

In [27]:
embs_users.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, -1BSu2dt_rOAqllw9ZDXtA to zx2NkJtfSvJhid6rxvYMlg
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   embs_best   1000 non-null   object
 1   embs_worst  1000 non-null   object
dtypes: object(2)
memory usage: 23.4+ KB


In [29]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16744 entries, 9 to 150326
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   business_id  16744 non-null  object
 1   name         16744 non-null  object
 2   categories   16735 non-null  object
 3   embs_best    16744 non-null  object
dtypes: object(4)
memory usage: 654.1+ KB


In [31]:
# unindo o dataframe final com o de embeddings dos users, mesclando a coluna embs_best de df_final com a coluna embs_best de embs_users onde embs_users.embs_best.notnull()
df_final = df_final.join(embs_users['embs_best'], on='business_id', how='left', rsuffix='_user')

In [33]:
# as linhas em que embs_best_user não é nulo, substitui o embs_best por embs_best_user
df_final['embs_best'] = np.where(df_final['embs_best_user'].notnull(), df_final['embs_best_user'], df_final['embs_best'])


In [39]:
df_final.drop('embs_best_user', axis=1, inplace=True)

## Exportando os embeddings (segunda tentativa)

In [41]:
# gerando os embeddings no formato desejado
import os


def export_dataset(df: pd.DataFrame, emb_column: str, output_file: str):
    """
    Export the embeddings to a csv file.
    """
    if not os.path.exists(output_file):
        os.makedirs(output_file)

    np.savetxt(output_file + '/embeddings.txt', np.stack(df[emb_column]), delimiter='\t')
    df.drop(emb_column, axis=1).to_csv(output_file + "/metadados.csv", sep="\t", index=False)


export_dataset(df_final, 'embs_best', '../data/Embeddings/SecondAttempt')

In [42]:
!python ../evaluation/evaluation.py ../data/Embeddings/SecondAttempt/embeddings.txt ../data/Embeddings/SecondAttempt/metadados.csv

              business_id  ...                                         categories
0  bBDDEgkFA1Otx9Lfe7BZUQ  ...  Ice Cream & Frozen Yogurt, Fast Food, Burgers,...
1  MUTTqe8uqyMdBl186RmNeA  ...                  Sushi Bars, Restaurants, Japanese
2  8wGISYjYkE2tSqn3cDMu8A  ...  Automotive, Car Rental, Hotels & Travel, Truck...
3  ROeacJQwBeh05Rqg7F6TCg  ...                                Korean, Restaurants
4  qhDdDeI3K4jy2KyzwFN53w  ...   Shopping, Books, Mags, Music & Video, Bookstores

[5 rows x 3 columns]


Avaliação de Embeddings
Embeddings:  ../data/Embeddings/SecondAttempt/embeddings.txt
Total Users:  1000
NDCG@5:  0.5517928975002069
NDCG@10:  0.5958024267360613


Your CPU supports instructions that this binary was not compiled to use: AVX AVX2
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
*
0it [00:00, ?it/s]
74it [00:00, 725.60it/s]
147it [00:00, 721.82it/s]
220it [00:00, 707.03it/s]
291it [00:00, 703.10it/s]
362it [00:00, 684.39it/s]
434it [00:00, 694.03it/s]
509it [00:00, 711.66it/s]
581it [00:00, 708.50it/s]
654it [00:00, 711.60it/s]
726it [00:01, 705.30it/s]
800it [00:01, 712.50it/s]
873it [00:01, 716.04it/s]
945it [00:01, 713.02it/s]
1000it [00:01, 709.70it/s]
