> A estratégia consiste em selecionar os business que aparecem em eval_set, selecionar as 5 melhores reviews dele e as 5 piores reviews (com menor nota), gerar embeddings dessas reviews e calcular a média.

> Depois, fazemos o mesmo para os users que aparecem em eval_set: pegamos as 5 melhores, 5 piores, tira a média dos embeddings e comparamos com as médias dos business.

> A lógica é: se os embeddings de melhor avaliação for maior, recomenda para a pessoa, se o de pior avaliação for maior, não recomenda.

> Para ordenar os business, verificamos qual é maior:
- se for o embedding de melhor avaliação, ordenamos de forma que quanto maior o valor dele, mais à frente fica.
- se for o embedding de pior avaliação, ordenamos de forma que quanto maior o valor dele, mais atrás ele fica.

In [1]:
import pandas as pd

# Carregando os dados

In [2]:
# carregando os dados do eval_set
eval_set = pd.read_csv('../data/evaluation/eval_users.csv')

In [3]:
# carregando o dataset de reviews
reviews = pd.read_parquet('../data/DatasetsLimpos/yelp_academic_dataset_review.parquet')

# Filtrando

### Filtrando apenas os business que aparecem no eval_set

In [4]:
## carregando eval_df
# verificando os business que aparecem em eval
business_ = eval_set['reclist'].apply(lambda x: x.replace("[", "").replace("]", "").split(',')).explode().unique()

In [5]:
business_ = pd.Series(business_).apply(lambda x: x.replace("'", "").replace(" ", "")).unique()

In [6]:
# filtrando apenas os business que aparecem em eval
#reviews = reviews[reviews['business_id'].isin(business_)]

# filtrando apenas os business que aparecem em eval e que não temos os embeddings
reviews = reviews[reviews['business_id'].isin(business_) & ~reviews['business_id'].isin(pd.read_parquet('../data/embeddingsBusiness.parquet').index)]

In [7]:
reviews.shape

(1676719, 7)

In [25]:
# selecionando apenas as colunas que serão utilizadas
reviews = reviews[['business_id', 'user_id','text', 'stars']]

In [26]:
# agrupando os business
reviews = reviews.groupby('business_id').agg({'text': lambda x: [i for i in x], 'stars': lambda x: [i for i in x]})

In [27]:
import numpy as np

In [28]:
# criando uma função que recebe uma lista e retorna os índices dos 5 maiores e 5 menores valores (ou 2)
def get_best_worst_indices(lista):
    # Converte a lista para um array NumPy
    arr = np.array(lista)

    if arr.shape[0] < 10:
        # Obtém os índices dos N maiores valores usando argsort
        indices_n_maiores = np.argsort(-arr)[:2]
        indices_n_menores = np.argsort(arr)[:2]
    else:
        # Obtém os índices dos N maiores valores usando argsort
        indices_n_maiores = np.argsort(-arr)[:5]
        indices_n_menores = np.argsort(arr)[:5]
    return indices_n_maiores.tolist(), indices_n_menores.tolist()

In [29]:
# aplicando na coluna stars
reviews['stars'] = reviews.stars.apply(get_best_worst_indices)

In [30]:
# filtrando a coluna texto para pegar apenas as reviews que estão nos indices de stars
reviews['text'] = reviews.apply(lambda x: [x['text'][i] for i in x['stars'][0] + x['stars'][1]], axis=1)

### Gerando os embeddings dos business

In [31]:
from BertEmbedding import get_bert_embedding

In [None]:
# aplicando a funcao no text
reviews['embs'] = reviews['text'].apply(lambda x: get_bert_embedding(x)[0])

In [34]:
# separando os embeddings em colunas: melhor avaliados [:5] e piores avaliados [5:]
reviews['embs_best'] = reviews['embs'].apply(lambda x: x[:5] if x.shape[0] == 10 else x[:2])
reviews['embs_worst'] = reviews['embs'].apply(lambda x: x[5:] if x.shape[0] == 10 else x[2:])

In [35]:
reviews.head()

Unnamed: 0_level_0,text,stars,embs,embs_best,embs_worst
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
-0EdehHjIQc0DtYU8QcAig,[I had the best Beef and Broccoli that I've ev...,"([17, 21, 18, 13, 12], [0, 23, 19, 28, 4])","[[0.76516813, -0.9572798, 2.6221435, -1.212702...","[[0.76516813, -0.9572798, 2.6221435, -1.212702...","[[0.8259199, -0.9743092, 2.640242, -1.0966994,..."
-0fOUV_llBAPMo7exZFHPA,[My latest hair style has either been a top kn...,"([0, 1], [3, 0])","[[0.60077584, -0.87541145, 2.7138147, -1.42937...","[[0.60077584, -0.87541145, 2.7138147, -1.42937...","[[0.7065828, -0.82314223, 2.373746, -1.3431708..."
-0gWtMKg8_iV6vC5wRFDiA,[Great experience. Family and adult friendly....,"([7, 10, 1, 4, 5], [8, 11, 12, 0, 2])","[[1.853262, -1.0348336, 3.1113107, -1.3340988,...","[[1.853262, -1.0348336, 3.1113107, -1.3340988,...","[[0.5425563, -0.71506804, 2.4221518, -1.121309..."
-1EGqUQFBmGEp76CE-Zk4Q,[I love their shrimp quesadilla! They are real...,"([16, 1, 30, 27, 25], [0, 34, 33, 21, 12])","[[1.4248459, -0.8644156, 2.872499, -1.4126887,...","[[1.4248459, -0.8644156, 2.872499, -1.4126887,...","[[0.31495255, -0.652539, 2.4236493, -0.9839752..."
-2CPhK6ik9ZBgFX_F-dkxQ,[Just your typical subway. This one is particu...,"([0, 4], [1, 2])","[[0.9070763, -1.0537813, 2.781538, -1.2485958,...","[[0.9070763, -1.0537813, 2.781538, -1.2485958,...","[[0.3103292, -0.9453506, 2.5486505, -1.3950517..."


In [36]:
# gerando media dos embeddings
reviews['embs_best'] = reviews.embs_best.apply(lambda x: np.mean(x, axis=0))
reviews['embs_worst'] = reviews.embs_worst.apply(lambda x: np.mean(x, axis=0))

Unnamed: 0,user_id,business_id,stars,useful,funny,cool,text
0,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,1,1,1,"If you decide to eat here, just be aware it is..."
1,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,1,1,I've taken a lot of spin classes over the year...
2,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,1,1,1,Family diner. Had the buffet. Eclectic assortm...
3,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,1,1,"Wow! Yummy, different, delicious. Our favo..."
4,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,1,1,Cute interior and owner (?) gave us tour of up...
...,...,...,...,...,...,...,...
6990275,qskILQ3k0I_qcCMI-k6_QQ,jals67o91gcrD4DC81Vk6w,5,1,1,1,Latest addition to services from ICCU is Apple...
6990276,Zo0th2m8Ez4gLSbHftiQvg,2vLksaMmSEcGbjI5gywpZA,5,1,1,1,"This spot offers a great, affordable east week..."
6990277,mm6E4FbCMwJmb7kPDZ5v2Q,R1khUUxidqfaJmcpmGd4aw,4,1,1,1,This Home Depot won me over when I needed to g...
6990278,YwAMC-jvZ1fvEUum6QkEkw,Rr9kKArrMhSLVE9a53q-aA,5,1,1,1,For when I'm feeling like ignoring my calorie-...


In [37]:
reviews[['embs_best', 'embs_worst']].to_parquet('../data/embeddingsBusiness.parquet')

### Filtrando os users que aparecem em eval_set

In [38]:
users = eval_set['user_id'].unique()

In [39]:
# carregando o dataset de reviews
reviews = pd.read_parquet('../data/DatasetsLimpos/yelp_academic_dataset_review.parquet')

In [88]:
reviews = reviews[reviews['user_id'].isin(users)]

In [90]:
## Todo_ o resto é igual...

# selecionando apenas as colunas que serão utilizadas
reviews = reviews[['user_id', 'text', 'stars']]

# agrupando os business
reviews = reviews.groupby('user_id').agg({'text': lambda x: [i for i in x], 'stars': lambda x: [i for i in x]})

# criando uma função que recebe uma lista e retorna os índices dos 5 maiores e 5 menores valores (ou 2)
def get_best_worst_indices(lista):
    # Converte a lista para um array NumPy
    arr = np.array(lista)

    if arr.shape[0] < 10:
        # Obtém os índices dos N maiores valores usando argsort
        indices_n_maiores = np.argsort(-arr)[:2]
        indices_n_menores = np.argsort(arr)[:2]
    else:
        # Obtém os índices dos N maiores valores usando argsort
        indices_n_maiores = np.argsort(-arr)[:5]
        indices_n_menores = np.argsort(arr)[:5]
    return indices_n_maiores.tolist(), indices_n_menores.tolist()


# aplicando na coluna stars
reviews['stars'] = reviews.stars.apply(get_best_worst_indices)
# filtrando a coluna texto para pegar apenas as reviews que estão nos indices de stars
reviews['text'] = reviews.apply(lambda x: [x['text'][i] for i in x['stars'][0] + x['stars'][1]], axis=1)

### Gerando os embeddings dos user_id

In [None]:
# aplicando a funcao no text
reviews['embs'] = reviews['text'].apply(lambda x: get_bert_embedding(x)[0])

In [92]:
# separando os embeddings em colunas: melhor avaliados [:5] e piores avaliados [5:]
reviews['embs_best'] = reviews['embs'].apply(lambda x: x[:5] if x.shape[0] == 10 else x[:2])
reviews['embs_worst'] = reviews['embs'].apply(lambda x: x[5:] if x.shape[0] == 10 else x[2:])

In [93]:
reviews.head()

Unnamed: 0_level_0,text,stars,embs,embs_best,embs_worst
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
-1BSu2dt_rOAqllw9ZDXtA,[Hank and I love Brocatos!The freshest ingredi...,"([0, 1, 2, 3, 4], [6, 9, 11, 0, 1])","[[0.4294569, -0.74091864, 2.5334587, -1.151932...","[[0.4294569, -0.74091864, 2.5334587, -1.151932...","[[0.7082855, -0.7646998, 2.8138785, -1.3410811..."
-6DoXmdXEy_P5N-QZzntgA,"[With a home airport like ORD, SBA is an incre...","([5, 8, 12, 0, 1], [7, 10, 11, 0, 1])","[[0.78651774, -1.0710886, 2.6585214, -1.109733...","[[0.78651774, -1.0710886, 2.6585214, -1.109733...","[[0.4615741, -0.6224606, 2.4306178, -1.099641,..."
-8NOuak4Sipn7-zy7Nk5hg,[One of Philadelphia's best restaurants in my ...,"([0, 1, 3, 4, 5], [11, 8, 6, 2, 12])","[[1.1762301, -1.0045295, 2.9417152, -1.2765405...","[[1.1762301, -1.0045295, 2.9417152, -1.2765405...","[[1.0583673, -0.7620707, 2.6820798, -1.0745525..."
-8rSnT5ztVk6vmTDkxTqsQ,[Great space! Loved the accessibility to outdo...,"([10, 0, 1, 2, 4], [3, 9, 11, 14, 5])","[[1.127403, -0.9143411, 2.9474373, -1.2328093,...","[[1.127403, -0.9143411, 2.9474373, -1.2328093,...","[[0.57977533, -0.7533637, 2.3341944, -1.041487..."
-C7xxeVQI5qEZGAzFdx-cg,[This place is the best! Their food isn't spic...,"([0, 2, 3, 5, 7], [4, 9, 1, 6, 0])","[[0.9144162, -0.76526994, 2.5454423, -1.262568...","[[0.9144162, -0.76526994, 2.5454423, -1.262568...","[[1.2083615, -0.67561406, 2.804841, -1.2537695..."


In [94]:
# gerando media dos embeddings
reviews['embs_best'] = reviews.embs_best.apply(lambda x: np.mean(x, axis=0))
reviews['embs_worst'] = reviews.embs_worst.apply(lambda x: np.mean(x, axis=0))

In [97]:
reviews.embs_best.apply(len).describe()

count    1000.0
mean      768.0
std         0.0
min       768.0
25%       768.0
50%       768.0
75%       768.0
max       768.0
Name: embs_best, dtype: float64

In [98]:
# salvando em parquet os resultados
reviews[['embs_best', 'embs_worst']].to_parquet('../data/embeddingsUsers.parquet')