> Chalkiopoulos Georgios, Electrical and Computer Engineer NTUA <br />
> Data Science postgraduate Student <br />
> gchalkiopoulos@aueb.gr

# Introduction

## Goal

* Find a rating-based or matching-based (binary) dataset that can be used to inform a recsys based on collaborative filtering.

* Email me your dataset of choice to confirm before working on your recsys.

* Build a Python Notebook that:

(1) Loads the dataset

(2) Tries at least 2 different recommendation methods based on collaborative filtering (Tensorflow, Matrix factorization, Count-based)

(3) Uses quantitative metrics to evaluate the recommendations of each of the two methods that you selected.

## Dataset

The dataset was [downloaded from Kaggle](https://www.kaggle.com/datasets/thedevastator/1-5-million-beer-reviews-from-beer-advocate) and contains 1.5m beer reviews from beer advocate.

## Useful info

The notebook uses three recommendation methods:
- Count-based: which is split in user based and movie based
- LDA: from which topic modelling is performed, and movies from the most fitting group are recommended
- SVD: in which the recommendation is made based on the predicted rating

## Evalutation

A group of users (test_users) was created, of whom the first five (random) will be used to evaluate the results. These users have between 500 and 600 ratings. For these users, we will get recommendations, and calculate the ratio of positively rated beers. This ratio represents the beers that are potential recommendation, but the user has already rated.

Moreover, another user (user 5) will be used to print the actual recommendations, along with the previous metric.

## Code setup

This Notebook uses modules placed under the folder `utils` to help with readability. Moreover, a new environment can be created using the `requirements.txt` file. The logging level can be set from the coresponding cell. If the user whishes to see minimal info, it should be set to WARNING.

## pkl files

some parts of the code use saved `.pkl` files, to reduce computational time. If the files do not exist, the corresponding calculation will take place and the files will then be saved. You may contact the author, if you wish to receive the pre-calculated pkl files.

# 1. Import Libraries

In [1]:
from logging import INFO, WARNING, DEBUG, ERROR
import pandas as pd
import numpy as np
from scipy.sparse.linalg import svds
from pathlib import Path
import pickle

# utils
from utils import helper_functions
from utils import lda
from utils import reader
from utils.Loggers import BaseLogger
from utils import svd

# 2. Set Logging level

In [2]:
BaseLogger.level = INFO
logger = BaseLogger().logger

# 3. Read data

In [3]:
file_path: str = "beer_reviews.csv"
reader = reader.ReviewReader(file_path=file_path)
reviews: pd.DataFrame = reader.read_reviews()

[2023-03-05 23:19:52,442] INFO [ReviewReader] - Loading beer_reviews.csv
[2023-03-05 23:19:59,054] INFO [ReviewReader] - 300000 reviews loaded.
[2023-03-05 23:20:05,179] INFO [ReviewReader] - 600000 reviews loaded.
[2023-03-05 23:20:11,568] INFO [ReviewReader] - 900000 reviews loaded.
[2023-03-05 23:20:17,746] INFO [ReviewReader] - 1200000 reviews loaded.
[2023-03-05 23:20:24,025] INFO [ReviewReader] - 1500000 reviews loaded.
[2023-03-05 23:20:25,702] INFO [ReviewReader] - All reviews loaded. Total reviews: 1586614
[2023-03-05 23:20:53,312] INFO [ReviewReader] - Processing completed.


In [4]:
print("Shape: ", reviews.shape)
reviews.head()

Shape:  (1399623, 12)


Unnamed: 0,user_id,user_name,beer_id,beer_name,beer_style,review_overall,review_aroma,review_appearance,review_taste,review_palate,brewery_id,rating
10,7,fodeeoz,436,Amstel Light,Light Lager,3.0,2.0,3.0,2.5,2.5,163,A
18,15,jdhilt,436,Amstel Light,Light Lager,2.5,3.0,3.0,2.0,2.0,163,A
19,16,UCLABrewN84,58046,Rauch Ür Bock,Rauchbier,4.5,4.5,3.0,4.5,4.0,1075,P
20,17,zaphodchak,58046,Rauch Ür Bock,Rauchbier,4.0,4.0,4.0,4.0,3.0,1075,P
21,18,Tilley4,58046,Rauch Ür Bock,Rauchbier,4.0,4.5,4.0,4.0,3.5,1075,P


# 4. Item-Based and User-Based Recommendations with Hashing

## 4.1 Index data & compute similarity scores

* Due to the size of the data, we will use Hashing in order to be able to perform calculations
* Two recommendation systems (User based and Movie based) will be created by calculating the corresponding similarity scores
* For faster development, pkl files are available.

In [5]:
indexer_user = helper_functions.load_indexer(focus="usr",
                             reviews=reviews, beer_id_col='beer_id', user_id_col='user_id',
                             beer_mapping=reader.beer_mapping, user_mapping=reader.user_mapping,
                             indexer_path="indexer_usr.pkl"
                            )

indexer_beer = helper_functions.load_indexer(focus="beer",
                             reviews=reviews, beer_id_col='beer_id', user_id_col='user_id',
                             beer_mapping=reader.beer_mapping, user_mapping=reader.user_mapping,
                             indexer_path="indexer_beer.pkl"
                            )

[2023-03-05 21:18:30,817] INFO [BaseLogger] - Trying to load indexer_usr.pkl file.
[2023-03-05 21:18:39,472] INFO [BaseLogger] - indexer_usr.pkl file loaded.
[2023-03-05 21:18:39,473] INFO [BaseLogger] - Trying to load indexer_beer.pkl file.
[2023-03-05 21:18:49,784] INFO [BaseLogger] - indexer_beer.pkl file loaded.


## 4.1 Test users evaluation

* In this part we will check the recommendations for some users. The recommendations provided will be compared to the user's ratings.

### 4.1.1 Test users evaluation - User Based

In [6]:
for user in reader.test_users[:5]:

    try:

        # Get already rated beers
        already_rated = helper_functions.recommend_ub(indexer_user, user=user, rec_num=10, verbose=0)

        # print the ratio of positively rated beers vs recommendations
        print(f"\nUser: {user}")
        if already_rated:
            print("Positive ratio:", f'{list(already_rated.values()).count("P")/len(already_rated.values())*100:.2f}%')
            print("Already Rated number of beers:", f'{len(already_rated.values())}')
        else:
            print("No beers already rated")

    except KeyError:
       print(f"\nUser {user} not found")


User: 2
Positive ratio: 100.00%
Already Rated number of beers: 31

User: 63
Positive ratio: 97.55%
Already Rated number of beers: 286

User: 76
Positive ratio: 84.00%
Already Rated number of beers: 25

User: 80
No beers already rated

User: 151
Positive ratio: 94.44%
Already Rated number of beers: 54


### 4.1.2 Test users evaluation - Movie Based

In [7]:
for user in reader.test_users[:5]:
    try:
        # Get already rated beers
        already_rated = helper_functions.recommend_mb(indexer_beer, user=user, rec_num=10, verbose=0)

        # print the ratio of positively rated beers vs recommendations
        print(f"\nUser: {user}")
        if already_rated:
            print("Positive ratio:", f'{list(already_rated.values()).count("P")/len(already_rated.values())*100:.2f}%')
            print("Already Rated number of beers:", f'{len(already_rated.values())}')
        else:
            print("No beers already rated")

    except KeyError:
       print(f"\nUser {user} not found")


User: 2
Positive ratio: 94.74%
Already Rated number of beers: 19

User: 63
Positive ratio: 98.01%
Already Rated number of beers: 403

User: 76
Positive ratio: 73.91%
Already Rated number of beers: 23

User: 80
No beers already rated

User: 151
Positive ratio: 93.22%
Already Rated number of beers: 59


## 4.21 User 5 recommendations - User Based

In [8]:
already_rated = helper_functions.recommend_ub(indexer_user, user=5, rec_num=10, verbose=1)

print("\n\nPositive ratio:", f'{list(already_rated.values()).count("P")/len(already_rated.values())*100:.2f}%')
print("Already Rated number of beers:", f'{len(already_rated.values())}')


I suggest the following beers because they have received positive ratings
from users who tend to like what you like:

1904 Sierra Nevada Celebration Ale 89.48
2751 Racer 5 India Pale Ale 82.04
35738 Hop Stoopid 70.62
6549 Northern Hemisphere Harvest Wet Hop Ale 65.05
16403 Smuttynose IPA "Finest Kind" 61.27
646 Westmalle Trappist Tripel 60.11
33644 B.O.R.I.S. The Crusher Oatmeal-Imperial Stout 59.49
7348 Founders Porter 59.41
22505 Green Flash West Coast I.P.A. 56.96
8919 G'Knight Imperial Red Ale 56.42


Positive ratio: 93.06%
Already Rated number of beers: 72


## 4.2.1 User 5 recommendations - Movie Based

In [9]:
already_rated = helper_functions.recommend_mb(indexer_beer, user=5, rec_num=10, verbose=1)

print("\n\nPositive ratio:", f'{list(already_rated.values()).count("P")/len(already_rated.values())*100:.2f}%')
print("Already Rated number of beers:", f'{len(already_rated.values())}')


I suggest the following beers because they are similar to the beers you already like:

3916 AleSmith IPA 30.02
2751 Racer 5 India Pale Ale 29.00
33644 B.O.R.I.S. The Crusher Oatmeal-Imperial Stout 29.00
8919 G'Knight Imperial Red Ale 28.02
6549 Northern Hemisphere Harvest Wet Hop Ale 27.75
1658 Big Bear Black Stout 26.32
35738 Hop Stoopid 26.24
7348 Founders Porter 25.53
22505 Green Flash West Coast I.P.A. 24.99
1117 Bell's Kalamazoo Stout 24.07


Positive ratio: 95.08%
Already Rated number of beers: 61


# 5. LDA

We will use Latent Dirichlet Allocation (LDA) for topic modeling, to create groups of beers. In order to create the groups, we will create a "doc" for each user, containing a list of the beerid concatenated with the rating.
* This will work as the corpus, from which groups (topics) will be generated)

## 5.1 Create Corpus

In [10]:
# list of reviews (beer_id+rating) per user
logger.info("Creating list of reviews per user.")
reviews_user = reviews[["user_id", "beer_id", "rating"]].groupby('user_id').apply(lambda x: list(x['beer_id'].astype(str).str.cat(x['rating'])))

print("Number of users:", reviews_user.shape[0])
reviews_user.head()

logger.info("Creating User docs.")
docs: dict = dict(zip(reviews_user.index, reviews_user.values))
logger.info("Docs created.")

[2023-03-05 21:29:06,264] INFO [BaseLogger] - Creating list of reviews per user.
[2023-03-05 21:29:14,171] INFO [BaseLogger] - Creating User docs.
[2023-03-05 21:29:14,175] INFO [BaseLogger] - Docs created.


Number of users: 10707


## 5.2 Create LDA topics (Groups)

To speed up computing time, we have pre-calculated the groups. If the pkl file doesn't exist, the LDA caluclation will take place, by finding the best combination of k and iterations.

In [11]:
lda_groups_path: Path = Path("lda_groups.pkl")

try:

    logger.info(f"Trying to load {lda_groups_path} file.")
    with open(lda_groups_path, 'rb') as handle:
        groups = pickle.load(handle)
        logger.info(f"{lda_groups_path} file loaded.")

except (FileNotFoundError, EOFError) as e:

    logger.warning(f"{lda_groups_path} file doesn't exist. Proceeding to calculate.")
    groups = lda.setup_LDA(docs=docs, k_range=(3, 7), k_step=2,
                   iteration_range=(0, 500), iterations_step=20,
                   top_n=100, reviews=reviews, verbose=0)

    logger.info(f"Saving {lda_groups_path} file.")
    with open(lda_groups_path, 'wb') as handle:
        pickle.dump(groups, handle, protocol=pickle.HIGHEST_PROTOCOL)
    logger.info(f"{lda_groups_path} file saved.")

handle.close()


[2023-03-05 21:30:03,021] INFO [BaseLogger] - Trying to load lda_groups.pkl file.
[2023-03-05 21:30:03,033] INFO [BaseLogger] - lda_groups.pkl file loaded.


## 5.3 LDA test users' recommendations

In [12]:
for user in reader.test_users[:5]:
    print(f"\nUser: {user}")
    try:
        best_recommendations = lda.recommend_lda(groups=groups, user=user, rec_num=10, _user_ratings=indexer_user.user_ratings, _beer_mapping=reader.beer_mapping)

    except KeyError:
       print(f"\nUser {user} not found")


User: 2
Group: 1
Positive ratio: 75.00%
Already Rated number of beers: 4

Group: 2
Positive ratio: 100.00%
Already Rated number of beers: 24

Group: 3
Positive ratio: 100.00%
Already Rated number of beers: 4


User: 63
Group: 1
Positive ratio: 89.69%
Already Rated number of beers: 97

Group: 2
Positive ratio: 99.00%
Already Rated number of beers: 100

Group: 3
Positive ratio: 94.44%
Already Rated number of beers: 36


User: 76
Group: 1
Positive ratio: 71.43%
Already Rated number of beers: 14

Group: 2
Positive ratio: 87.50%
Already Rated number of beers: 16

Group: 3
Positive ratio: 87.50%
Already Rated number of beers: 8


User: 80
Group: 1
No beers already rated in the group
Group: 2
No beers already rated in the group
Group: 3
Positive ratio: 100.00%
Already Rated number of beers: 1


User: 151
Group: 1
Positive ratio: 78.12%
Already Rated number of beers: 32

Group: 2
Positive ratio: 91.67%
Already Rated number of beers: 48

Group: 3
Positive ratio: 85.71%
Already Rated number of 

## 5.3 LDA user 5 recommendations

In [13]:
best_recommendations = lda.recommend_lda(groups=groups, user=5, rec_num=10, _user_ratings=indexer_user.user_ratings, _beer_mapping=reader.beer_mapping)


print("Best Recommendations\n")
for beer_id, beer in best_recommendations:
    print(f"{beer_id:<6}: {beer}")

Group: 1
Positive ratio: 82.22%
Already Rated number of beers: 45

Group: 2
Positive ratio: 92.00%
Already Rated number of beers: 50

Group: 3
Positive ratio: 100.00%
Already Rated number of beers: 16

Best Recommendations

646   : Westmalle Trappist Tripel
674   : Westmalle Trappist Dubbel
221   : Fuller's London Porter
1346  : Chimay Tripel (White)
219   : Fuller's ESB
222   : Fuller's London Pride
1836  : La Chouffe
672   : Chimay Première (Red)
652   : Traquair Jacobite
63    : Anchor Steam Beer


# 6. Model Based

Using SVD we will try to build a recommended system that predicts the user's score for each beer.
We will user the test users to find the optimal value of k, based on the MSE.

## 6.1 Find optimal k value

* In the step we will find the optimal value of k, by using the recall as the metric of evaluation

In [4]:
ratings_np_centered, user_means, ratings_matrix = svd.initialize_svd(reviews=reviews)

# This part was used to find the optimal value of k. Due to the long time of processing it's commented out

In [5]:
# for k in range(1, 302, 50):
#     logger.info(f"Started SVD calculation for k={k}")
#     #perform svd with k singular values (features)
#     U, sigma, Vt = svds(ratings_np_centered, k = k)
#     sigma = np.diag(sigma)
#     logger.info(f"Completed SVD calculation for k={k}")
#
#     count = []
#     acc = []
#
#     # calculate average score for test users
#     for user in helper_functions.progressbar(reader.test_users,
#                                              f"Validating recall for k={k}: ",
#                                              60,
#                                              print_every=1):
#         try:
#             _acc, _count = svd.validate(user, reviews, U, sigma, Vt, user_means, ratings_matrix)
#             acc.append(_acc)
#             count.append(_count)
#         except ValueError:
#             print(f"User {user} not found")
#
#     print(f"Weighted recall for k={k}: {np.dot(np.array(acc)/np.array(count), np.array(count)/np.array(count).sum(axis=0,keepdims=1))*100:.2f}%")


## 6.2 Evaluate results

In [6]:
U, sigma, Vt = svds(ratings_np_centered, k = 1)
sigma = np.diag(sigma)

### 6.2.1 Test users evaluation - SVD

In [9]:
for user in reader.test_users[:5]:

    rec_df, already_rated = svd.recommend(uid=user, reviews=reviews,U=U,sigma=sigma,Vt=Vt,user_means=user_means,rec_num=10, ratings_matrix=ratings_matrix, beer_mapping=reader.beer_mapping)

    if already_rated:
        print(f"\n\nUser: {user}")
        print("Positive ratio:", f'{list(already_rated.values()).count("P")/len(already_rated.values())*100:.2f}%')
        print("Already Rated number of beers:", f'{len(already_rated.values())}')



User: 2
Positive ratio: 93.75%
Already Rated number of beers: 16


User: 63
Positive ratio: 98.67%
Already Rated number of beers: 226


User: 76
Positive ratio: 81.82%
Already Rated number of beers: 22


User: 151
Positive ratio: 95.00%
Already Rated number of beers: 60


### 6.2.2 User 5 evaluation - SVD

In [16]:
rec_df, already_rated = svd.recommend(uid=5, reviews=reviews,U=U,sigma=sigma,Vt=Vt,user_means=user_means,rec_num=10, ratings_matrix=ratings_matrix, beer_mapping=reader.beer_mapping)


print("Positive ratio:", f'{list(already_rated.values()).count("P")/len(already_rated.values())*100:.2f}%')
print("Already Rated number of beers:", f'{len(already_rated.values())}')
rec_df

Positive ratio: 93.33%
Already Rated number of beers: 60


Unnamed: 0,beer_id,beer_name,predicted_rating,pred_rating_discr
76393,1904,Sierra Nevada Celebration Ale,3.943417,P
154175,2751,Racer 5 India Pale Ale,3.927446,P
155929,3916,AleSmith IPA,3.904008,P
1131796,221,Fuller's London Porter,3.902999,P
82961,6549,Northern Hemisphere Harvest Wet Hop Ale,3.902391,P
971432,646,Westmalle Trappist Tripel,3.900599,P
1362025,35738,Hop Stoopid,3.892939,P
394065,228,Great Lakes Dortmunder Gold,3.890578,P
154433,1658,Big Bear Black Stout,3.890358,P
45773,48434,Sierra Nevada Kellerweis Hefeweizen,3.89024,P
