# Beer Advocate Case

## About the dataset

This dataset consists of beer reviews from Beeradvocate. The data span a period of more than 10 years, including all ~1.5 million reviews up to November 2011. Each review includes ratings in terms of five "aspects": appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plaintext review. We also have reviews from ratebeer.

https://data.world/socialmediadata/beeradvocate

## About the task

Could you provide an answer for the following questions:

- Which brewery produces the strongest beers by abv ?
- If you had to pick 3 beers to recommend to someone, how would you approach the problem ?
- What are the factors that impacts the quality of beer the most ?
- I enjoy a beer which aroma and appearance matches the beer style. What beer should I buy ?

## Imports

In [1]:
import pandas as pd

## Data exploration

In [2]:
# Load data into pandas DataFrame
df = pd.read_csv('data/beer_reviews.csv')

In [3]:
# Reorder columns
df = df[[
    'brewery_id',
    'brewery_name',
    'beer_beerid',
    'beer_name',
    'beer_style',
    'beer_abv',
    'review_time',
    'review_profilename',
    'review_appearance',
    'review_aroma',
    'review_palate',
    'review_taste',
    'review_overall'
]]
df

Unnamed: 0,brewery_id,brewery_name,beer_beerid,beer_name,beer_style,beer_abv,review_time,review_profilename,review_appearance,review_aroma,review_palate,review_taste,review_overall
0,10325,Vecchio Birraio,47986,Sausa Weizen,Hefeweizen,5.0,1234817823,stcules,2.5,2.0,1.5,1.5,1.5
1,10325,Vecchio Birraio,48213,Red Moon,English Strong Ale,6.2,1235915097,stcules,3.0,2.5,3.0,3.0,3.0
2,10325,Vecchio Birraio,48215,Black Horse Black Beer,Foreign / Export Stout,6.5,1235916604,stcules,3.0,2.5,3.0,3.0,3.0
3,10325,Vecchio Birraio,47969,Sausa Pils,German Pilsener,5.0,1234725145,stcules,3.5,3.0,2.5,3.0,3.0
4,1075,Caldera Brewing Company,64883,Cauldron DIPA,American Double / Imperial IPA,7.7,1293735206,johnmichaelsen,4.0,4.5,4.0,4.5,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1586609,14359,The Defiant Brewing Company,33061,The Horseman's Ale,Pumpkin Ale,5.2,1162684892,maddogruss,3.5,4.0,4.0,4.0,5.0
1586610,14359,The Defiant Brewing Company,33061,The Horseman's Ale,Pumpkin Ale,5.2,1161048566,yelterdow,2.5,5.0,2.0,4.0,4.0
1586611,14359,The Defiant Brewing Company,33061,The Horseman's Ale,Pumpkin Ale,5.2,1160702513,TongoRad,3.0,3.5,3.5,4.0,4.5
1586612,14359,The Defiant Brewing Company,33061,The Horseman's Ale,Pumpkin Ale,5.2,1160023044,dherling,4.5,4.5,4.5,4.5,4.0


In [4]:
# Convert str columns to numeric
numeric_columns = [
    "beer_abv",
    "review_appearance",
    "review_aroma",
    "review_palate",
    "review_taste",
    "review_overall"
]
df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric)

In [78]:
unique_breweries = len(df['brewery_id'].unique())
unique_beers = len(df['beer_beerid'].unique())
unique_beer_names = len(df['beer_name'].unique())
unique_beer_styles = len(df['beer_style'].unique())
unique_reviews = len(df)
unique_reviewers = len(df['review_profilename'].unique())

print(f"Unique breweries: {unique_breweries}")
print(f"Unique beers: {unique_beers}")
print(f"Unique beer names: {unique_beers_name}")
print(f"Unique beer styles: {unique_beer_styles}")
print(f"Unique reviews: {unique_reviews}")
print(f"Unique reviewers: {unique_reviewers}")

Unique breweries: 5840
Unique beers: 66055
Unique beer names: 56857
Unique beer styles: 104
Unique reviews: 1586614
Unique reviewers: 33388


### Nulls

In [6]:
def check_for_nulls(df):
    for column in list(df):
        null_count = df[column].isnull().sum()
        null_ratio = null_count / len(df)
        print(f"{column.ljust(18)} \t {null_count} \t ({null_ratio*100:.2f}%)")

In [7]:
check_for_nulls(df)

brewery_id         	 0 	 (0.00%)
brewery_name       	 15 	 (0.00%)
beer_beerid        	 0 	 (0.00%)
beer_name          	 0 	 (0.00%)
beer_style         	 0 	 (0.00%)
beer_abv           	 67785 	 (4.27%)
review_time        	 0 	 (0.00%)
review_profilename 	 348 	 (0.02%)
review_appearance  	 0 	 (0.00%)
review_aroma       	 0 	 (0.00%)
review_palate      	 0 	 (0.00%)
review_taste       	 0 	 (0.00%)
review_overall     	 0 	 (0.00%)


## Q2

### Check if the same user could rate the same beer twice

### Data cleaning

- Exclude data with nulls in review_profilename

In [73]:
def prepare_data_for_q2(df):
    # Exclude data with nulls in review_profilename
    df = df.loc[df['review_profilename'].notnull()]
    
    # Leave only columns that we need
    columns = [
        "review_profilename",  # reviewer
        "beer_beerid",  # beer
        "beer_style",   # beer style
        "beer_name",    # beer name for decoding predictions
        "review_overall",  # rating
        "review_time",  # timestamp
    ]
    
    return df[columns]

assert len(df) - len(prepare_data_for_q2(df)) == 348  # There are 348 nulls in this column in our dataframe

In [74]:
prepare_data_for_q2(df)

Unnamed: 0,review_profilename,beer_beerid,beer_style,beer_name,review_overall,review_time
0,stcules,47986,Hefeweizen,Sausa Weizen,1.5,1234817823
1,stcules,48213,English Strong Ale,Red Moon,3.0,1235915097
2,stcules,48215,Foreign / Export Stout,Black Horse Black Beer,3.0,1235916604
3,stcules,47969,German Pilsener,Sausa Pils,3.0,1234725145
4,johnmichaelsen,64883,American Double / Imperial IPA,Cauldron DIPA,4.0,1293735206
...,...,...,...,...,...,...
1586609,maddogruss,33061,Pumpkin Ale,The Horseman's Ale,5.0,1162684892
1586610,yelterdow,33061,Pumpkin Ale,The Horseman's Ale,4.0,1161048566
1586611,TongoRad,33061,Pumpkin Ale,The Horseman's Ale,4.5,1160702513
1586612,dherling,33061,Pumpkin Ale,The Horseman's Ale,4.0,1160023044


In [10]:
# from scipy import sparse

# df_q2 = clean_data_for_q2(df)
# R = sparse.coo_matrix((df_q2['review_overall'], 
#                        (df_q2['review_profilename'], df_q2['beer_beerid'])))


In [68]:
import os
import random
import sys
import pickle
import numpy as np
import pandas as pd
#from decimal import Decimal
from collections import defaultdict
import math
from datetime import datetime

def ensure_dir(file_path):
    directory = os.path.dirname(file_path)
    if not os.path.exists(directory):
        os.makedirs(directory)


class MatrixFactorization(object):

    Regularization = 0.002
    BiasLearnRate = 0.005
    BiasReg = 0.002

    LearnRate = 0.002
    all_beers_mean = 0
    number_of_ratings = 0

    item_bias = None
    user_bias = None
    beta = 0.02

    iterations = 0

    def __init__(self, save_path, max_iterations=10):
        self.save_path = save_path
        self.user_factors = None
        self.item_factors = None
        self.item_counts = None
        self.item_sum = None
        self.u_inx = None
        self.i_inx = None
        self.user_ids = None
        self.beer_ids = None

        self.all_beers_mean = 0.0
        self.number_of_ratings = 0
        self.MAX_ITERATIONS = max_iterations
        random.seed(42)

        ensure_dir(save_path)

    def initialize_factors(self, ratings, k=25):
        self.user_ids = set(ratings['review_profilename'].values)
        self.beer_ids = set(ratings['beer_beerid'].values)
        self.item_counts = ratings[['beer_beerid', 'review_overall']].groupby('beer_beerid').count()
        self.item_counts = self.item_counts.reset_index()

        self.item_sum = ratings[['beer_beerid', 'review_overall']].groupby('beer_beerid').sum()
        self.item_sum = self.item_sum.reset_index()

        self.u_inx = {r: i for i, r in enumerate(self.user_ids)}
        self.i_inx = {r: i for i, r in enumerate(self.beer_ids)}

        self.item_factors = np.full((len(self.i_inx), k), 0.1)
        self.user_factors = np.full((len(self.u_inx), k), 0.1)

        self.all_beers_mean = calculate_all_beers_mean(ratings)
        print("user_factors are {}".format(self.user_factors.shape))
        self.user_bias = defaultdict(lambda: 0)
        self.item_bias = defaultdict(lambda: 0)

    def predict(self, user, item):

        pq = np.dot(self.item_factors[item], self.user_factors[user].T)
        b_ui = self.all_beers_mean + self.user_bias[user] + self.item_bias[item]
        prediction = b_ui + pq

        if prediction > 5:
            prediction = 5
        elif prediction < 1:
            prediction = 1
        return prediction

    def build(self, ratings, params):

        if params:
            k = params['k']
            self.save_path = params['save_path']

        self.train(ratings, k)

    def split_data(self, min_rank, ratings):

        users = self.user_ids

        train_data_len = int((len(users) * 50 / 100))
        test_users = set(random.sample(users, (len(users) - train_data_len)))
        train_users = users - test_users

        train = ratings[ratings['review_profilename'].isin(train_users)]
        test_temp = ratings[ratings['review_profilename'].isin(test_users)].sort_values('review_time', ascending=False)
        test = test_temp.groupby('review_profilename').head(min_rank)
        additional_training_data = test_temp[~test_temp.index.isin(test.index)]

        train = train.append(additional_training_data)

        return test, train

    def meta_parameter_train(self, ratings_df):

        for k in [15, 20, 30, 40, 50, 75, 100]:
            self.initialize_factors(ratings_df, k)
            print("Training model with {} factors".format(k))
            print(str(k), "factor, iterations, train_mse, test_mse, time")

            test_data, train_data = self.split_data(10,
                                                    ratings_df)
            columns = ['review_profilename', 'beer_beerid', 'review_overall']
            ratings = train_data[columns].to_numpy()
            test = test_data[columns].to_numpy()

            self.MAX_ITERATIONS = 10
            iterations = 0
            index_randomized = random.sample(range(0, len(ratings)), (len(ratings) - 1))

            for factor in range(k):
                factor_iteration = 0
                factor_time = datetime.now()

                last_err = sys.maxsize
                last_test_mse = sys.maxsize
                finished = False

                while not finished:
                    train_mse = self.stocastic_gradient_descent(factor,
                                                                index_randomized,
                                                                ratings)

                    iterations += 1
                    test_mse = self.calculate_rmse(test, factor)

                    finished = self.finished(factor_iteration,
                                             last_err,
                                             train_mse,
                                             last_test_mse,
                                             test_mse)

                    last_err = train_mse
                    last_test_mse = test_mse
                    factor_iteration += 1

                    print(str(k), f"{factor}, {iterations}, {train_mse}, {test_mse}, {datetime.now() - factor_time}")

            self.save(k, False)

    def calculate_rmse(self, ratings, factor):

        def difference(row):
            user = self.u_inx[row[0]]
            item = self.i_inx[row[1]]

            pq = np.dot(self.item_factors[item][:factor + 1], self.user_factors[user][:factor + 1].T)
            b_ui = self.all_beers_mean + self.user_bias[user] + self.item_bias[item]
            prediction = b_ui + pq
            MSE = (prediction - row[2]) ** 2
            return MSE

        squared = np.apply_along_axis(difference, 1, ratings).sum()
        return math.sqrt(squared / ratings.shape[0])

    def train(self, ratings_df, k=40):

        self.initialize_factors(ratings_df, k)
        print("training matrix factorization at {}".format(datetime.now()))
        
        valid_data, train_data = self.split_data(30, ratings_df) # new
        print(len(valid_data))
        print(len(train_data))
        
        columns = ['review_profilename', 'beer_beerid', 'review_overall']
        ratings = train_data[columns].to_numpy()
        valid = valid_data[columns].to_numpy()

        #ratings = ratings_df[['review_profilename', 'beer_beerid', 'review_overall']].to_numpy()

        index_randomized = random.sample(range(0, len(ratings)), (len(ratings) - 1))

        for factor in range(k):
            factor_time = datetime.now()
            iterations = 0
            last_err = sys.maxsize
            last_valid_mse = sys.maxsize
            
            iteration_err = sys.maxsize
            finished = False

            while not finished:
                start_time = datetime.now()
                iteration_err = self.stocastic_gradient_descent(factor,
                                                              index_randomized,
                                                              ratings)

                valid_mse = self.calculate_rmse(valid, factor)  # new
                
                iterations += 1
                print("epoch in {}, f={}, i={} err={} valid_err={}".format(datetime.now() - start_time,
                                                                       factor,
                                                                       iterations,
                                                                       iteration_err,
                                                                       valid_mse))  # new
                finished = self.finished(iterations,
                                         last_err,
                                         iteration_err,
                                         last_valid_mse,  # new
                                         valid_mse)  # new
                last_err = iteration_err
                last_valid_mse = valid_mse  # new
            self.save(factor, finished)
            print("finished factor {} on f={} i={} err={} valid_err={}".format(factor,
                                                                  datetime.now() - factor_time,
                                                                  iterations,
                                                                  iteration_err,
                                                                  valid_mse))  # new

    def stocastic_gradient_descent(self, factor, index_randomized, ratings):

        lr = self.LearnRate
        b_lr = self.BiasLearnRate
        r = self.Regularization
        bias_r = self.BiasReg

        for inx in index_randomized:
            rating_row = ratings[inx]

            u = self.u_inx[rating_row[0]]
            i = self.i_inx[rating_row[1]]
            rating = rating_row[2]

            err = (rating - self.predict(u, i))

            self.user_bias[u] += b_lr * (err - bias_r * self.user_bias[u])
            self.item_bias[i] += b_lr * (err - bias_r * self.item_bias[i])

            user_fac = self.user_factors[u][factor]
            item_fac = self.item_factors[i][factor]

            self.user_factors[u][factor] += lr * (err * item_fac
                                                  - r * user_fac)
            self.item_factors[i][factor] += lr * (err * user_fac
                                                  - r * item_fac)
        return self.calculate_rmse(ratings, factor)

    def finished(self, iterations, last_err, current_err,
                 last_valid_mse=0.0, valid_mse=0.0):

        if iterations >= self.MAX_ITERATIONS or last_err - current_err < 0.001:
            print('Finish w iterations: {}, last_err: {}, current_err {}, lst_valid_mse {}, valid_mse {}'
                             .format(iterations, last_err, current_err, last_valid_mse, valid_mse))
            return True
        else:
            self.iterations += 1
            return False

    def save(self, factor, finished):

        save_path = self.save_path + '/model/'
        if not finished:
            save_path += str(factor) + '/'

        ensure_dir(save_path)

        print("saving factors in {}".format(save_path))
        user_bias = {uid: self.user_bias[self.u_inx[uid]] for uid in self.u_inx.keys()}
        item_bias = {iid: self.item_bias[self.i_inx[iid]] for iid in self.i_inx.keys()}

        uf = pd.DataFrame(self.user_factors,
                          index=self.user_ids)
        it_f = pd.DataFrame(self.item_factors,
                            index=self.beer_ids)

        with open(save_path + 'user_factors.json', 'w') as outfile:
            outfile.write(uf.to_json())
        with open(save_path + 'item_factors.json', 'w') as outfile:
            outfile.write(it_f.to_json())
        with open(save_path + 'user_bias.data', 'wb') as ub_file:
            pickle.dump(user_bias, ub_file)
        with open(save_path + 'item_bias.data', 'wb') as ub_file:
            pickle.dump(item_bias, ub_file)

    def recommend_items(self, user_id, ratings_df, num=3):

        active_user_items = ratings_df[ratings_df['review_profilename'] == user_id]  # new
        active_user_items = active_user_items[['beer_beerid', 'review_overall']]  # new

        return self.recommend_items_by_ratings(user_id, ratings_df, active_user_items)

    def recommend_items_by_ratings(self, user_id, ratings_df, active_user_items, num=3):
        
        avg = calculate_all_beers_mean(loaded_ratings)

        rated_beers = {beer['beer_beerid']: beer['review_overall'] \
                       for _, beer in active_user_items.iterrows()}  # new
        recs = {}
        
        # new
        uf = pd.DataFrame(self.user_factors,
                          index=self.user_ids).T
        
        # new
        it_f = pd.DataFrame(self.item_factors,
                            index=self.beer_ids).T
        
        # new
        user_bias_all = {uid: self.user_bias[self.u_inx[uid]] for uid in self.u_inx.keys()}
        item_bias_all = {iid: self.item_bias[self.i_inx[iid]] for iid in self.i_inx.keys()}
        
        if str(user_id) in uf.columns:

            user = uf[str(user_id)]

            scores = it_f.T.dot(user)

            sorted_scores = scores.sort_values(ascending=False)
            result = sorted_scores[:num + len(rated_beers)].astype(float)
            user_bias = 0

            if user_id in user_bias_all.keys():
                user_bias = user_bias_all[user_id]
            elif int(user_id) in user_bias_all.keys():
                user_bias = user_bias_all[int(user_id)]
                print(f'it was an int {user_bias}')

            rating = float(user_bias + avg)
            result += rating

            recs = {r[0]: {'prediction': r[1] + float(item_bias_all[r[0]])}
                    for r in zip(result.index, result) if r[0] not in rated_beers}

        sorted_items = sorted(recs.items(), key=lambda item: -float(item[1]['prediction']))[:num]

        return sorted_items


def load_all_ratings(ratings, min_ratings=1):
    columns = ['review_profilename', 'beer_beerid', 'review_overall', 'beer_style', 'review_time']
    ratings = ratings[columns]
    
    #print(len(ratings))

    user_count = ratings[['review_profilename', 'beer_beerid']].groupby('review_profilename').count()
    user_count = user_count.reset_index()
    user_ids = user_count[user_count['beer_beerid'] >= min_ratings]['review_profilename']
    ratings = ratings[ratings['review_profilename'].isin(user_ids)]

    ratings['review_overall'] = ratings['review_overall'].astype(float)
    return ratings


def calculate_all_beers_mean(ratings):
    avg = ratings['review_overall'].sum() / ratings.shape[0]
    return avg


In [75]:
loaded_ratings = load_all_ratings(prepare_data_for_q2(df), min_ratings=1)
loaded_ratings

Unnamed: 0,review_profilename,beer_beerid,review_overall,beer_style,review_time
0,stcules,47986,1.5,Hefeweizen,1234817823
1,stcules,48213,3.0,English Strong Ale,1235915097
2,stcules,48215,3.0,Foreign / Export Stout,1235916604
3,stcules,47969,3.0,German Pilsener,1234725145
4,johnmichaelsen,64883,4.0,American Double / Imperial IPA,1293735206
...,...,...,...,...,...
1586609,maddogruss,33061,5.0,Pumpkin Ale,1162684892
1586610,yelterdow,33061,4.0,Pumpkin Ale,1161048566
1586611,TongoRad,33061,4.5,Pumpkin Ale,1160702513
1586612,dherling,33061,4.0,Pumpkin Ale,1160023044


In [70]:
MF = MatrixFactorization(save_path='./models/funkSVD/{}/'.format(datetime.now()), max_iterations=40)

In [71]:
MF.train(loaded_ratings, k=20)

user_factors are (33387, 20)
training matrix factorization at 2021-05-27 13:28:24.128178
160394
1425872
epoch in 0:00:27.967054, f=0, i=1 err=0.6447916959656839 valid_err=0.7021857554347545
epoch in 0:00:27.714722, f=0, i=2 err=0.635466635893825 valid_err=0.6952797983900791


KeyboardInterrupt: 

In [87]:
MF.recommend_items('Grandwazoo', loaded_ratings)

[(16814, {'prediction': 4.7787058848893675}),
 (1545, {'prediction': 4.728252205829542}),
 (21690, {'prediction': 4.723718898987746})]

In [35]:
uf = pd.DataFrame(MF.user_factors,
                          index=MF.user_ids)

In [36]:
uf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
LyriCa1z,0.100135,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
Grandwazoo,0.099878,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
nikoelnutto,0.100000,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
JP024,0.100000,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
Davme76,0.099156,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
PPIBrian,0.100000,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
jostnix,0.099714,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
nothingxs,0.100277,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
AgentDark,0.099344,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
