## Capstone Project: Collaborative Filtering and Content-Based Recommender System

Author: Uldis Knox

Date: December 11th, 2022

## Table of Contents

1. [Data Cleaning](#http://localhost:8889/notebooks/Desktop/Capstone/Capstone%20Recommender%20System.ipynb#Data-Cleaning) <br>
2. [Data Matrix](#http://localhost:8888/notebooks/Desktop/Capstone/Capstone%20Recommender%20System.ipynb#Create-a-Data-Matrix) <br>
3. [Feature Matrix](#http://localhost:8888/notebooks/Desktop/Capstone/Capstone%20Recommender%20System.ipynb#Feature-Matrix) <br>
4. [Sample Means & Populate Matrices](#http://localhost:8888/notebooks/Desktop/Capstone/Capstone%20Recommender%20System.ipynb#Recommender-System) <br>
5. [Define Data Matrix & compute User Preferences](#http://localhost:8888/notebooks/Desktop/Capstone/Capstone%20Recommender%20System.ipynb#Define-Data-Matrix-and-compute-User-Preferences)
6. [Recommender System](#http://localhost:8888/notebooks/Desktop/Capstone/Capstone%20Recommender%20System.ipynb#Recommender-System) <br>

**Recommender system and collaborative filtering - User Recommendations**

In a collaborative filtering based recommendation system, we will compute preferences for each user (reviewer == user) and then predict the rating for each beer (that the particular user has not rated before). Based on the predictions, we will recommend the beers that the user hasn't tried before and they may like based on their preferences.

---

---

**Import Packages**

In [1]:
#load in the packages we will need for this recommender system
import numpy as np
import sklearn
import pandas as pd
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
import timeit
import warnings
warnings.filterwarnings("ignore")

In [2]:
#load the dataframe
df_beer = pd.read_csv('data/capstone.csv')

In [3]:
#Check the dataframe
df_beer.head()

Unnamed: 0,brewery_name,city,country,beer_beerid,beer_name,beer_style,taste,beer_abv,review_overall,review_aroma,review_appearance,review_palate,review_taste,review_profilename
0,Vecchio Birraio,Campo San Martino PD,Italy,47986,Sausa Weizen,Hefeweizen,sweet and fruity with notes of banana and clov...,5.0,1.5,2.0,2.5,1.5,1.5,stcules
1,Vecchio Birraio,Campo San Martino PD,Italy,48213,Red Moon,English Strong Ale,"Malty-sweet with fruity esters, often with a c...",6.2,3.0,2.5,3.0,3.0,3.0,stcules
2,Vecchio Birraio,Campo San Martino PD,Italy,48215,Black Horse Black Beer,Foreign / Export Stout,"roasted grain and malt flavor with a coffee, c...",6.5,3.0,2.5,3.0,3.0,3.0,stcules
3,Vecchio Birraio,Campo San Martino PD,Italy,47969,Sausa Pils,German Pilsener,"bitter taste, malty sweetness, floral hop aroma",5.0,3.0,3.0,3.5,2.5,3.0,stcules
4,Caldera Brewing Company,"Ashland, OR",USA,64883,Cauldron DIPA,American Double / Imperial IPA,"floral, citrus, stone fruit, spicy, pine/resin...",7.7,4.0,4.5,4.0,4.0,4.5,johnmichaelsen


In [4]:
# Split data frame into individual frames
reviews = dict()
for col in df_beer.columns:
    reviews.update({
            col:df_beer[col]
        })

## Data Cleaning

In [9]:
# Define Margin of Error and Z-score for 95% confidence interval
mError = 0.1
zScore = 1.96

def prep_data_frame(_list):
    
    _dict = {header: reviews[header] for header in _list}
    return pd.DataFrame.from_dict(_dict)
    
def calculate_stats(key, data_frame):
    _df = data_frame.groupby(level=0)
    samples = _df.count().rename(columns={key: 'count'})
    means = _df.mean().rename(columns={key: 'mean'})
    std = _df.std().rename(columns={key:'std'})
    return pd.concat([samples, means, std], axis=1)
    

beer_identifiers = df_beer[['beer_beerid','beer_name', 'beer_style', 'review_profilename']]
reviews_means = dict()
for key in ['review_overall', 'review_aroma', 'review_taste', 'review_appearance', 'review_palate']:
    
    # Prepare Data Frame for each review
    ids = prep_data_frame(beer_identifiers)
    review = prep_data_frame([key])
    data_frame = pd.concat([ids, review], axis=1).drop_duplicates(['beer_beerid','review_profilename'])
    
    # Filter rows if number of reviews meet certain criteria
    stats = calculate_stats(key, data_frame.set_index(["beer_beerid","beer_name"]))
    stats = stats[stats['std'] != 0] # Remove rows with zero std dev
    stats['required'] = stats['std'].map(lambda x:(x *zScore/mError)**2) # Add a new row with required num samples
    beer_ids = [idx for idx in stats.index if stats.loc[idx, 'count'] > stats.loc[idx, 'required']]
    mean_values = [stats.loc[idx, 'mean'] for idx in beer_ids]
    
    #(line 28-32) similar code found here and appropriated for this project: https://github.com/IamMrandrew/recommender-system-collaborative-filtering

    # Drop duplicate beerids and reviewer profilenames 
    data_frame = data_frame.drop_duplicates(['beer_beerid']).drop('review_profilename', axis=1)

    # Keep the beers that have minimum number of reviews to predict ratings with 95% confidence interval
    review_data_frame = data_frame.set_index(['beer_beerid'])

    review_data_frame = review_data_frame.drop([Id for Id in review_data_frame.index if Id not in beer_ids])

    
    #Add DataFrames for each attribute reviews
    reviews_means.update({
        key : review_data_frame.reset_index()
    })

The margin of error is defined a the range of values below and above the sample statistic in a confidence interval. The confidence interval is a way to show what the uncertainty is with a certain statistic (i.e. from a poll or survey).

Z-score indicates how much a given value differs from the standard deviation. The Z-score, or standard score, is the number of standard deviations a given data point lies above or below mean.

Our above code cell is essentially additional data cleaning to get our data packaged in the most appropriate way for our recommender system to process it and return recommendations based on the user/reviewers existing content (reviews).

## Data Matrix

In [10]:
samplesDF = df_beer[["beer_beerid","beer_name","review_overall", "review_profilename"]]
samplesDF = samplesDF.drop_duplicates(["beer_beerid","review_profilename"])
samplesDF = samplesDF.set_index(["beer_beerid","beer_name"])
nSamples = samplesDF.groupby(level=0).count().to_dict()
sampleMeans = samplesDF.groupby(level=0).mean().to_dict()
sampleStdDev = samplesDF.groupby(level=0).std()
mError = 0.1
zScore = 1.96

sampleMeansTemp = {}
for key in nSamples.keys(): 
    if key == "review_overall": # we are only interested in overall_review
        for beerID in nSamples[key].keys(): # get the values - beer_beerid and overall review
            if sampleStdDev[key][beerID] > 0:
                nSamplesRequired = (sampleStdDev[key][beerID] * zScore/mError)**2
            if nSamples[key][beerID] > nSamplesRequired:
                sampleMeansTemp[beerID] =  sampleMeans[key][beerID]

# redefine sampleMeans by sorted overall_reviews 
sampleMeans = sorted(sampleMeansTemp.items(), key=lambda x: x[1] , reverse=True)

reviewBeerIDs = [beerKey[0] for beerKey in sampleMeans]
# drop the duplicate beerIDs 
newBeerDF = df_beer.drop_duplicates(["beer_beerid"])
beerIDsAll = newBeerDF.beer_beerid.tolist()

# list the iDs that we need to discard
discardBeerIDs = [beerID for beerID in beerIDsAll if beerID not in reviewBeerIDs]


the above code cell for data matriRows are the beer_beerids and the columns correspond to user ratings. Each cell in the matrix correspond to a beer and the overall rating for that beer by the user in that column.

## Feature Matrix

In [7]:
# Feature Matrix
featureDF = df_beer[["beer_beerid", "review_profilename",'review_appearance','review_aroma', 
                      'review_palate','review_taste','review_overall']]
featureDF = featureDF.drop_duplicates(["beer_beerid","review_profilename"])
featureDF = featureDF.set_index("beer_beerid")

# discard the beers that didn't meet our screening criterion of 95% confidence level
featureDF = featureDF.drop(discardBeerIDs)
featureDF = featureDF.reset_index()


# Make lists that match data matrix indices
beerIDList = sorted(featureDF.beer_beerid.unique())
profileList = featureDF.review_profilename.unique()

# Reindex the dataframe for extracting features.
featureDF = featureDF.set_index(["beer_beerid","review_profilename"])

#Debug Info:
print(len(beerIDList),len(profileList))

2528 28856


Feature matrix rows are beer_beerids and columns are features. It contains four features - appearance,aroma, palate, taste. The data in the columns corresponding to a particular beerID is the samplemeans of each feature from all the ratings given for that beer_beerid. We assume that the mean rating is proportional to the attributes of a given beer. (We will not use beer_abv here as one of the features as this is not part of a review score, but a value set by the brewery.)

## Sample Means & Populate Matrices

In [8]:
# features sampleMeans
featuresDict = featureDF.groupby(level=0).mean().to_dict()
appearanceSampleMeans = featuresDict['review_appearance']
aromaSampleMeans = featuresDict['review_aroma']
palateSampleMeans = featuresDict['review_palate']
tasteSampleMeans = featuresDict['review_taste']

# Construct a numpy matrix with features Sample Means
featureMatrix = np.zeros(len(beerIDList*5)).reshape(len(beerIDList),5)
featuresMeansDicts = [appearanceSampleMeans,aromaSampleMeans,
                      palateSampleMeans,tasteSampleMeans]

# Populate the first element of the feature matrix with beerID
for beerIndex in range(len(beerIDList)):
    featureMatrix[beerIndex][0] = beerIDList[beerIndex]
    
featureIndex = 1 # feature index in feature Matrix
for featureDict in featuresMeansDicts:
    for beerIndex in range(len(beerIDList)):
        for key in featureDict.keys():
            if key == beerIDList[beerIndex]:
                featureMatrix[beerIndex][featureIndex] = featureDict[key]
    featureIndex += 1
#learned about feature indices and converting dataframes to matrices here: https://gist.github.com/danieljfarrell

# Add bias column in the featureMatrix
featureMatrix = np.insert(featureMatrix,1,1, axis=1)
#adding bias was found here: https://github.com/eliben/deep-learning-samples

Calculate sample means of overall ratings for a given beer.

Include only the beers where we can calculate the sample mean within a certain margin of error.

Chose the beers with reviews greater than a certain threshold number of reviews (min number of samples) that are required to predict the sample mean with 95% confidence interval.

Reduce the sample set by assigning the mean value as the overall rating for that beer.

## Define Data Matrix and compute User Preferences

Constructing a data matrix with all users and all beers takes a lot of time and memory. We will take a sample from our users and evaluate the paramater matrix that corresponds to the user's preferences.

For the total number of users in the list, ~30000, it takes lot of time (> 5 hrs). This is why in line 13 below, we are using '101' which is the first ~100 users in our list.

In [14]:
#Create a dictionary object that holds users (review_profileNames) and corresponding preference matrices
userPreferences = {}

# we will use this sampleMeans for mean normalization in Data Matrix
sampleMeans = featuresDict['review_overall'] 

# Generating user preferences for first ~100 users
dataDF = featureDF.drop(['review_appearance','review_aroma', 
                      'review_palate','review_taste'], axis=1)

#Dict object to hold a subset of profileNames with R^2 value greater than a certain threshold
bestScores = {}
for profile in profileList[:101]:
    dataMatrix = np.zeros(len(beerIDList*2)).reshape(len(beerIDList),2)
    for beerIndex in range(len(beerIDList)):
        dataMatrix[beerIndex][0] = beerIDList[beerIndex]
        try:
            dataMatrix[beerIndex][1] = dataDF.loc[beerIDList[beerIndex],profile].tolist()[0]
        except KeyError:
            dataMatrix[beerIndex][1] = 0.0
            
#lines 21-29: similar code found here and was used for this project: https://github.com/proback/BeyondMLR
            
    # X and y matrices for linear regression
    # Including all the rows results in poor R^2 values
    # Only the rows with reviews are included in the fit
    # Bias term not included
    y = np.array([dataMatrix[i][1] for i in range(dataMatrix.shape[0]) if dataMatrix[i][1] > 0])
    X = np.array([featureMatrix[i][1:] for i in range(featureMatrix.shape[0]) if dataMatrix[i][1] > 0])

    
    # linear regression to compute parameter matrix
    regressor = LinearRegression()
    regressor.fit(X,y)
    score = regressor.score(X,y)
    userPreferences[profile] = [regressor.coef_,score]
    
    # we will populate the dict with profile names whose scores are above a certain threshold
    if score > 0.5:
        bestScores[profile] = [profile,score]


Predicting the review based on featureMatrix and userPreferences.                                           

After populating the userPreferences dictionary with user preferences, for a given user, we can predict what their rating could be based on the feature matrix and their preferences.

For example, the 10th user in profileList has rated  ~40 beers out of 3000 in the list. Based on the predictions, we will recommend other beers that they may like.

## Recommender System

In [17]:
# Randomly select a user in the list of profiles and get recommendations

import random

if len(list(bestScores.keys())) > 0:
    profileName  = list(bestScores.keys())[random.randrange(0,len(list(bestScores.keys())))]
    r2score = bestScores[profileName][1]
else:
    profileName = profile
    r2score = score

# List of beers that weren't rated by the user
beersNotRated = np.array([featureMatrix[i][0] for i in range(featureMatrix.shape[0])if dataMatrix[i][1] == 0])

# Feature list of beers
X_notRated = np.array([featureMatrix[i][1:] for i in range(featureMatrix.shape[0]) if dataMatrix[i][1] == 0])

# computed userPreferences from our regression analysis
userPref = userPreferences[profileName][0]
userPref = userPref[:,np.newaxis]

# Compute predicted ratings
y_predRatings = np.dot(X_notRated, userPref)
# We will cap the rating at 5.0. 
#Our regression analysis has predicted values over 5
for i in range(y_predRatings.shape[0]):
    if y_predRatings[i] > 5.0:
        y_predRatings[i] = 5.0

# prepare a new dataFrame to display the top recommendations
predDF = pd.DataFrame()
# Add two columns
predDF['beerID'] = [int(beersNotRated[i]) for i in range(beersNotRated.shape[0])]
predDF['predRating'] = [y_predRatings[i][0] for i in range(beersNotRated.shape[0])]
# Sort in descending order. Main column is the predicted overall rating
predDF.sort_values(by='predRating', ascending=False, inplace=True)

#Extract beerIDs to collect beerName and beerStyle information from original dataFrame 
predBeerIDs = predDF.beerID.tolist() 

# Add other columns to the recommendation dataframe
avgAppearance = [appearanceSampleMeans[i] for i in predBeerIDs]
avgAroma = [aromaSampleMeans[i] for i in predBeerIDs]
avgPalate = [palateSampleMeans[i] for i in predBeerIDs]
avgTaste = [tasteSampleMeans[i] for i in predBeerIDs]
beerNames = [df_beer[df_beer.beer_beerid == i].beer_name.tolist()[0] for i in predBeerIDs]
breweryNames = [df_beer[df_beer.beer_beerid == i].brewery_name.tolist()[0] for i in predBeerIDs]
beerStyles = [df_beer[df_beer.beer_beerid == i].beer_style.tolist()[0] for i in predBeerIDs]
breweryCity = [df_beer[df_beer.beer_beerid == i].city.tolist()[0] for i in predBeerIDs]
breweryCountry = [df_beer[df_beer.beer_beerid == i].country.tolist()[0] for i in predBeerIDs]
predDF['BreweryName'] = breweryNames
predDF['country'] = breweryCountry
predDF['city'] = breweryCity
predDF['BeerStyle'] = beerStyles
predDF['BeerName'] = beerNames
predDF['Aroma'] = avgAroma
predDF['Appearance'] = avgAppearance
predDF['Palate'] = avgPalate
predDF['Taste'] = avgTaste



In [18]:
print ("ProfileName: ", profileName)
print ("R^2 value: ",r2score)
topRecommendations = predDF.head(10).set_index("BeerName")
topRecommendations.drop("beerID", axis=1, inplace=True)
topRecommendations


ProfileName:  ReelBigwigFish
R^2 value:  0.6916225272992417


Unnamed: 0_level_0,predRating,BreweryName,country,city,BeerStyle,Aroma,Appearance,Palate,Taste
BeerName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Stone 8th Anniversary Ale,5.0,Stone Brewing Co.,USA,"Escondido, CA",American Brown Ale,3.997917,3.985417,3.979167,4.064583
Brooklyn Local 1,5.0,Brooklyn Brewery,USA,"Brooklyn, NY",Belgian Strong Pale Ale,4.036168,4.182107,4.050761,4.107868
Smuttynose Homunculus (Big Beer Series),5.0,Smuttynose Brewing Company,USA,"Hampton, NH",Belgian IPA,3.873913,3.834783,3.778261,3.847826
Nor' Easter,5.0,Captain Lawrence Brewing Co.,USA,"Elmsford, NY",Belgian Strong Dark Ale,4.165803,4.07513,4.059585,4.19171
Mojo Risin' Double IPA,5.0,Boulder Beer / Wilderness Pub,USA,"Boulder, CO",American Double / Imperial IPA,3.892954,3.926829,3.864499,3.837398
Stone Soup,5.0,New Glarus Brewing Company,USA,"New Glarus, WI",Belgian Pale Ale,3.773897,3.693015,3.727941,3.735294
Sexual Chocolate,5.0,Foothills Brewing Company,USA,"Winston-Salem, NC",Russian Imperial Stout,4.067143,4.254286,4.09,4.144286
Unibroue 16,5.0,Unibroue,Canada,"Chambly, QC",Belgian Strong Pale Ale,4.089069,4.109312,3.997976,4.09919
Doppelbock Dunkel,5.0,Brauerei Schloss Eggenberg,Austria,Vorchdorf,Doppelbock,3.632479,3.803419,3.666667,3.709402
Schlafly Irish-Style Extra Stout,5.0,Saint Louis Brewery / Schlafly Tap Room,USA,"Saint Louis, MO",Foreign / Export Stout,3.81982,3.905405,4.013514,4.04955


Our recommender system is doing as it is supposed to, finding beers in the dataframe the user hasn't reviewed, but based on their existing reviews the system believes the user will like. This is shown through the predicted predicted ratings next to each of the recommended beers.

What our recommender system doesn't do, because it cannot, is identify beers that would be available in the same region/location as the user. We were not supplied with the User geo-location information in the dataset, so perhaps our recommender system is also a potential travel guide, identifying locations with breweries that serve beers the user will like.

Our recommender system also outputs the ProfileName for the user, because we need to know who we are recommending beers to, otherwise we are just making recommendations based on reviews, not for specific Users.

We are also outputting an R^2. This number is to show us how well the model is explaining the observed data. From the output above, we see an R^2 value of 0.6916 or 69.16%. This means that 69.16% of the variability observed in the target variable is explained by the regression model.

**Conclusion:** In this notebook we can constructed and successfully run a Collaborative Filtering and Content-Based Recommender System for a dataset of beer reviews and returns recommendations for similar beers in the data that a User has not yet reviewed.

We cleaned our dataframe further, established a Data Matrix and Feature Matrix, populated our Matrices, defined our Data Matrix and calculated our User Preferences, finally we executed our recommender system, with an R^2 value that is slightly below 0.7.

In the finance world, an R^2 above 0.7 is generally seen as showing a high correlation, so I am happy to take an R^2 of 0.6916 as a result for my beer recommender system.

This concludes my beer recommender capstone project.  


Thank you.

---