![header](./images/beertaps.png)

# Beer-Recommendation System: Content Based Recommendations
Author: Ashli Dougherty 

# Overview

This project's goal is to build a recommendation system for the beer enthusiast. I am interested in creating both a content based and collaborative filtering recommendation system. 
- A content based system will make recommendations based on a beer's features. The content based system will allow any user to enter a beer/characteristic and in return they will be given the names of other beers they will (hopefully) enjoy drinking.  
- The collaborative system will recommend items based on the ratings of other users. This system will compare beer drinker/reviewer profiles and then recommend items based on the similarity between these users. 


# Business Understanding

As of December 2021, there are more than [9,000 breweries](https://vinepair.com/booze-news/us-record-number-breweries-2021/#:~:text=Even%20after%20the%20setbacks%20of,beer%20producers%20in%20the%20U.S.)in the US alone. Even though some taprooms were forced to shut their doors during the pandemic, the craft beer business is still going strong. The [Brewer’s Association](https://www.brewersassociation.org/statistics-and-data/national-beer-stats/) is expecting an increase in craft brewery volume share in the post-pandemic industry market, and reported that craft beer retail sales were over $26 billion dollars in 2021.    
  
Currently, there are mobile apps (like [Untapped](https://untappd.com/)) and websites (like [Beer Advocate](https://www.beeradvocate.com/)) that allow you to personally track and rate the beer you try, but consumers should know they can enjoy their next sip (or pint) with confidence. There are so many options on the market that choosing which beverage to buy next, what brewery to visit in person, or which booth to stand in line for at a festival can seem overwhelming. My goal is to provide a system for beer enthusiasts to try new beers that they are guaranteed to love. Cheers!

# Collaborative Filtering

display undertanding and overall approach

# Imports & Functions

Standard imports, **EXPLAIN SURPRISE**
[Documentation here.](https://surprise.readthedocs.io/en/stable/index.html)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from surprise import Dataset, Reader, accuracy
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
from surprise.prediction_algorithms import BaselineOnly, NormalPredictor

In [2]:
from surprise.prediction_algorithms import KNNBaseline, KNNBasic, KNNWithMeans, KNNWithZScore

In [None]:
# from surprise.prediction_algorithms import SVD, SVDpp, NMF

# Load Data
Loading 'Reviews' CSV created from the [DataPrep Notebook](./DataPrep.ipynb). For use within the recommendation model the only columns that will be kept are **'review_overall', 'beer_id', and 'user_id'.**

In [3]:
df = pd.read_csv('../BeerData/reviews_cleaned.csv')
df.drop(columns='Unnamed: 0', inplace=True)
df.head()

Unnamed: 0,brewery_id,brewery_name,review_overall,beer_style,beer_name,beer_abv,beer_id,user_id
0,10325,Vecchio Birraio,1.5,Wheat Beer,Sausa Weizen,5.0,47986,13329
1,10325,Vecchio Birraio,3.0,Strong Ale,Red Moon,6.2,48213,13329
2,10325,Vecchio Birraio,3.0,Stout,Black Horse Black Beer,6.5,48215,13329
3,10325,Vecchio Birraio,3.0,Pale Lager,Sausa Pils,5.0,47969,13329
4,1075,Caldera Brewing Company,4.0,India Pale Ale,Cauldron DIPA,7.7,64883,10108


In [6]:
#checking shape
df.shape

(1475492, 8)

In [7]:
#checking info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1475492 entries, 0 to 1475491
Data columns (total 8 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   brewery_id      1475492 non-null  int64  
 1   brewery_name    1475492 non-null  object 
 2   review_overall  1475492 non-null  float64
 3   beer_style      1475492 non-null  object 
 4   beer_name       1475492 non-null  object 
 5   beer_abv        1475492 non-null  float64
 6   beer_id         1475492 non-null  int64  
 7   user_id         1475492 non-null  int64  
dtypes: float64(2), int64(3), object(3)
memory usage: 90.1+ MB


# Train Test Split
need to use tts from surprise, have to read in df and use specific columns in specific order, set the ratings on a sclae from 1-5

In [8]:
# instantiate a 'Reader' to read in the data so Surprise can use it
# rating_scale will be bounded by 1 and 5
reader = Reader(rating_scale=(1,5))

In [9]:
#loading relevant columns from dataframe
data = Dataset.load_from_df(df[['user_id', 'beer_id', 'review_overall']], reader)

In [10]:
# use surprise train test split 
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

In [11]:
# examining number of users and items in train set
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items)

Number of users:  14548 

Number of items:  45408


# Baseline Models


### Baseline Only

In [12]:
#intantiate model 
baseline = BaselineOnly()

#fit and get predictions 
predictions = baseline.fit(trainset).test(testset)

#score baseline model
baseline_score = accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 0.5962


### Normal Predictor

In [13]:
#intantiate model 
normal = NormalPredictor()

#fit and get predictions 
predictions2 = normal.fit(trainset).test(testset)

#score baseline model
normal_score = accuracy.rmse(predictions2)

RMSE: 0.9813


> **Baseline evaluation**: BaselineOnly model was able to predict the score of the beer within approximately 0.6 points. This was ~0.4 better than the NormalPredictor model.

# Determining the Best Model
**explain why i used these** Will be looking at the k nearest neighbors algorithms within surprise's library.

After this a grid search will be ran to determine best hyperparamters and RMSE will be compared. 
Documentation for Surprise k-NN inspired algorithms can be found [here](https://surprise.readthedocs.io/en/stable/knn_inspired.html). 

Documentation was used to determine which parameters to use during grid searches. Gridsearches are set up the same as when using sklearn. A dictionary of parameters to tune are created and passed to GridSearchCV where best parameters and scores can be determined and evaluated.

## SVD & Gridsearching

In [15]:
#instantiate & cross validate
knnbasic = KNNBasic(random_state=42)


In [16]:
cv_results = cross_validate(knnbasic, data, measures=['rmse'], cv=3, n_jobs=-1, verbose=True)

KeyboardInterrupt: 

In [None]:
# fit to train set and get predictions with test set
predictions_svd = svd.fit(trainset).test(testset)

# score svd model
svd_results = accuracy.rmse(predictions_svd)

> RMSE for untuned SVD is slightly worse than baseline model. 

In [None]:
# creating svd gridsearch parameters
svd_params = {
    'n_factors': [25, 50, 100],
    'n_epochs': [10, 20, 40],
    'lr_all': [.005, .05],
    'reg_all':  [.02, .05], 
    'biased': [True, False]
}

Parameters being tested in first grid search and their defaults: 
> n_factors: the number of factors, default is 100
>
> n_epochs: the number of iterations, default is 20
>
> lr_all: the learning rate for all parameters, default is 0.005
>
> reg_all: the regularization rate for all parameters, default is 0.02
>
> biased: whether to baselines(bias), default is True

In [None]:
# instantiate and fit grid search on data
svd_gs = GridSearchCV(SVD, param_grid=svd_params, 
                      cv=3, joblib_verbose=10, n_jobs=-1)
svd_gs.fit(data)

In [None]:
svd_gs.best_params['rmse']

In [None]:
svd_gs.best_score['rmse']

### SVD Model with Best Parameters

In [None]:
# instantiating model using best parameters
svd_gs_model = SVD(n_factors=100, n_epochs=40, lr_all=0.005, 
                   reg_all=0.05, biased=True, random_state=42)

# fitting on train set and getting predictions from test 
svd_gs_predictions = svd_gs_model.fit(trainset).test(testset)

# scoring tuned model
svd_gs_model_score = accuracy.rmse(svd_gs_predictions)


> Surprisingly the tuned SVD model scored just slightly better than the baseline model (RMSE: 0.5962). In order to see if there is a model with a lower RMSE I will continue onto the SVD++ model.

## SVD++ & Gridsearching

In [None]:
#instantiate & cross validate
svdpp = SVDpp(random_state=42)
cv_results = cross_validate(svdpp, data, measures=['rmse'], cv=3, n_jobs=-1, verbose=True)

In [None]:
# fit to train set and get predictions with test set
predictions_svdpp = svd.fit(trainset).test(testset)

# score svd model
svdpp_results = accuracy.rmse(predictions_svdpp)

> RMSE for untuned SVD++ model is 0.5966 which is the exact same score as the untuned SVD model.

In [None]:
# creating svd gridsearch parameters
svdpp_params = {
    'n_factors': [20, 50, 100],
    'n_epochs': [10, 20, 40],
    'lr_all': [0.025, .005, .007],
    'reg_all': [.01, .02, .05]
}

Parameters being tested in first grid search and their defaults: 
> n_factors: the number of factors, default is 20
>
> n_epochs: the number of iterations, default is 20
>
> lr_all: the learning rate for all parameters, default is 0.007
>
> reg_all: the regularization rate for all parameters, default is 0.02
>

In [None]:
# instantiate and fit grid search on data
svdpp_gs = GridSearchCV(SVDpp, param_grid=svdpp_params, 
                      cv=3, joblib_verbose=10, n_jobs=-1)
svdpp_gs.fit(data)

In [None]:
svdpp_gs.best_params['rmse']

In [None]:
svdpp_gs.best_score['rmse']

### SVD++ Model with Best Parameters