Building a Recommendation System in Python
============================
> In this tutorial we'll show you how to build a recommendation system using pandas, scikit-learn, and numpy. We've provided a dataset of beer reviews which we'll use for building our product recommender, but this use case could be easily substituted with a different product.

In [24]:
import pandas as pd
import numpy as np
import pylab as pl
import matplotlib.pyplot as plt

<h2><a href="https://s3.amazonaws.com/demo-datasets/beer_reviews.tar.gz">Download the data</a></h2>
<p>Grab the dataset from our data demos bucket on S3, then decompress it. It will create a directory called ~/Downloads/beer_reviews.</p>

In [25]:
cd /Users/mylesgartland/OneDrive - Rockhurst University/Courses/Predictive Models

/Users/mylesgartland/OneDrive - Rockhurst University/Courses/Predictive Models


In [26]:
# substitute your name here. If you're on windows you'll need a different filepath
df = pd.read_csv("beer_reviews.csv", encoding='iso-8859-1')
df.head(150)

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883
5,1075,Caldera Brewing Company,1325524659,3.0,3.5,3.5,oline73,Herbed / Spiced Beer,3.0,3.5,Caldera Ginger Beer,4.7,52159
6,1075,Caldera Brewing Company,1318991115,3.5,3.5,3.5,Reidrover,Herbed / Spiced Beer,4.0,4.0,Caldera Ginger Beer,4.7,52159
7,1075,Caldera Brewing Company,1306276018,3.0,2.5,3.5,alpinebryant,Herbed / Spiced Beer,2.0,3.5,Caldera Ginger Beer,4.7,52159
8,1075,Caldera Brewing Company,1290454503,4.0,3.0,3.5,LordAdmNelson,Herbed / Spiced Beer,3.5,4.0,Caldera Ginger Beer,4.7,52159
9,1075,Caldera Brewing Company,1285632924,4.5,3.5,5.0,augustgarage,Herbed / Spiced Beer,4.0,4.0,Caldera Ginger Beer,4.7,52159


In [27]:
df.shape

(1586614, 13)

In [82]:
import warnings
warnings.filterwarnings("ignore")

# Check for and remove nulls

In [83]:
df.isnull().sum()

brewery_id            0
brewery_name          0
review_time           0
review_overall        0
review_aroma          0
review_appearance     0
review_profilename    0
beer_style            0
review_palate         0
review_taste          0
beer_name             0
beer_abv              0
beer_beerid           0
dtype: int64

In [84]:
df.dropna(inplace=True)
df.isnull().sum()

brewery_id            0
brewery_name          0
review_time           0
review_overall        0
review_aroma          0
review_appearance     0
review_profilename    0
beer_style            0
review_palate         0
review_taste          0
beer_name             0
beer_abv              0
beer_beerid           0
dtype: int64

In [85]:
dtype_df = df.dtypes.reset_index()
dtype_df.columns = ["Count", "Column Type"]
dtype_df
dtype_df.groupby("Column Type").aggregate('count').reset_index()

Unnamed: 0,Column Type,Count
0,int64,3
1,float64,6
2,object,4


In [86]:
for attName in df.columns:
    dType = df[attName].dtype
    if (df[attName].dtype == float or df[attName].dtype == int):
        df[attName].fillna(df[attName].mean(), inplace=True)  

## Finding People Who Have Reviewed 2 Beers

In [87]:
beer_1, beer_2 = "Dale's Pale Ale", "Fat Tire Amber Ale"

beer_1_reviewers = df[df.beer_name==beer_1].review_profilename.unique()
beer_2_reviewers = df[df.beer_name==beer_2].review_profilename.unique()
common_reviewers = set(beer_1_reviewers).intersection(beer_2_reviewers)
print("Users in the sameset: %d"), len(common_reviewers)
list(common_reviewers)[:10]

Users in the sameset: %d


['Samp01',
 'kawilliams81',
 'beveritt',
 'LiquidBread219',
 'Beefeater57',
 'BerkeleyBeerMan',
 'BWMKappaSig',
 'CraftBeerTastic',
 'prototypic',
 'Proteus93']

## Extracting Reviews

In [88]:
beer_1, beer_2 = "Samuel Adams Octoberfest", "Caldera Ginger Beer"

beer_1_reviewers = df[df.beer_name==beer_1].review_profilename.unique()
beer_2_reviewers = df[df.beer_name==beer_2].review_profilename.unique()
common_reviewers = set(beer_1_reviewers).intersection(beer_2_reviewers)
print("Users in the sameset: %d"), len(common_reviewers)
list(common_reviewers)[:10]

Users in the sameset: %d


['LordAdmNelson', 'Reidrover', 'augustgarage', 'Halcyondays']

In [89]:
def get_beer_reviews(beer, common_users):
    mask = (df.review_profilename.isin(common_users)) & (df.beer_name==beer)
    reviews = df[mask].sort_values('review_profilename')
    reviews = reviews[reviews.review_profilename.duplicated()==False]
    return reviews
beer_1_reviews = get_beer_reviews(beer_1, common_reviewers)
beer_2_reviews = get_beer_reviews(beer_2, common_reviewers)

cols = ['beer_name', 'review_profilename', 'review_overall', 'review_aroma', 'review_palate', 'review_taste']
beer_2_reviews[cols].head()

Unnamed: 0,beer_name,review_profilename,review_overall,review_aroma,review_palate,review_taste
13,Caldera Ginger Beer,Halcyondays,4.0,4.5,2.5,3.0
8,Caldera Ginger Beer,LordAdmNelson,4.0,3.0,3.5,4.0
6,Caldera Ginger Beer,Reidrover,3.5,3.5,4.0,4.0
9,Caldera Ginger Beer,augustgarage,4.5,3.5,4.0,4.0


## Calculating Distance

In [95]:
ALL_FEATURES = ['review_overall', 'review_aroma', 'review_palate', 'review_taste', 'review_appearance']

In [96]:
warnings.filterwarnings("ignore")

ALL_FEATURES = ['review_overall', 'review_aroma', 'review_palate', 'review_taste']
def calculate_similarity(beer1, beer2):
    # find common reviewers
    beer_1_reviewers = df[df.beer_name==beer1].review_profilename.unique()
    beer_2_reviewers = df[df.beer_name==beer2].review_profilename.unique()
    common_reviewers = set(beer_1_reviewers).intersection(beer_2_reviewers)

    # get reviews
    beer_1_reviews = get_beer_reviews(beer1, common_reviewers)
    beer_2_reviews = get_beer_reviews(beer2, common_reviewers)
    dists = []
    for f in ALL_FEATURES:
        dists.append(euclidean_distances([beer_1_reviews[f]], [beer_2_reviews[f]])[0][0])
    return dists

calculate_similarity(beer_1, beer_2)

[0.8660254037844386, 1.118033988749895, 1.5, 1.3228756555322954]

## Calculate the Similarity for a Set of Beers

In [97]:
# calculate only a subset for the demo
warnings.filterwarnings("ignore")
beers = ["Dale's Pale Ale", "Sierra Nevada Pale Ale", "Michelob Ultra",
        "Natural Light", "Bud Light", "Fat Tire Amber Ale", "Coors Light",
         "Blue Moon Belgian White", "60 Minute IPA", "Guinness Draught", "Old Rasputin Russian Imperial Stout",
         "90 Minute IPA","Sierra Nevada Celebration Ale","Two Hearted Ale","Arrogant Bastard Ale","Pliny The Elder",
         "Sierra Nevada Bigfoot Barleywine Style Ale","La Fin Du Monde","Trappistes Rochefort 10","Ayinger Celebrator Doppelbock",
         "St. Bernardus Abt 12","Imperial Stout", "Samuel Adams Boston Lager","Duvel","Dead Guy Ale","Orval Trappist Ale",
         "Weihenstephaner Hefeweissbier", "Budweiser","Samuel Smith's Oatmeal Stout","Samuel Adams Octoberfest"]

# calculate everything for real production
# beers = df.beer_name.unique()

simple_distances = []
for beer1 in beers:
    print("starting") 
    print(beer1)
    for beer2 in beers:
        if beer1 != beer2:
            row = [beer1, beer2] + calculate_similarity(beer1, beer2)
            simple_distances.append(row)

starting
Dale's Pale Ale
starting
Sierra Nevada Pale Ale
starting
Michelob Ultra
starting
Natural Light
starting
Bud Light
starting
Fat Tire Amber Ale
starting
Coors Light
starting
Blue Moon Belgian White
starting
60 Minute IPA
starting
Guinness Draught
starting
Old Rasputin Russian Imperial Stout
starting
90 Minute IPA
starting
Sierra Nevada Celebration Ale
starting
Two Hearted Ale
starting
Arrogant Bastard Ale
starting
Pliny The Elder
starting
Sierra Nevada Bigfoot Barleywine Style Ale
starting
La Fin Du Monde
starting
Trappistes Rochefort 10
starting
Ayinger Celebrator Doppelbock
starting
St. Bernardus Abt 12
starting
Imperial Stout
starting
Samuel Adams Boston Lager
starting
Duvel
starting
Dead Guy Ale
starting
Orval Trappist Ale
starting
Weihenstephaner Hefeweissbier
starting
Budweiser
starting
Samuel Smith's Oatmeal Stout
starting
Samuel Adams Octoberfest


## Inspect the Results

In [98]:
cols = ["beer1", "beer2", "overall_dist", "aroma_dist", "palate_dist", "taste_dist"]
simple_distances = pd.DataFrame(simple_distances, columns=cols)
simple_distances.tail(15)

Unnamed: 0,beer1,beer2,overall_dist,aroma_dist,palate_dist,taste_dist
855,Samuel Adams Octoberfest,Arrogant Bastard Ale,25.238859,28.631277,26.706741,29.189039
856,Samuel Adams Octoberfest,Pliny The Elder,24.259019,30.88689,23.911294,26.893308
857,Samuel Adams Octoberfest,Sierra Nevada Bigfoot Barleywine Style Ale,25.514702,29.236108,26.589472,28.879058
858,Samuel Adams Octoberfest,La Fin Du Monde,24.652586,29.415132,27.376998,29.449109
859,Samuel Adams Octoberfest,Trappistes Rochefort 10,23.811762,30.757113,28.253318,30.020826
860,Samuel Adams Octoberfest,Ayinger Celebrator Doppelbock,25.278449,28.535066,26.636441,29.736341
861,Samuel Adams Octoberfest,St. Bernardus Abt 12,24.361855,29.828677,28.831406,30.095681
862,Samuel Adams Octoberfest,Imperial Stout,19.912308,22.005681,21.569655,22.242976
863,Samuel Adams Octoberfest,Samuel Adams Boston Lager,23.021729,21.05944,20.916501,21.430119
864,Samuel Adams Octoberfest,Duvel,26.781523,27.363297,27.522718,28.827071


## Allow the User to Customize the Weights

In [99]:
def calc_distance(dists, beer1, beer2, weights):
    mask = (dists.beer1==beer1) & (dists.beer2==beer2)
    row = dists[mask]
    row = row[['overall_dist', 'aroma_dist', 'palate_dist', 'taste_dist']]
    dist = weights * row
    return dist.sum(axis=1).tolist()[0]

weights = [2, 1, 2, 1]
#fixed from PY 2.7
print(calc_distance(simple_distances, 'Fat Tire Amber Ale', "Dale's Pale Ale", weights))
print(calc_distance(simple_distances, "Fat Tire Amber Ale", "Michelob Ultra", weights))

102.95349368782318
182.5057726767514


## Find Similar Beers 


In [100]:
my_beer = "Samuel Smith's Oatmeal Stout"
results = []
for b in beers:
    if my_beer!=b:
        results.append((my_beer, b, calc_distance(simple_distances, my_beer, b, weights)))
sorted(results, key=lambda x: x[2])[0:4]

[("Samuel Smith's Oatmeal Stout", "Dale's Pale Ale", 100.03287900519081),
 ("Samuel Smith's Oatmeal Stout", 'Pliny The Elder', 110.85382231220953),
 ("Samuel Smith's Oatmeal Stout",
  'Weihenstephaner Hefeweissbier',
  116.48301840501134),
 ("Samuel Smith's Oatmeal Stout", 'Two Hearted Ale', 120.7930670918206)]

# See in Production
http://beers.yhathq.com/

# Similar program in R

http://blog.yhat.com/posts/recommender-system-in-r.html