# Modules

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from datetime import datetime, date, time
import seaborn as sns
from scipy import stats
import scipy.stats as st
import glob

# Dataset

The dataset used in this analysis consists of beer reviews from two beer rating websites,**BeerAdvocate** and **RateBeer**, for a period ranging from 2001 to 2017. For each website, we have 5 files:
- users.csv: metadata about reviewers
- beers.csv : metadata about reviewed beers
- breweries.csv : metadata about breweries
- ratings.txt : all reviews given by users, including numerical ratings and sometimes textual reviews
- reviews.txt : only reviews given by users that include both numerical ratings and textual reviews

In our analysis, we will not use textual reviews. Thus, we will only use ratings.txt files and not reviews.txt files, as we will use all reviews, whether or not they include textual reviews.

### Load data into Dataframes

The .csv files are not too large and can efficiently be loaded into DataFrames.

In [2]:
BA_DATA_FOLDER = 'data/BeerAdvocate/'
RB_DATA_FOLDER = 'data/RateBeer/'

BA_USERS = BA_DATA_FOLDER+"users.csv"
BA_BEERS = BA_DATA_FOLDER+"beers.csv"
BA_BREWERIES = BA_DATA_FOLDER+"breweries.csv"

RB_USERS = RB_DATA_FOLDER+"users.csv"
RB_BEERS = RB_DATA_FOLDER+"beers.csv"
RB_BREWERIES = RB_DATA_FOLDER+"breweries.csv"

In [3]:
ba_users = pd.read_csv(BA_USERS)
ba_beers = pd.read_csv(BA_BEERS)
ba_breweries = pd.read_csv(BA_BREWERIES)

rb_users = pd.read_csv(RB_USERS)
rb_beers = pd.read_csv(RB_BEERS)
rb_breweries = pd.read_csv(RB_BREWERIES)

On the other hand, the ratings.txt files are extremely large, and trying to load them directly into DataFrames leads to kernel freezes. In order to circumvent this problem, we wrote a script (review_parser.py, located in src/scripts), which processes each rating file by dividing it into parts, parsing each part, and saving as JSON. In the notebook, we then load the different JSON files into DataFrames, that we concatenate. Dividing the large .txt files into smaller JSON chunks and then loading each chunk separately, avoids trying to load the entire file into memory at once, which can cause kernel freezes due to memory overload. In addition, JSON is a format that pandas can read efficiently.

In [4]:
# Load BeerAdvocate ratings stored in json files into a single DataFrame
ba_json_files = glob.glob(BA_DATA_FOLDER+'*.json')
ba_df_list = [pd.read_json(file) for file in ba_json_files]
ba_ratings = pd.concat(ba_df_list, ignore_index=True)
ba_ratings.head()

  ba_df_list = [pd.read_json(file) for file in ba_json_files]
  ba_df_list = [pd.read_json(file) for file in ba_json_files]


Unnamed: 0,beer_name,beer_id,brewery_name,brewery_id,style,abv,date,user_name,user_id,appearance,aroma,palate,taste,overall,rating
0,Régab,142544.0,Societe des Brasseries du Gabon (SOBRAGA),37262.0,Euro Pale Lager,4.5,2015-08-20 09:59:28,nmann08,nmann08.184925,3.25,2.75,3.25,2.75,3.0,2.88
1,Barelegs Brew,19590.0,Strangford Lough Brewing Company Ltd,10093.0,English Pale Ale,4.5,2009-02-20 10:59:12,StJamesGate,stjamesgate.163714,3.0,3.5,3.5,4.0,3.5,3.67
2,Barelegs Brew,19590.0,Strangford Lough Brewing Company Ltd,10093.0,English Pale Ale,4.5,2006-03-13 10:59:12,mdagnew,mdagnew.19527,4.0,3.5,3.5,4.0,3.5,3.73
3,Barelegs Brew,19590.0,Strangford Lough Brewing Company Ltd,10093.0,English Pale Ale,4.5,2004-12-01 10:59:12,helloloser12345,helloloser12345.10867,4.0,3.5,4.0,4.0,4.5,3.98
4,Barelegs Brew,19590.0,Strangford Lough Brewing Company Ltd,10093.0,English Pale Ale,4.5,2004-08-30 09:59:28,cypressbob,cypressbob.3708,4.0,4.0,4.0,4.0,4.0,4.0


In [5]:
# Load RateBeer ratings stored in json files into a single DataFrame
rb_json_files = glob.glob(RB_DATA_FOLDER+'*.json')
rb_df_list = [pd.read_json(file) for file in rb_json_files]
rb_ratings = pd.concat(rb_df_list, ignore_index=True)
rb_ratings.head()

  rb_df_list = [pd.read_json(file) for file in rb_json_files]
  rb_df_list = [pd.read_json(file) for file in rb_json_files]


Unnamed: 0,beer_name,beer_id,brewery_name,brewery_id,style,abv,date,user_name,user_id,appearance,aroma,palate,taste,overall,rating
0,33 Export (Gabon),410549.0,Sobraga,3198.0,Pale Lager,5.0,2016-04-26 10:00:00,Manslow,175852.0,2.0,4.0,2.0,4.0,8.0,2.0
1,Castel Beer (Gabon),105273.0,Sobraga,3198.0,Pale Lager,5.2,2017-02-17 11:00:00,MAGICuenca91,442761.0,2.0,3.0,2.0,4.0,8.0,1.9
2,Castel Beer (Gabon),105273.0,Sobraga,3198.0,Pale Lager,5.2,2016-06-24 10:00:00,Sibarh,288889.0,3.0,3.0,2.0,3.0,5.0,1.6
3,Castel Beer (Gabon),105273.0,Sobraga,3198.0,Pale Lager,5.2,2016-01-01 11:00:00,fombe89,250510.0,4.0,3.0,1.0,2.0,5.0,1.5
4,Castel Beer (Gabon),105273.0,Sobraga,3198.0,Pale Lager,5.2,2015-10-23 10:00:00,kevnic2008,122778.0,2.0,4.0,2.0,4.0,7.0,1.9


### First look at the data

We will now examine the different DataFrames in more detail.

In [6]:
# explain the columns of users, beers, breweries and ratings DataFrames

**BeerAdvocate beer Dataframe**

In [7]:
ba_beers.sample(4)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,bros_score,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
81079,81490,805 Blonde Ale,2210,Firestone Walker Brewing Co.,American Blonde Ale,881,108,3.51,80.0,,4.7,3.475778,,0,
166944,199708,Ameliorator,32679,Wiseacre Brewing,Doppelbock,15,4,3.99,85.0,,8.5,3.932667,,0,
191126,219597,Bitter Enemy,43129,Bond Brothers Beer Company,American Double / Imperial IPA,9,0,3.97,,,8.2,3.971111,-0.01946,0,
211736,229193,Simply Simcoe,38647,Pareidolia Brewing Company,American IPA,0,0,,,,5.4,,,0,


Let us explain the different columns of the BeerAdvocate beer Dataframe, in which each row is a beer:
- beer_id, beer_name, brewery_id, brewery_name, style are explicit
- nbr_ratings: total number of reviews for that beer, whether or not they include textual reviews
- nbr_reviews: number of reviews for that beer that include textual reviews
- avg: average rating (out of 5) given to the beer based on user ratings
- ba_score: the BeerAdvocate score assigned to the beer, which corresponds to the beer's overall rating within its style category, calculated using a trimmed mean and a custom Bayesian formula that adjusts for the beer's style, balancing the score based on the number of ratings and the style's average
- bros_score: beer rating given by the site’s founders
- abv: 'Alcohol by volume', which indicates the percentage of alcohol content in the beer
- avg_computed: average rating (out of 5) recalculated using a weighted sum of the different aspect ratings
- zscore: z-score of the beer's average rating, which is a statistical measure that indicates how many standard deviations the average rating is from the mean of all ratings from the BeerAdvocate dataset
- nbr_matched_valid_ratings: number of valid ratings for beers that were successfully matched between two BeerAdvocate and RateBeer
- avg_matched_valid_ratings: average rating of those matched and valid ratings across the sites

The last two columns are related to the analysis performed by Robert West and Gael Lederrey in the following paper: https://dlab.epfl.ch/people/west/pub/Lederrey-West_WWW-18.pdf.

**RateBeer beer Dataframe**

In [8]:
rb_beers.sample(4)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,overall_score,style_score,avg,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
359183,345410,Whole Foods Market (Houston) Radio Gold,21848,Whole Foods Market Brewing Company &#40;Housto...,Saison,1,,,3.06,,3.6,,0,
230669,140544,Rock Bottom Colorado Springs Birthday Barleywine,6551,Rock Bottom Colorado Springs,Barley Wine,1,,,2.88,9.0,3.7,,0,
130890,259942,CAP / Dugges BelgoStout,5996,Dugges Bryggeri,Imperial Stout,10,96.0,50.0,3.63,12.0,3.89,,0,
117358,154746,Berghoeve Windbuul,11771,Berghoeve Brouwerij,Dunkler Bock,48,50.0,78.0,3.2,6.0,3.23125,,0,


Let us explain the different columns of the RateBeer beer Dataframe, in which each row is a beer:

The beer_id, beer_name, brewery_id, brewery_name, style, nbr_ratings, avg, abv, avg_computed, z-score, nbr_matched_valid_ratings and avg_matched_valid_ratings are the same as for the BeerAdvocate beer Dataframe.

Some columns are missing compared to the BeerAdvocate beer Dataframe: ba_score and bros_score (which makes sense as these are BeerAdvocate-specific scores), and nbr_reviews.

New columns are present compared to the BeerAdvocate beer Dataframe:
- overall_score: score (out of 100) which "reflects the rating given by RateBeer users and how this beer compares to all other beers on RateBeer", calculated by considering the ratings given by each user and the total number of ratings for the beer
- style_score: score given to the beer (out of 100) specifically within its style category

**BeerAdvocate user Dataframe**

In [9]:
ba_users.sample(4)

Unnamed: 0,nbr_ratings,nbr_reviews,user_id,user_name,joined,location
10636,20,15,belgianbeerbunny.710294,Belgianbeerbunny,1356779000.0,"United States, Illinois"
105514,2,2,gobieesq.82376,GobieESQ,1149502000.0,"United States, New York"
35512,22,1,lakeeffectsteam.283336,LakeEffectSteam,1230894000.0,"United States, New York"
55340,3,0,kjebar.908836,kjebar,1418641000.0,Norway


Let us explain the different columns of the BeerAdvocate user Dataframe, in which each row is a reviewer:
- nbr_ratings, nbr_reviews, user_id, user_name, and location are explicit
- joined: timestamp indicating when the user joined BeerAdvocate in Unix timestamp format (the number of seconds since January 1, 1970, 00:00:00 UTC)

**RateBeer user Dataframe**

In [10]:
rb_users.sample(4)

Unnamed: 0,nbr_ratings,user_id,user_name,joined,location
1289,632,31185,morrdt,1134644000.0,"United States, Florida"
3569,1,109158,paleobones,1277978000.0,
17740,25,274275,fischersfriend,1376302000.0,"United States, North Carolina"
18120,1,292721,VcomeVibratore,1387019000.0,"United States, Minnesota"


Let us explain the different columns of the RateBeer user Dataframe, in which each row is a reviewer:

The columns are the same as in the BeerAdvocate user Dataframe (joined is obviously the timestamp indicating when the user joined RateBeer and not BeerAdvocate), except that nbr_reviews is missing.

**Brewery Dataframes**

In [11]:
ba_breweries.sample(4)

Unnamed: 0,id,location,name,nbr_beers
16320,7198,"United States, Georgia",Max Lager's American Grill & Brewery,0
5566,23352,Australia,John Boston Premium Beverages,4
15498,2113,"United States, Iowa",Raccoon River Brewing Company,26
12425,23590,"United States, Florida",Brooksville Brewing Company,12


In [12]:
rb_breweries.sample(4)

Unnamed: 0,id,location,name,nbr_beers
1555,14749,Canada,Cassel Brewery Co.,37
20410,15104,England,Hop Kettle,51
2779,25623,Italy,Birra del Moro,3
14828,27061,"United States, Massachusetts",Gentile Brewing Company,9


The columns are explicit and are the same for the 2 websites. Each row is a brewery.

**Rating Dataframes**

In [13]:
ba_ratings.sample(4)

Unnamed: 0,beer_name,beer_id,brewery_name,brewery_id,style,abv,date,user_name,user_id,appearance,aroma,palate,taste,overall,rating
5011217,Champion Of The Sun,220257.0,Aslin Beer Company,42560.0,American IPA,6.2,2017-03-05 11:00:00,bhdc,bhdc.722234,4.25,4.25,4.25,4.25,4.25,4.25
553980,Koppi Coffee IPA,65674.0,Mikkeller ApS,13307.0,American IPA,6.9,2014-05-08 09:59:28,Stum-pub,stum-pub.751927,,,,,,4.25
7822734,Paradise,235533.0,Prairie Artisan Ales,30356.0,American Double / Imperial Stout,13.0,2017-07-18 09:59:28,emerge077,emerge077.17949,4.0,4.25,4.5,4.25,4.25,4.26
2019125,Anchor Steam Beer,63.0,Anchor Brewing Company,28.0,California Common / Steam Beer,4.9,2013-09-10 09:59:28,buffcityx,buffcityx.382882,,,,,,4.0


In [14]:
rb_ratings.sample(4)

Unnamed: 0,beer_name,beer_id,brewery_name,brewery_id,style,abv,date,user_name,user_id,appearance,aroma,palate,taste,overall,rating
1358201,Schnitzer Bräu German Hirse Lemon,93172.0,Schnitzer Bräu,8005.0,Specialty Grain,2.6,2010-03-16 11:00:00,Skinnyviking,29274.0,3.0,4.0,2.0,4.0,8.0,2.1
3055144,Green Flash East Village Pilsner,109207.0,Green Flash Brewing Company,3111.0,Pilsener,5.3,2014-09-21 09:59:28,fiulijn,1786.0,3.0,6.0,2.0,5.0,10.0,2.6
4742182,Alchemist Donovans Red,29914.0,The Alchemist,4275.0,Irish Ale,5.16,2005-05-09 09:59:28,GonZoBeeR,9334.0,3.0,7.0,4.0,7.0,14.0,3.5
5012505,Thirsty Dog Whippet Wheat - Cherries,212719.0,Thirsty Dog Brewing Company,2514.0,Fruit Beer,5.1,2013-05-10 09:59:28,Dogbrick,2714.0,3.0,7.0,3.0,7.0,14.0,3.4


The columns are the same for the 2 Dataframes. Each row corresponds to an individual review. Most column names are explicit. 
- 'appearance','aroma', 'palate','taste' correspond to aspect ratings (out of 5)
- 'overall' is the mean of the 4 aspect ratings
- 'rating' is the final rating given by the user to the beer

# 0) Data cleaning

In [15]:
# remove useless columns (done)
# make sure each column has the right type (done)
# deal with missing or Nan values (partially done)
# check the correspondance between brewery_id in the beers DataFrames and brewery_id in the breweries Dataframes
# set all US locations to 'United States' (remove state information) (done)
# remove any embedded HTML links in the location strings (done)
# remove countries with too few reviewers (done)

## Filtering Dataframes

Let us start by removing columns in the different Dataframes that we will not use in our analysis.

The following rows will not be used in our analysis:
nbr_reviews, ba_score, bros_score, abv, avg_computed, zscore, nbr_matched_valid_ratings and avg_matched_valid_ratings, overall_score and style_score.

Let us remove them.

In [16]:
useless_columns_ba = ['nbr_reviews', 'ba_score', 'bros_score', 'abv', 'avg_computed', 'zscore', 'nbr_matched_valid_ratings', 'avg_matched_valid_ratings']
ba_beers = ba_beers.drop(columns=useless_columns_ba)
print(ba_beers.columns)

Index(['beer_id', 'beer_name', 'brewery_id', 'brewery_name', 'style',
       'nbr_ratings', 'avg'],
      dtype='object')


In [17]:
useless_columns_rb = [col for col in useless_columns_ba if col not in ['nbr_reviews','ba_score', 'bros_score']] + ['overall_score', 'style_score']
rb_beers = rb_beers.drop(columns=useless_columns_rb)
print(rb_beers.columns)

Index(['beer_id', 'beer_name', 'brewery_id', 'brewery_name', 'style',
       'nbr_ratings', 'avg'],
      dtype='object')


We will also not use the timestamps indicating the time when users joined the platforms, so let us remove this as well.

In [18]:
ba_users = ba_users.drop(columns='joined')
rb_users = rb_users.drop(columns='joined')
print(ba_users.columns)

Index(['nbr_ratings', 'nbr_reviews', 'user_id', 'user_name', 'location'], dtype='object')


## Verifying value types

Let us verify that the values in the different columns of the different Dataframes have the appropriate type.

In [19]:
print(ba_beers.dtypes,'\n','\n',rb_beers.dtypes)

beer_id           int64
beer_name        object
brewery_id        int64
brewery_name     object
style            object
nbr_ratings       int64
avg             float64
dtype: object 
 
 beer_id           int64
beer_name        object
brewery_id        int64
brewery_name     object
style            object
nbr_ratings       int64
avg             float64
dtype: object


In [20]:
print(ba_users.dtypes,'\n','\n',rb_users.dtypes)

nbr_ratings     int64
nbr_reviews     int64
user_id        object
user_name      object
location       object
dtype: object 
 
 nbr_ratings     int64
user_id         int64
user_name      object
location       object
dtype: object


In [21]:
print(ba_breweries.dtypes,'\n','\n',rb_breweries.dtypes)

id            int64
location     object
name         object
nbr_beers     int64
dtype: object 
 
 id            int64
location     object
name         object
nbr_beers     int64
dtype: object


In [22]:
columns_to_convert = ['beer_name', 'brewery_name', 'style']

ba_beers[columns_to_convert] = ba_beers[columns_to_convert].apply(lambda col: col.astype(str))
rb_beers[columns_to_convert] = rb_beers[columns_to_convert].apply(lambda col: col.astype(str))
print(ba_beers.dtypes,'\n','\n',rb_beers.dtypes)

beer_id           int64
beer_name        object
brewery_id        int64
brewery_name     object
style            object
nbr_ratings       int64
avg             float64
dtype: object 
 
 beer_id           int64
beer_name        object
brewery_id        int64
brewery_name     object
style            object
nbr_ratings       int64
avg             float64
dtype: object


In [23]:
print(ba_ratings.dtypes,'\n','\n',rb_ratings.dtypes)

beer_name               object
beer_id                float64
brewery_name            object
brewery_id             float64
style                   object
abv                    float64
date            datetime64[ns]
user_name               object
user_id                 object
appearance             float64
aroma                  float64
palate                 float64
taste                  float64
overall                float64
rating                 float64
dtype: object 
 
 beer_name               object
beer_id                float64
brewery_name            object
brewery_id             float64
style                   object
abv                    float64
date            datetime64[ns]
user_name               object
user_id                float64
appearance             float64
aroma                  float64
palate                 float64
taste                  float64
overall                float64
rating                 float64
dtype: object


The types of the values in the different columns of the different Dataframes seem appropriate.

## Dealing with missing values

In [24]:
ba_beers['avg'].value_counts()

avg
4.00    7783
3.75    7059
3.50    5946
3.88    3307
4.25    2871
        ... 
1.14       1
1.19       1
1.04       1
1.05       1
1.27       1
Name: count, Length: 401, dtype: int64

In [25]:
ba_beers

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,avg
0,166064,Nashe Moskovskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,
1,166065,Nashe Pivovskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,
2,166066,Nashe Shakhterskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,
3,166067,Nashe Zhigulevskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,
4,166063,Zhivoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,
...,...,...,...,...,...,...,...
280818,19139,Kölsch Ale,885,Summit Station Restaurant & Brewery,Kölsch,3,2.71
280819,19140,Nut Brown Ale,885,Summit Station Restaurant & Brewery,English Brown Ale,2,3.10
280820,19146,Octoberfest,885,Summit Station Restaurant & Brewery,Märzen / Oktoberfest,0,
280821,2805,Scotch Ale,885,Summit Station Restaurant & Brewery,Scotch Ale / Wee Heavy,0,


In [57]:
ba_beers_ = ba_beers[~pd.isna(ba_beers['avg'])].reset_index() # avg = NaN valued beers are removed since there are not any reviews
ba_beers_

Unnamed: 0,index,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,avg
0,23,142544,Régab,37262,Societe des Brasseries du Gabon (SOBRAGA),Euro Pale Lager,1,2.88
1,24,19590,Barelegs Brew,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,4,3.85
2,25,19827,Legbiter,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,75,3.45
3,26,20841,St. Patrick's Ale,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,8,3.86
4,27,20842,St. Patrick's Best,10093,Strangford Lough Brewing Company Ltd,English Bitter,64,3.56
...,...,...,...,...,...,...,...,...
247989,280814,19149,Diamond Stout,885,Summit Station Restaurant & Brewery,Irish Dry Stout,3,3.83
247990,280816,19142,IPA,885,Summit Station Restaurant & Brewery,English India Pale Ale (IPA),2,3.24
247991,280817,19141,Irvington Pale Ale,885,Summit Station Restaurant & Brewery,American Pale Ale (APA),3,3.60
247992,280818,19139,Kölsch Ale,885,Summit Station Restaurant & Brewery,Kölsch,3,2.71


In [58]:
rb_beers = rb_beers[~pd.isna(rb_beers['avg'])].reset_index() # avg = NaN valued beers are removed since there are not any reviews
rb_beers

Unnamed: 0,index,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,avg
0,0,410549,33 Export (Gabon),3198,Sobraga,Pale Lager,1,2.72
1,1,105273,Castel Beer (Gabon),3198,Sobraga,Pale Lager,10,2.18
2,2,19445,Régab,3198,Sobraga,Pale Lager,27,1.83
3,3,155699,Ards Bally Black Stout,13538,Ards Brewing Co.,Stout,6,3.18
4,4,239097,Ards Belfast 366,13538,Ards Brewing Co.,Golden Ale/Blond Ale,1,2.79
...,...,...,...,...,...,...,...,...
395652,442076,189684,Stela Selekt,1107,Stefani & Co,Pilsener,5,2.19
395653,442077,84884,Hotel Martini Donauer,9355,Hotel Martini,Pale Lager,1,2.77
395654,442078,93783,Birra Rozafa,9928,Rozafa Brewery,Pale Lager,1,2.64
395655,442079,220897,Svejk Blonde,17155,Svejk Beer Garden,Pale Lager,4,2.70


## Checking the correspondance between brewery_id in the beers DataFrames

## Removing state information

In [28]:
import warnings
warnings.filterwarnings('ignore')

In [47]:
def edit_location(data_name):
    data_name_c = data_name.copy()
    for i in range(len(data_name['location'])):
        if len(data_name['location'][i]) > 10:
            if 'United States' in data_name['location'][i]: # Remove state names
                data_name_c['location'][i] = 'United States'
            elif ',' in data_name['location'][i]:
                data_name_c['location'][i] = data_name['location'][i][:(data_name['location'][i].index(','))] # Removing for the double names ( such as 'United Kingdom,England' )
            elif 'href' in data_name['location'][i]:
                data_name_c.drop(i)
    return data_name_c

In [30]:
ba_breweries = edit_location(ba_breweries)

In [31]:
ba_breweries['location']

0           Kyrgyzstan
1           Kyrgyzstan
2           Kyrgyzstan
3           Kyrgyzstan
4           Kyrgyzstan
             ...      
16753          Germany
16754            Aruba
16755    United States
16756    United States
16757    United States
Name: location, Length: 16758, dtype: object

In [32]:
rb_breweries = edit_location(rb_breweries)

In [33]:
rb_breweries['location']

0                   Gabon
1        Northern Ireland
2        Northern Ireland
3        Northern Ireland
4        Northern Ireland
               ...       
24184             Albania
24185             Albania
24186             Albania
24187             Albania
24188             Albania
Name: location, Length: 24189, dtype: object

## Removing HTML links

In [None]:
# Done above

## Removing the countries who have too few reviewers

In [61]:
ba_users = ba_users[~pd.isna(ba_users['location'])].reset_index() # location = NaN valued users are removed
rb_users = rb_users[~pd.isna(rb_users['location'])].reset_index() # location = NaN valued users are removed
ba_users

Unnamed: 0,index,nbr_ratings,nbr_reviews,user_id,user_name,location
0,0,7820,465,nmann08.184925,nmann08,"United States, Washington"
1,1,2521,2504,stjamesgate.163714,StJamesGate,"United States, New York"
2,2,1797,1143,mdagnew.19527,mdagnew,Northern Ireland
3,3,31,31,helloloser12345.10867,helloloser12345,Northern Ireland
4,4,604,604,cypressbob.3708,cypressbob,Northern Ireland
...,...,...,...,...,...,...
122420,153698,1,0,eturchick.374415,ETurchick,"United States, California"
122421,153699,1,1,everman.532342,Everman,"United States, California"
122422,153700,1,1,justin0001.352175,Justin0001,"United States, California"
122423,153702,1,1,joetex.800347,JoeTex,"United States, California"


In [62]:
rb_users

Unnamed: 0,index,nbr_ratings,user_id,user_name,location
0,0,1890,175852,Manslow,Poland
1,1,89,442761,MAGICuenca91,Spain
2,2,169,288889,Sibarh,Poland
3,3,3371,250510,fombe89,Spain
4,4,13043,122778,kevnic2008,Germany
...,...,...,...,...,...
50587,70167,1,181614,HaraldS,Norway
50588,70169,1,134893,stamfordbus,England
50589,70170,1,327816,fobia405,Belgium
50590,70172,3,82020,klesidra,Slovenia


In [64]:
ba_users_ = edit_location(ba_users)
rb_users_ = edit_location(rb_users)

In [71]:
count = rb_users_['location'].value_counts() < 20
count[count == True]

location
Panama                   True
Moldova                  True
Costa Rica               True
Ecuador                  True
Virgin Islands (U.S.)    True
                         ... 
Honduras                 True
Falkland Islands         True
Kyrgyzstan               True
Papua New Guinea         True
Tibet                    True
Name: count, Length: 112, dtype: bool

In [72]:
rb_users_c = rb_users_.copy()
for i in range(len(rb_users_['location'])):
    if rb_users_['location'][i] in count :
        rb_users_c.drop(i)

KeyboardInterrupt: 

# 1) Link between culture and taste

## a) Beer style preferences

In [None]:
# use clustering techniques to determine beer style is most popular in each country / geographic area
# use time information to determine if regional beer style preferences are stable (which would suggest that they are 
# strongly affected by culture)or if they vary over time

## b) Importance of specific beer attributes

In [None]:
# perform linear regression between attribute ratings the final rating for all countries together and compare coefficients for each attribute
# perform linear regression between attribute ratings the final rating for the different countries separately and observe the distribution of the coefficients for the different attributes across countries

# 2) Location-related biases in ratings

## a) Cultural biases

In [None]:
# determine the final rating for each country/ geographic area
# determine if the final rating for each country/ geographic area is the same using statistical tests

## b) Beer origin bias

In [None]:
# compare the final rating of domestic vs foreign beers and determine if there is a significant difference using statistical tests
# determine if the final rating of a given beer is correlated with the number of reviewers from the country where the beer comes from who reviewed that beer (scatter plot + Pearson’s correlation coefficient + regression)
# isolate beer enthusiasts (who wrote a very large number of reviews) and compare the final rating of domestic vs foreign beers and determine if there is a significant difference using statistical tests

# 3) Other biases

## a) Seasonal biases

In [None]:
# use the time information to determine the season during which each rating was posted (only consider countries with 4 seasons)
# group ratings by season
# within each group, determine the average final rating of each beer style
# compare the results for the different seasons

## b) Experience biais

In [None]:
# isolate users who gave a lot of ratings (based on a chosen threshold)
# for each user, sort their reviews chronologically and assign an "experience level" (predefined values that will be the same for all users: n<o<p) to each rating based the count of reviews posted by that user up to that rating: new reviewer (for the first n reviews), amateur (for the n+1 th review up to the oth review), expert (for the o+1 th review until the last review)
# calculate the average final rating for each experience level across all users
# represent results as a bar plot
# if a particular trend is visible,perform a paired t-test (for early vs. late reviews by the same user) to test if the rating decrease or increase is statistically significant