# Project Milestone 2 Notebook
## Do Americans prefer beers with a higher alcohol content (ABV) than Europeans?
- Is it linked to the beer style? Do they generally prefer beer styles that have a higher ABV? (Grouping + micro/macro averages)
- Has it evolved between 2000 and 2017? (Time series analysis + maybe regression)
- Can we map American States to European countries? (Graph/network algorithms)

## 0. Imports and global variables

In [127]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from load_data import load_gzip_txt_data

## 1. Preprocessing of the data:

#### 1.0 Split users between North America and Europe:

Define European countries:

In [128]:
european_countries = [
    "Albania", "Andorra", "Armenia", "Austria", "Azerbaijan", "Belarus",
    "Belgium", "Bosnia and Herzegovina", "Bulgaria", "Croatia", "Cyprus",
    "Czech Republic", "Denmark", "England", "Estonia", "Finland", "France", "Georgia",
    "Germany", "Gibraltar", "Greece", "Hungary", "Iceland", "Ireland", "Italy", "Jersey", "Kazakhstan",
    "Kosovo", "Latvia", "Liechtenstein", "Lithuania", "Luxembourg", "Malta",
    "Moldova", "Monaco", "Montenegro", "Netherlands", "Northern Ireland", "Macedonia",
    "Norway", "Poland", "Portugal", "Romania", "Russia", "San Marino", "Scotland", "Serbia",
    "Slovakia", "Slovak Republic", "Slovenia", "Spain", "Sweden", "Switzerland", "Turkey",
    "Ukraine", "Vatican City", "Wales"
]

# Slovakia in RateBeer and Slovak Republic in BeerAdvocate

Load users files and classify if European, North American or neither:

In [129]:
def get_na_or_eu(row):
    """
    In a Dataframe with a column "location", returns whether the location 
    is in Europe or North America.
    
    :param row: pandas.Dataframe row with an attribute "location".
    :return: str, can be "NA", "EU" or "Other".
    """
    location = row['location']
    if location in european_countries:
        return 'EU'
    elif (location == 'Canada') or ("United States" in str(location)):
        return 'NA'
    else:
        return 'Other'


ba_users_df = (pd.read_csv("./data/BeerAdvocate/users.csv").drop(
    columns=['nbr_ratings', 'nbr_reviews', 'user_name', 'joined'])
               .dropna())
ba_users_df['eu_na'] = ba_users_df.apply(get_na_or_eu, axis=1)

rb_users_df = (pd.read_csv("./data/RateBeer/users.csv").drop(
    columns=['nbr_ratings', 'user_name', 'joined'])
               .dropna())
rb_users_df['eu_na'] = rb_users_df.apply(get_na_or_eu, axis=1)

print("BeerAdvocate")
display(ba_users_df)
print("RateBeer")
display(rb_users_df)

BeerAdvocate


Unnamed: 0,user_id,location,eu_na
0,nmann08.184925,"United States, Washington",
1,stjamesgate.163714,"United States, New York",
2,mdagnew.19527,Northern Ireland,EU
3,helloloser12345.10867,Northern Ireland,EU
4,cypressbob.3708,Northern Ireland,EU
...,...,...,...
153698,eturchick.374415,"United States, California",
153699,everman.532342,"United States, California",
153700,justin0001.352175,"United States, California",
153702,joetex.800347,"United States, California",


RateBeer


Unnamed: 0,user_id,location,eu_na
0,175852,Poland,EU
1,442761,Spain,EU
2,288889,Poland,EU
3,250510,Spain,EU
4,122778,Germany,EU
...,...,...,...
70167,181614,Norway,EU
70169,134893,England,EU
70170,327816,Belgium,EU
70172,82020,Slovenia,EU


Let's analyze the repartition of NA/EU/Other users:

In [130]:
print("BeerAdvocate:")
display(ba_users_df['eu_na'].value_counts())
print("\nRateBeer:")
display(rb_users_df['eu_na'].value_counts())

BeerAdvocate:


eu_na
NA       116547
EU         3944
Other      1934
Name: count, dtype: int64


RateBeer:


eu_na
NA       30110
EU       16156
Other     4326
Name: count, dtype: int64

Clearly, the vast majority of users comes from North America, followed by Europe especially in the BeerAdvocate dataset. In the following analysis, we will only consider NA and EU users.


In [131]:
ba_users_df = ba_users_df.query("eu_na != 'Other'")
rb_users_df = rb_users_df.query("eu_na != 'Other'")

#### 1.1 Load the ratings:

Load the ratings files:

In [132]:
ba_ratings_df = load_gzip_txt_data("./data/BeerAdvocate/ratings.txt.gz", ["user_id", "date", "abv", "rating"],
                                   max_entries=100000)
rb_ratings_df = load_gzip_txt_data("./data/RateBeer/ratings.txt.gz", ["user_id", "date", "abv", "rating"],
                                   max_entries=100000)

Loading data from:  ./data/BeerAdvocate/ratings.txt.gz


1799996it [00:01, 901324.32it/s]


Loading data from:  ./data/RateBeer/ratings.txt.gz


1699997it [00:01, 903458.63it/s]


Cast the columns to meaningful types:

In [133]:
ba_ratings_df = ba_ratings_df.astype({
    'user_id': 'str',
    'date': 'int64',
    'abv': 'float32',
    'rating': 'float32'})

rb_ratings_df = rb_ratings_df.astype({
    'user_id': 'int64',
    'date': 'int64',
    'abv': 'float32',
    'rating': 'float32'})

# Use this to convert the dates to monthly periods:
#df['date'] = pd.to_datetime(df['date'].astype('int64'), unit='s').dt.to_period('m').astype('datetime64[M]')

#### 1.2 Merge the users with the ratings

In [135]:
ba_df = ba_ratings_df.merge(ba_users_df, on='user_id').drop(columns=['user_id'])
rb_df = rb_ratings_df.merge(rb_users_df, on='user_id').drop(columns=['user_id'])

In [136]:
print("BeerAdvocate:")
display(ba_df)
print("RateBeer:")
display(rb_df)

BeerAdvocate:


Unnamed: 0,date,abv,rating,location,eu_na
0,1440064800,4.5,2.88,"United States, Washington",
1,1447498800,5.0,3.56,"United States, Washington",
2,1321614000,8.0,4.50,"United States, Washington",
3,1367575200,10.5,3.75,"United States, Washington",
4,1422097200,6.0,3.81,"United States, Washington",
...,...,...,...,...,...
91537,1340791200,4.7,4.27,"United States, Georgia",
91538,1339754400,4.7,3.25,"United States, Illinois",
91539,1338804000,4.7,3.00,"United States, Pennsylvania",
91540,1336989600,4.7,3.50,"United States, Florida",


RateBeer:


Unnamed: 0,date,abv,rating,location,eu_na
0,1461664800,5.0,2.0,Poland,EU
1,1430820000,5.2,1.7,Poland,EU
2,1429437600,4.5,1.9,Poland,EU
3,1429610400,5.3,3.7,Poland,EU
4,1447326000,6.1,2.5,Poland,EU
...,...,...,...,...,...
89706,1391598000,6.0,4.5,Canada,
89707,1440064800,6.0,3.7,Spain,EU
89708,1434016800,6.0,4.4,France,EU
89709,1483009200,8.5,4.0,Canada,


## 2. General Analysis 

In [None]:
# TODO: use abv (maybe round it to plot histograms), and date to see differences and evolutions. Try to approach the research (sub)questions.
# TODO: Do we want to use a bit of NLP ? Not sure how to do it in a meaningful way...
# TODO: PLEASE plot with seaborn if simpler (mainly for automatic CI) I beg you ^^