# Milestone 1 - CasierVert952

This data analysis will be conduct on the datasets of two beers rating websites :

- BeerAdvocate
- RateBeer

> *In the following: BeerAdvocate will be abbreviated as BA and RateBeer as RB.*

In [2]:
# Import the requiered libraries
import os
import csv
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Preprocessing

### Loading CSV data
Let first import the data in CSV format for the two dataset, the users, the beers and the breweries.

In [3]:
# Create Dataframes for the BA dataset
BA_data_path = "data/BeerAdvocate/"

BA_beers = pd.read_csv(BA_data_path + 'beers.csv')
BA_breweries = pd.read_csv(BA_data_path + 'breweries.csv')
BA_users = pd.read_csv(BA_data_path + 'users.csv')

In [4]:
# Create Dataframes for the RB dataset
RB_data_path = "data/RateBeer/"

RB_beers = pd.read_csv(RB_data_path + 'beers.csv')
RB_breweries = pd.read_csv(RB_data_path + 'breweries.csv')
RB_users = pd.read_csv(RB_data_path + 'users.csv')

### Transforming the ratings files from TXT to CSV

You can download the ```ratings.csv``` files for both dataset with the following links (~2GB each):

- For BA : [here](https://coursedingler.ch/data/BA/ratings.csv)
- For RB : [here](https://coursedingler.ch/data/RB/ratings.csv)

The following cell should **NOT** be executed, it only shows how the ```ratings_cleaned.csv``` for each dataset were generated.

It take around 19 minutes to generate the BA ratings file and 14 minutes for the RB one.

```python
from helpers import txt_to_csv

file_txt = 'ratings.txt'
file_csv = 'ratings.csv'

txt_to_csv(BA_data_path + file_txt, BA_data_path + file_csv, "BA")
txt_to_csv(RB_data_path + file_txt, RB_data_path + file_csv, "RB")
```

Then it's time to get the ratings data in Dataframes

**Make sure you have been placed or generated the ```ratings.csv``` files in the ```BeerAdvocate``` and ```RateBeer``` folder as well as all other data files when executing the following cell !**
```
data/
├── BeerAdvocate
│   ├── beers.csv
│   ├── breweries.csv
│   ├── users.csv
│   └── ratings.csv
│
└── RateBeer
    ├── beers.csv
    ├── breweries.csv
    ├── users.csv
    └── ratings.csv
```

In [6]:
# Read the BA ratings file
s_time = time.time()
BA_rating = pd.read_csv(BA_data_path + 'ratings.csv')
e_time = time.time()
print("Reading of BA ratings ended in " + str(e_time - s_time) + " seconds.")

Reading of BA ratings ended in 55.41016411781311 seconds.


In [7]:
# Read the RB ratings file
s_time = time.time()
RB_rating = pd.read_csv(RB_data_path + 'ratings.csv')
e_time = time.time()
print("Reading of RB ratings ended in " + str(e_time - s_time) + " seconds.")

Reading of RB ratings ended in 60.91123008728027 seconds.


In [8]:
print("Size of BA ratings dataset : " + str(BA_rating.shape))
print("Size of RB ratings dataset : " + str(RB_rating.shape))

Size of BA ratings dataset : (8392192, 17)
Size of RB ratings dataset : (7121361, 16)


The ratings datasets contains both more than 7 millions of user's ratings with respectively 17 and 16 features for the BA and RB datasets.

### Data cleaning

#### Changing date format

The format of the date field was initialy the timestamp format, here we convert it in a human readable format.

In [9]:
# Changing the format of the date of the two datasets to the format "Day-Month-Year" (the format can easily be changed).
BA_rating["date"] = pd.to_datetime(BA_rating["date"], unit='s')
BA_rating["date"] = BA_rating["date"].dt.strftime("%d-%m-%Y")

RB_rating["date"] = pd.to_datetime(RB_rating["date"], unit='s')
RB_rating["date"] = RB_rating["date"].dt.strftime("%d-%m-%Y")

#### Dropping and renaming columns

We are dropping the columns that are not needed, some of them will be recovered during the merging phase. The columns are renamed to avoid colisions during the merges.

In [10]:
from helpers import ratings_dict

# Removing not needed columns
BA_rating.drop(columns=["text", "review"], inplace=True)
RB_rating.drop(columns=["text"], inplace=True)

# Removing columns that will be recovered when merging
BA_rating.drop(columns=["brewery_name", "style", "beer_name", "user_name", "abv"], inplace=True)
RB_rating.drop(columns=["brewery_name", "style", "beer_name", "user_name", "abv"], inplace=True)

# Renaming the columns as define by "ratings_dict"
BA_rating.rename(columns=ratings_dict, inplace=True)
RB_rating.rename(columns=ratings_dict, inplace=True)

#### Merging with beers 's Data

In [11]:
from helpers import beers_dict

# Merging with the beers's data
BA_merged = pd.merge(BA_rating, BA_beers, on=["beer_id", "brewery_id"], how="inner")
RB_merged = pd.merge(RB_rating, RB_beers, on=["beer_id", "brewery_id"], how="inner")

# Renaming the columns as define by "beers_dict"
BA_merged.rename(columns=beers_dict, inplace=True)
RB_merged.rename(columns=beers_dict, inplace=True)

#### Merging with breweries's data

In [12]:
from helpers import breweries_dict

# Merging with the breweries's data
BA_merged = pd.merge(BA_merged, BA_breweries, left_on="brewery_id", right_on="id", how="inner")
RB_merged = pd.merge(RB_merged, RB_breweries, left_on="brewery_id", right_on="id", how="inner")

# Dropping the duplicate columns
BA_merged.drop(columns=["id", "name"], inplace=True)
RB_merged.drop(columns=["id", "name"], inplace=True)

# Renaming the columns as define by "breweries_dict"
BA_merged.rename(columns=breweries_dict, inplace=True)
RB_merged.rename(columns=breweries_dict, inplace=True)

#### Merging with users's data

In [13]:
from helpers import users_dict

# Merging with the users's data
BA_merged = pd.merge(BA_merged, BA_users, on=["user_id"], how="inner")
RB_merged = pd.merge(RB_merged, RB_users, on=["user_id"], how="inner")

# Renaming the columns as define by "users_dict"
BA_merged.rename(columns=users_dict, inplace=True)
RB_merged.rename(columns=users_dict, inplace=True)

#### First data visualization
What follow shows how our data looks when merged together. In the following analysis parts, we will derive some dataframes from ```BA_merged``` and ```RB_merged``` by doing a copy of them, and then removing, modifying and adding feature to these copies.

In [20]:
RB_merged.sample(10)

Unnamed: 0,beer_id,brewery_id,rating_date,user_id,rating_appearance,rating_aroma,rating_palate,rating_taste,rating_overall,rating,...,beer_avg_computed,beer_zscore,beer_nbr_matched_valid_ratings,beer_avg_matched_valid_ratings,breweries_location,breweries_nbr_beers,user_nbr_ratings,user_name,user_join_date,user_location
3145529,105010,661,13-01-2010,63197,3,6,3,7,11,3.0,...,3.061538,,0,,"United States, Wisconsin",199,2022,phishpond417,1193656000.0,"United States, Wisconsin"
3393247,238060,6642,18-05-2014,1001,3,8,4,7,16,3.8,...,2.876,,0,,Netherlands,136,6899,caesar,992599200.0,Netherlands
2316216,122103,6766,03-04-2012,8814,3,6,3,6,11,2.9,...,3.380952,,0,,"United States, Missouri",122,11458,bu11zeye,1062410000.0,"United States, Texas"
417334,5924,1063,31-10-2005,5011,3,7,2,6,12,3.0,...,3.319915,-0.077418,236,3.319915,Switzerland,111,10752,madsberg,1026554000.0,Denmark
387662,6468,896,04-09-2011,130058,3,7,4,7,15,3.6,...,3.501217,,0,,Belgium,55,15563,Benzai,1307959000.0,Netherlands
289607,233856,3642,20-01-2016,128086,3,7,3,6,13,3.2,...,3.440909,,0,,Norway,186,4842,Beersiveknown,1304590000.0,Northern Ireland
5149637,6945,1173,22-08-2003,6688,4,8,3,6,14,3.5,...,3.715993,,0,,Belgium,32,1148,redlem,1043060000.0,"United States, Ohio"
834991,428217,22393,18-07-2017,79214,3,7,3,7,13,3.3,...,3.6,,0,,England,213,5809,zvikar,1217239000.0,Israel
3128581,2533,435,12-09-2006,31748,2,4,2,3,9,2.0,...,2.756316,-1.044322,1417,2.756316,"United States, Hawaii",67,6483,BuckeyeBoy,1136027000.0,"United States, Idaho"
2649128,30297,261,19-08-2007,50835,1,10,4,8,16,3.9,...,3.909294,,0,,"United States, Delaware",99,5643,obguthr,1173006000.0,"United States, Virginia"


## Examination of both datasets

Indeed, we are given two different but very similar datasets to perform our analysis. ...

In [37]:
#Create Dataframe of only USA users
RB_usa = RB_merged.copy()[RB_merged['user_location'].str.contains('United States', case=False, na=False)]
BA_usa = BA_merged.copy()[BA_merged['user_location'].str.contains('United States', case=False, na=False)]

In [40]:
RB_other = RB_merged.copy()[~RB_merged['user_location'].str.contains('United States', case=False, na=True)]
BA_other = BA_merged.copy()[~BA_merged['user_location'].str.contains('United States', case=False, na=True)]

In [None]:
#Top ten beer for users of different states
#Democrats States
RB_california = RB_usa.copy()[RB_usa['user_location']=='United States, California']
RB_california[['beer_name','breweries_location','rating']].groupby(['beer_name']).agg({'rating': 'mean','breweries_location': 'first'}).sort_values('rating',ascending=False)[:10]

RB_massachusetts = RB_usa.copy()[RB_usa['user_location']=='United States, Massachusetts']
RB_massachusetts[['beer_name','breweries_location','rating']].groupby(['beer_name']).agg({'rating': 'mean','breweries_location': 'first'}).sort_values('rating',ascending=False)[:10]

#Republican States
RB_alabama = RB_usa.copy()[RB_usa['user_location']=='United States, Alabama']
RB_alabama[['beer_name','breweries_location','rating']].groupby(['beer_name']).agg({'rating': 'mean','breweries_location': 'first'}).sort_values('rating',ascending=False)[:10]

RB_indiana = RB_usa.copy()[RB_usa['user_location']=='United States, Indiana']
RB_indiana[['beer_name','breweries_location','rating']].groupby(['beer_name']).agg({'rating': 'mean','breweries_location': 'first'}).sort_values('rating',ascending=False)[:10]

Unnamed: 0_level_0,rating,breweries_location
beer_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Temescal Push Pop,5.0,"United States, California"
Jacob Best Ice,5.0,"United States, Wisconsin"
Cellarmaker Hoppiness is Fleeting,5.0,"United States, California"
Bloc 70 Super Light Lager,5.0,"United States, California"
Ks Cabos and Honey,5.0,Japan
Stuttgarter Hofbräu Herrenpils,5.0,Germany
Cavalier Imperial Stout,5.0,Australia
Casey Family Preserves: Triple Crown Blackberry,5.0,"United States, Colorado"
Casa Agria Heritage Gold,5.0,"United States, California"
Societe The Malingerer,5.0,"United States, California"


## Country Representation in Data

# EN DESSOUS, Cellules de tests, à retirer !!!

In [14]:
with open('data/BeerAdvocate/ratings.txt', 'r') as fichier:
    for i, ligne in enumerate(fichier):
        if i >= 100000:  # Arrête la boucle après 100 lignes
            break
        print(ligne.strip())

beer_name: RÃ©gab
beer_id: 142544
brewery_name: Societe des Brasseries du Gabon (SOBRAGA)
brewery_id: 37262
style: Euro Pale Lager
abv: 4.5
date: 1440064800
user_name: nmann08
user_id: nmann08.184925
appearance: 3.25
aroma: 2.75
palate: 3.25
taste: 2.75
overall: 3.0
rating: 2.88
text: From a bottle, pours a piss yellow color with a fizzy white head.  This is carbonated similar to soda.The nose is basic.. malt, corn, a little floral, some earthy straw.  The flavor is boring, not offensive, just boring.  Tastes a little like corn and grain.  Hard to write a review on something so simple.Its ok, could be way worse.
review: True

beer_name: Barelegs Brew
beer_id: 19590
brewery_name: Strangford Lough Brewing Company Ltd
brewery_id: 10093
style: English Pale Ale
abv: 4.5
date: 1235127600
user_name: StJamesGate
user_id: stjamesgate.163714
appearance: 3.0
aroma: 3.5
palate: 3.5
taste: 4.0
overall: 3.5
rating: 3.67
text: Pours pale copper with a thin head that quickly goes. Caramel, golden syru

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2599: character maps to <undefined>