In [1]:
from utils import *

# Dataset exploration : 


+ **Goal** : The goal of this notebook is to help us understand the dataset


+ **Source** : [Original dataset](https://drive.google.com/drive/folders/1Wz6D2FM25ydFw_-41I9uTwG9uNsN4TCF?usp=sharing) 

+ **Links** : [Here](../docs/datasets/links.md) is a document with more links about the datasets (parser/crawler, papers, ...)

Now let's look at the all the data available

## File conversion

The txt files from the original dataset have been converted to csv file with the function *ratings_text_to_csv()* in [utils.py](utils.py)

The data loaded, will be the converted one. Here is a link to the [converted dataset](https://drive.google.com/drive/folders/1sbUpQaA4lJ_vyq-aX0h_sRIaG0B6oq_B?usp=drive_link)

In [2]:
# You need to extract the txt.zip files, and you will need to compress the csv files to csv.zip afterward
#ratings_text_to_csv()

## RAM limitation

If there is to much data to load, you can limit the number of ratings to 'reduced' for each loading of ratings/reviews dataframe. It takes only the first 'reduced' ratings/reviews

Avoid loading data you won't work with

In [3]:
#REDUCED = None # No reduction
REDUCED = 1e5  # Approx 0.1% of the reviews

## BeerAdvocate

In [4]:
# Load BeerAdvocate dataset
beers_ba, breweries_ba, users_ba, ratings_ba, reviews_ba = load_data('ba', reduced=REDUCED)

### Beers

In [5]:
# A beer example
beers_ba.sample().T

Unnamed: 0,44585
beer_id,119268
beer_name,Spoh Russian Imperial Stout
brewery_id,34423
brewery_name,Cerveza Spoh
style,Russian Imperial Stout
nbr_ratings,4
nbr_reviews,1
avg,3.99
ba_score,
bros_score,


**Fields description**

***s.e.** means self-explanatory*

+ From BeerAdvocate website :
    * beer_id : s.e.
    * beer_name : s.e.
    * brewery_id : s.e.
    * brewery_name : s.e.
    * style : type of beer such as 'English India Pale Ale (IPA)'
    * nbr_ratings : number of ratings given by the BeerAdvocate community
    * nbr_reviews : same as number of ratings, BUT only includes thoses with at with a text review
    * abv : Alcohol by volume
    * avg : Description on website : "Average across all ratings for this beer", some ratings may be dropped for numerous reasons
    * [ba_score](https://www.beeradvocate.com/community/threads/beeradvocate-ratings-explained.184726/) :

        The BA Score is the beer's overall score based on its ranking within its style category. It's based on the beer's truncated (trimmed) mean and a custom Bayesian (weighted rank) formula that takes the beer's style into consideration. Its purpose is to provide consumers with a quick reference using a format that's familiar to the wine and liquor worlds.

        + 95-100 = world-class
        + 90-94 = outstanding
        + 85-89 = very good
        + 80-84 = good
        + 70-79 = okay
        + 60-69 = poor
        + < 60 = awful
    * bros_score (not used anymore today 2023) : a separate rating provided by the Alström brothers (founders of BA)

+ Computed
    + avg_computed : average of all the ratings about this beer
    + zscore : 
        - Only computed for matched beers --> see matched beer data
        - zscore of the beer rating with the beer ratings of the same year

+ See matched beer data
    * nbr_matched_valid_ratings 
    * avg_matched_valid_ratings 

### Breweries

In [6]:
breweries_ba.sample().T

Unnamed: 0,3028
id,9053
location,Italy
name,Birrificio Udinese - BIRE
nbr_beers,0


**Fields description**

***s.e.** means self-explanatory*

+ From BeerAdvocate website :
    * id : s.e.
    * location : (state, ) country the brewery
    * name : s.e.

+ Computed : 
    * nbr_beers : number of different beers produced by this brewery. Only the beers included in the database are included in the count

### Users

In [7]:
users_ba.sample().T

Unnamed: 0,76085
nbr_ratings,3
nbr_reviews,0
user_id,ivtds1game.944949
user_name,Ivtds1game
joined,1423998000.0
location,


**Fields description**

***s.e.** means self-explanatory*

+ From BeerAdvocate website :
    * nbr_ratings : s.e.
    * nbr_reviews : s.e.
    * user_id : s.e.
    * user_name : s.e.
    * joined : timestamp when the user joined the community (day precision, hours always at 12am)
    * location : (state, ) country the user


### Ratings and reviews

In [8]:
ratings_ba.sample().T

Unnamed: 0,51849
beer_name,Dark Island
beer_id,401
brewery_name,Orkney Brewery
brewery_id,118
style,Scottish Ale
abv,4.6
date,1410170400
user_name,Ozzylizard
user_id,ozzylizard.757498
appearance,


**Fields description**

***s.e.** means self-explanatory*

+ From BeerAdvocate website

    - beer_name : s.e.
    - beer_id : s.e.
    - brewery_name : s.e.
    - brewery_id : s.e.
    - style : type of beer such as 'English India Pale Ale (IPA)'
    - abv : Alcohol by Volume
    - date : timestamp of the rating (day precision, hours always at 12am)
    - user_name : s.e.
    - user_id : s.e.
    * appearance : score on appearance given **by user**, 1 to 5 with resolution at 0.25
    * aroma : score on appearance given **by user**, 1 to 5 with resolution at 0.25
    * palate : score on palate given **by user**, 1 to 5 with resolution at 0.25
    * taste : score on taste given **by user**, 1 to 5 with resolution at 0.25
    * overall : score on overall given **by user**, 1 to 5 with resolution at 0.25
    + rating : 0.06 * appearance + 0.24 * aroma + 0.10 * palate + 0.40 * taste + 0.20 * overall 
        - We found and checked the formula
    * text : the review from the user
    * review : True if the text has at least 150 characters


## RateBeer

In [9]:
beers_rb, breweries_rb, users_rb, ratings_rb, reviews_rb = load_data('rb', reduced=REDUCED)

### Beers

In [10]:
beers_rb.sample().T

Unnamed: 0,286123
beer_id,416299
beer_name,Lupine Dopplebock
brewery_id,22002
brewery_name,Lupine Brewing Company
style,Doppelbock
nbr_ratings,0
overall_score,
style_score,
avg,
abv,7.2


**Fields description**

***s.e.** means self-explanatory*

+ From RateBeer website :
    * beer_id : s.e.
    * beer_name : s.e.
    * brewery_id : s.e.
    * brewery_name : s.e.
    * style : type of beer such as 'India Pale Ale (IPA)'
    * nbr_ratings : number of ratings given by the BeerAdvocate community

    + overall_score : 
        - A score that ranks this beer against all other beers on RateBeer.
        - score up to 100  , nan if no more than 10 ratings
        - RateBeer uses an algorithm when calculating the overall score, considering both the ratings given by each user and the total number of ratings for the beer.
    + style_score : rank among other beer of the same style, nan if no more than 10 ratings. Higher is better
    
    + avg : average rating (some ratings may dropped for numerous reasons) , 1 to 5
    
    * abv : Alcohol by volume
    
+ Computed :
    + avg_computed : average of all the ratings about this beer
    + zscore : 
        - Only computed for matched beers --> see matched beer data
        - zscore of the beer rating with the beer ratings of the same year
    
+ See matched beer data
    * nbr_matched_valid_ratings 
    * avg_matched_valid_ratings 



- About the scores :
    - [Source for the scores explanation](https://www.ratebeer.com/our-scores)
    - **DISCLAIMER** : The link above is the description of 2023, and may not correspond to the scores of the dataset that stop at year 2017
    - Some ratings may not be counted for some score, for number of reasons --> see link above

### Breweries

In [11]:
breweries_rb.sample().T

Unnamed: 0,9965
id,22909
location,Vietnam
name,Hadubeco
nbr_beers,2


**Fields description**

***s.e.** means self-explanatory*

+ From RateBeer website :
    * id : s.e.
    * location : (state, ) country the brewery
    * name : s.e.

+ Computed : 
    * nbr_beers : number of different beers produced by this brewery. Only the beers included in the database are included in the count

### Users

In [12]:
users_rb.sample().T

Unnamed: 0,43270
nbr_ratings,8
user_id,94792
user_name,ScottButler
joined,1251108000.0
location,"United States, Alabama"


**Fields description**

***s.e.** means self-explanatory*

+ From RateBeer website :
    * nbr_ratings : s.e.
    * user_id : s.e.
    * user_name : s.e.
    * joined : timestamp when the user joined the community (day precision, hours always at 12am)
    * location : (state, ) country the user

### Ratings and reviews

In [13]:
ratings_rb.sample().T

Unnamed: 0,75162
beer_name,Gahan Premier George Coles Cream Ale
beer_id,19392
brewery_name,Prince Edward Island Brewing Company &#40;The ...
brewery_id,3091
style,Cream Ale
abv,5.0
date,1408096800
user_name,jksipa
user_id,243817
appearance,4


**Fields description**

***s.e.** means self-explanatory*

+ From RateBeer website

    - beer_name : s.e.
    - beer_id : s.e.
    - brewery_name : s.e.
    - brewery_id : s.e.
    - style : type of beer such as 'English India Pale Ale (IPA)'
    - abv : Alcohol by Volume
    - date : timestamp of the rating (day precision, hours always at 12am)
    - user_name : s.e.
    - user_id : s.e.


    * appearance : score on appearance given **by user**, 1 to 5 with resolution at 1
    * aroma : score on appearance given **by user**, 1 to 10 with resolution at 1
    * palate : score on palate given **by user**, 1 to 5 with resolution at 1
    * taste : score on taste given **by user**, 1 to 10 with resolution at 1
    * overall : score on overall given **by user**, 1 to 20 with resolution at 1
    + rating :  sum of the 5 previous scores divided by 10
        - We found and checked the formula. It works almost everytime (99.9996%)
    * text : the review from the user


## Matched beer data

[matching method](https://github.com/epfl-dlab/when_sheep_shop/blob/master/code/notebooks/1-matching.ipynb)

In [14]:
beers_matched, breweries_matched, users_matched, ratings_matched, reviews_matched = load_data('matched', reduced=REDUCED)
(users_matched, users_approx_matched) = users_matched
(ratings_matched, ratings_ba_matched, ratings_rb_matched) = ratings_matched
(ratings_with_text_ba_matched, ratings_with_text_rb_matched) = reviews_matched

  beers_matched = pd.read_csv(DATASET_MATCHED_DIR+'beers.csv.zip', header=0, compression=COMPRESSION)
  ratings_matched = pd.read_csv(DATASET_MATCHED_DIR+'ratings.csv.zip', header=0, compression=COMPRESSION)


### Breweries

**How the matching is done**:

Breweries are matched on the name (cosine **sim** of at least 0.8 and **diff**erence between first and second match of at least 0.3) and on the location (exact match)

*Explanation of sim and diff fields above*

In [15]:
breweries_matched.head(2).T

Unnamed: 0,0,1
ba,id,10093
ba.1,location,Northern Ireland
ba.2,name,Strangford Lough Brewing Company Ltd
ba.3,nbr_beers,5
rb,id,4959
rb.1,location,Northern Ireland
rb.2,name,Strangford Lough
rb.3,nbr_beers,5
scores,diff,0.43127548708778424
scores.1,sim,0.8890620997705575


**Fields** : All the fields are either explained above or self explanatory

### Beers


**How the matching is done**:

Beers are matched on the name without the brewery name (cosine **sim** of at least 0.8 and **diff**erence between first and second match of at least 0.4) from the matched breweries and on their ABV (exact match)

*Explanation of sim and diff fields above*

+ nbr_matched_valid_ratings : number of matched ratings with the condition stated above
+ avg_matched_valid_ratings : average rating of the matched ratings

In [16]:
beers_matched.head(2).T

Unnamed: 0,0,1
ba,abv,4.8
ba.1,avg,3.45
ba.2,avg_computed,3.439866666666666
ba.3,avg_matched_valid_ratings,3.504067796610169
ba.4,ba_score,80.0
ba.5,beer_id,19827
ba.6,beer_name,Legbiter
ba.7,beer_wout_brewery_name,Legbiter
ba.8,brewery_id,10093
ba.9,brewery_name,Strangford Lough Brewing Company Ltd


**Fields** : All the fields are either explained above or self explanatory

### Users and users_approx


**How the matching is done**:

- users : The users are matched on their username and their location (exact matches). The usernames are transformed into only lowercase letters.
- users_approx : The users are matched on their username and their location (cosine **sim** of at least 0.9). The usernames are transformed into only lowercase letters.

*Explanation of sim and diff fields above*

In [17]:
users_matched.head(2).T

Unnamed: 0,0,1
ba,joined,1220868000.0
ba.1,location,Germany
ba.2,nbr_ratings,6
ba.3,nbr_reviews,6
ba.4,user_id,erzengel.248045
ba.5,user_name,Erzengel
ba.6,user_name_lower,erzengel
rb,joined,1224324000.0
rb.1,location,Germany
rb.2,nbr_ratings,8781


In [18]:
users_approx_matched.head(2).T

Unnamed: 0,0,1
ba,joined,1483009200.0
ba.1,location,Spain
ba.2,nbr_ratings,3
ba.3,nbr_reviews,0
ba.4,user_id,magicuenca.1185749
ba.5,user_name,MAGICuenca
ba.6,user_name_lower,magicuenca
rb,joined,1484046000.0
rb.1,location,Spain
rb.2,nbr_ratings,89


**Fields** : All the fields are either explained above or self explanatory

### All ratings

Match ratings with matched user and matched beer

* ratings_matched : All matched ratings
* ratings_ba_matched : All matched ratings from BeerAdvocate
* ratings_rb_matched : All matched ratings from RateBeer
* ratings_with_text_ba_matched : All matched ratings from BeerAdvocate, with review == True
* ratings_with_text_rb_matched : All matched ratings from RateBeer, with a text review

Every fields have been explained above

In [19]:
ratings_matched.head(2).T

Unnamed: 0,0,1
ba,abv,11.3
ba.1,appearance,4.5
ba.2,aroma,4.5
ba.3,beer_id,645
ba.4,beer_name,Trappistes Rochefort 10
ba.5,brewery_id,207
ba.6,brewery_name,Brasserie de Rochefort
ba.7,date,1324810800
ba.8,overall,5.0
ba.9,palate,4.5


In [20]:
ratings_ba_matched.sample().T

Unnamed: 0,11287
beer_name,Beerlao Gold
beer_id,89973
brewery_name,Lao Brewery Co.
brewery_id,2970
style,American Adjunct Lager
abv,5.0
date,1480071600
user_name,BucBasil
user_id,bucbasil.134775
appearance,3.0


In [21]:
ratings_rb_matched.sample().T

Unnamed: 0,56539
beer_name,Sawdust City Winding Road For 7km
beer_id,300243
brewery_name,Sawdust City Brewing Company
brewery_id,13784
style,Saison
abv,7.0
date,1420628400
user_name,Bendrixian
user_id,104485
appearance,4


In [22]:
ratings_with_text_ba_matched.sample().T

Unnamed: 0,11672
beer_name,Best Bitter
beer_id,115850
brewery_name,Persephone Brewing
brewery_id,32889
style,English Bitter
abv,5.0
date,1491732000
user_name,biboergosum
user_id,biboergosum.168458
appearance,3.75


In [23]:
ratings_with_text_rb_matched.sample().T

Unnamed: 0,22481
beer_name,La Succursale Bünnweizen
beer_id,150710
brewery_name,La Succursale
brewery_id,12864
style,German Hefeweizen
abv,5.2
date,1373018400
user_name,lesifflebiere
user_id,237596
appearance,2


---