In [1]:
import pandas as pd
import numpy as np

from data_loader import (
    get_users_df,
    get_reviews_df,
    get_beers_df,
    get_breweries_df,
    join_breweries_on_beers,
    merge_reviews,
)

In [2]:
reviews_path_ba = "data/matched_beer_data/ratings_ba.txt"
reviews_path_rb = "data/matched_beer_data/ratings_rb.txt"
users_path_ba = "data/users_ba.csv"
users_path_rb = "data/users_rb.csv"
breweries_path = "data/matched_beer_data/breweries.csv"
beers_path = "data/matched_beer_data/beers.csv"

# TITLE

Little text to explain main idea

## 1) Loading data and basic analysis

Could contain, how many rows, columns, missing values, etc.

### Data cleaning

Drop fill missing values, etc.

In [3]:
users_df_ba = get_users_df(users_path_ba)
users_df_rb = get_users_df(users_path_rb)
ba_df = get_reviews_df(reviews_path_ba)
rb_df = get_reviews_df(reviews_path_rb)
breweries_df = get_breweries_df(breweries_path)
beers_df = get_beers_df(beers_path)
beers_df = join_breweries_on_beers(beers_df, breweries_df)
reviews_df = merge_reviews(ba_df, rb_df, beers_df, users_df_ba, users_df_rb)

## User dataframe (The user dataframe are similar for both platforms)

In [9]:
users_df_ba.head()

Unnamed: 0,nbr_ratings,nbr_reviews,user_id,user_name,joined,location
0,7820,465,nmann08.184925,nmann08,1199704000.0,"United States, Washington"
1,2521,2504,stjamesgate.163714,StJamesGate,1191838000.0,"United States, New York"
2,1797,1143,mdagnew.19527,mdagnew,1116410000.0,Northern Ireland
3,31,31,helloloser12345.10867,helloloser12345,1101380000.0,Northern Ireland
4,604,604,cypressbob.3708,cypressbob,1069326000.0,Northern Ireland




1. **nbr_ratings**: Number of beer ratings by users.
2. **nbr_reviews**: Count of beer reviews submitted.
3. **user_id**: Unique identifier for users.
4. **user_name**: Usernames associated with users.
5. **joined**: Date when users joined the platform.
6. **location**: Geographical location of users.

## Breweries dataframe

In [10]:
breweries_df.head()

Unnamed: 0,brewery_id_ba,brewery_location_ba,brewery_name_ba,brewery_nbr_beers_ba,brewery_id_rb,brewery_nbr_beers_rb
0,10093,Northern Ireland,Strangford Lough Brewing Company Ltd,5,4959,5
1,32848,Northern Ireland,The Sheelin Brewery,4,17616,2
2,40360,Northern Ireland,Walled City Brewing Company,6,24866,3
3,40309,Northern Ireland,Ards Brewing Company,7,13538,13
4,41205,Northern Ireland,Barrahooley Brewery,3,22304,4


1. **brewery_id_ba**: Unique identifier for breweries on BeerAdvocate.
2. **brewery_location_ba**: Geographical location of breweries on BeerAdvocate.
3. **brewery_name_ba**: Names of breweries on BeerAdvocate.
4. **brewery_nbr_beers_ba**: Number of beers associated with each brewery on BeerAdvocate.
5. **brewery_id_rb**: Unique identifier for breweries on RateBeer.
6. **brewery_nbr_beers_rb**: Number of beers associated with each brewery on RateBeer.

## Beers dataframe (enriched with data from breweries dataframe)

- The rows contain the different beers rated on the two platforms

In [14]:
beers_df.head()

Unnamed: 0,abv_ba,beer_avg_rating_ba,beer_id_ba,beer_name_ba,brewery_id_ba,nbr_ratings_ba,style_ba,beer_avg_rating_rb,beer_id_rb,brewery_id_rb,nbr_ratings_rb,beer_avg_rating_ba_rb,brewery_location_ba,brewery_name_ba,brewery_nbr_beers_ba,brewery_nbr_beers_rb
0,4.8,3.439867,19827,Legbiter,10093,75,English Pale Ale,2.923596,37923,4959,89,3.159695,Northern Ireland,Strangford Lough Brewing Company Ltd,5,5
1,6.0,3.88875,20841,St. Patrick's Ale,10093,8,English Pale Ale,3.290909,41286,4959,11,3.542632,Northern Ireland,Strangford Lough Brewing Company Ltd,5,5
2,4.2,3.556094,20842,St. Patrick's Best,10093,64,English Bitter,2.831081,41287,4959,74,3.167319,Northern Ireland,Strangford Lough Brewing Company Ltd,5,5
3,4.8,3.96,22659,St. Patrick's Gold,10093,1,American Pale Wheat Ale,2.775,41285,4959,4,3.012,Northern Ireland,Strangford Lough Brewing Company Ltd,5,5
4,4.5,,178681,Sheelin Stout,32848,0,Irish Dry Stout,3.2,230283,17616,2,,Northern Ireland,The Sheelin Brewery,4,2


1. **abv_ba**: Alcohol by volume for beers on BeerAdvocate.
2. **beer_avg_rating_ba**: Average rating for beers on BeerAdvocate.
3. **beer_id_ba**: Unique identifier for beers on BeerAdvocate.
4. **beer_name_ba**: Names of beers on BeerAdvocate.
5. **brewery_id_ba**: Unique identifier for breweries associated with BeerAdvocate.
6. **nbr_ratings_ba**: Number of ratings for beers on BeerAdvocate.
7. **style_ba**: Beer style on BeerAdvocate.
8. **beer_avg_rating_rb**: Average rating for beers on RateBeer.
9. **beer_id_rb**: Unique identifier for beers on RateBeer.
10. **brewery_id_rb**: Unique identifier for breweries associated with RateBeer.
11. **nbr_ratings_rb**: Number of ratings for beers on RateBeer.
12. **beer_avg_rating_ba_rb**: Average rating for beers on both BeerAdvocate and RateBeer.
13. **brewery_location_ba**: Geographical location of breweries on BeerAdvocate.
14. **brewery_name_ba**: Names of breweries on BeerAdvocate.
15. **brewery_nbr_beers_ba**: Number of beers associated with each brewery on BeerAdvocate.
16. **brewery_nbr_beers_rb**: Number of beers associated with each brewery on RateBeer.

## Complete reviews dataframe

In [17]:
reviews_df.head()

Unnamed: 0,beer_name,beer_id,style,date,user_name,user_id,appearance,aroma,palate,taste,...,beer_id_rb,brewery_id_rb,nbr_ratings_rb,beer_avg_rating_ba_rb,brewery_location,brewery_name,brewery_nbr_beers_ba,brewery_nbr_beers_rb,user_location,user_nbr_ratings
0,Legbiter,19827,English Pale Ale,1417431600,Hellpop65,hellpop65.48993,,,,,...,37923,4959,89,3.159695,Northern Ireland,Strangford Lough Brewing Company Ltd,5,5,"United States, Kansas",2326.0
1,Legbiter,19827,English Pale Ale,1401357600,Latarnik,latarnik.52897,,,,,...,37923,4959,89,3.159695,Northern Ireland,Strangford Lough Brewing Company Ltd,5,5,"United States, New Jersey",3098.0
2,Legbiter,19827,English Pale Ale,1393412400,RochefortChris,rochefortchris.697017,,,,,...,37923,4959,89,3.159695,Northern Ireland,Strangford Lough Brewing Company Ltd,5,5,"United States, North Carolina",1866.0
3,Legbiter,19827,English Pale Ale,1392030000,OKCNittany,okcnittany.144868,,,,,...,37923,4959,89,3.159695,Northern Ireland,Strangford Lough Brewing Company Ltd,5,5,"United States, Oklahoma",1131.0
4,Legbiter,19827,English Pale Ale,1390647600,jaydoc,jaydoc.265507,,,,,...,37923,4959,89,3.159695,Northern Ireland,Strangford Lough Brewing Company Ltd,5,5,"United States, Kansas",9987.0


1. **beer_name**: Name of the rated beer.
2. **beer_id**: Unique identifier for the rated beer.
3. **style**: Style of the rated beer.
4. **date**: Date of the rated review.
5. **user_name**: Name of the user providing the rated review.
6. **user_id**: Unique identifier for the user.
7. **appearance**: Rating for the appearance of the rated beer.
8. **aroma**: Rating for the aroma of the rated beer.
9. **palate**: Rating for the palate of the rated beer.
10. **taste**: Rating for the taste of the rated beer.
11. **overall**: Overall rating for the rated beer.
12. **rating**: Another rating associated with the rated beer.
13. **text**: Textual content of the rated review.
14. **review**: Another aspect of the rated review.
15. **abv**: Alcohol by volume for the rated beer.
16. **beer_avg_rating_ba**: Average rating for the rated beer on BeerAdvocate.
17. **beer_id_ba**: Unique identifier for the rated beer on BeerAdvocate.
18. **brewery_id_ba**: Unique identifier for the brewery associated with the rated beer on BeerAdvocate.
19. **nbr_ratings_ba**: Number of ratings for the rated beer on BeerAdvocate.
20. **beer_avg_rating_rb**: Average rating for the rated beer on RateBeer.
21. **beer_id_rb**: Unique identifier for the rated beer on RateBeer.
22. **brewery_id_rb**: Unique identifier for the brewery associated with the rated beer on RateBeer.
23. **nbr_ratings_rb**: Number of ratings for the rated beer on RateBeer.
24. **beer_avg_rating_ba_rb**: Average rating for the rated beer on both BeerAdvocate and RateBeer.
25. **brewery_location**: Geographical location of the brewery associated with the rated beer.
26. **brewery_name**: Name of the brewery associated with the rated beer.
27. **brewery_nbr_beers_ba**: Number of beers associated with the brewery of the rated beer on BeerAdvocate.
28. **brewery_nbr_beers_rb**: Number of beers associated with the brewery of the rated beer on RateBeer.
29. **user_location**: Geographical location of the user.
30. **user_nbr_ratings**: Number of ratings provided by the user.

## 2) First analysis : how beers from one country are rated (reviewed)

### Data cleaning

Remove any beers with < 10 reviews, then aggregate by country, then remove any countries with < 10 beers

### Analysis

Compute some averages for countries, plot them, say there are some diff so we want to investigate

Also do wordclouds for each country using the adjective datasets

Discuss the climate map, how it could be used but only if time allows it

## 3) Second analysis : how users from one country rate beers

### Data cleaning

Remove any users with < 10 reviews, then aggregate by country, then remove any countries with < 10 users

le 10 c'est un peu arbitraire, mais c'est pour éviter d'avoir des pays avec 1 ou 2 users qui ont mis 5 étoiles à une bière et donc qui faussent les moyennes

### Analysis

Compute some averages for countries, plot them, say there are some diff so we want to investigate
Potentially also wordclouds, here because we explained the dataset in part 2, just say "we gonna do the same"