## Data from Beer Reviews

*Data*: BeerAdvocate 

# BeerAdvocate

**beers.csv**
* beer_id
* beer_name
* brewery_id
* brewery_name
* style
* nbr_ratings
* nbr_reviews
* avg
* ba_score
* bros_score
* abv,avg_computed
* zscore
* nbr_matched_valid_ratings
* avg_matched_valid_ratings

**breweries.csv**
* id,
* location
* name
* nbr_beers

**users.csv**
* nbr_ratings
* nbr_reviews
* user_id
* user_name
* joined
* location

**ratings.txt** (format ligne i.e. Header=None): donées incomplètes
* beer_name
* beer_id
* brewery_name
* brewery_id
* style
* abv
* date
* user_name
* user_id
* appearance
* aroma
* palate
* taste
* overall
* rating
* text
* review: *True or False*

**reviews.txt** (format ligne i.e. Header=None, sous-ensemble de **ratings.txt**, données pour lesquelles review=True dans **ratings.txt**)
* beer_name
* beer_id
* brewery_name
* brewery_id
* style
* abv
* date
* user_name
* user_id
* appearance : *up to 5*
* aroma : *up to 5*
* palate : *up to 5*
* taste : *up to 5*
* overall : *up to 5*
* rating : *up to 5, unkown formula but different weights for each parameter*
* text

----------------------------------------------------------------------------------------------------

# Loading the data

In [1]:
import pandas as pd
%matplotlib inline

DATA_FOLDER = 'data/'
BEER_ADVOCATE_FOLDER = DATA_FOLDER + 'BeerAdvocate/' #BA

BA_BEERS_DATASET = BEER_ADVOCATE_FOLDER + "beers.csv"
BA_BREWERIES_DATASET = BEER_ADVOCATE_FOLDER + "breweries.csv"
BA_USERS_DATASET = BEER_ADVOCATE_FOLDER + "users.csv"
BA_RATINGS_DATASET = BEER_ADVOCATE_FOLDER + "ratings.txt"
BA_REVIEWS_DATASET = BEER_ADVOCATE_FOLDER + "reviews.txt"

### Beer Advocate : ratings.txt to dataframe: ba_ratings

**ba_ratings:**

size = 8393032 x 17 columns

In [2]:
columns = ['beer_name', 'beer_id', 'brewery_name', 'brewery_id', 'style', 'abv', 'date', 
           'user_name', 'user_id','appearance', 'aroma', 'palate', 'taste', 'overall', 
           'rating', 'text', 'review']

data = []
current_entry = {}

max_entries = 1000 #nb of treated lines in txt
entry_count = 0

with open(BA_RATINGS_DATASET, 'r', encoding='utf-8') as file:
    #preview = file.read(100)
    #print(preview)

    for line in file:
        line = line.strip()  # Supprimer les espaces de début/fin
        if line:  # Si la ligne n'est pas vide
            if ':' in line:
                key, value = line.split(':', 1)  # Séparer la clé et la valeur
                key = key.strip()
                value = value.strip()
                current_entry[key] = value
        else:
            if current_entry:  # Si un bloc est terminé, ajouter l'entrée au dataset
                data.append(current_entry)
                current_entry = {}  # Réinitialiser pour le prochain bloc
                entry_count += 1
                if entry_count >= max_entries:  # Arrêter après 40 entrées
                    break
# Ajouter la dernière entrée si nécessaire et si le fichier ne finit pas par une ligne vide
if current_entry and entry_count < max_entries:
    data.append(current_entry)

ba_ratings = pd.DataFrame(data, columns=columns)
ba_ratings["date"] = pd.to_numeric(ba_ratings["date"])
ba_ratings["date"] = pd.to_datetime(ba_ratings["date"], unit='s').dt.strftime('%d/%m/%Y')
cols = ['appearance', 'aroma', 'palate', 'taste', 'overall', 'rating']
ba_ratings[cols] = ba_ratings[cols].apply(pd.to_numeric, errors = 'coerce')
#display(ba_ratings.head(1))

### Beer Advocate : reviews.txt to dataframe: ba_reviews

**ba_reviews:**

size = 2589586 x 16 columns

In [3]:
columns = ['beer_name', 'beer_id', 'brewery_name', 'brewery_id', 'style', 'abv', 'date', 
           'user_name', 'user_id', 'appearance', 'aroma', 'palate', 'taste', 'overall', 
           'rating', 'text']
data = []
current_entry = {}

max_entries = 1000 #nb of treated lines in txt
entry_count = 0

with open(BA_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    #preview = file.read(700)
    #print(preview)

    for line in file:
        line = line.strip()  # Supprimer les espaces de début/fin
        if line:  # Si la ligne n'est pas vide
            if ':' in line:
                key, value = line.split(':', 1)  # Séparer la clé et la valeur
                key = key.strip()
                value = value.strip()
                current_entry[key] = value
        else:
            if current_entry:  # Si un bloc est terminé, ajouter l'entrée au dataset
                data.append(current_entry)
                current_entry = {}  # Réinitialiser pour le prochain bloc
                entry_count += 1
                if entry_count >= max_entries:  # Arrêter après 40 entrées
                    break

# Ajouter la dernière entrée si nécessaire et si le fichier ne finit pas par une ligne vide
if current_entry and entry_count < max_entries:
    data.append(current_entry)

ba_reviews = pd.DataFrame(data, columns=columns)
ba_reviews["date"] = pd.to_numeric(ba_reviews["date"])
ba_reviews["date"] = pd.to_datetime(ba_reviews["date"], unit='s').dt.strftime('%d/%m/%Y')
cols = ['appearance', 'aroma', 'palate', 'taste', 'overall', 'rating']
ba_reviews[cols] = ba_reviews[cols].apply(pd.to_numeric, errors = 'coerce')

### Access to each file :

In [4]:
ba_beers = pd.read_csv(BA_BEERS_DATASET)
ba_breweries = pd.read_csv(BA_BREWERIES_DATASET)
ba_users = pd.read_csv(BA_USERS_DATASET)
# ba_ratings
# ba_reviews

### Exploration ba_reviews :

du mal à tout traité d'un coup, censé faire des samples puis analyse statistique ? ou faire petit à petit puis concaténer les df ? à voir 

In [5]:
unique_count = ba_reviews['beer_name'].nunique()
print(f"Le nombre d'entrées uniques dans 'beer_name' est : {unique_count}")

Le nombre d'entrées uniques dans 'beer_name' est : 282


### Exploration ba_breweries :

size: 16758 rows × 4 columns <br>
size: 16723 rows × 4 columns (ba_breweries_c, sans les href)

In [6]:
display(ba_breweries.sample(3))

Unnamed: 0,id,location,name,nbr_beers
9455,22202,"United States, Massachusetts",Element Brewing Company,32
16150,395,"United States, Kentucky",Bluegrass Brewing Co. - East St. Matthew's,121
14967,12041,France,Les Bieres Des Hauts,2


In [7]:
ba_breweries_c = ba_breweries[~ba_breweries['location'].str.contains('href', na=False)]

# display(sorted(ba_breweries_c['location'].unique()))

ba_breweries_c.sample(2)

Unnamed: 0,id,location,name,nbr_beers
6249,33333,Spain,La Buena Pinta,1
9939,16858,"United States, New York",Rohrbach Brewing Company (Brewery),1


In [8]:
ba_breweries_c['location'].describe()

count       16723
unique        262
top       Germany
freq         1431
Name: location, dtype: object

In [9]:
ba_breweries_c['name'].describe()

count                           16723
unique                          16222
top       Granite City Food & Brewery
freq                               33
Name: name, dtype: object

In [10]:
ba_breweries_c['nbr_beers'].describe()

count    16723.000000
mean        20.937810
std         68.651483
min          0.000000
25%          2.000000
50%          6.000000
75%         18.000000
max       1196.000000
Name: nbr_beers, dtype: float64

In [11]:
breweries_per_country = ba_breweries_c.groupby('location').size()
avg_beer_nbr_per_country = ba_breweries_c.groupby('location')['nbr_beers'].mean()

df = pd.DataFrame({
    'nbr_breweries': breweries_per_country,
    'avg_beer_nbr': avg_beer_nbr_per_country
})
df_sorted = df.sort_values(by='nbr_breweries', ascending=False)

display(df_sorted.head(3))

Unnamed: 0_level_0,nbr_breweries,avg_beer_nbr
location,Unnamed: 1_level_1,Unnamed: 2_level_1
Germany,1431,4.318658
England,997,9.17653
"United States, California",929,38.306781


In [12]:
# Countries with many regions/states:

us = ba_breweries_c[ba_breweries_c['location'].str.contains("United States")]
canada = ba_breweries_c[ba_breweries_c['location'].str.contains("Canada")]

df_us = pd.DataFrame({
    'nbr_breweries': us.groupby('location').size(),
    'avg_beer_nbr': us.groupby('location')['nbr_beers'].mean()
})

df_canada = pd.DataFrame({
    'nbr_breweries': canada.groupby('location').size(),
    'avg_beer_nbr': canada.groupby('location')['nbr_beers'].mean()
})

df_us_sorted=df_us.sort_values(by='nbr_breweries', ascending=False)
df_canada_sorted=df_canada.sort_values(by='nbr_breweries', ascending=False)

#display(df_us_sorted)
#display(df_canada_sorted)


# ATTENTION US, US en dernière position !!!

In [13]:
us_calif_breweries = ba_breweries_c[ba_breweries_c['location'] == 'United States, California']
display(us_calif_breweries)
us_calif_breweries['nbr_beers'].describe()

Unnamed: 0,id,location,name,nbr_beers
8319,30407,"United States, California",Arcana Brewing Company,18
8320,45671,"United States, California",Archaic Craft Brewery / Centro,0
8321,31686,"United States, California",Area 51 Craft Brewery,8
8322,30816,"United States, California",Armstrong Brewing Co.,10
8323,43153,"United States, California",Arrogant Brewing,41
...,...,...,...,...
16684,2180,"United States, California",Berkeley Alehouse (Pyramid Breweries),1
16710,4309,"United States, California",Truckee Brewing Company,2
16727,3114,"United States, California",California Brewing Company,7
16739,13767,"United States, California",Rafters Grille And Brewery,0


count     929.000000
mean       38.306781
std       103.823891
min         0.000000
25%         4.000000
50%        13.000000
75%        34.000000
max      1196.000000
Name: nbr_beers, dtype: float64

### Exploration ba_users :


In [28]:
ba_users.sample(10)

Unnamed: 0,nbr_ratings,nbr_reviews,user_id,user_name,joined,location
67343,2,0,clinton87.832106,Clinton87,1406196000.0,
81027,2,0,dwerner.731951,dwerner,1368180000.0,"United States, Florida"
116099,1,0,nyyfan24.806142,NYYFAN24,1402740000.0,
39471,2,0,gcappe.730053,Gcappe,1366884000.0,Canada
47960,4,0,ciscoed.864466,Ciscoed,1410862000.0,"United States, Illinois"
147042,2,0,bryonsh.618110,BryonSh,1314526000.0,"United States, Texas"
65763,1,0,jucacobe.905455,Jucacobe,1418123000.0,
113168,2,0,dludwig.913701,dludwig,1419419000.0,"United States, Ohio"
153524,1,1,sukaluski.157032,sukaluski,1189073000.0,"United States, Maryland"
63240,15,15,boyfromthenorth.434573,boyfromthenorth,1267700000.0,"United States, Michigan"


In [24]:
#Conversion en date lisible: joined column
ba_users_c = ba_users.copy()
ba_users_c['joined'] = pd.to_datetime(ba_users_c['joined'], unit='s')

display(ba_users_c.sort_values(by='nbr_ratings', ascending=False))

Unnamed: 0,nbr_ratings,nbr_reviews,user_id,user_name,joined,location
228,12046,7593,sammy.3853,Sammy,2003-12-01 11:00:00,Canada
1352,10360,66,acurtis.508168,acurtis,2010-09-27 10:00:00,"United States, New Jersey"
1583,10302,34,texasfan549.572853,Texasfan549,2011-02-26 11:00:00,"United States, Texas"
969,10180,2091,kylehay2004.571365,kylehay2004,2011-02-23 11:00:00,"United States, Illinois"
994,9991,1122,grg1313.288024,GRG1313,2009-01-15 11:00:00,"United States, California"
...,...,...,...,...,...,...
106949,1,0,natsteff.416507,natsteff,2010-01-17 11:00:00,"United States, Wisconsin"
106951,1,1,tiller06.613183,tiller06,2011-07-31 10:00:00,"United States, Maryland"
106957,1,0,visser5.806930,visser5,2014-06-16 10:00:00,
106960,1,1,dosgoat.689285,Dosgoat,2012-08-14 10:00:00,


In [25]:
ba_users_c['nbr_ratings'].describe()

count    153704.000000
mean         54.605163
std         252.388790
min           1.000000
25%           1.000000
50%           3.000000
75%          16.000000
max       12046.000000
Name: nbr_ratings, dtype: float64

In [26]:
ba_users_c['nbr_reviews'].describe()

count    153704.000000
mean         16.847876
std         139.846706
min           0.000000
25%           0.000000
50%           0.000000
75%           2.000000
max        8970.000000
Name: nbr_reviews, dtype: float64

In [27]:
ba_users_c['joined'].describe()

count                           151052
mean     2013-01-03 18:40:31.618250752
min                1996-08-23 10:00:00
25%                2011-04-18 10:00:00
50%                2014-02-09 11:00:00
75%                2014-12-04 11:00:00
max                2017-07-31 10:00:00
Name: joined, dtype: object