# Breweries, Beer, and Breviews: Unraveling the Global Beer Preferences
## Introduction

## Repository structure




## Include Libraries and initial settings
### Import all the libraries

In [2]:
# Import all the libraries
import pandas as pd
import plotly.express as px
from src.utils.plots import *
import numpy as np
import bar_chart_race as bcr
import networkx as nx
from pyvis.network import Network
import plotly.graph_objects as go

# Shutdown FutureWarning and UserWarning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

# Set some internal settings for plotly
px.defaults.width = 800
px.defaults.height = 600
px.defaults.template = 'plotly_white'

# Define the folder
FOLDER = 'data/processed/'
SAVING_FOLDER = 'docs/plots/'

### Load and filter the data
#### Load the data

In [3]:
# Load the data
df_beers = pd.read_parquet(FOLDER + 'beers.pq')
df_breweries = pd.read_parquet(FOLDER + 'breweries.pq')
df_users = pd.read_parquet(FOLDER + 'users.pq')
df_ratings_no_text = pd.read_parquet(FOLDER + 'ratings_no_text.pq')

## Data presentation
This section provides an overview of the dataset, which has been carefully cleaned and is nearly ready for analysis. We will summarize the data, display the first few rows, and describe the data types of the columns. For columns that are not self-explanatory, a brief description is included. <br><br>
Let's begin our analysis with the beers.

In [4]:
df_beers.head(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,abv
0,410549,33 Export (Gabon),3198,Sobraga,Pale Lager,5.0
1,105273,Castel Beer (Gabon),3198,Sobraga,Pale Lager,5.2
2,19445,Régab,3198,Sobraga,Pale Lager,4.5
3,155699,Ards Bally Black Stout,13538,Ards Brewing Co.,Stout,4.6
4,239097,Ards Belfast 366,13538,Ards Brewing Co.,Golden Ale/Blond Ale,4.2


In [5]:
print(f"In the platform there are {df_beers.shape[0]} different beers")

In the platform there are 399987 different beers


Now let's take a look at the breweries.

In [6]:
df_breweries.head(5)

Unnamed: 0,brewery_id,brewery_name,country_brewery,state_brewery
0,3198,Sobraga,Gabon,
1,13538,Ards Brewing Co.,United Kingdom,
2,22304,Barrahooley Craft Brewery,United Kingdom,
3,22818,Boundary,United Kingdom,
4,24297,Brewbot Belfast,United Kingdom,


We have chosen, to ease the analysis processes, to split the location into two columns:
- One for the country.
- A separate column for US states.

This split was specifically applied to the US because it has a large number of users and breweries and combining all states into a single "US" category would have obscured important insights. Additionally, the size and population of many US states are comparable to those of entire countries. This approach was not extended to other countries due to the lack of state-level data for them. <br><br>
The same choice has been made for all the location columns.

In [7]:
print(f"In the platform there are {df_breweries.shape[0]} different breweries")

In the platform there are 24189 different breweries


Let's now add the location columns to the beers dataset

In [8]:
df_beers = df_beers.join(df_breweries[['brewery_id', 'country_brewery', 'state_brewery']].set_index('brewery_id'), on='brewery_id').rename(columns={'country_brewery': 'country_beer', 'state_brewery': 'state_beer'})
df_beers.head(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,abv,country_beer,state_beer
0,410549,33 Export (Gabon),3198,Sobraga,Pale Lager,5.0,Gabon,
1,105273,Castel Beer (Gabon),3198,Sobraga,Pale Lager,5.2,Gabon,
2,19445,Régab,3198,Sobraga,Pale Lager,4.5,Gabon,
3,155699,Ards Bally Black Stout,13538,Ards Brewing Co.,Stout,4.6,United Kingdom,
4,239097,Ards Belfast 366,13538,Ards Brewing Co.,Golden Ale/Blond Ale,4.2,United Kingdom,


Let's go back to analyze our data with the users dataset now.

In [9]:
df_users.head(5)

Unnamed: 0,user_id,user_name,joined,country_user,state_user
0,175852,Manslow,2012-05-20 10:00:00,Poland,
1,442761,MAGICuenca91,2017-01-10 11:00:00,Spain,
2,288889,Sibarh,2013-11-16 11:00:00,Poland,
3,250510,fombe89,2013-03-22 11:00:00,Spain,
4,122778,kevnic2008,2011-02-02 11:00:00,Germany,


In [10]:
print(f"In the platform there are {df_users.shape[0]} different users")

In the platform there are 50592 different users


And finally let's have a look at the ratings dataset.

In [11]:
df_ratings_no_text.head(5)

Unnamed: 0,date,beer_id,user_id,brewery_id,abv,style,rating,palate,taste,appearance,aroma,overall,year,brewery_name,country_brewery,state_brewery,country_user,state_user
0,2016-04-26 12:00:00,410549,175852,3198,5.0,Pale Lager,2.0,2.0,4.0,2.0,4.0,8.0,2016,Sobraga,Gabon,,Poland,
1,2017-02-17 12:00:00,105273,442761,3198,5.2,Pale Lager,1.9,2.0,4.0,2.0,3.0,8.0,2017,Sobraga,Gabon,,Spain,
2,2016-06-24 12:00:00,105273,288889,3198,5.2,Pale Lager,1.6,2.0,3.0,3.0,3.0,5.0,2016,Sobraga,Gabon,,Poland,
3,2016-01-01 12:00:00,105273,250510,3198,5.2,Pale Lager,1.5,1.0,2.0,4.0,3.0,5.0,2016,Sobraga,Gabon,,Spain,
4,2015-10-23 12:00:00,105273,122778,3198,5.2,Pale Lager,1.9,2.0,4.0,2.0,4.0,7.0,2015,Sobraga,Gabon,,Germany,


The ratings comes from the [RateBeer](ratebeer.com) website which has the following rating system (source [RateBeer Scores](https://www.ratebeer.com/our-scores)):
- <b>Aroma</b>: The smell of the beer
- <b>Appearance</b>: The color, clarity, head and visual carbonation of this beer.
- <b>Taste</b>: The flavors in this beer, thinking about the palate, bitterness and finish.
- <b>Mouthfeel</b>: The body of the beer, carbonation and astringency.
- <b>Overall</b>: The overall characteristics and your personal experience of the beer.

Appearance and Mouthfeel are each scored out of 5. Aroma and Taste are scored out of 10. While Overall is scored out of 20. These all combine to give the beer a total score out of 50 (the <b>Rating</b> column), which is then divided and displayed as a score out of 5 for each rating. <br>
The fact that the <b>Rating</b> column is the sum of the other columns is important to keep in mind when analyzing the data in particular when doing correlation analysis. No normalization has been performed at this stage.<br><br>
The country of the brewery is computed from the <code>brewery</code> dataset while the country of the user is computed from the <code>user</code> dataset.

In [12]:
print(f"In the platform there are {df_ratings_no_text.shape[0]} different ratings")
print(f"The first rating was made on {df_ratings_no_text['date'].min()}")
print(f"The last rating was made on {df_ratings_no_text['date'].max()}")

In the platform there are 7123786 different ratings
The first rating was made on 2000-04-12 12:00:00
The last rating was made on 2017-07-31 12:00:00


Now that we had an overview of the data let's have a look at the geographical distribution of our data in the world.

In [13]:
# Create the DataPresentation object
from src.processing import presentation as pr
presentation = pr.DataPresentation(df_beers, df_breweries, df_users, df_ratings_no_text)

In [14]:
# Get the aggregated spatial data
df_no_US, df_US = presentation.get_spatial_aggregated()

# Define some options for the plot
options = {
    "title": "Beer Statistics by Country and US State",
    "plots": [
        { 'label': 'Beers per country', 'location_label': 'location', 'z_label': 'count', 'colorscale': 'Blues' },
        { 'label': 'Users per country', 'location_label': 'location', 'z_label': 'count', 'colorscale': 'Blues' },
        { 'label': 'Breweries per country', 'location_label': 'location', 'z_label': 'count', 'colorscale': 'Blues' },
        { 'label': 'Number of ratings based on the brewery country', 'location_label': 'location', 'z_label': 'count', 'colorscale': 'Blues' },
        { 'label': 'Number of ratings based on the reviewer country', 'location_label': 'location', 'z_label': 'count', 'colorscale': 'Blues'}
    ]
}

# Plot the map
plot_map(df_no_US, df_US, options)

And finally let's review the temporal distribution of the data.

In [15]:
ratings_temporal_grouping, users_temporal_grouping = presentation.get_temporal_aggregated()

In [16]:
px.bar(ratings_temporal_grouping, x='Year', y='Number of ratings', title=f'Number of ratings per year').show()

In [17]:
px.bar(users_temporal_grouping, x='Year', y='Number of users', title=f'Number of users that has joined each year').show()

From the different analysis we see that both the year 2000 and the year 2017 have some missing data and due to this we might have inconsistent results. For this reason we are going to remove from our analysis both the years.

#### Filter the data

In [18]:
df_ratings_no_text = df_ratings_no_text[(df_ratings_no_text['date'].dt.year>=2001) & (df_ratings_no_text['date'].dt.year<=2016)]

In [19]:
# Define the countries where the breweries have more than 250 reviews
countries_min_number_reviews_breweries = df_ratings_no_text.groupby('country_brewery').size().sort_values(ascending=False).reset_index().rename(columns={0:'number_reviews'})
countries_min_number_reviews_breweries = list(countries_min_number_reviews_breweries[countries_min_number_reviews_breweries['number_reviews']>=250]['country_brewery'])

# Define the countries where the users from that country have more than 250 reviews
countries_min_number_reviews_users = df_ratings_no_text.groupby('country_user').size().sort_values(ascending=False).reset_index().rename(columns={0:'number_reviews'})
countries_min_number_reviews_users = list(countries_min_number_reviews_users[countries_min_number_reviews_users['number_reviews']>=250]['country_user'])

## Global ratings analysis
In this section we are going to analyze the preferences of the users of the platform in a general way.
### General study of the distribution of the ratings
First of all we are going to study how the grades are distributed in the different categories. This will help us understand if there are specific bias towards some categories and whether the users are more or less demanding in some categories.

In [20]:
# Define a df with only the numerical values of the ratings
df_ratings_values_only = df_ratings_no_text[['palate','taste','appearance','aroma','overall']].copy()

# Create a subplot with the diferent distribution close to each othe
from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=5, subplot_titles=('Palate', 'Taste', 'Appearance', 'Aroma', 'Overall'))

for i, column in enumerate(df_ratings_values_only.columns):
    value_counts = df_ratings_values_only[column].value_counts().sort_index()
    fig.add_trace(go.Bar(x=value_counts.index, y=value_counts.values, name=column), row=1, col=i+1)

fig.update_layout(title_text='Distribution of the ratings', showlegend=False)
fig.show()

Oh no! It seems that the data are in different ranges. Let's normalize all of them into the range 0-5 to be able to compare them (we are going to quantize the data too).

In [21]:
fig = make_subplots(rows=1, cols=5, subplot_titles=('Palate', 'Taste', 'Appearance', 'Aroma', 'Overall'), shared_yaxes=True)

ranges = [0,0.2,0.4,0.6,0.8,1]
for i, column in enumerate(df_ratings_values_only.columns):
    # Compute the aggregated value counts
    value_counts = df_ratings_values_only[column].value_counts().sort_index()
    value_counts.index = value_counts.index / value_counts.index.max()
    value_counts_aggregated = value_counts.groupby(pd.cut(value_counts.index, ranges)).sum()
    # Change the index into the elements in ranges[1:]
    value_counts_aggregated.index = [ranges[i+1]*5 for i in range(len(ranges)-1)]
    fig.add_trace(go.Bar(x=value_counts_aggregated.index, y=value_counts_aggregated.values, name=column), row=1, col=i+1)
fig.update_layout(title_text='Distribution of the ratings', showlegend=False)
fig.show()

Now it's better. Before doing any conclusion also let's compute some key metrics for each category. To make them comparable we are going to normalize them too (but here we are not going to discretize them to avoid losing information).

In [22]:
# Compute the Skewness and perform the D'Agostino's K^2 Test
from scipy.stats import skew, kurtosis, normaltest
from great_tables import GT

# Normalize
df_ratings_values_only = df_ratings_values_only / df_ratings_values_only.max()

# Get the analysis
mean = df_ratings_values_only.mean().round(2)
std = df_ratings_values_only.std().round(2)
median = df_ratings_values_only.median().round(2)
skewness = df_ratings_values_only.apply(skew).round(2)
kurtosis_result = df_ratings_values_only.apply(kurtosis).round(2)
normaltest_results = df_ratings_values_only.apply(lambda x: normaltest(x).pvalue)

# Wrap the results into a dataframe
df_results = pd.DataFrame({
    'Mean': mean,
    'Std': std,
    'Median': median,
    'Skewness': skewness,
    'Kurtosis': kurtosis_result,
    'Can reject H0 (95%)': normaltest_results < 0.05
}, index=df_ratings_values_only.columns).reset_index().rename(columns={'index': 'Rating'})

# Show the results
(
    GT(df_results)
    .tab_header(title='Statistics of the ratings')
)

Statistics of the ratings,Statistics of the ratings,Statistics of the ratings,Statistics of the ratings,Statistics of the ratings,Statistics of the ratings,Statistics of the ratings
Rating,Mean,Std,Median,Skewness,Kurtosis,Can reject H0 (95%)
palate,0.66,0.16,0.6,-0.2,0.24,True
taste,0.65,0.15,0.7,-0.78,1.0,True
appearance,0.69,0.16,0.6,-0.16,0.26,True
aroma,0.64,0.15,0.7,-0.77,0.98,True
overall,0.66,0.16,0.7,-0.99,1.39,True


We observe the following:
- <b>Normality Testing</b>: None of the distributions are normal. The Shapiro-Wilk test, conducted at a significance level of 0.05, returned p-values near 0 for all rating categories, confirming this.
- <b>Right Skewness</b>: All distributions are skewed to the right, with significant skewness observed in the taste, aroma, and overall categories. This indicates a general tendency for users to give higher scores, deviating from normality. The data does not explain this behavior, but it may stem from human tendencies, such as avoiding very low scores or using 60% as a minimum, reflecting common grading practices.
- <b>Median Values and User Preferences</b>: Higher median values for taste, aroma, and overall experience suggest users value these aspects more than palate and appearance. This could reflect genuine preferences or be a consequence that appearance and palate are more subjective and harder to evaluate, potentially leading to more random ratings.

### Correlation between the ratings attributes
Here we are gonna analyze the correlation between the ratings attributes given by the users. 

In [20]:
corr = df_ratings_no_text[['palate','taste','appearance','aroma','overall','abv']].corr()
plot_correlation_matrix(df_ratings_no_text[['palate','taste','appearance','aroma','overall','abv']], filename=SAVING_FOLDER + 'correlation_matrix.html', title='Correlation matrix between the ratings provided by the users')

All factors influence the overall user experience, but taste (correlation: 0.86) and aroma (correlation: 0.77) stand out as the most significant. Palate (correlation: 0.66) and appearance (correlation: 0.50) show slightly lower correlations with overall experience, though their contribute remain important.

### Rating in the different countries

In [19]:
# Remove the countries with less than 250 ratings
number_of_ratings_per_country = df_ratings_no_text.groupby('country_brewery').size().reset_index().rename(columns={0:'count'})
number_of_ratings_per_country = number_of_ratings_per_country[number_of_ratings_per_country['count'] > 250]
df_ratings_filtered = df_ratings_no_text[df_ratings_no_text['country_brewery'].isin(number_of_ratings_per_country['country_brewery'])]

# Compute the average rating per country
average_rating_no_US = df_ratings_filtered[df_ratings_filtered['country_brewery'] != 'United States'].groupby('country_brewery')['rating'].mean().reset_index().rename(columns={'country_brewery':'location', 'rating':'count'})
average_rating_US = df_ratings_filtered[df_ratings_filtered['country_brewery'] == 'United States'].groupby('state_brewery')['rating'].mean().reset_index().rename(columns={'state_brewery':'location', 'rating':'count'})

# Plot everything
options = {
    "title": "Average Ratings by Country and US State",
    "plots": [{
        'label': 'Beers per country',
        'location_label': 'location',
        'z_label': 'count',
        'colorscale': 'Blues'
    }]
}
plot_map(average_rating_no_US, average_rating_US, options)

In [20]:
average_rating = pd.concat([average_rating_no_US, average_rating_US]).sort_values('count', ascending=False)
average_rating.head(10)

Unnamed: 0,location,count
35,Oklahoma,3.647186
1,Alaska,3.628913
4,California,3.607066
34,Ohio,3.558747
12,Illinois,3.550528
36,Oregon,3.535853
47,Washington DC,3.516449
10,Belgium,3.504538
44,Vermont,3.502204
9,Georgia,3.494033


In [21]:
average_rating = pd.concat([average_rating_no_US, average_rating_US]).sort_values('count', ascending=True)
average_rating.head(10)

Unnamed: 0,location,count
52,Iran,1.499602
31,El Salvador,1.847122
85,Nicaragua,1.847436
44,Guatemala,1.858252
24,Cuba,1.929189
113,Uganda,1.933131
30,Egypt,1.937105
117,Venezuela,1.939168
28,Dominican Republic,1.98248
108,Tanzania,2.00383


### Rating evolution over time in the different countries

In [22]:
# Create the variables
row_US = []
row_no_US = []

# Prepare the data
for year in df_ratings_no_text['date'].dt.year.unique():
    df_state_no_US = df_ratings_no_text[(df_ratings_no_text['date'].dt.year == year) & (df_ratings_no_text['country_brewery'] != 'United States')].groupby('country_brewery').agg({'rating': 'mean'}).reset_index()
    row_no_US += [{'year': year, 'location': state, 'count': abv} for state, abv in zip(df_state_no_US['country_brewery'], df_state_no_US['rating'])]

    df_state_US = df_ratings_no_text[(df_ratings_no_text['date'].dt.year == year) & (df_ratings_no_text['country_brewery'] == 'United States')].groupby('state_brewery').agg({'rating': 'mean'}).reset_index()
    row_US += [{'year': year, 'location': state, 'count': abv} for state, abv in zip(df_state_US['state_brewery'], df_state_US['rating'])]

# Create the dataframes
df_states_no_US = pd.DataFrame(row_no_US)
df_states_US = pd.DataFrame(row_US)

# Filter the data for a more clean plot (No US)
number_of_years_per_state_no_US = df_states_no_US.groupby('location').size().reset_index().rename(columns={0:'count'})
number_of_years_per_state_no_US = number_of_years_per_state_no_US[number_of_years_per_state_no_US['count'] == len(df_ratings_no_text['date'].dt.year.unique())]
df_states_no_US = df_states_no_US[df_states_no_US['location'].isin(number_of_years_per_state_no_US['location'])]

# Filter the data for a more clean plot (US)
number_of_years_per_state_US = df_states_US.groupby('location').size().reset_index().rename(columns={0:'count'})
number_of_years_per_state_US = number_of_years_per_state_US[number_of_years_per_state_US['count'] == len(df_ratings_no_text['date'].dt.year.unique())]
df_states_US = df_states_US[df_states_US['location'].isin(number_of_years_per_state_US['location'])]

# Define the options for the plot
options = {
    'title': '',
    'time_range': range(df_ratings_no_text['date'].dt.year.min(), df_ratings_no_text['date'].dt.year.max() + 1),
    'time_label': 'year',
    'location_label': 'location',
    'value_label': 'count',
    'range_color': [1, 4],
    'color_scale': 'Viridis'
}

# Display the plot
plot_map_time(df_states_no_US, df_states_US, options)

#### Bias in the evaluation in the different countries

## ABV and style analysis
#### ABV analysis
In this section we'll continue by analyzing the popularity of ABV and style in the world and we'll also study how these have evolved over time.

In [23]:
# Define some useful variables
MAX_ABV = 20
NUMBER_OF_SAMPLES_ABV = 201
MIN_NUMBER_OF_RATING = 250

# Process the data
beer_ratings = []
linspace = np.linspace(0, MAX_ABV, NUMBER_OF_SAMPLES_ABV)
for year in sorted(df_ratings_no_text['date'].dt.year.unique()):
    # Filter the data by the year
    df_year = df_ratings_no_text[df_ratings_no_text['year'] == year]
    
    # Iterate within the ABV range
    for i in range(len(linspace) - 1):
        # Filter the data
        min_abv = round(linspace[i], 2)
        max_abv = round(linspace[i + 1], 2)

        # Compute the matrics
        filtered = df_year[(df_year['abv'] >= min_abv) & (df_year['abv'] < max_abv)]
        ratings = filtered['rating'].mean()
        nbr_ratings = filtered['rating'].count()

        # Append the data
        if nbr_ratings > MIN_NUMBER_OF_RATING:
            beer_ratings.append({'year': year, 'abv': (min_abv+max_abv)/2, 'rating': ratings, 'nbr_ratings': nbr_ratings})

# Convert to DataFrame
beer_ratings = pd.DataFrame(beer_ratings)

# Do the plot
fig = px.scatter(beer_ratings, x='abv', y='rating', size='nbr_ratings', hover_name='abv',animation_frame='year', labels={'abv': 'ABV:', 'rating': 'Rating:', 'nbr_ratings': 'Number of ratings:'},range_x=[0, 20], range_y=[2.25, 4.75])
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
fig.update_layout(showlegend=False)
fig.update_xaxes(title_text='ABV')
fig.update_yaxes(title_text='Rating')
fig.show()

In [24]:
# Compute the metrics
avg_abv = df_ratings_no_text.groupby(df_ratings_no_text['date'].dt.year).agg({'abv': 'mean'}).reset_index().rename(columns={'date': 'year'})
corr = avg_abv['year'].corr(avg_abv['abv'], method='spearman')

# Plot the data
plots_values_over_time({'average_abv': avg_abv}, 'year', 'abv', 'Average ABV Over Years', 'Year', 'Average ABV')

In [25]:
# Create the variables
row_no_US = []
row_US = []

# Iterate over the years
for year in df_ratings_no_text['date'].dt.year.unique():
    df_state_US = df_ratings_no_text[(df_ratings_no_text['date'].dt.year == year) & (df_ratings_no_text['country_brewery'] == 'United States')].groupby('state_brewery').agg({'abv': 'mean'}).reset_index()
    row_US += [{'year': year, 'location': state, 'avg_abv': abv} for state, abv in zip(df_state_US['state_brewery'], df_state_US['abv'])]

    df_state_no_US = df_ratings_no_text[(df_ratings_no_text['date'].dt.year == year) & (df_ratings_no_text['country_brewery'] != 'United States')].groupby('country_brewery').agg({'abv': 'mean'}).reset_index()
    row_no_US += [{'year': year, 'location': state, 'avg_abv': abv} for state, abv in zip(df_state_no_US['country_brewery'], df_state_no_US['abv'])]
df_states_no_US = pd.DataFrame(row_no_US)
df_states_US = pd.DataFrame(row_US)

# Filter the data
nbr_years_considered = df_states_no_US.groupby('location').agg({'year': 'count'}).reset_index()
nbr_years_considered = nbr_years_considered[nbr_years_considered['year'] == nbr_years_considered['year'].max()]
df_states_no_US = df_states_no_US[df_states_no_US['location'].isin(nbr_years_considered['location'])]

# Filter the data
nbr_years_considered = df_states_US.groupby('location').agg({'year': 'count'}).reset_index()
nbr_years_considered = nbr_years_considered[nbr_years_considered['year'] == nbr_years_considered['year'].max()]
df_states_US = df_states_US[df_states_US['location'].isin(nbr_years_considered['location'])]

# Define the options
options = {
    'title': 'Average ABV by State Over the Years',
    'time_range': range(df_ratings_no_text['date'].dt.year.min(), df_ratings_no_text['date'].dt.year.max() + 1),
    'time_label': 'year',
    'location_label': 'location',
    'value_label': 'avg_abv',
    'range_color': [4, 8],
    'color_scale': 'Viridis'
}
# Display the plot
plot_map_time(df_states_no_US, df_states_US, options)

#### Style analysis
Here we are going to do our analysis on the style

In [26]:
# Choose the number of elements to display
number_of_elements_displayed = 19

# Do the styling computations
styles_counted = df_ratings_no_text['style'].value_counts().reset_index()
top_styles = styles_counted.head(number_of_elements_displayed).copy()
top_styles.loc[len(top_styles)] = {'style': 'Other', 'count': styles_counted['count'][number_of_elements_displayed:].sum()}

# Display a pie chart with gradient colors
fig = px.pie(top_styles, values='count', names='style', title='Distribution of the styles of beers', color_discrete_sequence=px.colors.sequential.Viridis, hole=0.6)
fig.update_traces(textinfo='label+percent', textfont_size=14)
fig.show()

In [28]:
# Find the countries style preferences
countries_style_preferences = {}
unique_styles = set()
for country in countries_min_number_reviews_users:
    # Filter the data
    df_country = df_ratings_no_text[df_ratings_no_text['country_user'] == country]
    
    # Compute the style preferences
    style_preferences = df_country['style'].value_counts().reset_index().head(1).iloc[0]['style']

    # Add into countries_style_preferences the data
    countries_style_preferences[country] = style_preferences
    unique_styles.add(style_preferences)

In [84]:
# Create a network graph
G = nx.Graph()

# Add a node for each country
for style in unique_styles:
    G.add_node(style)

# Add a node for each country and connect it to the style
for country, style in countries_style_preferences.items():
    G.add_node(country)
    G.add_edge(country, style, weight=1)

# Plot the network with plot
net = Network(height='750px', width='100%', notebook=True)
net.from_nx(G)
net.repulsion(node_distance=150)  # Adjust the node distance for closer visualization
net.show('docs/plots/beer_styles_networl.html');

docs/plots/beer_styles_networl.html


In [None]:
import plotly.graph_objects as go
import pandas as pd

# Prepare data for the chord diagram
countries = list(countries_style_preferences.keys())
styles = list(unique_styles)

# Combine countries and styles for all_nodes
all_nodes = countries + styles
node_indices = {node: idx for idx, node in enumerate(all_nodes)}

# Create the source and target lists for the chord diagram
sources = []
targets = []
values = []

for country, style in countries_style_preferences.items():
    sources.append(node_indices[country])
    targets.append(node_indices[style])
    values.append(1)  # Set weight for each connection

# Create a color palette for nodes
num_countries = len(countries)
num_styles = len(styles)
colors = [f"rgba(0, 128, 255, 0.8)" for _ in range(num_countries)] + [f"rgba(255, 128, 0, 0.8)" for _ in range(num_styles)]

# Assign unique colors to each cluster of connections
link_colors = []
style_color_map = {style: f"rgba({max(0, 255 - i * 30)}, {min(255, 100 + i * 30)}, {min(255, 150 + i * 20)}, 0.6)" for i, style in enumerate(styles)}
for source, target in zip(sources, targets):
    link_colors.append(style_color_map[all_nodes[target]])

# Create the chord diagram using Plotly
fig = go.Figure(
    data=[
        go.Sankey(
            node=dict(
                pad=20,
                thickness=30,
                line=dict(color="black", width=1),
                label=all_nodes,
                color=colors,
            ),
            link=dict(
                source=sources,  # Indices of source nodes
                target=targets,  # Indices of target nodes
                value=values,    # Values for connections
                color=link_colors,
                hovertemplate='%{source.label} → %{target.label}<extra></extra>',
            ),
        )
    ]
)

# Update layout for better visualization
fig.update_layout(
    title_text="Countries and Their Preferred Beer Styles",
    font_size=12,
    title_font_size=18,
    title_font_color="darkblue",
    title_x=0.5,
    height=900,
    width=1000,
    plot_bgcolor="rgba(240, 240, 240, 0.9)",
)

# Save the plot to an HTML file
fig.write_html("docs/plots/beer_styles_chord_diagram.html")

# Show the plot
fig.show()


In [None]:
# Could be interesting to show the evolution of the style preferences over the years with the network graph

In [75]:
# Prepare the data
all_styles = df_ratings_no_text['style'].unique()

# Create a dataframe with the top_10_styles_list elements as columns
row = []
for year in sorted(df_ratings_filtered['date'].dt.year.unique()):
    # Get the data for the year
    df_year = df_ratings_filtered[df_ratings_filtered['date'].dt.year == year]

    # Compute the style preferences
    style_preferences = df_year['style'].value_counts().reset_index().head(10)

    # Compute the ABV for the top 10 styles
    for style in style_preferences['style'].values:
        # Get the data
        df_style = df_year[df_year['style'] == style]

        # Compute the average ABV
        avg_abv = df_style['abv'].mean()

        # Append the data
        row.append({'year': year, 'style': style, 'avg_abv': avg_abv, 'count': style_preferences[style_preferences['style'] == style]['count'].values[0]})

# Create the dataframe
df_style_avg_abv = pd.DataFrame(row)

In [83]:
# Create the plot with the slider for the years
fig = px.bar(df_style_avg_abv, x='style', y='count', animation_frame='year', title='Top 10 styles over the years', range_y=[0, 100000], color='avg_abv', range_color=[0, 10], color_continuous_scale='Blues')
# Use log scale for the y-axis
fig.update_xaxes(title_text='Style')
fig.update_yaxes(title_text='Number of ratings')
fig.show()

In [38]:
# Prepare the data
all_styles = df_ratings_no_text['style'].unique()

# Create a dataframe with the top_10_styles_list elements as columns
nbr_ratings_per_style = pd.DataFrame(columns=['year'] + all_styles)
for year in sorted(df_ratings_filtered['date'].dt.year.unique()):
    # Get the data for the year
    df_year = df_ratings_filtered[df_ratings_filtered['date'].dt.year <= year]

    # Compute the number of ratings per style
    styles_counted = df_year['style'].value_counts().reset_index()

    # Create the dictionary
    row = {style: count for style, count in zip(styles_counted['style'], styles_counted['count'])}
    row['year'] = year

    # Add the row to the dataframe
    nbr_ratings_per_style = pd.concat([nbr_ratings_per_style, pd.DataFrame([row])])

# Fill the missing values and set the index
nbr_ratings_per_style = nbr_ratings_per_style.fillna(0).set_index('year')

# Set to integer the types
nbr_ratings_per_style = nbr_ratings_per_style.astype(int)

In [33]:
# Display race bar chart
bcr.bar_chart_race(nbr_ratings_per_style, period_length=1000, title='Total number of ratings per beer style', n_bars=10, steps_per_period=50, figsize=(8, 6), cmap='tab20', period_fmt='Year: {x}')

## Brewery popularity analysis
In this section we are going to see whether user likes a lot beers from specific breweries from specific countries and in general we'll focus on understanding the impact of breweries on the popularity of beers.

## NLP analysis
In this final section we are going to see from the textual ratings if there are some specific words or emotions that are associated with high or low ratings.