# General analysis
### Data importing

In [11]:
# Import all libraries
import pandas as pd
import os
import plotly.express as px
from src.utils.plots import *

# Set the data folder
DATA_FOLDER = os.path.join(os.getcwd(), '../data/processed/')

# Save the images folder
IMAGES_FOLDER = os.path.join(os.getcwd(), '../docs/plots/')

# Load the data
df_beers = pd.read_parquet(DATA_FOLDER + 'beers.pq')
df_breweries = pd.read_parquet(DATA_FOLDER + 'breweries.pq')
df_users = pd.read_parquet(DATA_FOLDER + 'users.pq')
df_ratings_no_text = pd.read_parquet(DATA_FOLDER + 'ratings_no_text.pq')

### General analysis with correlation study

In [12]:
corr = df_ratings_no_text[['palate','taste','appearance','aroma','overall','abv']].corr()
plot_correlation_matrix(df_ratings_no_text[['palate','taste','appearance','aroma','overall','abv']], filename=IMAGES_FOLDER + 'correlation_matrix.html')

<b>Rewrite them to make them suite in the narrative of the "popularity"</b>

Let's review for a moment the definition of palate, taste, appearance, aroma and overall, as defined by ratebeer:
- <b>Aroma</b>: the smell of the beer
- <b>Appearance</b>: the color, clarity, head and visual carbonation of the beer.
- <b>Taste</b>: the flavor of the beer, thinking about the palate, bitterness and finish.
- <b>Mouthfeel</b>: the body of the beer, the carbonation and the astrincency.
- <b>Overall</b>: the overall characteristics and the personal experience of the beer.

In this case we don't plot the rating since it's computed from all the other features but we'll use it to quantify the rating of a beer (rather than using the overall) since it takes into account all the other features. <br><br>
Appearance and Mouthfeel are each scored out of 5. Aroma and Taste are scored out of 10. While Overall is scored out of 20. These all combine to give the beer a total score out of 50, which is then divided and displayed as a score out of 5 for each rating.

Overall, taste and aroma have a strong connection. We see that this is reasonable since the taste and the smell of a beer are significant while expressing a preference for a beer and it's likely that users when evaluating the overall experience gives a high weight to these two factors (so if they are low, also the overall experience will be low). We see that while the appearance and the palate have a connection which is not negiglible (0.5 an 0.66) we see that this is less strong since it's reasonable to think that the appearance and the body of the beer impact less the overall rating, which makes this connection weaker

The correlation between the overall rating and ABV (0.37) is indeed less strong, which makes sense. While there is some relationship between alcohol content and the overall rating, the lower correlation suggests that ABV is not a dominant factor in how users rate a beer. This is intuitive because many beers have similar ABV ranges, and the experience can vary significantly based on other factors (like taste, aroma, or body). Therefore, even beers with the same ABV can lead to quite different user experiences, reducing the strength of the correlation between ABV and overall rating.

<b>Note</b> Here I'd add the correlation matrix (maybe change the plot if you don't like it) and I'd describe how the rating / popularity is linked with the different aspect of the ratings. I'd highlight the fact that the correlation between the ABV and the overall is still interesting because it proves what we observed in the ABV plots so I'd make a reference to the fact that we'll observe this phenomenon in the next sections. Let me know if you think we need to do some kind of temporal analysis (I think spatial would be too complex but let me know)

### Average rating in the different countries

In [13]:
# Create a filtered dataframe
number_of_ratings_per_country = df_ratings_no_text.groupby('country_user').size().reset_index().rename(columns={0:'count'})
number_of_ratings_per_country = number_of_ratings_per_country[number_of_ratings_per_country['count'] > 250]
df_ratings_filtered = df_ratings_no_text[df_ratings_no_text['country_user'].isin(number_of_ratings_per_country['country_user'])]

# Compute the average rating per country
average_rating_no_US = df_ratings_filtered[df_ratings_filtered['country_user'] != 'United States'].groupby('country_user')['rating'].mean().reset_index().rename(columns={'country_user':'location', 'rating':'count'})
average_rating_US = df_ratings_filtered[df_ratings_filtered['country_user'] == 'United States'].groupby('state_user')['rating'].mean().reset_index().rename(columns={'state_user':'location', 'rating':'count'})

# Plot everything
options = {
    "title": "Average Ratings by Country and US State",
    "plots": [{
        'label': 'Beers per country',
        'location_label': 'location',
        'z_label': 'count',
        'colorscale': 'Blues'
    }]
}
plot_map(average_rating_no_US, average_rating_US, options)

In [14]:
average_rating = pd.concat([average_rating_no_US, average_rating_US]).sort_values('count', ascending=False)
average_rating.head(10)

Unnamed: 0,location,count
56,Puerto Rico,3.510774
38,Rhode Island,3.496934
2,Arizona,3.494112
44,Vermont,3.490121
57,Romania,3.481652
43,Utah,3.467568
6,Connecticut,3.463515
61,Singapore,3.459852
18,Maine,3.455931
27,Greece,3.451455


In [15]:
average_rating = pd.concat([average_rating_no_US, average_rating_US]).sort_values('count', ascending=True)
average_rating.head(10)

Unnamed: 0,location,count
34,Indonesia,2.681097
21,El Salvador,2.724705
43,Marshall Islands,2.749757
69,Taiwan,2.825769
14,Colombia,2.893103
15,Croatia,2.903519
23,Faroe Islands,2.918591
7,Bolivia,2.959013
19,Dominican Republic,2.961093
46,Mozambique,2.964452


### Rating evolution over time in the different countries

In [16]:
unique_states_breweries = df_ratings_no_text['country_user'].unique()
row_US = []
row_no_US = []
for year in range(2002, 2018):
    df_state_no_US = df_ratings_no_text[(df_ratings_no_text['date'].dt.year == year) & (df_ratings_no_text['country_user'] != 'United States')].groupby('country_user').agg({'rating': 'mean'}).reset_index()
    df_state_US = df_ratings_no_text[(df_ratings_no_text['date'].dt.year == year) & (df_ratings_no_text['country_user'] == 'United States')].groupby('state_user').agg({'rating': 'mean'}).reset_index()
    row_no_US += [{'year': year, 'location': state, 'count': abv} for state, abv in zip(df_state_no_US['country_user'], df_state_no_US['rating'])]
    row_US += [{'year': year, 'location': state, 'count': abv} for state, abv in zip(df_state_US['state_user'], df_state_US['rating'])]
df_states_no_US = pd.DataFrame(row_no_US)
df_states_US = pd.DataFrame(row_US)

In [17]:
options = {
    'title': '',
    'time_range': range(2002, 2018),
    'time_label': 'year',
    'location_label': 'location',
    'value_label': 'count',
    'range_color': [1, 4],
    'color_scale': 'Viridis'
}
plot_map_time(df_states_no_US, df_states_US, options)

In [18]:
obj = {}
for state in average_rating_US.sort_values('count', ascending=False).head(5)['location']:
    filtered = df_ratings_no_text[df_ratings_no_text['state_user'] == state].groupby('year').agg({'rating': 'mean'}).reset_index()
    obj[state] = filtered
obj_combined = obj
plots_values_over_time(obj, 'year', 'rating', 'Year', 'Average rating', 'Average rating per year for the top 5 US states')

In [19]:
obj = {}
for state in average_rating_no_US.sort_values('count', ascending=False).head(5)['location']:
    filtered = df_ratings_no_text[df_ratings_no_text['country_user'] == state].groupby('year').agg({'rating': 'mean'}).reset_index()
    obj[state] = filtered
obj_combined = obj_combined | obj
plots_values_over_time(obj, 'year', 'rating', 'Year', 'Average rating', 'Average rating per year for the top 5 non-US countries')

In [20]:
plots_values_over_time(obj_combined, 'year', 'rating', 'Year', 'Average rating', 'Average rating per year for top-10 countries or US states')

It makes sense to split US Vs Non US because we see in general that the US-based breweries outperform significantly the other breweries and they beat countries that are well known for their beers (such as England, Belgium or Germany). 