# ADA Project - DataBrewers

The aim of this project is to analyze beer preference shifts across seasons and in response to weather changes, events and cultural festivities. This could significantly help professionals (brewers, marketers, etc.) by providing insights into customer preferences. 
By understanding seasonal trends, brewers can adjust their product offerings to align more closely with consumer demand. 
Moreover, as a conclusion of our research, we could suggest the best beer that would be the perfect match for each season/festivity, in the form of a time fresco. 

Our analysis will primarly focus on the reviews published by users based in the United State of America. The reason for this choice is that it is the country were the majority of reviews are done, and it provides us with enough data to conduct our analysis.


For this project, we choose to only use the BeerAdvocate dataset and to discard the RateBeer dataset due to several analytical and practical considerations:
1. **The Herding Effect:** As explained in the paper "When Sheep Shop: Measuring Herding Effects in Product Ratings with Natural Experiments" from Gael Lederrey and Robert West, initial ratings can influence the following ones. By focusing on a single dataset, we can better control for this effect within a single user community. Including RateBeer might introduce inconsistent herding effects that could skew comparative analyses.
2. **Inconsistant Rating Standards:** Both BeerAdvocate and RateBeer communities likely develop their own informal standards for beer ratings. Focusing on a single platform like BeerAdvocate allows dor a more cohesive dataset, with users who rate within the same context, minimizing cross-platform variance.
3. **Data Sufficiency:** The BeerAdvocate dataset provides us with enough reviews and data to perform meaningful analysis and draw reliable insights.

In [1]:
import pandas as pd
import tarfile
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
file_path = 'data/matched_beer_data.tar'
with tarfile.open(file_path) as tar:
    tar.extractall(path='../data')  
    tar.list()

  tar.extractall(path='../data')


?rwxrwxrwx gayouf/gayouf   26775015 2017-08-10 17:15:37 ratings.csv 
?rw-rw-r-- gayouf/gayouf   99396732 2018-03-19 14:13:29 ratings_ba.txt.gz 
?rw-rw-r-- gayouf/gayouf  133634318 2018-03-19 14:22:54 ratings_rb.txt.gz 
?rwxrwxrwx gayouf/gayouf     533538 2017-08-08 15:35:54 users_approx.csv 
?rwxrwxrwx gayouf/gayouf     429785 2017-08-07 14:51:08 users.csv 
?rwxrwxrwx gayouf/gayouf   14246582 2018-03-19 14:26:49 beers.csv 
?rwxrwxrwx gayouf/gayouf    1045044 2017-08-02 18:10:05 breweries.csv 
?rw-rw-r-- gayouf/gayouf   77201217 2018-03-20 21:16:35 ratings_with_text_ba.txt.gz 
?rw-rw-r-- gayouf/gayouf  133632940 2018-03-20 21:25:56 ratings_with_text_rb.txt.gz 


In [3]:
ratings_merged = pd.read_csv('../data/ratings.csv')
ratings_ba = pd.read_csv('../data/ratings_ba.txt.gz', compression='gzip', delimiter='\t')
ratings_rb = pd.read_csv('../data/ratings_rb.txt.gz', compression='gzip', delimiter='\t')
users_approx = pd.read_csv('../data/users_approx.csv')
users_merged = pd.read_csv('../data/users.csv')
beers_merged = pd.read_csv('../data/beers.csv')
breweries_merged = pd.read_csv('../data/breweries.csv')
ratings_with_text_ba = pd.read_csv('../data/ratings_with_text_ba.txt.gz', compression='gzip', delimiter='\t')
ratings_with_text_rb = pd.read_csv('../data/ratings_with_text_rb.txt.gz', compression='gzip', delimiter='\t')

  ratings_merged = pd.read_csv('../data/ratings.csv')
  beers_merged = pd.read_csv('../data/beers.csv')


In [4]:
file_path = 'data/BeerAdvocate'
beers = pd.read_csv(file_path + '/beers.csv')
breweries = pd.read_csv(file_path + '/breweries.csv')
users = pd.read_csv(file_path + '/users.csv')

In [5]:
# We prompt the number of beers in the dataset
print("Number of beers in the dataset: ", len(beers))

# We prompt the numbers of ratings in the dataset using the nbr_ratings column of the beers dataset and summing them
print("Number of ratings in the dataset: ", beers['nbr_ratings'].sum())

# We prompt the number of reviews in the dataset using the nbr_reviews column of the beers dataset and summing them
print("Number of reviews in the dataset: ", beers['nbr_reviews'].sum())

# We only keep users from the US
# We discard users whose location is NaN
users = users.dropna(subset=['location'])

# We prompt the number of ratings made by US users using the nb_ratings column of the users dataset and summing them for users where location is contains United States 
print("Number of ratings made by US users: ", users[users['location'].str.contains('United States')]['nbr_ratings'].sum())

# We prompt the number of reviews made by US users using the nb_reviews column of the users dataset and summing them for users where location is contains United States
print("Number of reviews made by US users: ", users[users['location'].str.contains('United States')]['nbr_reviews'].sum())

Number of beers in the dataset:  280823
Number of ratings in the dataset:  8393032
Number of reviews in the dataset:  2589586
Number of ratings made by US users:  7303870
Number of reviews made by US users:  2241334
