When describing the relevant aspects of the data, and any other datasets you may intend to use, you should in particular show (non-exhaustive list):

1) That you can handle the data in its size.
2) That you understand what’s in the data (formats, distributions, missing values, correlations, etc.).
3) That you considered ways to enrich, filter, transform the data according to your needs.
4) That you have a reasonable plan and ideas for methods you’re going to use, giving their essential mathematical details in the notebook.
5) That your plan for analysis and communication is reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.

We will evaluate this milestone according to how well these steps have been done and documented, the quality of the code and its documentation, the feasibility and critical awareness of the project. We will also evaluate this milestone according to how clear, reasonable, and well thought-through the project idea is. Please use the second milestone to really check with us that everything is in order with your project (idea, feasibility, etc.) before you advance too much with the final Milestone P3! There will be project office hours dedicated to helping you.

# Modules and tools :

In [3]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn', Mutes warnings when copying a slice from a DataFrame.
import numpy as np
import requests
from matplotlib import pyplot as plt
from datetime import datetime, date, time

# Beer dataset:

The dataset come from two different websites : https://www.ratebeer.com/ and https://www.beeradvocate.com/. This two platform where one can review and share about beers. Technically it consists of three csv files and two txt files for each websites :
1) beers.csv : informations about the beers
2) breweries.csv : informations about the breweries
3) users.csv : informations about the users
4) rating.txt : all the different rating for each beer
5) reviews.txt : all the different review for each beer

There is also matched dataset between the two websites.

### Lets take a look at the data given by BeerAdvocate :

In [14]:
beers_ba = pd.read_csv('DATA/BeerAdvocate/beers.csv')
beers_ba.head(10)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,bros_score,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
0,166064,Nashe Moskovskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,,4.7,,,0,
1,166065,Nashe Pivovskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,,3.8,,,0,
2,166066,Nashe Shakhterskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,,4.8,,,0,
3,166067,Nashe Zhigulevskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,,4.0,,,0,
4,166063,Zhivoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,,4.5,,,0,
5,166068,Arpa,39913,Arpa (АРПА),Euro Pale Lager,0,0,,,,4.0,,,0,
6,166071,Eles,39914,Bear Beer,Euro Pale Lager,0,0,,,,4.0,,,0,
7,166072,Eles Light,39914,Bear Beer,Euro Pale Lager,0,0,,,,3.2,,,0,
8,166074,Toroz Svetloye,39914,Bear Beer,American Pale Lager,0,0,,,,4.5,,,0,
9,166076,Toroz Temnoye,39914,Bear Beer,Euro Dark Lager,0,0,,,,4.1,,,0,


In [11]:
breweries_ba = pd.read_csv('DATA/BeerAdvocate/breweries.csv')
breweries_ba.head(10)

Unnamed: 0,id,location,name,nbr_beers
0,39912,Kyrgyzstan,Abdysh-Ata (Абдыш Ата),5
1,39913,Kyrgyzstan,Arpa (АРПА),1
2,39914,Kyrgyzstan,Bear Beer,4
3,39915,Kyrgyzstan,Blonder Pub,4
4,39916,Kyrgyzstan,Kellers Bier,2
5,16051,Kyrgyzstan,Pivzavod Uzgen,0
6,16052,Kyrgyzstan,Steinbrau Pub,4
7,39917,Kyrgyzstan,Usu-Salkin Pivo,3
8,37262,Gabon,Societe des Brasseries du Gabon (SOBRAGA),1
9,10093,Northern Ireland,Strangford Lough Brewing Company Ltd,5


As we can see it seems that the data are stored in a relational model (i.e., the index ```brewery_id``` in the beers.csv dataset is the foreign key that point to the primary key ```id``` in the breweries.csv. 

We can confirm our observation with a prelimenary check on the name of the brewery for the first row of ```beer_ba```: 

In [27]:
beers_ba.iloc[0].brewery_name == breweries_ba[breweries_ba.id == beers_ba.iloc[0].brewery_id].name

0    True
Name: name, dtype: bool


Ok so we have our first confirmation, we should make sure that there is no problem with the attribution of the key by performing the test for each rows but for now let assume there is no issues.

Let's first dicuss about the data size.