## Data Acquisition

This notebook will describe the process for downloading the data from ratebeer.com. 

## RateBeer Website

RateBeer.com is widely recognized as the most in-depth, accurate, and one of the most-visited source for beer information. RateBeer is a world site for craft beer enthusiasts and is dedicated to serving the entire craft beer community through beer education, promotion and outreach. Established and maintained by dedicated volunteers, RateBeer has become the premier resource for consumer-driven beer ratings, features on beer culture and industry events, weekly beer-related editorials, and an internationally recognized, annual RateBeer Best competition. A vibrant community of hundreds of thousands of members from more than 100 countries have rated hundreds of thousands of different beers around the world.

## API

The ratebeer website provides an API key for beer enthusiasts who are interested in developing beer apps in partnership with the website and also for students who are intersted in general data analysis research. For the purpose of this project, an API key was obtained under the academia agreement with RateBeer. 

The API documentation is provided at this link: https://www.ratebeer.com/api-documentation.asp

## Download Data

The data can be downloaded using the API key as json files. The data can be downloaded by ratings, 100 beers a time. A total of 5,000 calls are allowed per month.

A for loop was written in python to do 100 calls at a time to not ovrload the server. The 'id' of the top ranked beer was obtained using the graphiql interface provided by ratebeer. The API query requires the a string to be stitched together as shown in the code below. This depends on the beer id. Once the for loop is done, the last downloaded beer id is obatained and the for loop is initalized again. This was done multiple times to due to server limitations. Eventually, data was downloaded for the top 220,000 beers as 13 different .json files.

In [None]:
#code for downloading data from ratebeer. The main code has been commented to prevent accidentally running it
#imoort modules
import requests
import json
import time

#initalize headers as per API
headers = {
    'content-type': 'application/json',
    'accept': 'application/json',
    'x-api-key': '******************',
}
l = []  #initialize an empty list
j = 31518   #beer id to initialize the for loop

#for loop to acquire 10,000 beers starting from beer id 31518. This will for loop will combine the data into one final file.
#for i in range(1,101) :
#    request_string  =  '{' + '"'+'query'+'"'+':'+'"'+"query{topBeers(first: 100" +"," + 'after:'+' '+str(j) + ')'+'{items{id,name,description,abv,styleScore,overallScore,averageRating,ratingCount,style{name,description},brewer{name, score, description,type,twitter,facebook,streetAddress,city,state{name,country{name,continent}}}},last}}' +'"'+'}'
#    r = requests.post('https://api.r8.beer/v1/api/graphql/', headers=headers, data=request_string)
#    json_data = r.json()
#   l.append(json_data)
#    j = (json_data['data']['topBeers']['last'])
#   time.sleep(2)    #wait for 2 seconds before starting the next iteration

#write the json file
output_file = open("beers_first200.json", "w")
json.dump(l,output_file, indent = 4)

#get the last beer id to reset for loop
print(json_data['data']['topBeers']['last'])

## Data Merging

A small code was written in R to merge the json files into one csv.This will be converted to a python code. 

Let's take quick look at the merged file.

In [1]:
import pandas as pd

In [3]:
df = pd.read_csv('RateBeer.csv')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,beer_id,beer_name,beer_description,beer_abv,beer_styleScore,beer_overallScore,beer_averageRating,beer_ratingCount,beer_styleName,brewery_name,brewery_type,brewery_street,brewery_city,brewery_state,brewery_country,brewery_continent,brewery_twitter,brewery_facebook
0,1,58057,Närke Kaggen Stormaktsporter,Imperial Stout brewed with heather honey and a...,9.5,100.0,100.0,4.489771,557,Imperial Stout,Närke Kulturbryggeri,Microbrewery,Beväringsgatan 2,Örebro,,Sweden,,,
1,2,4934,Westvleteren 12 (XII),Westvleteren has the smallest output of the Tr...,10.2,100.0,100.0,4.426578,3332,Abt/Quadrupel,Westvleteren Abdij St. Sixtus,Microbrewery,Donkerstraat 12,Westvleteren,,Belgium,,,
2,3,231441,Schramm’s The Heart of Darkness,The Heart of Darkness is our capstone mead. It...,14.0,100.0,100.0,4.423655,77,Mead,Schramm’s Mead,Meadery,327 West 9 Mile Road,Ferndale,Michigan,United States,North America,schrammsmead,https://facebook.com/SchrammsMeadery
3,4,106749,B. Nektar Ken Schramm Signature Series - The H...,Meadmaker Ken Schramm crafted the Heart of Dar...,14.0,100.0,100.0,4.421873,50,Mead,B. Nektar Meadery,Meadery,1511 Jarvis,Ferndale,Michigan,United States,North America,bnektar,https://facebook.com/b.nektar
4,5,140581,Cigar City Pilot Series Dragonfruit Passion Fr...,Editor’s Note: This is an archived entry for t...,0.0,100.0,100.0,4.420719,46,Berliner Weisse,Cigar City Brewing,Microbrewery,3924 W Spruce Street,Tampa,Florida,United States,North America,cigarcitybeer,https://facebook.com/cigarcitybeer


In [6]:
df.shape

(220000, 19)

The data set contains 220,000 beers with 19 variables. The data cleaning and further analysis will be described in separate notebooks.