# Collect Data for Beers and Breweries
This is a tool to predict the rating of a beer based on the brewery, style, and location.

Our first task is to find the information we want for each beer in the html.  Going to the URL https://untappd.com/b/one-barrel-brewing-company-fanny-pack/2389998 gives us a page for one beer.  At the top of the page, we can see the name of the beer, the style, and the brewery.  There is also a table that tells us the number of ratings, the date added, the ABV, IBU, and importantly - the rating!  Everything we want is in that first block.  There are probably interesting questions to be asked from analyzing the rest of the data on the page, but it's beyond what we need currently.  
Let's load the first page of HTML and see what we can extract.

## Login

In [None]:
#install packages
import requests
from bs4 import BeautifulSoup

#Enter the account info to login to the Untappd site
loginInfo = {
    'username': 'YourUsernameHere',
    'password': 'YourPasswordHere'
}

#Go to the login page
LoginURL = 'https://untappd.com/login'
    
#Establish a session and enter the login info
with requests.Session() as session:
    post = session.post(LoginURL, data=loginInfo)

## Collect Beer Data

In [None]:
import time
import csv

#loop through the urls of beers within a range of ID's
for m in range(1200450, 1200600):
    print("Getting data for " + str(m))
    url = 'https://untappd.com/b/a/' + str(m)
    page = session.get(url, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'})
    soup = BeautifulSoup(page.text, 'html.parser')

    #Get the title info containing the name of the beer and the brewery
    title = soup.find('title').get_text()
    title = title.split(' - ')
    
    if len(title) < 2: #skip to the next page if both the name and brewery are not found
        print('invalid page')
        time.sleep(20)
        continue
        
    name = title[0]
    brewery = title[1]
    
    #Get the style of the beer (i.e. stout, IPA)
    style = soup.find(attrs={'class':'style'}).get_text()
    
    #Get the rating for the beer
    rating = soup.find(attrs={'class':'num'}).get_text()
    
    if rating == '(N/A)': #skip to the next page if the beer has no rating (occurs if # of reviews < 10)
        print('no rating')
        time.sleep(20)
        continue
    
    #Get ABV, alcohol by volume
    abv = soup.find(attrs={'class':'abv'}).get_text()
    
    #Get IBU, international biterness units
    ibu = soup.find(attrs={'class':'ibu'}).get_text()
    
    #Get the number of people who have reviewed the beer
    raters = soup.find(attrs={'class':'raters'}).get_text()
    
    #Get the date the beer was added to the website
    date = soup.find(attrs={'class':'date'}).get_text()
    
    #Get the untappd url suffix for the brewery
    brewerylinkline = soup.find(attrs={'class':'brewery'})
    brewerylinkhtml = brewerylinkline.find('a')
    brewerylink = brewerylinkhtml.get('href')
    
    #Write these data to a csv
    with open('untappd2.csv', 'a') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow([str(m), name, brewery, style, rating, abv, ibu, raters, date, brewerylink])
    
    #Pause 10 seconds to avoid overloading the server
    time.sleep(20)
    
print('Finished collecting data')

## Collect Brewery Data

In [2]:
import pandas as pd
beerdf = pd.read_csv("untappd2.csv", header=None)
#make a list of brewery url suffixes
brewerylist = beerdf.loc[:, 9].tolist()

#remove duplicate items (this may re-order the list, but this is fine)
breweryshort = list(set(brewerylist))
print('list of unique breweries is ' + str(len(breweryshort)) + ' items long')

#print the first 10 items
print(brewerylist[0:10])

#check if we already have info for some of these breweries
brewdf = pd.read_csv("breweries.csv", header=None)
brewinfo = brewdf.loc[:, 0].tolist()
newbrew = list(set(breweryshort) - set(brewinfo))
print('list of breweries missing data is ' + str(len(newbrew)) + ' items long')

list of unique breweries is 287 items long
['/FactionBrewing', '/BugnuttyBrew', '/goldencoastmead', '/tailgatebeer', '/Rhinegeist', '/WasserhundBrewingCompany', '/KannahCreekBrewingCompany', '/902Brewing', '/DandyAlesyyc', '/AlpineBeerCo']
list of breweries missing data is 15 items long


In [None]:
print(newbrew[0:10])

In [None]:
import time
import csv

#loop through the urls of beers within a range of ID's
for brewerylink in newbrew[0:94]:
    print("Getting data for " + brewerylink)
    url = 'https://untappd.com' + brewerylink
    page = session.get(url, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'})
    soup = BeautifulSoup(page.text, 'html.parser')

    #get brewery location, split into city and state/country
    location = soup.find(attrs={'class':'brewery'}).get_text()
    location = location.split(', ')
    
    if len(location) < 2: #skip to the next page if both the city and state/country are not found
        print('invalid page')
        time.sleep(25)
        continue
        
    city = location[0]
    statecountry = location[1]
    
    #get the type of the brewery (i.e. microbrewery)
    b_type = soup.find(attrs={'class':'style'}).get_text()

    #get the overall brewery rating
    b_rating = soup.find(attrs={'class':'num'}).get_text()

    if b_rating == '(N/A)': #skip to the next page if the beer has no rating (occurs if # of reviews < 10)
        print('no rating')
        time.sleep(22)
        continue

    #get the number of beers the brewery has in the system
    b_count = soup.find(attrs={'class':'count'}).get_text()

    #get the number of raters
    b_raters = soup.find(attrs={'class':'raters'}).get_text()

    #get the date the brewery was added
    b_date = soup.find(attrs={'class':'date'}).get_text()
    
    #Write these data to a csv
    with open('breweries.csv', 'a') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow([brewerylink, city, statecountry, b_type, b_rating, b_count, b_raters, b_date])
    
    #Pause 10 seconds to avoid overloading the server
    time.sleep(21)
    
print('Finished collecting data')