## Load Config Data

Load in configuration data, which will dictate the behavior of the scraper.  The login information will be used to log into untappd.  The search_terms is only required if the url_file does not exist.  If it exists, the search scraping will not occur.  

**Sample Config File**

Open "untappd_sample.cfg" for a sample configuration file.  Add a username and password.  If you make a copy and name it untappd.cfg, git will ignore it and your password will not be checked in. 


In [1]:
import os
import json

config_path = 'untappd.cfg'

with open(config_path) as rdr:
    config = json.load(rdr)

### Create web driver using the scraper

In [2]:
import untappd_scraper
from untappd_scraper import ScraperType

browser = untappd_scraper.create_driver(config, headless=True)

### Identify Beer URLs to Scrape

If the url_file in the config exists, we'll use that.  Otherwise we'll use the search terms to begin scraping.

In [3]:
url_file = config['scraping']['url_file']

if not os.path.exists(url_file):
    urls = []
    
    ## Create search term scraper
    search_term_scraper = untappd_scraper.create_scraper(ScraperType.SEARCH, browser)
    
    for search_term in config['scraping']['search_terms']:
        urls.extend(search_term_scraper.scrape_search_term(search_term))
    
    urls = list(set(urls))
    untappd_scraper.write_pkl(url_file, urls)
    
else:
    urls = untappd_scraper.read_pkl(url_file)
    
print('URLs Found:', len(urls))

URLs Found: 3733


In [None]:
import uuid
import feather

import pandas as pd
from glob import glob

## Identify all existing urls, and remove them from our url list
df = pd.concat([feather.read_dataframe(file) for file in glob('../data/beer-info*.feather')])
df.head()

existing_ids = set(df['id'])

print("Number of URLS before filter:", len(urls))
urls = [url for url in urls if int(url.split('/')[-1]) not in existing_ids]
print("Number of URLS after filter: ", len(urls))

beer_scraper = untappd_scraper.create_scraper(ScraperType.BEER, browser)

beer_results = []
review_results = []

for url in urls:
    
    beers, reviews = beer_scraper.scrape_beer(url)
    beer_results.append(beers)
    review_results.extend(reviews)
    
    print(f"{len(beer_results)}) {url} found {len(reviews)} reviews")
    ## Every 25 beers write out the beer info and reviews
    if len(beer_results) >= 50:
        print("Clearing")
        file_id = str(uuid.uuid4())
        
        # Write beer info
        feather.write_dataframe(untappd_scraper.create_beer_df(beer_results), f'../data/beer-info_{file_id}.feather')
        pd.DataFrame(beer_results).to_json(f'../data/beer-info_{file_id}.json', orient='records')

        # Write user reviews
        feather.write_dataframe(untappd_scraper.create_reviews_df(review_results), f'../data/reviews_{file_id}.feather')
        pd.DataFrame(review_results).to_json(f'../data/reviews_{file_id}.json', orient='records')
        
        beer_results = []
        review_results = []

Number of URLS before filter: 3733
Number of URLS after filter:  2259
1) https://untappd.com/beer/566994 found 290 reviews
2) https://untappd.com/beer/2717978 found 32 reviews
3) https://untappd.com/beer/68933 found 290 reviews
4) https://untappd.com/beer/2216258 found 290 reviews
5) https://untappd.com/beer/1991200 found 14 reviews
6) https://untappd.com/beer/2561113 found 20 reviews
7) https://untappd.com/beer/2800334 found 190 reviews
8) https://untappd.com/beer/2943303 found 66 reviews
9) https://untappd.com/beer/397375 found 290 reviews
10) https://untappd.com/beer/2694348 found 77 reviews
11) https://untappd.com/beer/581938 found 265 reviews
12) https://untappd.com/beer/1796100 found 290 reviews
13) https://untappd.com/beer/1400728 found 290 reviews
14) https://untappd.com/beer/2108251 found 103 reviews
15) https://untappd.com/beer/2738595 found 290 reviews
16) https://untappd.com/beer/802219 found 265 reviews
17) https://untappd.com/beer/2978738 found 290 reviews
18) https://unt

In [None]:
browser.quit()

In [None]:
## TODO: 

 - Done? Clean the data, write out JSON and feather with same uuid
Add scraping a user
Add optional flag to NOT scraper reviews from a beer