# Reviews-Based Yelp Award Project
# *Scrape California Michelin Guide*

**Alison Glazer**

This project is a conceptual idea for a new Yelp award that is driven by the content of a business's Yelp reviews. The initial prototype focuses on high-end restaurants in California, and the California Michelin Guide is used as a proxy for a judging criterion. 

This notebook contains the work done to scrape all of the restaurants included in the California Michelin Guide.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Libraries" data-toc-modified-id="Import-Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Libraries</a></span></li><li><span><a href="#Scrape-the-Michelin-Guide-website-for-all-restaurant-names-in-the-California-2019-guide" data-toc-modified-id="Scrape-the-Michelin-Guide-website-for-all-restaurant-names-in-the-California-2019-guide-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Scrape the Michelin Guide website for all restaurant names in the California 2019 guide</a></span></li></ul></div>

## Import Libraries

In [16]:
# Web Scraping
import requests
from bs4 import BeautifulSoup
import random
import time

# Numerical / Data
import pandas as pd
import numpy as np
import datetime as dt

# Saving
import pickle

## Scrape the Michelin Guide website for all restaurant names in the California 2019 guide

Define a function to retrieve the HTML code from a URL. Randomize the user-agent to minimize the likelihood of the request being denied. 

In [2]:
def get_soup(url):
    """
    This function parses a URL and returns the HTML code
    ---
    Input: URL (string)
    Output: HTML code (bs4.BeautifulSoup object)
    """
    UAS = (
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko\
        /20100101 Firefox/40.1",
        "Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101\
        Firefox/36.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0)\
        Gecko/20100101 Firefox/33.0",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36\
        (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) \
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 \
        Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 \
        (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
    )

    ua = UAS[np.random.randint(0, len(UAS))]

    headers = {'user-agent': ua}
    response = requests.get(url, headers=headers)
    print(response.status_code)
    if str(response.status_code)[0] != '2':
        print('Check status code = {}'.format(response.status_code))
        return
    soup = BeautifulSoup(response.text, 'lxml')
    return soup

There are 649 restaurants in the 2019 California Michelin guide. Generate a list with URLs for all pages containing the restaurants names

In [3]:
# Make list of URL's to scrape for restaurant names
michelin_urls = []
for i in np.arange(1,34,1):
    michelin_urls.append('https://guide.michelin.com/us/en/california/restaurants/page/{}'.format(i))

In [4]:
def get_restaurants_from_page(michelin_url):
    """
    Returns the names of the restaurants on this results page from the Michelin Guide website
    """
    print(michelin_url)
    soup = get_soup(michelin_url)
    restaurant_tags = soup.find_all(class_ = 'card__menu-content--title last')
    names = []
    for tag in restaurant_tags:
        names.append(' '.join(tag.text.split()))
    return names

This is an iterative process. The Michelin Guide website randomizes the order of the restaurants, so visiting consecutive pages does not always produce unique restaurant names from the previous page. Run this several times and maintain a SET of restaurants until all restaurant names have been logged.

In [24]:
michelin_restaurants4 = []
for url in michelin_urls:
    michelin_restaurants4 += get_restaurants_from_page(url)

https://guide.michelin.com/us/en/california/restaurants/page/1
200
https://guide.michelin.com/us/en/california/restaurants/page/2
200
https://guide.michelin.com/us/en/california/restaurants/page/3
200
https://guide.michelin.com/us/en/california/restaurants/page/4
200
https://guide.michelin.com/us/en/california/restaurants/page/5
200
https://guide.michelin.com/us/en/california/restaurants/page/6
200
https://guide.michelin.com/us/en/california/restaurants/page/7
200
https://guide.michelin.com/us/en/california/restaurants/page/8
200
https://guide.michelin.com/us/en/california/restaurants/page/9
200
https://guide.michelin.com/us/en/california/restaurants/page/10
200
https://guide.michelin.com/us/en/california/restaurants/page/11
200
https://guide.michelin.com/us/en/california/restaurants/page/12
200
https://guide.michelin.com/us/en/california/restaurants/page/13
200
https://guide.michelin.com/us/en/california/restaurants/page/14
200
https://guide.michelin.com/us/en/california/restaurants/p

In [22]:
mich = michelin_restaurants + michelin_restaurants2 + michelin_restaurants3 + michelin_restaurants4
len(set(mich))

604

In [23]:
# Save data for analysis
pd.Series(list(set(mich))).to_pickle('data/michelin_scrape_restaurants.pkl')