# Yelp Search
(Fetching and displaying yelp data)

#### Web Scraping
* Using HTTP to fetch the content of a website
* HTTP Requests (and lifecycle)
* RESTful APIs
  * Authentication (OAuth)
  * Pagination
  * Rate limiting
* JSON vs. HTML (and how to parse each)
* HTML traversal (CSS selectors)

#### Data Collection & Data Processing
* Slicing data frames
* Filtering data
* Grouped counts
* Joining two tables
* NA/Null values

#### Library Documentation
Documentation for the Python packages used:
* Standard Library: 
  * [io](https://docs.python.org/2/library/io.html)
  * [time](https://docs.python.org/2/library/time.html)
  * [json](https://docs.python.org/2/library/json.html)
* Third Party
  * [requests](http://docs.python-requests.org/en/master/)
  * [Beautiful Soup (version 4)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
  * [yelp-fusion](https://www.yelp.com/developers/documentation/v3/get_started)

### 1. Import needed libraries

In [1]:
import time, json, requests
from bs4 import BeautifulSoup

### 2. Download and return the raw HTML content of a URL
`retrieve_html`: download and return raw HTML content of url passed as an argument
* args: 
  * url (string)
* returns:
  * status_code (integer):
  * raw_html (string): the raw html content of the response, properly encoded according to the http headers

In [2]:
"""
  Return the raw HTML at the specified URL.

  Args:
    url (string): URL to get html code from

  Returns:
    status_code (integer): status of the request (200: successful, 404: not found)
    raw_html (string): the raw HTML content of the response, properly encoded according to the HTTP headers.
"""
def retrieve_html(url):
  response = requests.get(url)
  status_code = response.status_code
  raw_html = response.text
  return status_code, raw_html

Test `retrieve_html`:

In [3]:
youtube_article = retrieve_html('https://www.nytimes.com/2019/03/29/technology/youtube-online-extremism.html')
print(youtube_article[0])
print(youtube_article[1][:500])
# (200, '<!DOCTYPE html>\n<html lang="en" class="story" xmlns:og="http://opengraphprotocol.org/schema/">\n  <head>\n    <title data-rh="true">YouTube’s ...)

200
<!DOCTYPE html>
<html lang="en" class="story"  xmlns:og="http://opengraphprotocol.org/schema/">
  <head>
    <title data-rh="true">YouTube’s Product Chief on Online Radicalization and Algorithmic Rabbit Holes - The New York Times</title>
    <meta data-rh="true" property="article:published_time" content="2019-03-29T15:40:56.000Z"/><meta data-rh="true" property="article:modified_time" content="2019-04-01T02:48:25.343Z"/><meta data-rh="true" http-equiv="Content-Language" content="en"/><meta data-r


### 3. Store Yelp API key locally and read in API key to authenticate

In [3]:
with open('yelp_api_key.txt', 'r') as f:
  api_key = f.read().replace('\n','')

### 4. Make an authenticated request to the search endpoint
`api_get_request`: send an HTTP GET request and return a json response

In [6]:
"""
  Send a HTTP GET request and return a json response

  Args:
    url (string): API endpoint url
    headers (dict): A python dictionary containing HTTP headers including authentication to be sent
    url_params (dict): The parameters (required and optional) supported by endpoint

  Returns:
    content_json (dictionary): response from request as dictionary
"""
def api_get_request(url, headers, url_params):
  http_method = 'GET'
  # send request and store results (Response object) in response
  response = requests.get(url, params = url_params, headers = headers)
  # get response as dictionary
  content_json = response.json()
  return content_json

### 5. Get paramerters needed for sending an HTTP get request
`location_search_params`: construct url, headers, and url parameters

In [7]:
"""
  Get and return parameters needed for sending an HTTP GET requests

  Args:
    url_addition(string): ending of Yelp API endpoint URL
    api_key(string): Yelp API key for authentication
    **kwargs: additional arguments needed pulling data from api

  Returns:
    (list): list of tuple
      url(string): API endpoint URL
      headers(dict): headers needed for accessing the Yelp API
      url_params(dict): other parameters needing for getting data (such as location, restaurant name, etc.)
"""
def get_url_params(url_addition, api_key, **kwargs):
  # url endpoint for search + url addition (for search all businesses, search reviews of certain business, etc)
  url = 'https://api.yelp.com/v3/businesses' + url_addition

  # authentication
  headers = {'Authorization': f"Bearer {api_key}"}

  url_params = {}

  # include keyword arguments in url_params
  if kwargs is not None:
    for key, value in kwargs.items():
      url_params[f'{key}'] = value

  return url, headers, url_params

### 6.
`yelp_search`: get 20 bussinessed in the given location

In [8]:
"""
  Get at most 20 businesses based on location.

  Args:
    api_key (string): Yelp API key for authentication
    location (string): business Location
    offset (int): parameter for pagination

  Returns:
    total (integer): total number of businesses on Yelp corresponding to the location
    businesses (list): list representing each business
"""
def yelp_search(api_key, location, offset = 0):
  # get url, headers, and URL parameters for request
  url, headers, url_params = get_url_params('/search', api_key, location = location, offset = offset)

  # send request and get results as a dictionary
  response_json = api_get_request(url, headers, url_params)
  return response_json["total"], list(response_json["businesses"])


test `yelp_search`:

In [9]:

api_key = read_api_key('yelp_api_key.txt')
num_records, data = yelp_search(api_key, 'Chicago')
print(num_records)
#8500
print(len(data))
#20
print(list(map(lambda x: x['name'], data)))
#['Girl & the Goat', 'Wildberry Pancakes and Cafe', 'Au Cheval', 'The Purple Pig', 'Art Institute of Chicago', 'Smoque BBQ', "Lou Malnati's Pizzeria", 'Alinea', "Kuma's Corner - Belmont", 'Little Goat Diner', "Bavette's Bar & Boeuf", 'Cafe Ba-Ba-Reeba!', "Portillo's Hot Dogs", 'Quartino Ristorante', "Pequod's Pizzeria", 'Crisp', "Joe's Seafood, Prime Steak & Stone Crab", 'Xoco', "Molly's Cupcakes", 'Millennium Park']

8500
20
['Girl & The Goat', 'Wildberry Pancakes and Cafe', 'Au Cheval', 'The Purple Pig', "Lou Malnati's Pizzeria", 'Art Institute of Chicago', 'Smoque BBQ', 'Cafe Ba-Ba-Reeba!', "Bavette's Bar & Boeuf", 'Little Goat', 'Alinea', "Kuma's Corner - Belmont", 'Quartino Ristorante', "Pequod's Pizzeria", "Portillo's Hot Dogs", "Joe's Seafood, Prime Steak & Stone Crab", 'Crisp', 'Xoco', 'Sapori Trattoria', "Molly's Cupcakes"]


### 7.
`paginated_restaurant_search_requests`: returns a list of tuples for paginated search of all restaurants


In [10]:
"""
  Returns a list of tuples (url, headers, url_params) for paginated search of all restaurants.
  Paginated to stop timeout due to too many Yelp API requests.

  Args:
    api_key (string): Yelp API key for authentication
    location (string): Business Location
    total (int): Total number of items to be fetched

  Returns:
    results (list): list of tuple (url, headers, url_params)
"""
def paginated_restaurant_search_requests(api_key, location, total):
  pages = [] # add result to pages

  current_offset = 0
  while current_offset < total:
    url, headers, url_params = get_url_params('/search', api_key, location, offset = current_offset, limit = 20, categories = "restaurants")
    pages.append((url, headers, url_params))
    current_offset = current_offset + 20
    # slight pause to stop too many requests to Yelp API
    time.sleep(.3)

  return pages


### 8.
`all_restaurants`: construct the pagination requests for all the restaurants on yelp for a location

In [11]:
"""
  Construct the pagination requests for ALL the restaurants on Yelp for a given location.

  Args:
    api_key (string): Yelp API key for authentication
    location (string): Business Location

  Returns:
    results (list): list of dicts representing each restaurant in passed location
"""
def all_restaurants(api_key, loc):

  # offset passed to get first page of restaurants in Yelp
  url, headers, url_params = get_url_params('/search', api_key, location = loc, offset = 0)

  # get results from request
  content_json = api_get_request(url, headers, url_params)
  total_items = content_json["total"]
  # get parameters needed to retrieve all restaurants in the given location
  restaurant_pages = paginated_restaurant_search_requests(api_key, loc, total_items)

  # use returned list of (url, headers, url_params) and function api_get_request to retrive all restaurants
  result = []
  for i in range(len(restaurant_pages)):
    page = restaurant_pages[i]

    url = page[0]
    headers = page[1]
    url_params = page[2]

    # send request
    content_json = api_get_request(url, headers, url_params)
    page_data = content_json['businesses']

    for j in range(len(page_data)):
      result.append(page_data[j])

    # pause slightly after each request to prevent timeout
    time.sleep(.3)

  return result

In [12]:
data = all_restaurants(api_key, 'Greektown, Chicago, IL')
print(len(data))
# 102
print(list(map(lambda x:x['name'], data)))
# ['Greek Islands Restaurant', 'Meli Cafe & Juice Bar', 'Artopolis', 'WJ Noodles', 'Athena Greek Restaurant', ...]

104
['Greek Islands Restaurant', 'Meli Cafe & Juice Bar', 'Artopolis', 'Athena Greek Restaurant', 'Zeus Restaurant', 'Green Street Smoked Meats', 'Santorini', 'Mr Greek Gyros', "Philly's Best", 'Primos Chicago Pizza Pasta', 'Monteverde', 'J.P. Graziano Grocery', '9 Muses', 'Sepia', 'Sizzling Pot King', 'High Five Ramen', 'Dawali Jerusalem Kitchen', 'Green Street Local', 'Spectrum Bar and Grill', 'The Allis', "Lou Mitchell's", "Nando's Peri-Peri", 'Jubilee Juice & Grill', "Formento's", 'Taco Burrito King', 'Dine', 'Chicken & Farm Shop', 'H Mart - Chicago', 'The Madison Bar and Kitchen', 'Parlor Pizza Bar', 'Loop Juice', 'Blaze Pizza', 'M2 Cafe', 'Booze Box', 'El Che Steakhouse & Bar', 'Omakase Yume', 'Blackwood BBQ', 'Bandit', 'Morgan Street Cafe', "Giordano's", "Nonna's Pizza & Sandwiches", "Vero's Caffe & Gelato", 'Ciao! Cafe & Wine Lounge', 'Umami Burger - West Loop', 'Slightly Toasted', 'Sushi Pink', 'Rye Deli & Drink', 'Epic Burger', 'My Thai', 'Naansense', "Hannah's Bretzel", "Nan

### 9.
`parse_api_response`: parse yelp api results to extract restaurant URLs

In [None]:
"""
  Parse Yelp API results to extract restaurant URLs.

  Args:
    data (string): string of properly formatted JSON.

  Returns:
    (list): list of URLs as strings from the input JSON.
"""
"""
  url, headers, url_params = location_search_params(api_key, "Chicago", offset=0)
  response = requests.get(url, params = url_params, headers = headers)
  response_json = response.json()
  response_str = json.dumps(response_json)
  parsed = parse_api_response(response_str)
  print(parsed[1:5])
"""
def parse_api_response(data):

  data_ = json.loads(data)
  data_ = data_['businesses']

  urls = []
  for i in range(len(data_)):
    temp = data_[i]
    urls.append(temp['url'])

  return urls

test `parse_api_response`

In [14]:
url, headers, url_params = location_search_params(api_key, "Chicago", offset=0)
response = requests.get(url, params = url_params, headers = headers)
response_json = response.json()

response_str = json.dumps(response_json)
parsed = parse_api_response(response_str)
print(parsed[1:5])
# ['https://www.yelp.com/biz/girl-and-the-goat-chicago?adjust_creative=ioqEYAcUhZO272qCIvxcVA....',
#  'https://www.yelp.com/biz/wildberry-pancakes-and-cafe-chicago-2?adjust_creative=ioqEYAcUhZO272qCIvxcVA...',..]

['https://www.yelp.com/biz/wildberry-pancakes-and-cafe-chicago-2?adjust_creative=680Jc2E_i5DwGD1xYXozpg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=680Jc2E_i5DwGD1xYXozpg', 'https://www.yelp.com/biz/au-cheval-chicago?adjust_creative=680Jc2E_i5DwGD1xYXozpg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=680Jc2E_i5DwGD1xYXozpg', 'https://www.yelp.com/biz/the-purple-pig-chicago?adjust_creative=680Jc2E_i5DwGD1xYXozpg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=680Jc2E_i5DwGD1xYXozpg', 'https://www.yelp.com/biz/lou-malnatis-pizzeria-chicago?adjust_creative=680Jc2E_i5DwGD1xYXozpg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=680Jc2E_i5DwGD1xYXozpg']


### 10. Parse a Yelp restaurant Page using `BeautifulSoup`

`parse_page`: parse the reviews on a single page of a restaurant
* args:
  * html (string): string of HTML corresponding to a yelp restaurant
* returns:
  * reviews_list (list): list of dicts corresponding to the extracted review info; review structure:
    ```python
    {
      'author': str
      'rating': float
      'date': str ('yyyy-mm-dd')
      'description': str
    }
    ```
  * url_next (string): URL for the next page of reviews (or None if it is the last page)

### 10. Get id of a business

In [None]:
"""
  Get the ID of a business from the Yelp API (https://www.yelp.com/developers/documentation/v3/business_match)

  Args:
    api_key(string): Yelp API key for authentication
    name(string): name of the business to access
    address(string): address of the business to access (can be empty if business has no address)
    city(string): city of the business to access
    state(string): ISO 3166-2 state code of the wanted business
    country(string): ISO 3166-1 alpha-2 country code of the wanted business

  Return:
    (string): business id
    or None if there is a validation error
"""
def get_business_id(api_key, name, address, city, state, country):
  url, headers, url_params = get_url_params('/matches', api_key, name = name, address1 = address, city = city, state = state, country = country)

  response_json = api_get_request(url, headers, url_params)
  
  if 'error' in response_json:
    print('error code:', (response_json['error'])['code'])
    print('description:', (response_json['error'])['description'])
    return None
  
  business = response_json['businesses']
  return (business[0])['id']

### 11. Extract Yelp reviews for a Single Restaurant

In [None]:
"""
  Get three reviews from Yelp API

  Args:
    api_key(string): Yelp API key for authentication
    name(string): name of the business to access
    address(string): address of the business to access (can be empty if business has no address)
    city(string): city of the business to access
    state(string): ISO 3166-2 state code of the wanted business
    country(string): ISO 3166-1 alpha-2 country code of the wanted business

  Returns:
    reviews(dict): reviews
    total(int): total number of reviews
    
    or None if business is not found
"""
def extract_reviews(api_key, name, address, city, state, country):
  business_id = get_business_id(api_key, name, address, city, state, country)
  
  if business_id == None:
    print('business not found')
    return None

  # add business_id to endpoint
  url_addition = '/' + business_id + '/reviews'
  url, headers, url_params = get_url_params(url_addition, api_key)

  response_json = api_get_request(url, headers, url_params)

  reviews = response_json['reviews']
  total = response_json['total']
  return reviews, total

Test `extract_reviews`:

In [None]:
reviews, total = extract_reviews(api_key, 'jibarito shop', '1646 W 18th St', 'Chicago', 'IL', 'US')
print(len(reviews))