# Yelp Search
(Fetching and displaying yelp data)

#### Web Scraping
* Using HTTP to fetch the content of a website
* HTTP Requests (and lifecycle)
* RESTful APIs
  * Authentication (OAuth)
  * Pagination
  * Rate limiting
* JSON vs. HTML (and how to parse each)
* HTML traversal (CSS selectors)

#### Data Collection & Data Processing
* Slicing data frames
* Filtering data
* Grouped counts
* Joining two tables
* NA/Null values

#### Library Documentation
For solving this part, you need to look up online documentation for the Python packages you will use:
* Standard Library: 
  * [io](https://docs.python.org/2/library/io.html)
  * [time](https://docs.python.org/2/library/time.html)
  * [json](https://docs.python.org/2/library/json.html)
* Third Party
  * [requests](http://docs.python-requests.org/en/master/)
  * [Beautiful Soup (version 4)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
  * [yelp-fusion](https://www.yelp.com/developers/documentation/v3/get_started)

### 1. Import needed libraries

In [1]:
import io, time, json, requests
from bs4 import BeautifulSoup
from io import StringIO

### 2. Download and return the raw HTML content of a URL
`retrieve_html`: download and return raw HTML content of url passed as an argument
* args: 
  * url (string)
* returns:
  * status_code (integer):
  * raw_html (string): the raw html content of the response, properly encoded according to the http headers

In [2]:
def retrieve_html(url):
  response = requests.get(url)
  status_code = response.status_code
  raw_html = response.text 
  return status_code, raw_html

Test `retrieve_html`:

In [3]:
youtube_article = retrieve_html('https://www.nytimes.com/2019/03/29/technology/youtube-online-extremism.html')
print(youtube_article[0])
print(youtube_article[1][:500])
# (200, '<!DOCTYPE html>\n<html lang="en" class="story" xmlns:og="http://opengraphprotocol.org/schema/">\n  <head>\n    <title data-rh="true">YouTube’s ...)

200
<!DOCTYPE html>
<html lang="en" class="story"  xmlns:og="http://opengraphprotocol.org/schema/">
  <head>
    <title data-rh="true">YouTube’s Product Chief on Online Radicalization and Algorithmic Rabbit Holes - The New York Times</title>
    <meta data-rh="true" property="article:published_time" content="2019-03-29T15:40:56.000Z"/><meta data-rh="true" property="article:modified_time" content="2019-04-01T02:48:25.343Z"/><meta data-rh="true" http-equiv="Content-Language" content="en"/><meta data-r


### 3. Store Yelp API key locally and read in API key to authenticate

In [None]:
with open('yelp_api_key.txt', 'r') as f:
  api_key = f.read().replace('\n','')

`read_api_key`: read yelp API from file
* args:
  * filepath (string): file containing API key
* returns:
  * api_key (string): api key

In [5]:
def read_api_key(filepath):    
  with open(filepath, 'r') as f:
    return f.read().replace('\n','')

### 4. Make an authenticated request to the search endpoint
`api_get_request`: send an HTTP GET request and return a json response
* args:
  * url (string): API endpoint url
  * headers (dict): A python dictionary containing HTTP headers including Authentication to be sent
  * url_params (dict): The parameters (required and optional) supported by endpoint
* returns:
  * results (json): response as json

In [6]:
def api_get_request(url, headers, url_params):
  http_method = 'GET'
  # See requests.request?
  response = requests.get(url, params = url_params, headers = headers)
  content_json = response.json()  
  return content_json

### 5. Get 
`location_search_params`: construct url, headers, and url parameters
* args:
  * api_key (string):
  * location (string):
  * kwargs ():
* returns:
  * url (string):
  * headers (): 
  * url_parama ():

In [7]:
def location_search_params(api_key, location, **kwargs):

  # What is the url endpoint for search?
  url = 'https://api.yelp.com/v3/businesses/search'

  # How is Authentication performed?
  headers = {'Authorization': f"Bearer {api_key}"}

  # SPACES in url is problematic. How should you handle location containing spaces?
  location_no_space = location.replace(" ", "-")
  url_params = {'location': f"{location_no_space}"} 

  # Include keyword arguments in url_params 
  if kwargs is not None:
    for key, value in kwargs.items():
      #str(value).replace(" ", "-")
      url_params[key] = value

  return url, headers, url_params

### 6.
`yelp_search`: 
* args:
  * api_key (string): 
  * location ():
  * offset ():
* returns:
  * response_json (json):
  * 

In [8]:
def yelp_search(api_key, location, offset=0):
  """
  Make an authenticated request to the Yelp API.

  Args:
    api_key (string): Your Yelp API Key for Authentication
    location (string): Business Location
    offset (int): param for pagination

  Returns:
    total (integer): total number of businesses on Yelp corresponding to the location
    businesses (list): list of dicts representing each business
  """
  url, headers, url_params = location_search_params(api_key, location, offset=0)

  response_json = api_get_request(url, headers, url_params)
  return response_json["total"], list(response_json["businesses"])

test `yelp_search`:

In [9]:

api_key = read_api_key('yelp_api_key.txt')
num_records, data = yelp_search(api_key, 'Chicago')
print(num_records)
#8500
print(len(data))
#20
print(list(map(lambda x: x['name'], data)))
#['Girl & the Goat', 'Wildberry Pancakes and Cafe', 'Au Cheval', 'The Purple Pig', 'Art Institute of Chicago', 'Smoque BBQ', "Lou Malnati's Pizzeria", 'Alinea', "Kuma's Corner - Belmont", 'Little Goat Diner', "Bavette's Bar & Boeuf", 'Cafe Ba-Ba-Reeba!', "Portillo's Hot Dogs", 'Quartino Ristorante', "Pequod's Pizzeria", 'Crisp', "Joe's Seafood, Prime Steak & Stone Crab", 'Xoco', "Molly's Cupcakes", 'Millennium Park']

8500
20
['Girl & The Goat', 'Wildberry Pancakes and Cafe', 'Au Cheval', 'The Purple Pig', "Lou Malnati's Pizzeria", 'Art Institute of Chicago', 'Smoque BBQ', 'Cafe Ba-Ba-Reeba!', "Bavette's Bar & Boeuf", 'Little Goat', 'Alinea', "Kuma's Corner - Belmont", 'Quartino Ristorante', "Pequod's Pizzeria", "Portillo's Hot Dogs", "Joe's Seafood, Prime Steak & Stone Crab", 'Crisp', 'Xoco', 'Sapori Trattoria', "Molly's Cupcakes"]


### 7.
`paginated_restaurant_search_requests`: returns a list of tuples for paginated search of all restaurants
* args:
  * api_key (string): yelp api key for Authentication
  * location (string): business location
  * total (int): total number of items to be fetched
* returns:
  * results (list): list of tuple (url, headers, url_params)

In [10]:
def paginated_restaurant_search_requests(api_key, location, total):
  """
  Returns a list of tuples (url, headers, url_params) for paginated search of all restaurants
  Args:
    api_key (string): Your Yelp API Key for Authentication
    location (string): Business Location
    total (int): Total number of items to be fetched
  Returns:
    results (list): list of tuple (url, headers, url_params)
  """
  # HINT: Use total, offset and limit for pagination
  # You can reuse function location_search_params(...)

  pages = [] # add result to page

  offset_ = 0
  while offset_ < total:
    url, headers, url_params = location_search_params(api_key, location, offset = offset_, limit = 20, categories = "restaurants")
    pages.append((url, headers, url_params))
    offset_ = offset_ + 20
    time.sleep(.3)

  return pages

### 8.
`all_restaurants`: construct the pagination requests for all the restaurants on yelp for a location
* args:
  * api_key (string): yelp api key for authentication
  * location (string): business location
* returns:
  * results (list): list of dicts representing each restaurant

In [11]:
def all_restaurants(api_key, location):
  
  # What keyword arguments should you pass to get first page of restaurants in Yelp
  url, headers, url_params = location_search_params(api_key, location, offset = 0)
  # 
  #response_json = api_get_request(url, headers, url_params)
  response = requests.get(url, params = url_params, headers = headers)
  content_json = response.json()  

  total_items = content_json["total"]

  restaurant_pages = paginated_restaurant_search_requests(api_key, location, total_items)

  # Use returned list of (url, headers, url_params) and function api_get_request 
  # to retrive all restaurants
  # REMEMBER to pause slightly after each request.
  result = []

  for i in range(len(restaurant_pages)):
    page = restaurant_pages[i]

    url = page[0]
    headers = page[1]
    url_params = page[2]

    response = requests.get(url, params = url_params, headers = headers)
    content_json = response.json() 
    page_data = content_json['businesses']

    for j in range(len(page_data)):
      result.append(page_data[j])

    time.sleep(.3)


  return result

In [12]:
data = all_restaurants(api_key, 'Greektown, Chicago, IL')
print(len(data))
# 102
print(list(map(lambda x:x['name'], data)))
# ['Greek Islands Restaurant', 'Meli Cafe & Juice Bar', 'Artopolis', 'WJ Noodles', 'Athena Greek Restaurant', ...]

104
['Greek Islands Restaurant', 'Meli Cafe & Juice Bar', 'Artopolis', 'Athena Greek Restaurant', 'Zeus Restaurant', 'Green Street Smoked Meats', 'Santorini', 'Mr Greek Gyros', "Philly's Best", 'Primos Chicago Pizza Pasta', 'Monteverde', 'J.P. Graziano Grocery', '9 Muses', 'Sepia', 'Sizzling Pot King', 'High Five Ramen', 'Dawali Jerusalem Kitchen', 'Green Street Local', 'Spectrum Bar and Grill', 'The Allis', "Lou Mitchell's", "Nando's Peri-Peri", 'Jubilee Juice & Grill', "Formento's", 'Taco Burrito King', 'Dine', 'Chicken & Farm Shop', 'H Mart - Chicago', 'The Madison Bar and Kitchen', 'Parlor Pizza Bar', 'Loop Juice', 'Blaze Pizza', 'M2 Cafe', 'Booze Box', 'El Che Steakhouse & Bar', 'Omakase Yume', 'Blackwood BBQ', 'Bandit', 'Morgan Street Cafe', "Giordano's", "Nonna's Pizza & Sandwiches", "Vero's Caffe & Gelato", 'Ciao! Cafe & Wine Lounge', 'Umami Burger - West Loop', 'Slightly Toasted', 'Sushi Pink', 'Rye Deli & Drink', 'Epic Burger', 'My Thai', 'Naansense', "Hannah's Bretzel", "Nan

### 9.
`parse_api_response`: parse yelp api results to extract restaurant URLs
* args:
  * data (string): string of properly formatted JSON
* returns:
  * urls (list): list of URLs as strings from the input JSON

In [13]:
def parse_api_response(data):

  data_ = json.loads(data)
  data_ = data_['businesses']

  urls = []
  for i in range(len(data_)):
    temp = data_[i]
    urls.append(temp['url'])

  return urls

test `parse_api_response`

In [14]:
url, headers, url_params = location_search_params(api_key, "Chicago", offset=0)
response = requests.get(url, params = url_params, headers = headers)
response_json = response.json()

response_str = json.dumps(response_json)
parsed = parse_api_response(response_str)
print(parsed[1:5])
# ['https://www.yelp.com/biz/girl-and-the-goat-chicago?adjust_creative=ioqEYAcUhZO272qCIvxcVA....',
#  'https://www.yelp.com/biz/wildberry-pancakes-and-cafe-chicago-2?adjust_creative=ioqEYAcUhZO272qCIvxcVA...',..]

['https://www.yelp.com/biz/wildberry-pancakes-and-cafe-chicago-2?adjust_creative=680Jc2E_i5DwGD1xYXozpg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=680Jc2E_i5DwGD1xYXozpg', 'https://www.yelp.com/biz/au-cheval-chicago?adjust_creative=680Jc2E_i5DwGD1xYXozpg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=680Jc2E_i5DwGD1xYXozpg', 'https://www.yelp.com/biz/the-purple-pig-chicago?adjust_creative=680Jc2E_i5DwGD1xYXozpg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=680Jc2E_i5DwGD1xYXozpg', 'https://www.yelp.com/biz/lou-malnatis-pizzeria-chicago?adjust_creative=680Jc2E_i5DwGD1xYXozpg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=680Jc2E_i5DwGD1xYXozpg']


### 10. Parse a Yelp restaurant Page using `BeautifulSoup`

`parse_page`: parse the reviews on a single page of a restaurant
* args:
  * html (string): string of HTML corresponding to a yelp restaurant
* returns:
  * reviews_list (list): list of dicts corresponding to the extracted review info; review structure:
    ```python
    {
      'author': str
      'rating': float
      'date': str ('yyyy-mm-dd')
      'description': str
    }
    ```
  * url_next (string): URL for the next page of reviews (or None if it is the last page)

In [15]:
def parse_page(html):

  soup = BeautifulSoup(html,'html.parser')
  url_next = soup.find('link', rel='next')
  if url_next:
    url_next = url_next.get('href')
  else:
    url_next = None

  reviews = soup.find_all('div', itemprop="review")
  reviews_list = []
  #for r in reviews:
    #reviews_list.append(r.find('p', itemprop="description"))


  for i in range(len(reviews)):
    r = reviews[i]
    author = r.find(itemprop='author')
    author = author.attrs['content']
    rating = r.find(itemprop='ratingValue')
    rating = float(rating.attrs['content'])
    date = r.find(itemprop='datePublished')
    date = date.attrs['content']
    description = r.find('p').text
    #description = r.find(itemprop='description')
    #description = description.attrs['p']
    review = {'author': author, 'rating': rating, 'date': date, 'description': description}
    reviews_list.append(review)

  #return_val = (reviews_list, url_next)  

  return reviews_list, url_next

### 11. Extract all Yelp reviews for a Single Restaurant
`extract_reviews`: retrieve all of the reviews for a single restaurant on yelp
* args:
  * url (string): yelp URL corresponding to the restaurant of interest
  * html_fetcher (function): function that takes url and returns html status code and content
* returns:
  * reviews (list): list of dictionaries containing extracted review info

In [16]:
def extract_reviews(url, html_fetcher):
  
  code, html = html_fetcher(url) 
  #reviews = []
  reviews, next_url = parse_page(html)
  print(html[0:100])

  while (next_url != None):
    code, html = html_fetcher(next_url)
    reviews_temp, next_url_temp = parse_page(html)
    for r in reviews_temp:
      reviews.append(r)
    next_url = next_url_temp

  return reviews

Test `extract_reviews`:

In [17]:
#data = extract_reviews('https://www.yelp.com/biz/the-jibarito-stop-chicago-2?start=220', html_fetcher=retrieve_html)
data = extract_reviews('https://www.yelp.com/biz/the-jibarito-stop-chicago-2?start=220', html_fetcher=retrieve_html)
print(len(data))
# 40
print(data[0])
# {'author': 'Betsy F.', 'rating': '5.0', 'date': '2016-10-01', 'description': "Authentic, incredible ... " }

<!DOCTYPE HTML>

<!--[if lt IE 7 ]> <html xmlns:fb="http://www.facebook.com/2008/fbml" class="ie6 ie
0


IndexError: list index out of range