### Unit 1 Homework:  Scraping the Yelp Website

Welcome!  For this homework assignment you'll be tasked with building a web scraper in a manner that builds on what was covered in our web scraping class.

The assignment will extend the lab work done during that time, where we built a dataset that listed the name, number of reviews and price range for restaurant on the following web page: https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1

**What You'll Turn In:**

A finished jupyter notebook that walks us through the steps you took in order to get your results.  Provide notes where appropriate to explain what you are doing.

The notebook should produce a finished dataset at the end.  

If for some reason you're experiencing problems with the final result, please let someone know when turning it in.
 
Your homework will be divided into three tiers, each of which have increasing levels of difficulty:

##### Tier 1: Five Columns From the First Page

At the most basic level for this assignment, you will need to extend what we did in class, and create a dataset that has five columns in it that are 30 rows long.  This means you will not need to go off the first page in order to complete this section.

##### Tier 2:  100 Row Dataset With At Least 3 Columns

For this portion of the assignment, take 3 of your columns from step 1, and extend them out to multiple pages on the yelp website.  You should appropriately account for the presence of missing values.

##### Tier 3:  100 Row Dataset With At Least 5 Columns

Very similar to Tier 2, but if you use this many columns you will be forced to encounter some columns that will frequently have missing values, whereas with Tier 2 you could likely skip these if you wanted to.  

##### Tier 4:  100 Row Dataset With At Least 5 Columns + Individual Restaurant Categories

Restaurants often have different categories associated with them, so grabbing them individually as separate values is often challenging.  To complete this tier, you'll have to find a way to 'pick out' each of the individual categories as their own separate column value.  

##### Tier 5:  Unlimited Row Dataset With At Least 5 Columns + Individual Restaurant Categories

Take what you did in Tier 4, and extend it so that the code will work with an arbitrary number of pages.  Ie, regardless of how many pages there are listing the best restaurants in London, your scraper will find them, and cleanly parse their information into clean datasets.

### Hints

Here are a few tips that will save you time when completing this assignment:

 - The name, average rating, total ratings and neighborhood of a restaurant tend to be the 'easy' ones, because they rarely have missing values, so what ever logic you use on the first page will typically apply to all pages.  They are a good place to start
 - Phone numbers, price ranges and reviews are more commonly missing, so if you are trying to get a larger number of items from them across multiple pages you should expect to do some error handling
 - You can specify any sort of selector when using the `find_all()` method, not just `class`.  For example, imagine you have the following `<div>` tag:
    `<div class='main-container red-blue-green' role='front-unit' aria-select='left-below'>Some content here</div>`
    
   This means that when you use `scraper.find_all('div')`, you can pass in arguments like `scraper.find_all('div', {'role': 'front-unit'})` or anything else that allows you to isolate that particular tag.
 - When specifying selectors like `{'class': 'dkght__384Ko'}`, sometimes less is more.  If you include multiple selectors, you are saying return a tag with **any one of these** distinctions, not all of them.  So if your results are large, try different combinations of selectors to get the smallest results possible.
 - If you begin dealing with values that are unreliably entered, you should use the 'outside in' technique where you grab a parent container that holds the element and find a way to check to see if a particular value is there by scraping it further.  The best way to do this is to try and find a unique container for every single restaurant.  This means that you will have a reliable parent element for every single restaurant, and within *each of these* you can search for `<p>`, `<a>`, `<div>`, and `<span>` tags and apply further logic.
 - When you get results from `BeautifulSoup`, you will be given data that's denoted as either `bs4.element.Tag` or `bs4.element.ResultSet`.  They are **not the same**.  Critically, you can search a `bs4.element.Tag` for further items, but you cannot do this with a `bs4.element.ResultSet`.  
 
   For example, let's say you grab all of the divs from a page with `scraper.find_all('div')` and save it as the variable `total_divs`.  This means `total_divs` will look somethig like this:  
   
   `[<div><p>Div content</p><p>Second paragraph</p></div>,`
      `<div><p>Div content</p><p>Second paragraph</p></div>,`
      `<div><p>Div content</p><p>Second paragraph</p></div>]`
      
   In this case the variable `total_divs` is a result set and there's nothing else you can do to it directly.  However, every item within `total_divs` is a tag, which means you can scrape it further.  
   
   So if you wanted you could write a line like:  `total_paragraphs = [div.find_all('p') for div in total_divs]`, and get the collection of paragraphs within each div.  
   
   If you confuse the two you'll get the following error message:  
   
   `AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?`
 - The values of the different selectors change periodically on yelp, so if your scraper all of a sudden stops working that's probably why.  Ie, if you have a command like `scraper.find_all('div', {'class': '485dk0W__container09'}` that no longer returns results, the class `485dk0W__container09` may now be `r56kW__container14` or something similar.

# Yelp Reviews Code

First, I start by importing various libraries to help with my web scraping functions. I also define the London Yelp url here so that it can be reused throughout as a variable. 

**Note**: In future enhancements for this code, I would like to implement a feature that allows users to directly call the city they want to search for reviews for.

In [8]:
# your code here
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import math
import numpy as np

yelp_url = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1'




To find useful data to collect from the Yelp page, I first scanned the webpage and noticed that each restaurant had a name, category, and rating. Most restaurants also had reviews and a cost rating, too. To systematically select those values for each record, I inspected the HTML to see the classes of the HTML tags the data was contained in (e.g. div or a). 

The most difficult value to get was the star rating, because each rating tier (4 star, 4.5 star, etc.) had its own unique class. I used a regular expression module to do a partial match for the term "i-stars" which was contained in each class name. I then used the text in the aria-label to determine what the rating was and converted that value to a float.

To get rid of the pound symbol representing cost, I simply counted the length of the string and used an integer to represent it instead. 

In [23]:
def find_yelp_reviews(restaurant_data):
    """Helper function that returns a dictionary with restauraunt names, cost, reviews, ratings, and 
    categories for a given Yelp page. """
    restaurants = []
    for restaurant in restaurant_data:
        d = dict()
        categories = []
        name = restaurant.find('a', class_='link__09f24__1kwXV link-color--inherit__09f24__3PYlA link-size--inherit__09f24__2Uj95')
        try:
            name = name.text
        except AttributeError:
            name = False
        if name:
            #only do further data work if the restaurant has a valid name; 
            #otherwise skip to the next restaurant in the data set
            
            #checking for null values of category so the code doesn't blow up; skips if no valid category
            categories_data = restaurant.find_all('a', class_='link__09f24__1kwXV link-color--inherit__09f24__3PYlA link-size--default__09f24__3xWLF')
            for data in categories_data:
                try:
                    categories.append(data.text)
                except AttributeError:
                    pass

            # trying to generate a cost rating
            cost = restaurant.find('span', class_="text__09f24__2tZKC priceRange__09f24__2O6le text-color--black-extra-light__09f24__38DtK text-align--left__09f24__3Drs0 text-bullet--after__09f24__1MWoX")
            try:
                cost = str(len(cost.text))
            except AttributeError:
                cost = ''
                
            # trying to generate number of reviews
            num_reviews = restaurant.find('span', class_="text__09f24__2tZKC reviewCount__09f24__EUXPN text-color--black-extra-light__09f24__38DtK text-align--left__09f24__3Drs0")
            try:
                num_reviews = int(num_reviews.text)
            except AttributeError:
                num_reviews = 0
                
            # trying to generate a rating
            rating = restaurant.find('div', class_=re.compile("i-stars"))
            try:
                rating = float(rating['aria-label'].split()[0])
            except AttributeError:
                rating = 0
            except TypeError:
                rating = 0
            #building out a dictionary with the restaurant's attributes
            d['Name'] = name
            d['Rating'] = rating
            d['Cost'] = cost
            d['Reviews'] = num_reviews
            
            # building new columns for each category:
            for i in range(0,len(categories)):
                d[f"Category {i+1}"] = categories[i]
            
            
            # adding completed dictionary for the current page
            restaurants.append(d)
    
    return restaurants
    




Once I could reliably capture all restaurant data on a single page, the next step was to extend that functionality to find all restaurants across all search results pages. I did this by noticing that on each results page, the website tells how many total result pages there are. In the case of the London restaurants, there are 24. I also noticed that on subsequent results pages, there's a "&start=" appended to the url, indicating which result to display first. Because there are 10 restaurants listed per results page, that means the final page could be found by subtracting one from the totaly number of pages and multiplying by 10. So in the case of the London data, the last page would have "&start=230" ending the url. I tested this by trying numbers larger than that, and the website returned an error.

To get data on all resturants, then, I simply ran the above function on each page in a loop and then combined all of the results into a final data frame. I paginated by adding 10 to the "&start=" after each url in the range 0 to max pages. Note that these calculations are performed in the function itself, so the function can be used on any Yelp restaurant results page without the user having prior knowledge about the number of results.

I additionally tested that this function can be generally used on other cities by finding all results for Gaithersburg, Maryland, where I live.

In [24]:
def display_yelp_reviews(url):
    """Takes a Yelp url and returns a Data Frame with restaurant data for all restaurants in that location."""
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    #Find the total number of pages to determine how many times to loop to get data. Loop should end at (n-1)*10
    num_pages = soup.find('span', class_="text__09f24__2tZKC text-color--black-extra-light__09f24__38DtK text-align--left__09f24__3Drs0")
    max_start = ((int(num_pages.text.split()[2]))-1)*10
        
    restaurants = []
    
    for index in range(0,max_start+1,10):
        current_page = requests.get(url+'&start='+str(index))
        current_soup = BeautifulSoup(current_page.content, 'html.parser')
        restaurant_data = current_soup.find_all('li', class_='border-color--default__09f24__R1nRO')
        
        restaurants += find_yelp_reviews(restaurant_data)
    
    
    return pd.DataFrame(restaurants).replace({np.nan: None})
    

In [25]:
london_data = display_yelp_reviews(yelp_url)


In [26]:
pd.set_option('display.max_rows', None)
# sorting based on rating

london_data.sort_values(by=['Rating'], ascending=False)

Unnamed: 0,Name,Rating,Cost,Reviews,Category 1,Category 2,Category 3
120,Pidgin,5.0,3.0,2,Modern European,British,
81,Perilla,5.0,,2,Modern European,Cocktail Bars,British
197,Il Portico,5.0,3.0,48,Italian,,
195,Tanakatsu,5.0,,4,Japanese,,
189,Andy’s Greek Taverna,5.0,1.0,24,Greek,,
114,Roast Hog,5.0,1.0,10,Portuguese,Food Stands,
56,New London Cafe,5.0,1.0,24,British,Cafes,
77,Kennington Lane Cafe,5.0,1.0,98,Cafes,,
116,Madera at Treehouse London,5.0,,1,Mexican,,
182,Chicken Shop,5.0,2.0,24,Chicken Shop,,


In [None]:
# gaithersburg_data = display_yelp_reviews('https://www.yelp.com/search?find_desc=Restaurants&find_loc=Gaithersburg%2C%20Maryland&ns=1')
# gaithersburg_data = gaithersburg_data.replace({np.nan: None})