# Scraping Data from Yelp on the Bay Area Tri-Cities

## Imports

In [22]:
# necessary imports
import requests
import re

!{sys.executable} -m pip install requests



## Inspection

First, I begin by taking an initial look at the contents of the `html` file of my Yelp search:

In [23]:
url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=Newark%2C+CA&start=0"
html = requests.get(url)

print(html.text[:100])

<!DOCTYPE html><html lang="en-US" prefix="og: http://ogp.me/ns#" style="margin: 0;padding: 0; border


From using Google Chrome developer tools, it appears that the following URL returns a JSON response:

https://www.yelp.com/search/snippet?find_desc=Restaurants&find_loc=Newark%2C+CA&start=0

From experimenting with the URL, it also appears that I can get the next page of search results by incrementing the `start` variable in the URL by $10$, so if I want the second page of results, I would change it to `set=10`.

In [24]:
# The API endpoint
url = "https://www.yelp.com/search/snippet?find_desc=Restaurants&find_loc=Newark,+CA&start=0"

# A GET request to the API
response = requests.get(url)

# Print the response
response_json = response.json()
print(str(response_json)[:50]) # converted json dict to String for smaller print statements

{'pageTitle': 'Top 10 Best Restaurants near Newark


From the looks of the printed result, it appears that the content (a JSON file) matches that of the JSON response link. To understand the file better, I moved the contents into an [online JSON formatter tool](https://jsonformatter.org/). And upon further exploration by searching for `bizId` (most likely the primary key of a restaurant), I determine that I need to index into the list with the keys `searchPageProps` and `mainContentComponentsListProps` to access my desired level of granularity, which is the information on individual businesses. The following two pictures show how I came to that conclusion:

!['searchPageProps_img'](img/online_json_img_1.png)
!['mainContentComponentsListProps_img'](img/online_json_img_2.png)

And upon using those keys, I observe the following:

In [25]:
results = response.json()['searchPageProps']['mainContentComponentsListProps']

for result in results:
    print(str(result)[:40]) # converted json dict to String for smaller print statements

{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'bizId': 'CcbFduunKlsrnW71-PH0ZA', 'sea
{'bizId': 'purj1aiUzDi0I__qLOaNRg', 'sea
{'bizId': '1y4juqtkSJ9DPPZyRMi-xA', 'sea
{'bizId': '9Shb0yRis5NEQ5xIGG0FcA', 'sea
{'bizId': 'mog4wAikb1EnzwWSnKt-BQ', 'sea
{'bizId': 'i_GuYJ_1hmW-lHeQ4GmOiw', 'sea
{'bizId': 'zyDGcWX4CNlX9Jm6SEs37Q', 'sea
{'bizId': 'UhMdgcDu6zRADR43uhyfYg', 'sea
{'bizId': 'FQnIs9ojngandeHW4UrasA', 'sea
{'bizId': 'kl8JGeBsEYdiFHi1RGSt3Q', 'sea
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 


This output meant that there were still some filtering needed. Luckily, it appears that there are **10** `bizId` fields, which is more than likely the ten businesses from the search result. By investigating the contents further in the online json formatter tool, it appears that the field `searchResultLayoutType` is shared by all of the elements but the ten businesses in question can be differentiated with the filter `searchResultLayoutType == "iaResult"`.

!['searchResultLayoutType_separator_img'](img/online_json_img_3.png)
!['searchResultLayoutType_iaresult_img'](img/online_json_img_4.png)

In addition, I came to realize that even if a city was specified in a Yelp inquiry, it's not necessarily the case that a restaurant is located in that city. To account for that, the `alias` key can be used, which returns a String containing the restaurant name and the city it is located in (ie. mcdonalds-newark).

With these two filters in mind, I proceed to perform the search on the city of Newark with the following code:

In [26]:
for result in results:
    if result['searchResultLayoutType'] == "iaResult":
        # Filters need to be on separate lines because if the first condition is not satisfied, the `alias` key won't exist
        # resulting in a KeyError upon code execution.
        # if 'newark' in result['searchResultBusiness']['alias']: # SECOND FILTER TO BE USED
            # Extra spaces added for cleaner print statement
            print('name:        ', result['searchResultBusiness']['name'])

            # to check if second filter works
            print('alias:       ', result['searchResultBusiness']['alias']) 
            print('correct_city:', int('newark' in result['searchResultBusiness']['alias']))


            print('review_count:', result['searchResultBusiness']['reviewCount'])
            print('rating:      ', result['searchResultBusiness']['rating'], '\n')

name:         Pocha K
alias:        pocha-k-newark
correct_city: 1
review_count: 55
rating:       4.5 

name:         四姐 Pan Fried Dumplings
alias:        四姐-pan-fried-dumplings-newark-2
correct_city: 1
review_count: 613
rating:       4.5 

name:         Mingkee Deli
alias:        mingkee-deli-newark
correct_city: 1
review_count: 17
rating:       3.5 

name:         Duobao BBQ & Dumplings
alias:        duobao-bbq-and-dumplings-newark
correct_city: 1
review_count: 43
rating:       4.5 

name:         Tamper Room
alias:        tamper-room-fremont
correct_city: 0
review_count: 38
rating:       5.0 

name:         Lazy Dog Restaurant & Bar
alias:        lazy-dog-restaurant-and-bar-newark-2
correct_city: 1
review_count: 1617
rating:       4.0 

name:         Sizzling Lunch
alias:        sizzling-lunch-fremont
correct_city: 0
review_count: 411
rating:       4.5 

name:         Billy Roys Burger
alias:        billy-roys-burger-fremont-4
correct_city: 0
review_count: 285
rating:       4.5 

na

It appears that the filter is working as intended.

Next, I want to write the results of my search inquiry into a `.csv` file so that I can read it into a Pandas DataFrame to manipulate it for analysis. I will be using the following format for the `.csv` file:

`business_id`, `name`, `city`, `review_count`, `rating`

Since my current `requests.get()` only gets the result of **one** search, I need to scrape the results of all the pages. Returning to Yelp, it appears that there is a search cap of $24$ pages. With $10$ restaurants in each page, this implies that there are $240$ restaurants in the city of Newark. The bay area Tri-Cities consists of Newark, Fremont, and Union City; this indicates there are $720$ search results. 


While investigating my search results, I came to notice that within those $720$ search results, not a single commonly franchised fast food restaurant was captured. Apparently these chain restaurants **do not satisfy** Yelp's condition of what qualifies as a "Restaurant." This means that a separate search needs to be done for "fast food" restaurants, resulting in $720 \times 2 = 1440$ hits.

With my own personal understanding I know for a fact that these cities are not nearly big enough to have a cumulative total of $1440$ restaurants, and this should be a reasonable assumption as I observed previously that not all results returned by Yelp are guaranteed to be in the city specified (hence the reason for the second filter). Nevertheless, I will perform searches on all $48$ pages per city and will rely on filters and later data cleaning (simply going to be dropping duplicates) to address data integrity problems.

With all that in mind, I finally move on to writing a Python function to perform the scraping properly on my Yelp inquiries.

## Scraping

The function `tri_city_extrapolate` is to scrape data from Yelp on the Bay Area Tri-Cities: Newark, Fremont, and Union City, while satisfying all the conditions determined during inspection.

In [27]:
# A function that scrapes the Yelp search results of the Bay Area Tri Cities: Newark, Fremont, and Union City.
# The function will get the following information for each restaurant: 1) name, 2) city, 3) review_count, and
# 4) rating. These elements together will represent a restaurant, and this restaurant information will be written
# to a .csv file named "tri_city_data.csv".
def tri_city_extrapolate():
    # helper function to tri_city_extraoplate. Will take in a city name and search specificiation
    # cities would be the tri-cities "newark," "fremont," or "union+city"
    # search condition would be either "Restaurants" or "fast+food"
    def city_extrapolate(city, condition):
        for page_index in range(0, 231, 10): # 240 total, 0 is first ten, 230 is last ten
            # String constructed by following the same URL format as a Yelp search
            search_url = "https://www.yelp.com/search/snippet?find_desc=" + condition + "&find_loc=" + city + ",+CA&start=" + str(page_index) # API Endpoint
            search_response = requests.get(search_url) # GET request to the API
            search_results = search_response.json()['searchPageProps']['mainContentComponentsListProps'] # Index into the right granularity level
            
            # Loop through each result to write into file
            for search_result in search_results:

                # Discovered from debugging. What gets passed in to city is "union+city", which is in the necessary form
                # for the Yelp URL search. However, in the JSON file, the city alias is "union-city". This regex expression
                # is needed to reformat for JSON filter condition
                city_cond = re.sub("[+]", "-", city)

                # apply filters
                if search_result['searchResultLayoutType'] == "iaResult":
                    if city_cond in search_result['searchResultBusiness']['alias']: 
                        # Create variables for each feature
                        feature_business_id = search_result['bizId']
                        feature_name = search_result['searchResultBusiness']['name']
                        feature_review_count = search_result['searchResultBusiness']['reviewCount']
                        feature_rating = search_result['searchResultBusiness']['rating']
                        # Concatenate features together to form tuple
                        # regex expression used for the data's city attribute to look cleaner
                        curr_tuple = feature_business_id + ',' + feature_name + ',' + re.sub("[-]", " ", city_cond) + ',' + str(feature_review_count) + ',' + str(feature_rating) + '\n'
                        # Write tuple into destination file
                        tri_city_data.write(curr_tuple)

    tri_city_data = open("tri_city_data.csv", "w") # Open the destination file
    # First write the headers in
    tri_city_data.write("business_id,name,city,review_count,rating\n")
    # Extrapolate results for tri-cities
    for city in ["newark", "fremont", "union+city"]: # Union City is formatted as "Union+City" in Yelp URL
        for condition in ["Restaurants", "fast+food"]:
            city_extrapolate(city, condition)
    tri_city_data.close() # Close the destination file

With `tri_city_extrapolate` implemented, I call the function in the following cell:

In [28]:
# Perform data scraping on Yelp search inquiries
tri_city_extrapolate()

`tri_city_extrapolate` took about $3$ minutes to run; for $148$ page reads I think that's not bad. With the data scraping done, will move onto EDA and analysis.

## Issues & Problems

During this project, I ran into a few roadblocks that almost made me resort to manual data input. 

Before resorting to Yelp data scraping ([which I didn't know until midway through this project that scraping is against Yelp Terms of Service](https://www.yelp-support.com/article/Can-I-copy-or-scrape-data-from-the-Yelp-site?l=en_US)), I first looked into the [Yelp Open Dataset](https://www.yelp.com/dataset). Since I seek to answer questions about the Bay Area Tri-Cities, I quickly came to notice that the dataset **did not** contain information on Newark, Fremont, or Union City. This meant that the Dataset was no longer applicable, thus I went back to scraping.

I initially worked on this project using a Jupyter Notebook; for some reason, when performing my API GET call with `requests.get()`, I could not get the JSON response file's content in its entirety. I knew I wasn't getting the contents of the entire file because upon printing the last few characters of my API call, I could not see a closing brace; the end of JSON files are indicated with a closing brace. Google searches led me to a few Stack Overflow posts ([one](https://stackoverflow.com/questions/8049520/how-can-i-scrape-a-page-with-dynamic-content-created-by-javascript-in-python) and [two](https://stackoverflow.com/questions/56686756/python-requests-not-returning-fully-loaded-content)), which made me conclude that I was running into some Javascript rendering issue; how some contents of the page are not rendered unless the page is loaded. With only a little bit of experience in web design, I just assumed that I was out of luck.

Finally, I decide to resort to using a web scraping software called [ParseHub](https://www.parsehub.com/). I was a bit disapointed that I had to use this tool as my initial motivations for starting this project was to learn more about web scraping. Nevertheless, I resigned myself to turning this project into a data exploration and analysis task. But upon finishing my download, I was confronted with this [page](https://help.parsehub.com/hc/en-us/articles/10607213112855-Using-ParseHub-in-macOS-2-20-Ventura); ParseHub is not supported on macOS 2.20 Ventura, which is the current OS I'm on. No, I am not going to set up a Virtual Machine just to use a tool that I didn't want to use in the first place.

With all these shenanigans, I basically completely gave up on scraping and decided to manually input my data. To be fair, The Bay Area Tri-Cities are small, and I think I can handle manually inputting two-hundred or tuples.

So I began my manual input, but before doing so, I shared my predicament with my dad. And to my pleasant surprise, my dad discovered that my code **worked** it ran perfectly fine on VS Code, but just **not on my Jupyter Notebook**. This made absolutely no sense to me. But hey, if it works, it works, so now here I am working on the rest of this project on VS Code.