## Scraping Data from the Bay Tri-Cities off of Yelp

In [3]:
import sys

!{sys.executable} -m pip install requests

Collecting requests
  Downloading requests-2.28.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting charset-normalizer<4,>=2
  Downloading charset_normalizer-3.1.0-cp311-cp311-macosx_11_0_arm64.whl (121 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.7/121.7 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting idna<4,>=2.5
  Using cached idna-3.4-py3-none-any.whl (61 kB)
Collecting urllib3<1.27,>=1.21.1
  Downloading urllib3-1.26.15-py2.py3-none-any.whl (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.9/140.9 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting certifi>=2017.4.17
  Downloading certifi-2022.12.7-py3-none-any.whl (155 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.3/155.3 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: urllib3, idna, 

In [5]:
# necessary imports
import requests

In [13]:
url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=Newark%2C+CA&start=0"
html = requests.get(url)

print(html.text)

<!DOCTYPE html><html lang="en-US" prefix="og: http://ogp.me/ns#" style="margin: 0;padding: 0; border: 0; font-size: 100%; font: inherit; vertical-align: baseline;"><head><script>document.documentElement.className=document.documentElement.className.replace(no-j/,"js");</script><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta http-equiv="Content-Language" content="en-US" /><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><link rel="mask-icon" sizes="any" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b2bb2fb0ec9c/assets/img/logos/yelp_burst.svg" content="#FF1A1A"><link rel="shortcut icon" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/dcfe403147fc/assets/img/logos/favicon.ico"><script> window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;window.ygaPageStartTime=new Date().getTime();</script><script>
            window.yelp = window.yelp || {};
            

From using developer tools, it appears that the following URL returns a JSON response:

https://www.yelp.com/search/snippet?find_desc=Restaurants&find_loc=Fremont%2C+CA&start=0

From experimenting with the URL, it also appears that I can get the next page of search results by incrementing the `start` variable in the URL by $10$, so if I want the second page of results, I would change it to `set=10`.

In [39]:
# The API endpoint
#url = "https://www.yelp.com/search/snippet?find_desc=Restaurants&find_loc=Newark,+CA&start=10&parent_request_id=272882490592ff57&request_origin=user"
url = "https://www.yelp.com/search/snippet?find_desc=Restaurants&find_loc=Newark,+CA&start=0"

# A GET request to the API
response = requests.get(url)

# Print the response
response_json = response.json()
print(str(response_json)[:50]) # converted json dict to String for smaller print statements

{'pageTitle': 'Top 10 Best Restaurants near Newark


From the looks of the printed result, it appears that the content matches that of the link. To understand the file better, I moved the contents into an [online json formatter tool](https://jsonformatter.org/). And upon further exploration, I want to index into the list with the keys `searchPageProps` and `mainContentComponentsListProps` to access my desired level of granularity, which is the information on individual businesses. In doing so led to the following discovery:

In [35]:
results = response.json()['searchPageProps']['mainContentComponentsListProps']

for result in results:
    print(str(result)[:40]) # converted json dict to String for smaller print statements

{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'bizId': 'CcbFduunKlsrnW71-PH0ZA', 'sea
{'bizId': '1y4juqtkSJ9DPPZyRMi-xA', 'sea
{'bizId': 'purj1aiUzDi0I__qLOaNRg', 'sea
{'bizId': '9Shb0yRis5NEQ5xIGG0FcA', 'sea
{'bizId': 'Ch5JHbQ9KKDeRfZXRJVo2g', 'sea
{'bizId': 'mog4wAikb1EnzwWSnKt-BQ', 'sea
{'bizId': 'Hf3h4YNR8eL0Xdf5yygmIA', 'sea
{'bizId': '_k7U8HEAsBWScymSKWFAfQ', 'sea
{'bizId': 'JvSvKpVFQ13VM28n7y-BOw', 'sea
{'bizId': 'kl8JGeBsEYdiFHi1RGSt3Q', 'sea
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 
{'searchResultLayoutType': 'separator', 


This output meant that were still some filtering needed. Luckily, it appears that there are **10** 'bizId' fields, which is more than likely the ten businesses from the search result. By investigating the contents further in the online json formatter tool, it appears that the field `searchResultLayoutType` is shared by all of the elements but the ten businesses in question can be differentiated with the filter `searchResultLayoutType == "iaResult"`. In addition, I came to realize that even if a city was specified in a Yelp inquiry, it's not necessarily the case that a restaurant is located in that city. To account for that, the `alias` key can be used, which returns a String containing the restaurant name and the city it is located in (ie. mcdonalds-newark).

With these two filters in mind, I proceed perform the search on the city of Newark with the following code:

In [71]:
for result in results:
    if result['searchResultLayoutType'] == "iaResult":
        # Filters need to be on separate lines because if the first condition is not satisfied, the `alias` key won't exist
        # resulting in a KeyError upon code execution.
        # if 'newark' in result['searchResultBusiness']['alias']: # SECOND FILTER TO BE USED
            # Extra spaces added for cleaner print statement
            print('name:        ', result['searchResultBusiness']['name'])

            # to check if second filter works
            print('alias:       ', result['searchResultBusiness']['alias']) 
            print('correct_city:', int('newark' in result['searchResultBusiness']['alias']))


            print('review_count:', result['searchResultBusiness']['reviewCount'])
            print('rating:      ', result['searchResultBusiness']['rating'], '\n')

name:         Pocha K
alias:        pocha-k-newark
correct_city: 1
review_count: 54
rating:       4.5 

name:         Mingkee Deli
alias:        mingkee-deli-newark
correct_city: 1
review_count: 16
rating:       3.5 

name:         四姐 Pan Fried Dumplings
alias:        四姐-pan-fried-dumplings-newark-2
correct_city: 1
review_count: 612
rating:       4.5 

name:         Duobao BBQ & Dumplings
alias:        duobao-bbq-and-dumplings-newark
correct_city: 1
review_count: 42
rating:       4.5 

name:         Always Cool BBQ
alias:        always-cool-bbq-fremont-2
correct_city: 0
review_count: 153
rating:       4.5 

name:         Tamper Room
alias:        tamper-room-fremont
correct_city: 0
review_count: 37
rating:       5.0 

name:         Sia Fusion Eatery
alias:        sia-fusion-eatery-newark
correct_city: 1
review_count: 952
rating:       4.0 

name:         Mingala Restaurant
alias:        mingala-restaurant-newark
correct_city: 1
review_count: 83
rating:       4.5 

name:         Oyama B

It appears that the filter is working as intended.

Next, I want to write the results of my search inquiry into a text file formatted as a `.csv` so that I can read it into a Pandas DataFrame to manipulate it for analysis. I will be using the following format for the `.csv` file:

`name`, `city`, `review_count`, `rating`

Since my current `requests.get()` only gets the result of *one* search, I need to scrape the results of all the pages. Returning to Yelp, it appears that there is a search cap of $24$ pages. With $10$ restaurants in each page, this implies that there are $240$ restaurants in the city of Newark. The bay area Tri-Cities consists of Newark, Fremont, and Union City; and this indicates there are $720$ search results. With my own persaonal understanding I know for a fact that these cities are not nearly big enough to have a cumulative total of $720$ restaurants, and this should be a reasonable assumption as I observed previously that not all results returned by Yelp are guaranteed to be in the city specified (hence the reason for the second filter). Nevertheless, I will perform searches on all $24$ pages per city and will rely on filters and later data cleaning (simply going to be dropping duplicates) to address data integrity problems.

With all that in mind, the following is a Python function that performs the aforementioned action:

In [None]:
def tri_city_extrapolate():

    # helper function
    def city_extrapolate():
        return

    return

Issue: https://stackoverflow.com/questions/8049520/how-can-i-scrape-a-page-with-dynamic-content-created-by-javascript-in-python

https://stackoverflow.com/questions/56686756/python-requests-not-returning-fully-loaded-content

Issue persists, going to resort using a web scraiping tool.

In [None]:
import requests

search_url = "https://www.yelp.com/search/snippet?find_desc=Restaurants&find_loc=Seattle%2C+WA%2C+United+States&request_origin=user"
search_response = requests.get(search_url)
#search_results = search_response.json()['searchPageProps']['mainContentComponentsListProps']

for result in search_results:
    if result['searchResultLayoutType'] == "iaResult":
        print(result['searchResultBusiness']['name'])
        print(result['searchResultBusiness']['neighborhoods'])
        print(result['searchResultBusiness']['reviewCount'])
        print(result['searchResultBusiness']['rating'])
        print("https://www.yelp.com" + result['searchResultBusiness']['businessUrl'])
        print("--------")