# Scraper
In this notebook, I scrape the Yelp website for restaurant reviews in San Jose. I use both the Yelp Fussion API as well as the **`request`** library to scrape the reviews.

In [0]:
from bs4 import BeautifulSoup
import json
import numpy as np
import os
import pandas as pd
import re
import requests
import sys
import time

from google.colab import drive
from importlib.machinery import SourceFileLoader

## Setup
Mount Google Drive and load two Python scripts with custom functions and constants.

In [0]:
ROOT = '/content/drive'
PROJECT = 'My Drive/Thinkful/Final_Capstone_Project/'
PROJECT_PATH = os.path.join(ROOT, PROJECT)

In [0]:
# drive.mount(ROOT)

In [0]:
con = SourceFileLoader('constants', os.path.join(PROJECT_PATH, 'utilities/constants.py')).load_module()
met = SourceFileLoader('constants', os.path.join(PROJECT_PATH, 'utilities/methods.py')).load_module()

## Scrape Businesses
First, I will scrape the business data for 1000 restaurants in the San Jose, CA area. The API response includes the business IDs, which I will use to scrape the restaurant reviews.

In [0]:
con.SEARCH_PARAMS['offset'] = 0
json_list = []
while con.SEARCH_PARAMS['offset'] < 1000:
  response = requests.get(con.SEARCH_URL, params=con.SEARCH_PARAMS, headers=con.HEADERS)
  json_dict = json.loads(response.text)
  json_list.append(json_dict)
  con.SEARCH_PARAMS['offset'] += 50

Save the data to a json file.

In [0]:
with open(os.path.join(PROJECT_PATH, 'data/businesses.json'), 'w') as f:
    json.dump(json_list, f)

Load the json file.

In [0]:
with open(os.path.join(PROJECT_PATH, 'data/businesses.json')) as json_file:
  data = json.load(json_file)

Collect relevant in a **`dict`** and then load into a Pandas **`DataFrame`**. 

In [0]:
print(f'The data of interest are: {con.BUSINESSES_DICT.keys()}')

The data of interest are: dict_keys(['id', 'name', 'is_closed', 'review_count', 'rating', 'distance'])


In [0]:
for item in data:
  for businesses in item['businesses']:
    for key, value in con.BUSINESSES_DICT.items():
      con.BUSINESSES_DICT[key].append(businesses[key])

In [0]:
df_businesses = pd.DataFrame(con.BUSINESSES_DICT)

Save the **`DataFrame`** to a csv file.

In [0]:
df_businesses.to_csv(os.path.join(PROJECT_PATH, 'data/businesses.csv'), index=False)

## Scrape Reviews
Using the business ID from the previous API call, find review data for each business

In [0]:
reviews_list = []
for id in df_businesses['id']:
  review_url = f'{con.BUSINESSES_URL}/{id}/reviews'
  response = requests.get(review_url, headers=con.HEADERS)
  json_dict = json.loads(response.text)
  reviews_list.append(json_dict)

In [0]:
print(f'Found review data for {len(reviews_list)} businesses.')

Found review data for 1000 businesses.


Save the review data to a json file.

In [0]:
with open(os.path.join(PROJECT_PATH, 'data/reviews.json'), 'w') as f:
    json.dump(reviews_list, f)

Load the json file.

In [0]:
with open(os.path.join(PROJECT_PATH, 'data/reviews.json')) as json_file:
  reviews_data = json.load(json_file)

Collect the review data of interest into a **`dict`** and load into a Pandas **`DataFrame`**.

In [0]:
print(f'The data of interest are: {con.REVIEWS_DICT.keys()}')

The data of interest are: dict_keys(['id', 'rating', 'text', 'time_created', 'url'])


In [0]:
# Clear the dictionary in case it is not
for key in con.REVIEWS_DICT.keys():
  con.REVIEWS_DICT[key] = []

for item in reviews_data:
  if 'reviews' in item.keys():
    for reviews in item['reviews']:
      for key, value in con.REVIEWS_DICT.items():
        con.REVIEWS_DICT[key].append(reviews[key])
      break  # Only take the first review
  else:
    continue

In [0]:
df_reviews = pd.DataFrame(con.REVIEWS_DICT)

Save the review data to a csv file.

In [0]:
df_reviews.to_csv(os.path.join(PROJECT_PATH, 'data/reviews.csv'), index=False)

## Scrape Full Reviews
Using the review URLs aquired from the Yelp API call, I will now use the **`requests`** library to scrape the full reviews for each of the 1000 businesses.

In [0]:
print(f'The data of interest are: {con.FULL_REVIEWS_DICT.keys()}')

The data of interest are: dict_keys(['rating', 'text'])


In [0]:
# Clear the dictionary in case it is not empty
for key in con.FULL_REVIEWS_DICT.keys():
  con.FULL_REVIEWS_DICT[key] = []

skipped_url_count = 0
start_time = time.strftime('%H:%M:%S', time.localtime())
print(f'Scraper start time: {start_time}')
url_count = 0
for url in df_reviews['url'].values:
  try:
    data = requests.get(url)
  except:
    print(skipped_url_count)
    skipped_url_count += 1
    continue

  soup = BeautifulSoup(data.text, 'html.parser')
  div_list = []
  for div_tags in soup.select('div[class*="sidebarActionsHoverTarget__373c0__2kfhE arrange__373c0__UHqhV"]'):
    div_list.append(str(div_tags))

  for item in div_list:
    con.FULL_REVIEWS_DICT['text'].append(re.findall('<span class="lemon--span__373c0__3997G" lang="en">(.*?)</span>', item)[0])
    con.FULL_REVIEWS_DICT['rating'].append(int(re.findall('i-stars--regular-(\d+)', item)[0]))

  url_count += 1

end_time = time.strftime('%H:%M:%S', time.localtime())
print(f'Scraper end time: {end_time}')

print(f'{skipped_url_count} URLs could not be reached and were skipped.')
print(f'{url_count} URL count.')

Scraper start time: 04:29:05
0
1
2
3
Scraper end time: 05:38:00
4 URLs could not be reached and were skipped.
996 URL count.


## Save Scraped Data
Save the data to a Pandas **`DataFrame`**.

In [0]:
df_full_reviews = pd.DataFrame(con.FULL_REVIEWS_DICT)

In [0]:
df_full_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19300 entries, 0 to 19299
Data columns (total 2 columns):
rating    19300 non-null int64
text      19300 non-null object
dtypes: int64(1), object(1)
memory usage: 301.7+ KB


Save the **`DataFrame`** to a csv file.

In [0]:
df_full_reviews.to_csv(os.path.join(PROJECT_PATH, 'data/full_reviews.csv'), index=False)