<a href="https://colab.research.google.com/github/eastmountaincode/DSC/blob/main/howToGetReviewDataFromYelp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The following is a documentation of the process of getting review data from Yelp using a combination of the Yelp Fusion API and APIFY, a web scraping service (https://apify.com/). We use the Yelp Fusion API to get a list of businesses, and then we use APIFY to scrape the reviews.

## Initialize

In [None]:
#Install required packackge
!pip install yelpapi

#Import required libraries
import pandas as pd
from yelpapi import YelpAPI

In [None]:
#Mount google drive if needed (this code was created with Google Colab)
from google.colab import drive
drive.mount('/content/drive')

You will need an API key for the Yelp Fusion API. You can learn how to obtain an API key at this link: https://www.yelp.com/developers/documentation/v3/authentication

In [None]:
yelp_api = YelpAPI("your api key goes here!")

## Option 1/2: If getting data from just one location...

In [None]:
#Create a blank dataframe
businessdf = pd.DataFrame(columns=["BusinessName", "ID", "City", "State", "TotalRating", "URL", "bizType" ])

Yelp Fusion's list of categories is pleasantly comprehensive. Here's the link: https://www.yelp.com/developers/documentation/v3/all_category_list. Find the correct category name for the type of business you're looking for. In this case, let's search for ice cream shops, which has a category name of "icecream".

In [None]:
#Category name and location get inserted as parameter in this step
#Note that 50 is that maximum number of businesses we can get back from a single search
#A workaround for this could be to search in several nearby cities and then remove the duplicates
search_results = yelp_api.search_query(categories='icecream', location="Cincinnati", limit=50, radius=40000)
for businessi in range(len(search_results['businesses'])):
  #Just to be safe CHECK IF CATEGORY ALIAS MATCHES THE CATEGORY WE WANT
  if search_results['businesses'][0]['categories'][0]['alias'] == "icecream":
    if search_results['businesses'][businessi]['location']['state'] == "OH":
      #GET THE INFO
      businessname = search_results['businesses'][businessi]['alias']
      ID = search_results['businesses'][businessi]['id']
      actualCity = search_results['businesses'][businessi]['location']['city']
      state = search_results['businesses'][businessi]['location']['state']
      rating = search_results['businesses'][businessi]['rating']
      url = search_results['businesses'][businessi]['url']
      bizType = "iceCream"
      
      new_row = {'BusinessName': businessname, 'ID': ID, 'City': actualCity, 'State': state, "TotalRating": rating, "URL": url, "bizType": bizType}
      businessdf = businessdf.append(new_row, ignore_index=True)

In [None]:
display(businessdf)

The following code gives us a bunch of Yelp URLs that APIFY can read. Copy and paste ALL of this (If you're on Google Colab, right click and go to "view output fullscreen").

In [None]:
for i in range(len(businessdf)):
  print('"{}",'.format(businessdf.loc[i, "URL"]))

Now go to APIFY. Create a new task. Search for Yelp Scraper. Go to the Input tab. Remove search terms. Remove locations. Set search results limit to 99999999. In direct Yelp page URLs, type in some random characters, it doesn't matter what. Ignore the error message. Set maximum reviews to 99999999. Set max images to 0. Now switch from editor view to JSON view at the top left. In the list called direct URLs, paste in the the list of URLs that you got from this code. Remove the very last comma. You should now be able to switch back over to editor view if everything worked correctly. Run the scraper. Download the JSON dataset from APIFY.

## Option 2/2: If getting data from multiple locations...

Import a list of locations you want to search. In this case, I used a CSV file containting every city in Ohio ordered by population. I got this list from this link: https://www.ohio-demographics.com/cities_by_population (it required a bit of cleaning).

In [None]:
citydf = pd.read_csv("/content/drive/My Drive/Colab Notebooks/Yelp/OhioPopulationByCity.csv", index_col= 0)

In [None]:
citylist = []

for i in range(len(citydf)):
  city = citydf.loc[i, "City"]
  fullcity = "{}, OH".format(city)
  citylist.append(fullcity)

#...ALSO APPEND " ,OH" TO THE END OF EVERY CITY

In [None]:
#Let's just do the first 10 most populous cities for this example...
citylist = citylist[:10]

In [None]:
citylist

['Columbus, OH',
 'Cleveland, OH',
 'Cincinnati, OH',
 'Toledo, OH',
 'Akron, OH',
 'Dayton, OH',
 'Parma, OH',
 'Canton, OH',
 'Youngstown, OH',
 'Lorain, OH']

In [None]:
counter = 0
#FOR EVERY CITY, SEARCH FOR URGENT CARE
for cityname in citylist:
  search_results = yelp_api.search_query(categories='icecream', location=cityname, limit=50, radius=40000)
  #CHECK IF CATEGORY ALIAS MATCHES THE CATEGORY WE WANT
  for businessi in range(len(search_results['businesses'])):
    if search_results['businesses'][0]['categories'][0]['alias'] == "icecream":
      #This line causes the code to only includes businesses located in Ohio
      if search_results['businesses'][businessi]['location']['state'] == "OH":
        #GET THE INFO
        businessname = search_results['businesses'][businessi]['alias']
        ID = search_results['businesses'][businessi]['id']
        actualCity = search_results['businesses'][businessi]['location']['city']
        state = search_results['businesses'][businessi]['location']['state']
        rating = search_results['businesses'][businessi]['rating']
        url = search_results['businesses'][businessi]['url']
        bizType = "iceCream"
        
        new_row = {'BusinessName': businessname, 'ID': ID, 'City': actualCity, 'State': state, "TotalRating": rating, "URL": url, "bizType": bizType}
        businessdf = businessdf.append(new_row, ignore_index=True)
  counter += 1
  print(round(((counter / len(citylist)) * 100), 1))



In [None]:
#Because we did multiple searches, we need to make sure we don't have any duplicates...
businessdf = pd.DataFrame.drop_duplicates(businessdf)
businessdf.reset_index(drop=True, inplace=True)

In [None]:
display(businessdf)

The following code gives us a bunch of Yelp URLs that APIFY can read. Copy and paste ALL of this (If you're on Google Colab, right click and go to "view output fullscreen").

In [None]:
for i in range(len(businessdf)):
  print('"{}",'.format(businessdf.loc[i, "URL"]))

Now go to APIFY. Create a new task. Search for Yelp Scraper. Go to the Input tab. Remove search terms. Remove locations. Set search results limit to 99999999. In direct Yelp page URLs, type in some random characters, it doesn't matter what. Ignore the error message. Set maximum reviews to 99999999. Set max images to 0. Now switch from editor view to JSON view at the top left. In the list called direct URLs, paste in the the list of URLs that you got from this code. Remove the very last comma. You should now be able to switch back over to editor view if everything worked correctly. Run the scraper. Download the JSON dataset from APIFY.

## Working with the data from APIFY

For the next step, I'll be using the data I got from doing the steps under "If getting data from just one location...". I have JSON with 46 businesses and lots of metadata. 

In [None]:
#Import the JSON dataset from APIFY

APIFYdf = pd.read_json("/content/drive/My Drive/Colab Notebooks/Yelp/YelpTutorialAPIFY.json")

In [None]:
display(APIFYdf)

## Option 1/2: If you want to make a dataframe where each row is a separate review...

In [None]:
#REVIEW SCALE

reviewScaleDF = pd.DataFrame(columns=["businessTitle", "directURL", "businessID", "phoneNum", "address", "city", "state", "businessRating", "reviewDate", "reviewText", "reviewRating", "bizType", "dataSource"])
for business in range(len(APIFYdf)):
  #Let's say we only want businesses that are explicitly ice cream stores... we can add this line:
  if APIFYdf.loc[business, "cuisine"] == "Ice Cream & Frozen Yogurt":
    for review in range(len(APIFYdf.loc[business, 'reviews'])):

      businessTitle = APIFYdf.loc[business, 'name']
      directURL = APIFYdf.loc[business, 'directUrl']
      businessID = APIFYdf.loc[business, 'bizId']
      phoneNum = APIFYdf.loc[business, 'phone']
      addressaggregate = "{0}, {1}, {2}, {3}".format((APIFYdf.loc[business, "address"]['addressLine1']), (APIFYdf.loc[business, "address"]['city']), (APIFYdf.loc[business, "address"]['regionCode']), (APIFYdf.loc[business, "address"]['postalCode']))
      city = APIFYdf.loc[business, 'address']['city']
      state = APIFYdf.loc[business, 'address']['regionCode']

      overallRating = APIFYdf.loc[business, 'aggregatedRating']

      theDate = APIFYdf.loc[business, 'reviews'][review]['date'][0:10]

      reviewText = APIFYdf.loc[business, 'reviews'][review]['text']
      reviewRating = APIFYdf.loc[business, 'reviews'][review]['rating']


      

      new_row = {'businessTitle': businessTitle,
                'directURL': directURL,
                'businessID': businessID,
                'phoneNum': phoneNum,
                'address': addressaggregate,
                'city': city,
                'state': state,
                'businessRating': overallRating,
                'reviewDate': theDate,
                "reviewText": reviewText,
                "reviewRating": reviewRating,
                "bizType": "iceCream",
                "dataSource": "Yelp"}
      reviewScaleDF = reviewScaleDF.append(new_row, ignore_index=True)



In [None]:
display(reviewScaleDF)

In [None]:
#Download the file as a CSV
from google.colab import files
reviewScaleDF.to_csv("reviewScaleIceCreamYelp.csv")
files.download("reviewScaleIceCreamYelp.csv")

## Option 2/2: If you want to make a dataframe where each row is one business, and the review text is stored in a column with one big review, a concatenation of ALL that reviews for that business...

In [None]:
institutionalScaleDF = pd.DataFrame(columns=["businessTitle", "directURL", "businessID", "phoneNum", "address", "city", "state", "businessRating", "bigReview", "numberOfReviews", "bizType", "dataSource"])

for business in range(len(APIFYdf)):
  #Let's say we only want businesses that are explicitly ice cream stores... we can add this line:
  if APIFYdf.loc[business, "cuisine"] == "Ice Cream & Frozen Yogurt":
    businessTitle = APIFYdf.loc[business, 'name']
    directURL = APIFYdf.loc[business, 'directUrl']
    businessID = APIFYdf.loc[business, 'bizId']
    phoneNum = APIFYdf.loc[business, 'phone']
    addressaggregate = "{0}, {1}, {2}, {3}".format((APIFYdf.loc[business, "address"]['addressLine1']), (APIFYdf.loc[business, "address"]['city']), (APIFYdf.loc[business, "address"]['regionCode']), (APIFYdf.loc[business, "address"]['postalCode']))
    city = APIFYdf.loc[business, 'address']['city']
    state = APIFYdf.loc[business, 'address']['regionCode']

    overallRating = APIFYdf.loc[business, 'aggregatedRating']

    bigText = ""
    numOfReviews = 0
    #For each review...
    for review in range(len(APIFYdf.loc[business, 'reviews'])):
      #If the review has text, meaning it's not just a rating based on number of stars...
      if APIFYdf.loc[business, 'reviews'][review]['text']:
        reviewText = APIFYdf.loc[business, 'reviews'][review]['text']
        #...append the review to this string called big text, and separate individual reviews with a newline, 10 tildes, and another newline.
        bigText = bigText + reviewText
        bigText = bigText + "\n" + "~~~~~~~~~~" + "\n"
        numOfReviews += 1

    
    if bigText:
    
      new_row = {'businessTitle': businessTitle,
                'directURL': directURL,
                'businessID': businessID,
                'phoneNum': phoneNum,
                'address': addressaggregate,
                'city': city,
                'state': state,
                'businessRating': overallRating,
                'bigReview': bigText,
                'numberOfReviews': numOfReviews,
                'bizType': "urgentCare",
                "dataSource": "Yelp"}
      institutionalScaleDF = institutionalScaleDF.append(new_row, ignore_index=True)

In [None]:
display(institutionalScaleDF)

In [None]:
#An example of what one big concatenated review looks like
display(institutionalScaleDF.loc[3, "bigReview"])

In [None]:
#Download the file as a CSV
from google.colab import files
institutionalScaleDF.to_csv("institutionalScaleIceCreamYelp.csv")
files.download("institutionalScaleIceCreamYelp.csv")