#Collecting Review Data from Yelp with APIFY
The following documentation will demonstrate the process of obtaining review data from Yelp using a combination of Yelp's Fusion API along with APIFY, a web scraping service (https://apify.com/). 
1. First we'll access Yelp's Fusion API to obtain a list of relevant businesses.
2. Then use APIFY to scrape the text of the reviews (and any useful metadata) from the business list.


## Initialize: Setting up your Program

### Code Requirements: Python/Google
To run this program, you'll need to import a few different Python packages and libraries. Think of these like tools in your toolbox.

NOTE: If you're attempting this code from your own environment (outside Colab or Binder) remember - you'll have to run the ***pip install*** command from your terminal (Mac) or Command Line (Windows).

In [None]:
#Install required packackge
!pip install yelpapi

Collecting yelpapi
  Downloading yelpapi-2.4.0-py3-none-any.whl (7.1 kB)
Installing collected packages: yelpapi
Successfully installed yelpapi-2.4.0


In [None]:
#Import required libraries
import pandas as pd
from yelpapi import YelpAPI

This next piece of code mounts your google drive. This allows you to access data files stored there that your code relies on. This only applies if you're running code in Google Colab.

In [None]:
#Mount google drive if needed (this code was created with Google Colab)
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Obtaining an API Key 
You will need an API key (Application Programming Interface) to use the Yelp Fusion API.<br> 

An API key is an encrypted string that identifies an application. Yelp uses this to associate API requests with your specific account/project.
You can learn how to obtain an API key at this link: https://www.yelp.com/developers/documentation/v3/authentication

Once you've set up your account and obtained your API key, enter it below where it says: YOUR API KEY

In [None]:
yelp_api = YelpAPI("YOUR API KEY")


## Option 1: Getting a List of Businesses from only ONE Location (City)

### Collecting Yelp Business Metadata & URLs
First, we create an empty dataframe to hold the search results for Yelp Businesses. This dataframe will contain the following metadata: <br>BusinessName, UniqueID, City, State, TotalRating, URL, bizType.

In [None]:
#Create a blank dataframe
businessdf = pd.DataFrame(columns=["BusinessName", "ID", "City", "State", "TotalRating", "URL", "bizType" ])

Yelp's Fusion API offers a comprehensive list of business type categories you can explore here: https://www.yelp.com/developers/documentation/v3/all_category_list 
<br>In the following example, we'll search for ice cream shops - with the category "icecream".

This allows you to input a search category from Yelp's list with a corresponding location/city. However, the program is limited to 50 search results with each search or "API Call". 

One way to combat this is by searching for multiple nearby cities and removing the duplicates from the search results. 

The category alias is then checked against the actual category to make sure they match. The actual city, state, rating, url, and business type are then populated. Rows with the populated search information are then added to the dataframe. 

In [None]:
#Category name and location get inserted as parameter in this step
#Note that 50 is that maximum number of businesses we can get back from a single search
#A workaround for this could be to search in several nearby cities and then remove the duplicates
search_results = yelp_api.search_query(categories='icecream', location="Cincinnati", limit=50, radius=40000)
for businessi in range(len(search_results['businesses'])):
  #Just to be safe CHECK IF CATEGORY ALIAS MATCHES THE CATEGORY WE WANT
  if search_results['businesses'][0]['categories'][0]['alias'] == "icecream":
    if search_results['businesses'][businessi]['location']['state'] == "OH":
      #GET THE INFO
      businessname = search_results['businesses'][businessi]['alias']
      ID = search_results['businesses'][businessi]['id']
      actualCity = search_results['businesses'][businessi]['location']['city']
      state = search_results['businesses'][businessi]['location']['state']
      rating = search_results['businesses'][businessi]['rating']
      url = search_results['businesses'][businessi]['url']
      bizType = "iceCream"
      
      new_row = {'BusinessName': businessname, 'ID': ID, 'City': actualCity, 'State': state, "TotalRating": rating, "URL": url, "bizType": bizType}
      businessdf = businessdf.append(new_row, ignore_index=True)

Next - let's look at the first three rows of our dataframe to see how it worked.

In [None]:
businessdf.head(3)

Unnamed: 0,BusinessName,ID,City,State,TotalRating,URL,bizType
0,hello-honey-cincinnati,1uANFv_eHLQlFjfbYlsoiA,Cincinnati,OH,4.5,https://www.yelp.com/biz/hello-honey-cincinnat...,iceCream
1,graeters-ice-cream-cincinnati-12,9xWc_rdTVT89Jm4ayb4DaQ,Cincinnati,OH,4.5,https://www.yelp.com/biz/graeters-ice-cream-ci...,iceCream
2,milk-jar-cafe-cincinnati-3,Q71dBguMiZlL4jkANPqrBA,Cincinnati,OH,4.5,https://www.yelp.com/biz/milk-jar-cafe-cincinn...,iceCream


Now that we've verified our dataframe contains what we wanted, we can access just the Yelp URLs from this that we will input into APIFY so that we can pull individual reviews of each business. Copy and paste ALL of this (If you're on Google Colab, right click and go to "view output fullscreen").

In [None]:
for i in range(len(businessdf)):
  print('"{}",'.format(businessdf.loc[i, "URL"]))

"https://www.yelp.com/biz/hello-honey-cincinnati?adjust_creative=Ot21-soADPPXnbgEZOoEgQ&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=Ot21-soADPPXnbgEZOoEgQ",
"https://www.yelp.com/biz/graeters-ice-cream-cincinnati-12?adjust_creative=Ot21-soADPPXnbgEZOoEgQ&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=Ot21-soADPPXnbgEZOoEgQ",
"https://www.yelp.com/biz/milk-jar-cafe-cincinnati-3?adjust_creative=Ot21-soADPPXnbgEZOoEgQ&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=Ot21-soADPPXnbgEZOoEgQ",
"https://www.yelp.com/biz/aglamesis-bros-cincinnati-4?adjust_creative=Ot21-soADPPXnbgEZOoEgQ&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=Ot21-soADPPXnbgEZOoEgQ",
"https://www.yelp.com/biz/dojo-gelato-cincinnati?adjust_creative=Ot21-soADPPXnbgEZOoEgQ&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=Ot21-soADPPXnbgEZOoEgQ",
"https://www.yelp.com/biz/graeters-ice-cream-cincinnati-14?adjust_

### Collecting Reviews with APIFY
Now navigate back to your APIFY console (https://console.apify.com/) and select the **Tasks** tab. 

Click the "Create a new task" button then "Browse entire Apify Store" for **Yelp Scraper**. Once you're redirected to the Yelp Scraper page, click "Try Me".

ON THE INPUT TAB: 
1. **Search Terms** - set to blank
2. **Locations** - set to blank
3. **Search Results Limit** - set to 99999999
4. **Direct Yelp Page URLs:** lorem ipsum (this is a temporary placeholder we will fix in step 8, so you can ignore the error message) 
5. **Maximum Reviews** - set to 99999999
6. **Max Images** - set to 0
7. **Toggle View** -  At the top left of this input section, toggle from editor view to JSON view. In JSON View, find the Direct URLS list that looks like this: <br>
  
  "directUrls": [<br>
      "lorem ipsum"<br>
      ],<br>

8. In that list, replace *Lorem Ipsum* by pasting in the list of URLs from this previous code. Be sure to remove the very last comma, so that your list looks like this:<br>

    "directUrls": [<br>
    "https://www.yelp.com..." , <br>
    "https://www.yelp.com..." , <br>
    "https://www.yelp.com..."<br>
      ],

9. Toggle back over to Editor view and RUN the scraper. Once the task has run, you will see it marked:<br>
 "âœ… Succeeded"
10. Switch tabs from "Log" to "dataset" and download the dataset in JSON format.

## Option 2: Getting Data from Multiple Locations

Import a list of locations you want to search. In this case, I used a CSV file containting every city in Ohio ordered by population. I got this list from this link: https://www.ohio-demographics.com/cities_by_population (it required a bit of cleaning).

This allows you to input your data in the form of a .CSV file.

In [None]:
citydf = pd.read_csv("/content/drive/My Drive/Colab Notebooks/OhioPopulationByCity.csv", index_col= 0)

This code generates a list of cities from the .CSV file and puts them into an array. With this particular dataset, ", OH" is also appended to the end of every city in the array as well. 

In [None]:
citylist = []

for i in range(len(citydf)):
  city = citydf.loc[i, "City"]
  fullcity = "{}, OH".format(city)
  citylist.append(fullcity)

#...ALSO APPEND " ,OH" TO THE END OF EVERY CITY

This code limits the array to just the ten most populous cities in the list and eliminates the rest from the array.

In [None]:
#Let's just do the first 10 most populous cities for this example...
citylist = citylist[:10]

This code shows the new array.

In [None]:
citylist

This code searches for businesses in the desired business category in each city in the list. A query is then created that searches for every location in the desired city with the specific, designated category. Once again, only 50 results can be stored from the search. The program then checks to make sure the alias category matches with the actual category. It then generates a list of business located in Ohio that match the search criteria. It stores the information of every business in the form of businessname, ID, actualCity, state, rating, url, bizType. This information is then populated into the array and repeated for each city.

In [None]:
counter = 0
#FOR EVERY CITY, SEARCH FOR URGENT CARE
for cityname in citylist:
  search_results = yelp_api.search_query(categories='icecream', location=cityname, limit=50, radius=40000)
  #CHECK IF CATEGORY ALIAS MATCHES THE CATEGORY WE WANT
  for businessi in range(len(search_results['businesses'])):
    if search_results['businesses'][0]['categories'][0]['alias'] == "icecream":
      #This line causes the code to only includes businesses located in Ohio
      if search_results['businesses'][businessi]['location']['state'] == "OH":
        #GET THE INFO
        businessname = search_results['businesses'][businessi]['alias']
        ID = search_results['businesses'][businessi]['id']
        actualCity = search_results['businesses'][businessi]['location']['city']
        state = search_results['businesses'][businessi]['location']['state']
        rating = search_results['businesses'][businessi]['rating']
        url = search_results['businesses'][businessi]['url']
        bizType = "iceCream"
        
        new_row = {'BusinessName': businessname, 'ID': ID, 'City': actualCity, 'State': state, "TotalRating": rating, "URL": url, "bizType": bizType}
        businessdf = businessdf.append(new_row, ignore_index=True)
  counter += 1
  print(round(((counter / len(citylist)) * 100), 1))



This code deletes any duplicates that might have arisen from doing multiple searches.

In [None]:
#Because we did multiple searches, we need to make sure we don't have any duplicates...
businessdf = pd.DataFrame.drop_duplicates(businessdf)
businessdf.reset_index(drop=True, inplace=True)

This displays the results contained in the array.

In [None]:
display(businessdf)

The following code gives us a bunch of Yelp URLs that APIFY can read. Copy and paste ALL of this (If you're on Google Colab, right click and go to "view output fullscreen").

In [None]:
for i in range(len(businessdf)):
  print('"{}",'.format(businessdf.loc[i, "URL"]))

Now go to APIFY. Create a new task. Search for Yelp Scraper. Go to the Input tab. Remove search terms. Remove locations. Set search results limit to 99999999. In direct Yelp page URLs, type in some random characters, it doesn't matter what. Ignore the error message. Set maximum reviews to 99999999. Set max images to 0. Now switch from editor view to JSON view at the top left. In the list called direct URLs, paste in the the list of URLs that you got from this code. Remove the very last comma. You should now be able to switch back over to editor view if everything worked correctly. Run the scraper. Download the JSON dataset from APIFY.

## Working with the data from APIFY

For the next step, I'll be using the data I got from doing the steps under "If getting data from just one location...". I have JSON with 46 businesses and lots of metadata. 

This imports the data from the "Getting data from one location" section that is contained in the JSON dataset from APIFY.

In [None]:
#Import the JSON dataset from APIFY

APIFYdf = pd.read_json("/content/drive/My Drive/Colab Notebooks/Yelp/YelpTutorialAPIFY.json")

This displays the imported dataset.

In [None]:
display(APIFYdf)

## Option 1/2: If you want to make a dataframe where each row is a separate review...

This code allows you to format how the data will be viewed. First, a review dataframe is generated with all of the desired attributes: buisinessTitle, directURL, businessID, phoneNUm, address, city, state, businessRating, reviewDate, reviewText, reviewRating, bizType, and dataSource. Conditions can also be set to only include certain types of businesses. Here, we limited the results to stores whose cuisine was strictly "Ice Cream and Frozen Yogurt" to limit it to businesses that are explicitly ice cream stores. Then, for each review, all of the data from the APIFYdf dataframe containing the search results is pulled and stored in variables. The code pulls out the buisinessTitle, directURL, businessID, phoneNUm, address, city, state, businessRating, reviewDate, reviewText, reviewRating, bizType, and dataSource elements from the review located in the APIFYdf dataframe. These attributes are then combined to create a new row, and the new row is inserted into the review dataframe. This is then repeated for every review until there are no more reviews. 

In [None]:
#REVIEW SCALE

reviewScaleDF = pd.DataFrame(columns=["businessTitle", "directURL", "businessID", "phoneNum", "address", "city", "state", "businessRating", "reviewDate", "reviewText", "reviewRating", "bizType", "dataSource"])
for business in range(len(APIFYdf)):
  #Let's say we only want businesses that are explicitly ice cream stores... we can add this line:
  if APIFYdf.loc[business, "cuisine"] == "Ice Cream & Frozen Yogurt":
    for review in range(len(APIFYdf.loc[business, 'reviews'])):

      businessTitle = APIFYdf.loc[business, 'name']
      directURL = APIFYdf.loc[business, 'directUrl']
      businessID = APIFYdf.loc[business, 'bizId']
      phoneNum = APIFYdf.loc[business, 'phone']
      addressaggregate = "{0}, {1}, {2}, {3}".format((APIFYdf.loc[business, "address"]['addressLine1']), (APIFYdf.loc[business, "address"]['city']), (APIFYdf.loc[business, "address"]['regionCode']), (APIFYdf.loc[business, "address"]['postalCode']))
      city = APIFYdf.loc[business, 'address']['city']
      state = APIFYdf.loc[business, 'address']['regionCode']

      overallRating = APIFYdf.loc[business, 'aggregatedRating']

      theDate = APIFYdf.loc[business, 'reviews'][review]['date'][0:10]

      reviewText = APIFYdf.loc[business, 'reviews'][review]['text']
      reviewRating = APIFYdf.loc[business, 'reviews'][review]['rating']


      

      new_row = {'businessTitle': businessTitle,
                'directURL': directURL,
                'businessID': businessID,
                'phoneNum': phoneNum,
                'address': addressaggregate,
                'city': city,
                'state': state,
                'businessRating': overallRating,
                'reviewDate': theDate,
                "reviewText": reviewText,
                "reviewRating": reviewRating,
                "bizType": "iceCream",
                "dataSource": "Yelp"}
      reviewScaleDF = reviewScaleDF.append(new_row, ignore_index=True)



This code shows all of the information now contained in the review dataframe. 

In [None]:
display(reviewScaleDF)

This allows you to convert the data frame to a .CSV file and downloads the file.

In [None]:
#Download the file as a CSV
from google.colab import files
reviewScaleDF.to_csv("reviewScaleIceCreamYelp.csv")
files.download("reviewScaleIceCreamYelp.csv")

## Option 2/2: If you want to make a dataframe where each row is one business, and the review text is stored in a column with one big review, a concatenation of ALL that reviews for that business...

This code allows you to see each row as one singular business and populates all reviews about that business into one row. First, a result dataframe is created to hold the data about every business. The data frame has the following attributes: buisinessTitle, directURL, businessID, phoneNum, address, city, state, businessRating, bigReview, numberOfReviews, bizType, and dataSource. As done before, conditions can also be set to limit the businesses we will put in the result dataframe. Here, we limit the businesses to ones that are strictly "Ice Cream & Frozen Yogurt" stores. The code then pulls all of the result dataframe attributes from the APIFYdf dataframe and stores them. For every review a singular business has, the new review will be added onto the current reviews and separated by a new line, 10 tildes, and another new line. For every review, the numOfReviews will also increase by one. Once all the reviews for the specific business have been populated, the stored attributes will be used to create a new row that is added to the result dataframe. This is then repeated for every business until there are no more businesses.

In [None]:
institutionalScaleDF = pd.DataFrame(columns=["businessTitle", "directURL", "businessID", "phoneNum", "address", "city", "state", "businessRating", "bigReview", "numberOfReviews", "bizType", "dataSource"])

for business in range(len(APIFYdf)):
  #Let's say we only want businesses that are explicitly ice cream stores... we can add this line:
  if APIFYdf.loc[business, "cuisine"] == "Ice Cream & Frozen Yogurt":
    businessTitle = APIFYdf.loc[business, 'name']
    directURL = APIFYdf.loc[business, 'directUrl']
    businessID = APIFYdf.loc[business, 'bizId']
    phoneNum = APIFYdf.loc[business, 'phone']
    addressaggregate = "{0}, {1}, {2}, {3}".format((APIFYdf.loc[business, "address"]['addressLine1']), (APIFYdf.loc[business, "address"]['city']), (APIFYdf.loc[business, "address"]['regionCode']), (APIFYdf.loc[business, "address"]['postalCode']))
    city = APIFYdf.loc[business, 'address']['city']
    state = APIFYdf.loc[business, 'address']['regionCode']

    overallRating = APIFYdf.loc[business, 'aggregatedRating']

    bigText = ""
    numOfReviews = 0
    #For each review...
    for review in range(len(APIFYdf.loc[business, 'reviews'])):
      #If the review has text, meaning it's not just a rating based on number of stars...
      if APIFYdf.loc[business, 'reviews'][review]['text']:
        reviewText = APIFYdf.loc[business, 'reviews'][review]['text']
        #...append the review to this string called big text, and separate individual reviews with a newline, 10 tildes, and another newline.
        bigText = bigText + reviewText
        bigText = bigText + "\n" + "~~~~~~~~~~" + "\n"
        numOfReviews += 1

    
    if bigText:
    
      new_row = {'businessTitle': businessTitle,
                'directURL': directURL,
                'businessID': businessID,
                'phoneNum': phoneNum,
                'address': addressaggregate,
                'city': city,
                'state': state,
                'businessRating': overallRating,
                'bigReview': bigText,
                'numberOfReviews': numOfReviews,
                'bizType': "urgentCare",
                "dataSource": "Yelp"}
      institutionalScaleDF = institutionalScaleDF.append(new_row, ignore_index=True)

This code displays the result dataframe with all of the businesses.

In [None]:
display(institutionalScaleDF)

This code shows an example of a business with multiple reviews and what that would look like as an output.

In [None]:
#An example of what one big concatenated review looks like
display(institutionalScaleDF.loc[3, "bigReview"])

This converts the dataframe to a .csv file and downloads the .csv file.

In [None]:
#Download the file as a CSV
from google.colab import files
institutionalScaleDF.to_csv("institutionalScaleIceCreamYelp.csv")
files.download("institutionalScaleIceCreamYelp.csv")