### Yelp Restaurant Data through Yelp Fusion API

Followed the folowing [medium article](https://medium.com/web-mining-is688-spring-2021/yelp-restaurants-web-scraping-analysis-fcf54cf519f1) extract restaurants from the various businesses on Yelp using Yelp's Fusion API.

The Dataset contains the following 5 json files:

- business.json
- review.json 
- user.json
- checkin.json
- tip.json
- photo.json

We will use the Yelp Fusion API to make requests through Postman to get business id's from business.json and mapping it to the text reviews and star ratings from review.json to gather the data for our project. 

The goal is to collect 15k reviews with 5k reviews for each class (positve, negative, neutral). The reviews will be classified into one of the three classes by their respective star ratings (1-2: positive, 3: negative, 4-5: positive).

In [1]:
# import requests

# url = "https://api.yelp.com/v3/businesses/search?term=restaurants&location=Boston"
# url2 = "https://api.yelp.com/v3/businesses/search?term=restaurants&location=New York"

# payload={}
# headers = {
#   'Authorization': 'Bearer [auth_key]'
# }

# res_BOS = requests.request("GET", url, headers=headers, data=payload)
# res_NY = requests.request("GET", url, headers=headers, data=payload)

In [2]:
# print(res_BOS.content)
# print(res_NY.content)

### Business ID extraction from the response json's

Now that we have a json containing the business details about the restaurants in Boston and New York, the next step is to extract the details that we need to get to the text reviews. 

We will extract the following fields from the response JSON in the form: 

{"businesses": [
    {id: "gfjeru7_dhw",
    name: "Paradise of India",
    review_count: 1300,
    rating: 4.5},
    {...}
    ]
}

In [3]:
import json

bos_res = open('boston_restaurants.json', 'r')
values = json.load(bos_res)
bos_res.close()

In [4]:
data = {
    "businesses": [],
    "total_reviews": -1    
}

In [5]:
# for criteria in values['businesses']:
#     for key, value in criteria.items():
#         print(key, 'is:', value)
#     print('')

In [6]:
for field in values['businesses']:
    data["total_reviews"] += field["review_count"]
    res_info = {"id": None, "name": None, "review_count": None, "rating": None}
    for key,value in field.items():
        if key in res_info.keys():
            res_info[key] = value
    data["businesses"].append(res_info)
data

{'businesses': [{'id': 'sA2gVTJOBH7Qk32p48ENdQ',
   'name': 'Saltie Girl',
   'review_count': 1368,
   'rating': 4.5},
  {'id': 'kP1b-7BO_VhWk_0tvuA_tw',
   'name': "Carmelina's",
   'review_count': 2491,
   'rating': 4.5},
  {'id': 'xlOMKjE4omTgkI1eduWj8A',
   'name': 'The Friendly Toast',
   'review_count': 2038,
   'rating': 4.0},
  {'id': 'tKRg7pxR2Sw7aSrHTQS5mQ',
   'name': 'Buttermilk & Bourbon',
   'review_count': 729,
   'rating': 4.0},
  {'id': 'y2w6rFaO0XEiG5mFfOsiFA',
   'name': 'Neptune Oyster',
   'review_count': 5386,
   'rating': 4.5},
  {'id': 't_FFcwUutj9mIYKGw_gHsQ',
   'name': 'The Salty Pig',
   'review_count': 1711,
   'rating': 4.0},
  {'id': 'lCMN0raxlHnBFDoLgkz4sw',
   'name': 'Milkweed',
   'review_count': 793,
   'rating': 4.0},
  {'id': 'kb0VfKZKHHhvDze3JZqi0Q',
   'name': 'Q Restaurant',
   'review_count': 1152,
   'rating': 4.0},
  {'id': 'EbUZhM4fLpsWQ8fpBhhgEQ',
   'name': "Mike & Patty's - Boston",
   'review_count': 1867,
   'rating': 4.5},
  {'id': 'NY

In [7]:
NY_res = open('NY_restaurants.json', 'r')
values = json.load(NY_res)
NY_res.close()

# for criteria in values['businesses']:
#     for key, value in criteria.items():
#         print(key, 'is:', value)
#     print('')

id is: fVbUVAiLiGgLA_nxBFxyww
alias is: thursday-kitchen-new-york
name is: Thursday Kitchen
image_url is: https://s3-media4.fl.yelpcdn.com/bphoto/nXO8M-d-XTamNmc0BpXWtA/o.jpg
is_closed is: False
url is: https://www.yelp.com/biz/thursday-kitchen-new-york?adjust_creative=Ia1y9jsFUbkc4cDEThEkMg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=Ia1y9jsFUbkc4cDEThEkMg
review_count is: 1478
categories is: [{'alias': 'korean', 'title': 'Korean'}, {'alias': 'newamerican', 'title': 'American (New)'}, {'alias': 'tapasmallplates', 'title': 'Tapas/Small Plates'}]
rating is: 4.5
coordinates is: {'latitude': 40.7275, 'longitude': -73.9838}
transactions is: ['pickup', 'delivery']
price is: $$
location is: {'address1': '424 E 9th St', 'address2': None, 'address3': '', 'city': 'New York', 'zip_code': '10009', 'country': 'US', 'state': 'NY', 'display_address': ['424 E 9th St', 'New York, NY 10009']}
phone is: 
display_phone is: 
distance is: 2613.195773628755

id is: J3NT61-AH5d5Gu5t

In [8]:
for field in values['businesses']:
    data["total_reviews"] += field["review_count"]
    info = {"id": None, "name": None, "review_count": None, "rating": None}
    for key,value in field.items():
        if key in res_info.keys():    
            info[key] = value
    data["businesses"].append(info)

data

{'businesses': [{'id': 'sA2gVTJOBH7Qk32p48ENdQ',
   'name': 'Saltie Girl',
   'review_count': 1368,
   'rating': 4.5},
  {'id': 'kP1b-7BO_VhWk_0tvuA_tw',
   'name': "Carmelina's",
   'review_count': 2491,
   'rating': 4.5},
  {'id': 'xlOMKjE4omTgkI1eduWj8A',
   'name': 'The Friendly Toast',
   'review_count': 2038,
   'rating': 4.0},
  {'id': 'tKRg7pxR2Sw7aSrHTQS5mQ',
   'name': 'Buttermilk & Bourbon',
   'review_count': 729,
   'rating': 4.0},
  {'id': 'y2w6rFaO0XEiG5mFfOsiFA',
   'name': 'Neptune Oyster',
   'review_count': 5386,
   'rating': 4.5},
  {'id': 't_FFcwUutj9mIYKGw_gHsQ',
   'name': 'The Salty Pig',
   'review_count': 1711,
   'rating': 4.0},
  {'id': 'lCMN0raxlHnBFDoLgkz4sw',
   'name': 'Milkweed',
   'review_count': 793,
   'rating': 4.0},
  {'id': 'kb0VfKZKHHhvDze3JZqi0Q',
   'name': 'Q Restaurant',
   'review_count': 1152,
   'rating': 4.0},
  {'id': 'EbUZhM4fLpsWQ8fpBhhgEQ',
   'name': "Mike & Patty's - Boston",
   'review_count': 1867,
   'rating': 4.5},
  {'id': 'NY

In [9]:
LA_res = open('LA.json', 'r')
values = json.load(LA_res)
LA_res.close()

for field in values['businesses']:
    data["total_reviews"] += field["review_count"]
    res_info = {"id": None, "name": None, "review_count": None, "rating": None}
    for key,value in field.items():
        if key in res_info.keys():
            res_info[key] = value
    data["businesses"].append(res_info)
data

{'businesses': [{'id': 'sA2gVTJOBH7Qk32p48ENdQ',
   'name': 'Saltie Girl',
   'review_count': 1368,
   'rating': 4.5},
  {'id': 'kP1b-7BO_VhWk_0tvuA_tw',
   'name': "Carmelina's",
   'review_count': 2491,
   'rating': 4.5},
  {'id': 'xlOMKjE4omTgkI1eduWj8A',
   'name': 'The Friendly Toast',
   'review_count': 2038,
   'rating': 4.0},
  {'id': 'tKRg7pxR2Sw7aSrHTQS5mQ',
   'name': 'Buttermilk & Bourbon',
   'review_count': 729,
   'rating': 4.0},
  {'id': 'y2w6rFaO0XEiG5mFfOsiFA',
   'name': 'Neptune Oyster',
   'review_count': 5386,
   'rating': 4.5},
  {'id': 't_FFcwUutj9mIYKGw_gHsQ',
   'name': 'The Salty Pig',
   'review_count': 1711,
   'rating': 4.0},
  {'id': 'lCMN0raxlHnBFDoLgkz4sw',
   'name': 'Milkweed',
   'review_count': 793,
   'rating': 4.0},
  {'id': 'kb0VfKZKHHhvDze3JZqi0Q',
   'name': 'Q Restaurant',
   'review_count': 1152,
   'rating': 4.0},
  {'id': 'EbUZhM4fLpsWQ8fpBhhgEQ',
   'name': "Mike & Patty's - Boston",
   'review_count': 1867,
   'rating': 4.5},
  {'id': 'NY

In [10]:
miami_res = open('miami.json', 'r')
values = json.load(miami_res)
miami_res.close()

for field in values['businesses']:
    data["total_reviews"] += field["review_count"]
    res_info = {"id": None, "name": None, "review_count": None, "rating": None}
    for key,value in field.items():
        if key in res_info.keys():
            res_info[key] = value
    data["businesses"].append(res_info)
data

{'businesses': [{'id': 'sA2gVTJOBH7Qk32p48ENdQ',
   'name': 'Saltie Girl',
   'review_count': 1368,
   'rating': 4.5},
  {'id': 'kP1b-7BO_VhWk_0tvuA_tw',
   'name': "Carmelina's",
   'review_count': 2491,
   'rating': 4.5},
  {'id': 'xlOMKjE4omTgkI1eduWj8A',
   'name': 'The Friendly Toast',
   'review_count': 2038,
   'rating': 4.0},
  {'id': 'tKRg7pxR2Sw7aSrHTQS5mQ',
   'name': 'Buttermilk & Bourbon',
   'review_count': 729,
   'rating': 4.0},
  {'id': 'y2w6rFaO0XEiG5mFfOsiFA',
   'name': 'Neptune Oyster',
   'review_count': 5386,
   'rating': 4.5},
  {'id': 't_FFcwUutj9mIYKGw_gHsQ',
   'name': 'The Salty Pig',
   'review_count': 1711,
   'rating': 4.0},
  {'id': 'lCMN0raxlHnBFDoLgkz4sw',
   'name': 'Milkweed',
   'review_count': 793,
   'rating': 4.0},
  {'id': 'kb0VfKZKHHhvDze3JZqi0Q',
   'name': 'Q Restaurant',
   'review_count': 1152,
   'rating': 4.0},
  {'id': 'EbUZhM4fLpsWQ8fpBhhgEQ',
   'name': "Mike & Patty's - Boston",
   'review_count': 1867,
   'rating': 4.5},
  {'id': 'NY

In [11]:
SF_res = open('SF.json', 'r')
values = json.load(SF_res)
SF_res.close()

for field in values['businesses']:
    data["total_reviews"] += field["review_count"]
    res_info = {"id": None, "name": None, "review_count": None, "rating": None}
    for key,value in field.items():
        if key in res_info.keys():
            res_info[key] = value
    data["businesses"].append(res_info)
data

{'businesses': [{'id': 'sA2gVTJOBH7Qk32p48ENdQ',
   'name': 'Saltie Girl',
   'review_count': 1368,
   'rating': 4.5},
  {'id': 'kP1b-7BO_VhWk_0tvuA_tw',
   'name': "Carmelina's",
   'review_count': 2491,
   'rating': 4.5},
  {'id': 'xlOMKjE4omTgkI1eduWj8A',
   'name': 'The Friendly Toast',
   'review_count': 2038,
   'rating': 4.0},
  {'id': 'tKRg7pxR2Sw7aSrHTQS5mQ',
   'name': 'Buttermilk & Bourbon',
   'review_count': 729,
   'rating': 4.0},
  {'id': 'y2w6rFaO0XEiG5mFfOsiFA',
   'name': 'Neptune Oyster',
   'review_count': 5386,
   'rating': 4.5},
  {'id': 't_FFcwUutj9mIYKGw_gHsQ',
   'name': 'The Salty Pig',
   'review_count': 1711,
   'rating': 4.0},
  {'id': 'lCMN0raxlHnBFDoLgkz4sw',
   'name': 'Milkweed',
   'review_count': 793,
   'rating': 4.0},
  {'id': 'kb0VfKZKHHhvDze3JZqi0Q',
   'name': 'Q Restaurant',
   'review_count': 1152,
   'rating': 4.0},
  {'id': 'EbUZhM4fLpsWQ8fpBhhgEQ',
   'name': "Mike & Patty's - Boston",
   'review_count': 1867,
   'rating': 4.5},
  {'id': 'NY

In [12]:
amherst_res = open('amherst.json', 'r')
values = json.load(amherst_res)
amherst_res.close()

for field in values['businesses']:
    data["total_reviews"] += field["review_count"]
    res_info = {"id": None, "name": None, "review_count": None, "rating": None}
    for key,value in field.items():
        if key in res_info.keys():
            res_info[key] = value
    data["businesses"].append(res_info)
data

{'businesses': [{'id': 'sA2gVTJOBH7Qk32p48ENdQ',
   'name': 'Saltie Girl',
   'review_count': 1368,
   'rating': 4.5},
  {'id': 'kP1b-7BO_VhWk_0tvuA_tw',
   'name': "Carmelina's",
   'review_count': 2491,
   'rating': 4.5},
  {'id': 'xlOMKjE4omTgkI1eduWj8A',
   'name': 'The Friendly Toast',
   'review_count': 2038,
   'rating': 4.0},
  {'id': 'tKRg7pxR2Sw7aSrHTQS5mQ',
   'name': 'Buttermilk & Bourbon',
   'review_count': 729,
   'rating': 4.0},
  {'id': 'y2w6rFaO0XEiG5mFfOsiFA',
   'name': 'Neptune Oyster',
   'review_count': 5386,
   'rating': 4.5},
  {'id': 't_FFcwUutj9mIYKGw_gHsQ',
   'name': 'The Salty Pig',
   'review_count': 1711,
   'rating': 4.0},
  {'id': 'lCMN0raxlHnBFDoLgkz4sw',
   'name': 'Milkweed',
   'review_count': 793,
   'rating': 4.0},
  {'id': 'kb0VfKZKHHhvDze3JZqi0Q',
   'name': 'Q Restaurant',
   'review_count': 1152,
   'rating': 4.0},
  {'id': 'EbUZhM4fLpsWQ8fpBhhgEQ',
   'name': "Mike & Patty's - Boston",
   'review_count': 1867,
   'rating': 4.5},
  {'id': 'NY

In [13]:
chicago_res = open('chicago.json', 'r')
values = json.load(chicago_res)
chicago_res.close()

for field in values['businesses']:
    data["total_reviews"] += field["review_count"]
    res_info = {"id": None, "name": None, "review_count": None, "rating": None}
    for key,value in field.items():
        if key in res_info.keys():
            res_info[key] = value
    data["businesses"].append(res_info)
data

{'businesses': [{'id': 'sA2gVTJOBH7Qk32p48ENdQ',
   'name': 'Saltie Girl',
   'review_count': 1368,
   'rating': 4.5},
  {'id': 'kP1b-7BO_VhWk_0tvuA_tw',
   'name': "Carmelina's",
   'review_count': 2491,
   'rating': 4.5},
  {'id': 'xlOMKjE4omTgkI1eduWj8A',
   'name': 'The Friendly Toast',
   'review_count': 2038,
   'rating': 4.0},
  {'id': 'tKRg7pxR2Sw7aSrHTQS5mQ',
   'name': 'Buttermilk & Bourbon',
   'review_count': 729,
   'rating': 4.0},
  {'id': 'y2w6rFaO0XEiG5mFfOsiFA',
   'name': 'Neptune Oyster',
   'review_count': 5386,
   'rating': 4.5},
  {'id': 't_FFcwUutj9mIYKGw_gHsQ',
   'name': 'The Salty Pig',
   'review_count': 1711,
   'rating': 4.0},
  {'id': 'lCMN0raxlHnBFDoLgkz4sw',
   'name': 'Milkweed',
   'review_count': 793,
   'rating': 4.0},
  {'id': 'kb0VfKZKHHhvDze3JZqi0Q',
   'name': 'Q Restaurant',
   'review_count': 1152,
   'rating': 4.0},
  {'id': 'EbUZhM4fLpsWQ8fpBhhgEQ',
   'name': "Mike & Patty's - Boston",
   'review_count': 1867,
   'rating': 4.5},
  {'id': 'NY

In [14]:
with open('more_data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)