# How New York City Eats: Mapping the City's Landscape of Restaurants, Bars, & Cafes

## Code Outline

### Code Outline

1) **Yelp API**
    - Using the **Search** endpoint: Retrieve all businesses categorized as "resturants" or "night life". This will return operational businesses (not ones marked as permanently closed). Search points are focused around lat/long coordinates, which are generated by creating a tesselation of hexagons with a 500m radius and finding their center points. 
    - Using the **Business match** endpoint: Generate a list of all addresses at which currently operating restaurants are located (based on the DOHMH and Yelp Search data). For each address, retrieve businesses that have existed at each address. This will provide businesses that have been marked as closed. 
    - For all businesses that were found using the **match** endpoint but were not collected in the first search, retrieve full business details from the **Details** endpoint. 
    - Prepare data using Pandas and Geopandas for visualization in ArcGIS Online. 
2) **Eater NY** archived restaurant news articles
    - Scrape published web articles from Eater NY that report on restaurant openings and closures between 2012 and 2023. 
    - Some addresses for restaurants can be found in a FourSquare widget at the bottom of the article. Geocode addresses that are available. 
3) **Gayot** 
    - Scrape restaurant news and reviews from blog directory
    - Geocode available addresses

## Set Working Environment

In [None]:
import cloudscraper
from bs4 import BeautifulSoup
import time
import csv
import pandas as pd
import geopandas as gpd
import requests 
import json
import os
import re
# import arcpy (import this if running ArcGIS scripts, otherwise can leave commented)

# this makes dataframes viewable as interactive tables with search and sort
# see documentation: https://mwouts.github.io/itables/quick_start.html
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)

import itables.options as opt
opt.maxBytes = 0
### IMPORTANT: this removes size limits for interactive table, otherwise it will only display some of the rows

os.chdir("/Volumes/Alyse2023/GIS/How_NYC_Eats")

pd.set_option("display.max_columns", None)

## 1. Yelp 

### API Setup

In [None]:
#Note: API has a maximum of 500 requests per day. Alternate between the following API keys as needed.

MY_API_KEY = "PXQcdw-QX6KYgzGKO7S_nMz7zWpuOJdFXy2auqL7VdEmBjbVM-RSf6uOMy_0PJklEM9XlkRAQYKANGHE1zuIv8Cmqt0JEZ5kd05AJEiVS-fesySXRsr61ofP0WxmZXYx"

# MY_API_KEY = 'YYWyYqgnVKuVJmnrGEVQbzRL8-Yp5LdLN82nJHJjed-mXkusK2-ilMq2_G6_eEzMZygiAzuW-wuIwB1a6DJpMapKq0qFP3q4AZh9YGiCwRmP8zEWJkxJK8bpJs1wZXYx'

# MY_API_KEY = 'jHq9MXz38_Of3KC_yEGPPG0xmkWBQTMRaClpqDX7Pmlup2BAt_dEJvxVCWEE7e0E107PAD0DJ-oNevtKgFm1pim3sVZCjJZYDLoIpYMpRoukFWxDQq_5uculMjVyZXYx'

# MY_API_KEY = 'xlGXWTeTiCjbF1a7QBnV4LFTScFS0XXdQyCQ-Zgv0JJdC5IgjMqeZ8K38O9wjgiG4Rx5M0o7bDywndYI1GU9USntFppGuT1vGd625m-BJWMdgJXsmFDpL0HuxLB0ZXYx'

headers = {
    "Authorization": f"Bearer {MY_API_KEY}",
    "Content-Type": "application/json"
}

### 1.1 Business Search

Canvas the entire city to retrieve all operating businesses classified as "restaurants" and "nightlife".
Search is performed based on a generated grid of 500m radii hexagons. Search is performed in manual increments due to the API daily limits. 

#### 1.1.1 Create Search Grid

Generate a tesselated hexagonal grid with 500m radii using ArcGIS, calculate their center points, and export table of coordinates to be sent into the Yelp API search endpoint. 

In [None]:
#Create tesselation
arcpy.management.GenerateTessellation("nyc_hex_grid", '913175.109008789 120128.369995117 1067382.50860596 272844.293823242 PROJCS["NAD_1983_StatePlane_New_York_Long_Island_FIPS_3104_Feet",GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137.0,298.257222101]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic"],PARAMETER["False_Easting",984250.0],PARAMETER["False_Northing",0.0],PARAMETER["Central_Meridian",-74.0],PARAMETER["Standard_Parallel_1",40.66666666666666],PARAMETER["Standard_Parallel_2",41.03333333333333],PARAMETER["Latitude_Of_Origin",40.16666666666666],UNIT["Foot_US",0.3048006096012192]]', "HEXAGON", "649519.052 SquareMeters", 'PROJCS["NAD_1983_StatePlane_New_York_Long_Island_FIPS_3104_Feet",GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137.0,298.257222101]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic"],PARAMETER["False_Easting",984250.0],PARAMETER["False_Northing",0.0],PARAMETER["Central_Meridian",-74.0],PARAMETER["Standard_Parallel_1",40.66666666666666],PARAMETER["Standard_Parallel_2",41.03333333333333],PARAMETER["Latitude_Of_Origin",40.16666666666666],UNIT["Foot_US",0.3048006096012192]];-120039300 -96540300 3048.00609601219;-100000 10000;-100000 10000;3.28083333333333E-03;0.001;0.001;IsHighPrecision')

#select only the bins that intersect with the nyc boundary
arcpy.management.SelectLayerByLocation("nyc_hex_grid", "INTERSECT", "nybb", None, "NEW_SELECTION", "NOT_INVERT")

#export those as new feature
arcpy.conversion.ExportFeatures("nyc_hex_grid", "nyc_grid", '', "NOT_USE_ALIAS", Shape_Length "Shape_Length" false true true 8 Double 0 0,First,#,GenerateTessellation2,Shape_Length,-1,-1;Shape_Area "Shape_Area" false true true 8 Double 0 0,First,#,GenerateTessellation2,Shape_Area,-1,-1;GRID_ID "GRID_ID" true true false 12 Text 0 0,First,#,GenerateTessellation2,GRID_ID,0,12, None)

In [None]:
#Create new columns for latitude and longitude
#Calculate Geometry: Longitude = centroid x-coordinate, Latitude = centroid y-coordinate, Coordinate format = decimal degrees, Coordinate system = 2263

arcpy.management.CalculateGeometryAttributes("nyc_grid", "Longitude CENTROID_X;Latitude CENTROID_Y", '', '', 'PROJCS["NAD_1983_StatePlane_New_York_Long_Island_FIPS_3104_Feet",GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137.0,298.257222101]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic"],PARAMETER["False_Easting",984250.0],PARAMETER["False_Northing",0.0],PARAMETER["Central_Meridian",-74.0],PARAMETER["Standard_Parallel_1",40.66666666666666],PARAMETER["Standard_Parallel_2",41.03333333333333],PARAMETER["Latitude_Of_Origin",40.16666666666666],UNIT["Foot_US",0.3048006096012192]]', "DD")

In [None]:
#Due to the API daily limit, not all bins can be searched at once. Manually select the bins, and export the table with only fields longitude, and latitude

arcpy.conversion.ExportTable("nyc_grid", "/Volumes/Alyse2023/GIS/How_NYC_Eats/data/yelp/search_coordinates/500m_hex_grid_south_brooklyn.csv", '', "NOT_USE_ALIAS", 'Longitude "Longitude" true true false 8 Double 0 0,First,#,nyc_grid,Longitude,-1,-1;Latitude "Latitude" true true false 8 Double 0 0,First,#,nyc_grid,Latitude,-1,-1', None)

#Once bins have been searched, copy those into a new feature, color them green and mark "DONE"

#### 1.1.2 Search by Lat/Long Points

In [None]:
#Set the file path and output JSON here 
coordinate_csv_path = "data/yelp/search_coordinates/500m_bronx_2.csv"
output_json = "data/yelp/search_results/bronx_2.json"

In [None]:
#API Request
search_url = "https://api.yelp.com/v3/businesses/search"

#maximum items per results page
limit = 50

all_results = []
running_total = 0

with open(coordinate_csv_path) as coord_file:
    coords = csv.reader(coord_file)
    next(coords)

    counter = 1

    for row in coords: 

        latitude = row[2]
        longitude = row[1]

        print(f"{counter}: {latitude}, {longitude} ------------------------------")

        params = {
            "latitude": latitude,
            "longitude": longitude,
            "categories" : "restaurants,nightlife",
            "radius": 500,
            "limit" : limit,
        }

        r = requests.get(search_url, headers=headers, params=params)
        data = json.loads(r.text)
        all_results.extend(data['businesses'])

        total_results = data['total']
        running_total = running_total + total_results
        print(f"       {total_results} found.")

        #If more than 50 results were returned, the following script will run based on the offset until the total records are retrieved. 
        if total_results > 50:

            for offset in range(50, total_results, limit):
                params = {
                    "latitude": latitude,
                    "longitude": longitude,
                    "categories" : "restaurants,nightlife",
                    "radius": 500,
                    "limit" : limit,
                    "offset" : offset
                }
                
                r = requests.get(search_url, headers=headers, params=params)
                data = json.loads(r.text)
                all_results.extend(data['businesses'])

        counter += 1 

json.dump(all_results,open(output_json, 'w'), indent=2)
print(f"TOTAL RESTAURANTS FOUND: {running_total}")

#### 1.1.3 Compile all unique operating businesses into one master JSON file

In [None]:
unique_yelp_ids = []
compiled_results = []

for search_results in os.listdir('data/yelp/search_results'):
    with open(f"data/yelp/search_results/{search_results}") as results_file:
        results_json = json.load(results_file)

        for entry in results_json:
            if entry['id'] not in unique_yelp_ids:
                unique_yelp_ids.append(entry['id'])
                compiled_results.append(entry)

json.dump(compiled_results, open('data/yelp/operating_restaurants.json', 'w'), indent=2)

### 1.2 Retrieve information for closed businesses

Use the Yelp API Match and Details endpoints to retrieve businesses that previouosly operated at the addresses of currently operating businesses. 

#### 1.2.1 Compile all unique addresses from operating_restaurants.json

In [None]:
unique_addresses = []
unique_address_csv = open('data/yelp/addresses/all_compiled_addresses.csv', 'w')
csv_writer = csv.writer(unique_address_csv)

headers = ['address1', 'city', 'state', 'country', 'postal_code']
csv_writer.writerow(headers)

with open('data/yelp/operating_restaurants.json') as compiled_file:
    compiled_json = json.load(compiled_file)

    for entry in compiled_json:
        if entry['location']['display_address'] not in unique_addresses:
            unique_addresses.append(entry['location']['display_address'])

            address1 = entry['location']['address1']
            city = entry['location']['city']
            state = entry['location']['state']
            country = entry['location']['country']
            postal_code = entry['location']['zip_code']

            write_list = [address1, city, state, country, postal_code]

            csv_writer.writerow(write_list)

unique_address_csv.close()

#### 1.2.2 Get Data for Closed Businesses based on addresses of operating restaurants

First, businesses are identified via the Yelp API Match endpoint. Then, complete details are retrieved via the Details endpoint. 

Due to the API limit, the following code blocks need to be run in groups, based on the available calls left per day. Adjust API key as needed in the API Setup code above. 

In [None]:
search_zip = "11239"

In [None]:
#Match endpoint
matches = []
match_url = "https://api.yelp.com/v3/businesses/matches"

with open('data/yelp/addresses/all_compiled_addresses.csv') as compiled_addresses:
    add_csv = csv.reader(compiled_addresses)

    counter = 0

    for row in add_csv:
        if row[4] == search_zip:

            match_params = {
                "name" : "",
                "address1" : row[0],
                "city" : row[1],
                "state" : row[3],
                "postal_code" : search_zip,
                "country" : "US",
                'match_threshold' : 'none'
            }

            match_r = requests.get(match_url, headers=headers, params=match_params)
            match_data = json.loads(match_r.text)
            matches.extend(match_data['businesses'])
            counter += 1

            if counter % 5 == 0:
                print(counter)


json.dump(matches,open(f"data/yelp/matches/{search_zip}_business_matches.json", 'w'), indent=2)

In [None]:
#Print number of restaurants to find (to double check API limit)
pulled_ids = []

with open('data/yelp/operating_restaurants.json') as all_restaurants:
    rest_json = json.load(all_restaurants)

    for entry in rest_json:
        pulled_ids.append(entry['id'])

#List to get details only for businesses not found in the initial city-wide search
ids_to_get_details = []

with open(f'data/yelp/matches/{search_zip}_business_matches.json') as match_file:
    match_json = json.load(match_file)

    for match in match_json:
        if match['id'] not in pulled_ids and match['id'] not in ids_to_get_details:
            ids_to_get_details.append(match['id'])

search_number = len(ids_to_get_details)
print(f"{search_number} restaurants")

In [None]:
details = []
counter = 1
for yelp_id in ids_to_get_details:
    details_url = f"https://api.yelp.com/v3/businesses/{yelp_id}"
    details_r = requests.get(details_url, headers=headers)
    details_data = json.loads(details_r.text)
    details.append(details_data)
    keys = list(details_data.keys())
    if 'name' in keys:
        print(details_data['name'])
    else:
        print('! NO BUSINESS NAME!')
    print(f"{counter}/{search_number}")
    counter += 1
    time.sleep(1.5)

json.dump(details,open(f'data/yelp/details/{search_zip}_business_details.json', 'w'), indent=2)

### 1.3 Prepare data for visualization

#### 1.3.1 Combine operating and closed businesses into one JSON file

In [None]:
all_restaurants = []
unique_yelp_ids = []

folders = ['search_results', 'details']

for folder in folders:

    for f in os.listdir(f'data/yelp/{folder}'):

        print(f)

        with open(f"data/yelp/{folder}/{f}") as results_file:
            results_json = json.load(results_file)

            for entry in results_json:
                try:
                    if entry['id'] not in unique_yelp_ids:
                        all_restaurants.append(entry)
                        unique_yelp_ids.append(entry['id'])
                except:
                    print("No business ID.")

print(f"\nTotal Restaurants: {len(unique_yelp_ids)}\n")
json.dump(all_restaurants, open('data/yelp/operating_and_closed_restaurants.json', 'w'), indent=2)

#### 1.3.2 Write CSV for all unique restaurants

In [None]:
unique_resaurants_file = open('data/yelp/restaurants_unique.csv', 'w')
unique_writer = csv.writer(unique_resaurants_file)

unique_headers = ['yelp_id', 'yelp_url', 'name', 'image_url', 'is_closed', 'latitude', 'longitude', 'review_count', 'rating', 'full_address', 'price', 'primary_category', 'all_categories', 'source']

unique_writer.writerow(unique_headers)

with open('data/yelp/operating_and_closed_restaurants.json') as all_file:
    all_json = json.load(all_file)

    for entry in all_json:

        try:
            keys = list(entry.keys())

            yelp_id = entry['id']
            yelp_url = entry['url']
            name = entry['name']
            image_url = entry['image_url']
            is_closed = entry['is_closed']
            latitude = entry['coordinates']['latitude']
            longitude = entry['coordinates']['longitude']
            review_count = entry['review_count']
            rating = entry['rating']
            full_address = ", ".join(entry['location']['display_address'])

            price = None
            if "price" in keys:
                price = entry['price']
            
            if len(entry['categories']) == 0:
                primary_category = None
                all_categories = None
            else:
                primary_category = entry['categories'][0]['title']
                cats = []
                for cat in entry['categories']:
                    cats.append(cat['title'])
                all_categories = ", ".join(cats)

            write_list = [yelp_id, yelp_url, name, image_url, is_closed, latitude, longitude, review_count, rating, full_address, price, primary_category, all_categories, "yelp"]

            unique_writer.writerow(write_list)


        except:
            print('Error.')

unique_resaurants_file.close()

#### 1.3.3 Write CSV for all listed categories

In [None]:
categories_csv = open('data/yelp/restaurants_by_category.csv', 'w')
csv_writer = csv.writer(categories_csv)

headers = ['category', 'yelp_id', 'name', 'is_closed', 'latitude', 'longitude']
csv_writer.writerow(headers)

with open('data/yelp/operating_and_closed_restaurants.json') as all_rest_file:
    all_rest_json = json.load(all_rest_file)

    for entry in all_rest_json:
        
        yelp_id = entry['id']
        name = entry['name']
        is_closed = entry['is_closed']
        latitude = entry['coordinates']['latitude']
        longitude = entry['coordinates']['longitude']
        
        if len(entry['categories']) == 0:
            category = None
            write_list = [category, yelp_id, name, is_closed, latitude, longitude]
            csv_writer.writerow(write_list)
        else:
            for cat in entry['categories']:
                category = cat['title']
                write_list = [category, yelp_id, name, is_closed, latitude, longitude]
                csv_writer.writerow(write_list)

categories_csv.close()

#### 1.3.4 Classification of categories for analysis/visualization

At this point, broader classification is performed on the unique categories pulled above and categories that need to be weeded are determined (as the business search may have retrieved more than just restaurants, bars, and cafes). 

The code below retrieves category information from the Yelp API. This is used to create a mapping that is developped externally in Google Sheets. The completed categorization is exported as "category_mappping.csv"

In [None]:
#Create CSV file with unique categories (to weed/classify in Google Sheets) 
all_restaurants_df = pd.read_csv('data/yelp/restaurants_by_category.csv')
unique_categories = all_restaurants_df['category'].drop_duplicates()
unique_categories.to_csv('data/yelp/retrieved_categories.csv', index=False)

In [None]:
#Retrieve parent category information from the Yelp API
categories_url = "https://api.yelp.com/v3/categories"

params = {
    "locale" : "en_US"
}

r = requests.get(categories_url, headers=headers, params=params)
data = json.loads(r.text)
json.dump(data, open('data/yelp/yelp_categories.json', 'w'), indent=2)

In [None]:
#Write category data to csv
cat_csv = open('data/yelp/yelp_categories.csv', 'w')
cat_writer = csv.writer(cat_csv)

headers = ['title', 'parent_alias']
cat_writer.writerow(headers)

with open('data/yelp/yelp_categories.json') as cat_file:
    cat_json = json.load(cat_file)
    for entry in cat_json['categories']:
        cat_list = [entry['title'], entry['parent_aliases']]
        cat_writer.writerow(cat_list)

cat_csv.close()

### 1.4 Mapping!

#### 1.4.1 Create point layers from "unique_restaurants.csv" and "restaurants_by_categories.csv" files

In [None]:
#Restaurants by category
arcpy.management.XYTableToPoint("restaurants_by_category.csv", r"X:\GIS\How_NYC_Eats\data\shape_files\yelp_points\restaurants_by_category.shp", "longitude", "latitude", None, 'GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137.0,298.257223563]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]];-400 -400 1000000000;-100000 10000;-100000 10000;8.98315284119521E-09;0.001;0.001;IsHighPrecision')

In [None]:
#Unique restaurants
arcpy.management.XYTableToPoint("restaurants_unique.csv", r"X:\GIS\How_NYC_Eats\data\shape_files\yelp_points\unique_restaurants.shp", "longitude", "latitude", None, 'GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137.0,298.257223563]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]];-400 -400 1000000000;-100000 10000;-100000 10000;8.98315284119521E-09;0.001;0.001;IsHighPrecision')

In [None]:
#For the categories points, select all points that fall within the NYC boundary and join to NTA
arcpy.management.SelectLayerByLocation("restaurants_by_category", "INTERSECT", "nybb", None, "NEW_SELECTION", "NOT_INVERT")

arcpy.analysis.SpatialJoin("restaurants_by_category", "nta", r"X:\GIS\How_NYC_Eats\data\shape_files\yelp_points\restaurants_by_category_nta.shp", "JOIN_ONE_TO_ONE", "KEEP_ALL", 'category "category" true true false 254 Text 0 0,First,#,restaurants_by_category,category,0,254;yelp_id "yelp_id" true true false 254 Text 0 0,First,#,restaurants_by_category,yelp_id,0,254;name "name" true true false 254 Text 0 0,First,#,restaurants_by_category,name,0,254;is_closed "is_closed" true true false 254 Text 0 0,First,#,restaurants_by_category,is_closed,0,254;latitude "latitude" true true false 19 Double 0 0,First,#,restaurants_by_category,latitude,-1,-1;longitude "longitude" true true false 19 Double 0 0,First,#,restaurants_by_category,longitude,-1,-1;borocode "borocode" true true false 33 Double 31 32,First,#,nta,borocode,-1,-1;boroname "boroname" true true false 254 Text 0 0,First,#,nta,boroname,0,254;countyfips "countyfips" true true false 254 Text 0 0,First,#,nta,countyfips,0,254;ntacode "ntacode" true true false 254 Text 0 0,First,#,nta,ntacode,0,254;ntaname "ntaname" true true false 254 Text 0 0,First,#,nta,ntaname,0,254', "INTERSECT", None, '')

In [None]:
#For the unique restaurants point layer, select all points that fall within the NYC boundary and join to the nearest BBL. 
arcpy.management.SelectLayerByLocation("unique_restaurants", "INTERSECT", "nybb", None, "NEW_SELECTION", "NOT_INVERT")

arcpy.analysis.SpatialJoin("unique_restaurants", "MapPLUTO", r"X:\GIS\How_NYC_Eats\data\shape_files\yelp_points\unique_restaurants_bbl.shp", "JOIN_ONE_TO_ONE", "KEEP_ALL", 'yelp_id "yelp_id" true true false 254 Text 0 0,First,#,unique_restaurants,yelp_id,0,254;yelp_url "yelp_url" true true false 254 Text 0 0,First,#,unique_restaurants,yelp_url,0,254;name "name" true true false 254 Text 0 0,First,#,unique_restaurants,name,0,254;image_url "image_url" true true false 254 Text 0 0,First,#,unique_restaurants,image_url,0,254;is_closed "is_closed" true true false 254 Text 0 0,First,#,unique_restaurants,is_closed,0,254;latitude "latitude" true true false 19 Double 0 0,First,#,unique_restaurants,latitude,-1,-1;longitude "longitude" true true false 19 Double 0 0,First,#,unique_restaurants,longitude,-1,-1;review_cou "review_cou" true true false 10 Long 0 10,First,#,unique_restaurants,review_cou,-1,-1;rating "rating" true true false 19 Double 0 0,First,#,unique_restaurants,rating,-1,-1;full_addre "full_addre" true true false 254 Text 0 0,First,#,unique_restaurants,full_addre,0,254;price "price" true true false 254 Text 0 0,First,#,unique_restaurants,price,0,254;primary_ca "primary_ca" true true false 254 Text 0 0,First,#,unique_restaurants,primary_ca,0,254;all_catego "all_catego" true true false 254 Text 0 0,First,#,unique_restaurants,all_catego,0,254;source "source" true true false 254 Text 0 0,First,#,unique_restaurants,source,0,254;BBL "BBL" true true false 19 Double 0 0,First,#,MapPLUTO,BBL,-1,-1', "CLOSEST", "100 Feet", "distance_from_bbl")

#### 1.4.2 Count Categories per Neighborhood Tabulation Area (NTA)

In [None]:
raw_categories_df = gpd.read_file("data/shape_files/yelp_points/restaurants_by_category_nta.shp")

In [None]:
raw_categories_df.columns.to_list()

In [None]:
#Filter out businesses based on the following categories
exclude = ['Bowling', 'Kids Activities', 'Golf', 'Axe Throwing', 'Parks', 'Sports Clubs', 'Indoor Playcentre', 'Mini Golf', 'Amateur Sports Teams', 'Boating', 'Bocce Ball', 'Playgrounds', 'Skating Rinks', 'Summer Camps', 'Swimming Pools', 'Art Galleries', 'Music Venues', 'Performing Arts', 'Jazz & Blues', 'Arcades', 'Cabaret', 'Festivals', 'Social Clubs', 'Cinema', 'Stadiums & Arenas', 'Ticket Sales', 'Virtual Reality Centers', 'Wineries', 'Cultural Center', 'LAN Centers', 'Museums', 'Opera & Ballet', 'Casinos', 'Eatertainment', 'Paint & Sip', 'Cooking Classes', 'Aircraft Repairs', 'Auto Parts & Supplies', 'Gas Stations', 'Parking', 'Car Window Tinting', 'Airport Lounges', 'Hair Salons', 'Cosmetics & Beauty Supply', 'Day Spas', 'Nail Salons', 'Permanent Makeup', 'Piercing', 'Tattoo', 'Specialty Schools', 'Art Classes', 'Preschools', 'Venues & Event Spaces', 'Caterers', 'Party & Event Planning', 'Musicians', 'DJs', 'Hotels', 'Personal Chefs', 'Photo Booth Rentals', 'Bartenders', 'Boat Charters', 'Magicians', 'Wedding Planning', 'Party Bus Rentals', 'Party Equipment Rentals', 'Team Building Activities', 'Trivia Hosts', 'Accessories', 'Surf Shop', 'Used, Vintage & Consignment', 'Women\'s Clothing', 'Banks & Credit Unions', 'Yoga', 'Dance Studios', 'Golf Lessons', 'Gyms', 'Meditation Centers', 'Trainers', 'Florists', 'Gift Shops', 'Grocery', 'Food Delivery Services', 'Convenience Stores', 'Beer, Wine & Spirits', 'Coffee Roasteries', 'International Grocery', 'Imported Food', 'Custom Cakes', 'Do-It-Yourself Food', 'CSA', 'Specialty Food', 'Organic Stores', 'Butcher', 'Patisserie/Cake Shop', 'Beverage Store', 'Fruits & Veggies', 'Pasta Shops', 'Meat Shops', 'Seafood Markets', 'Health Markets', 'Cheese Shops', 'Candy Stores', 'Chocolatiers & Shops', 'Macarons', 'Olive Oil', 'Waxing', 'Cannabis Collective', 'Pharmacy', 'Traditional Chinese Medicine', 'Nurseries & Gardening', 'Home Decor', 'Tours', 'Bed & Breakfast', 'Laundromat', 'Public Art', 'Community Service/Non-Profit', 'Recording & Rehearsal Studios', 'Television Stations', 'Bookstores', 'Vinyl Records', 'Newspapers & Magazines', 'Video Game Stores', 'Outdoor Movies', 'Art Museums', 'Vocal Coach', 'Dog Parks', 'Skate Parks', 'Pet Adoption', 'Music Production Services', 'Graphic Design', 'Video/Film Production', 'Wholesalers', 'Community Centers', 'Commercial Real Estate', 'Shared Office Spaces', 'Churches', 'Delis', 'Live/Raw Food', 'Game Meat', 'Tobacco Shops', 'Head Shops', 'Musical Instruments & Teachers', 'Pop-up Shops', 'Public Markets', 'Vape Shops', 'Shopping Centers', 'Souvenir Shops', 'Tabletop Games', 'Pool & Billiards', 'Wholesale Stores', 'Books, Mags, Music & Video', 'Cannabis Dispensaries', 'Drugstores', 'Fashion', 'Toy Stores', 'Vitamins & Supplements', 'Art Schools', 'Dance Schools', 'Drama Schools', 'Language Schools', 'Bikes', 'Sports Wear', 'Wine Tasting Classes', 'Food Tours', 'Boat Tours', 'Walking Tours', 'Bus Tours', 'Historical Tours', 'Private Jet Charter', 'Wine Tasting Room', 'Shopping', 'Arts & Entertainment', 'Event Planning & Services', 'Local Services', 'Professional Services', 'Hotels & Travel', 'Farmers Market', 'Food', 'Restaurant Supplies', 'Paint Stores', 'Office Cleaning', 'Home Cleaning', 'Bike Repair/Maintenance', 'Tui Na', 'Reflexology', 'Massage Therapy', 'Makeup Artists', 'Laundry Services', 'Plumbing', 'Heating & Air Conditioning/HVAC', 'Air Duct Cleaning', 'Comic Books', 'Christmas Trees', 'Hardware Stores', 'Furniture Stores', 'Jewelry', 'Eyewear & Opticians', 'Optometrists', 'Hair Stylists', 'Real Estate Agents', 'Real Estate Services', 'Massage', 'Personal Injury Law', 'Workers Compensation Law', 'Social Security Law', 'Keys & Locksmiths', 'Water Heater Installation/Repair', 'Counseling & Mental Health', 'Mediators', 'Apartments', 'Cards & Stationery', 'Bike Rentals', 'Mobile Phones', 'Real Estate Law', 'Acupuncture', 'Children\'s Clothing', 'Men\'s Clothing', 'Trophy Shops', 'Engraving', 'Signmaking', 'Shoe Stores', 'Art Supplies', 'Hair Extensions', '3D Printing', 'Elementary Schools', 'Middle Schools & High Schools', 'Landmarks & Historical Buildings', 'Eyebrow Services', 'Eyelash Service', 'General Dentistry', 'Education', 'Public Services & Government', 'Skin Care', 'Physical Therapy', 'Sports Medicine', 'Rehabilitation Center', 'Medical Centers', 'Airport Shuttles', 'Barre Classes', 'Pilates', 'Framing', 'Child Care & Day Care', 'Montessori Schools', 'Floral Designers', 'Bankruptcy Law', 'Colleges & Universities', 'Threading Services', 'Metro Stations', 'Internal Medicine', 'Children\'s Museums', 'Jewelry Repair', 'Antiques', 'Taxis', 'Limos', 'Pet Stores', 'Pet Groomers', 'Pet Sitting', 'Music & DVDs', 'Boxing', 'Sporting Goods', 'Screen Printing/T-Shirt Printing', 'Dog Walkers', 'Pet Boarding', 'Shades & Blinds', 'Shutters', 'Nutritionists', 'Baby Gear & Furniture', 'Health & Medical', 'Hospitals', 'Sailing', 'Veterinarians', 'Kitchen & Bath', 'Appliances', 'Lighting Fixtures & Equipment', 'Marinas', 'Electronics', 'Hobby Shops', 'Dry Cleaning', 'Discount Store', 'Uniforms', 'Embroidery & Crochet', 'Accountants', 'Tax Services', 'Train Stations', 'Pest Control', 'Sewing & Alterations', 'Auto Repair', 'Oil Change Stations', 'Painters', 'Electricians', 'Home & Garden', 'Diagnostic Services', 'Property Management', 'Art Space Rentals', 'Taekwondo', 'Kickboxing', 'Carpet Cleaning', 'Window Washing', 'Barbers', 'Thrift Stores', 'Professional Sports Teams', 'Rugs', 'Movers', 'Lawyers', 'Printing Services', 'General Contractors', 'Lingerie', 'Furniture Reupholstery', 'Electronics Repair', 'Laser Hair Removal', 'Medical Spas', 'Immigration Law', 'Knitting Supplies', 'Used Bookstore', 'Home Health Care', 'Swimming Lessons/Schools', 'Cardio Classes', 'Tires', 'Hats', 'Car Rental', 'Midwives', 'Videos & Video Game Rental', 'Session Photography', 'Mailbox Centers', 'Post Offices', 'Notaries', 'Telecommunications', 'Internet Service Providers', 'Truck Rental', 'Packing Supplies', 'Transmission Repair', 'Endodontists', 'Cosmetic Dentists', 'Martial Arts', 'Shoe Repair', 'Car Wash', 'Auto Detailing', 'Event Photography', 'Libraries', 'Cooking Schools', 'Pediatricians', 'Herbal Shops', 'Alternative Medicine', 'Photographers', 'Brewing Supplies', 'Mobile Phone Accessories', 'Mobile Phone Repair', 'Pet Training', 'Appliances & Repair', 'Medical Transportation', 'Medical Supplies', 'Machine & Tool Rental', 'Crane Services', 'Shipping Centers', 'Packing Services', 'Travel Services', 'Tennis', 'Tutoring Centers', 'Test Preparation', 'Hostels', 'Car Share Services', 'Perfume', 'Candle Stores', 'Officiants', 'Luggage', 'Lactation Services', 'Urgent Care', 'Psychiatrists', 'Bespoke Clothing', 'Formal Wear', 'Psychologists', 'Addiction Medicine', 'Saunas', 'Men\'s Hair Salons', 'Doulas', 'Dentists', 'Dental Hygienists', 'Soccer', 'Climbing', 'Water Parks', 'Arts & Crafts', 'Recreation Centers', 'Active Life', 'Hair Removal', 'Junk Removal & Hauling', 'Doctors', 'Tree Services', 'Gardeners', 'Insurance', 'IT Services & Computer Repair', 'Web Design', 'Advertising', 'Check Cashing/Pay-day Loans', 'Driving Schools', 'Handyman', 'Flooring', 'Diagnostic Imaging', 'Pain Management', 'Reiki', 'Security Services', 'Fabric Stores', 'Self Storage', 'Buses', 'Religious Organizations', 'Vacation Rentals', 'Security Systems', 'Botanical Gardens', 'Print Media', 'Bounce House Rentals', 'Watch Repair', 'Balloon Services', 'Metal Fabricators', 'Fences & Gates', 'Obstetricians & Gynecologists', 'Photography Stores & Services', 'Building Supplies', 'Rock Climbing', 'Windows Installation', 'Glass & Mirrors', 'Motorcycle Repair', 'Body Shops', 'Life Coach', 'Carpeting', 'Carpet Installation', 'Supernatural Readings', 'Tanning', 'Walk-in Clinics', 'Chiropractors', 'Naturopathic/Holistic', 'Videographers', 'Calligraphy', 'Private Tutors', 'Day Camps', 'Pediatric Dentists', 'Home & Rental Insurance', 'Auto Insurance', 'Life Insurance', 'Laboratory Testing', 'Funeral Services & Cemeteries', 'Product Design', 'Door Sales/Installation', 'Car Buyers', 'Junkyards', 'Home Services', 'Pawn Shops', 'Mattresses', 'Auto Glass Services', 'Outdoor Gear', 'Allergists', 'Ear Nose & Throat', 'Cannabis Clinics', 'Aerial Fitness', 'Gymnastics', 'Cryotherapy', 'Float Spa', 'Challenge Courses', 'Escape Games', 'Garage Door Services', 'Home Network Installation', 'Bridal', 'Family Practice', 'Prenatal/Perinatal Care', 'Interior Design', 'Adult Shops', 'Muay Thai', 'Personal Shopping', 'Skate Shops', 'Psychics', 'Criminal Defense Law', 'Adult Entertainment', 'Blow Dry/Out Services', 'Motorcycle Dealers', 'Radio Stations', 'Screen Printing', 'Computers', 'Kitchen Incubators', 'Public Transportation', 'Flowers & Gifts', 'Spiritual Shop', 'Flea Markets', 'Ophthalmologists', 'Laser Eye Surgery/Lasik', 'Apartment Agents', 'Mortgage Brokers', 'Sugaring', 'Payroll Services', 'Brazilian Jiu-jitsu', 'Vascular Medicine', 'Radiologists', 'Department Stores', 'Wigs', 'Transportation']

In [None]:
#Filter out excluded categories
filtered_cat_df = raw_categories_df[~raw_categories_df['category'].isin(exclude)]

In [None]:
category_mapping = pd.read_csv('data/yelp/category_mapping.csv')
filtered_with_mapping = pd.merge(filtered_cat_df, category_mapping, left_on='category', right_on='category', how='left')
filtered_with_mapping.head(50)

In [None]:
open_with_mapping = filtered_with_mapping.loc[filtered_with_mapping['is_closed'] == 'False']
open_with_mapping.head(200)

In [None]:
#Create dataframes for counting 
ethnicity_df = open_with_mapping.loc[open_with_mapping['by_ethnicity'] == True]
ethnicity_df = ethnicity_df.drop_duplicates()

type_df = open_with_mapping.loc[open_with_mapping['by_type'] == True]
type_df = type_df.drop_duplicates()

dish_df = open_with_mapping.loc[open_with_mapping['by_dish'] == True]
dish_df = dish_df.drop_duplicates()

In [None]:
ethnicity_df.head(200)

In [None]:
type_df.head(200)

In [None]:
dish_df.head(200)

In [None]:
type_df_no_rest = type_df.loc[type_df['type_count'] != 'Restaurants']

In [None]:
ethnicity_count = ethnicity_df.groupby('ntaname')['ethnicity_count'].value_counts().unstack(fill_value=0)
dish_count = dish_df.groupby('ntaname')['category'].value_counts().unstack(fill_value=0)
type_count = type_df.groupby('ntaname')['type_count'].value_counts().unstack(fill_value=0)

In [None]:
ethnicity_range = ethnicity_count.columns[0:]
ethnicity_count['highest_category'] = ethnicity_count[ethnicity_range].idxmax(axis=1)

type_range = type_count.columns[0:]
type_count['highest_category'] = type_count[type_range].idxmax(axis=1)

dish_range = dish_count.columns[0:]
dish_count['highest_category'] = dish_count[dish_range].idxmax(axis=1)

In [None]:
ethnicity_count['highest_category']

In [None]:
type_count['highest_category']

In [None]:
dish_count['highest_category']

In [None]:
ethnicity_count.reset_index(inplace=True)
type_count.reset_index(inplace=True)
dish_count.reset_index(inplace=True)

ethnicity_count.to_csv('data/yelp/nta_counts/ethnicity_count_nta.csv', index=False)
type_count.to_csv('data/yelp/nta_counts/type_count_nta.csv', index=False)
dish_count.to_csv('data/yelp/nta_counts/dish_count_nta.csv', index=False)

In [None]:
ethnicity_df.to_csv('data/yelp/visualizations/ethnicty_breakdown_for_vis.csv', index=False)
type_df_no_rest.to_csv('data/yelp/visualizations/type_for_vis.csv', index=False)
dish_df.to_csv('data/yelp/visualizations/dish_for_vis.csv', index=False)
open_with_mapping.to_csv('data/yelp/visualizations/all_open_for_vis.csv')

In [None]:
all_restaurants_df = pd.read_csv('data/yelp/restaurants_by_category.csv')

## 2. Eater

### 2.1 Write txt files with article links from archive pages.

#### a. Closings

In [None]:
closings_txt = open('data/eater-closings-links.txt', 'w')
closings_url = "https://ny.eater.com/archives/restaurant-closings/"

closings_archive_pages = ['2023/1', '2023/2', '2023/3', '2023/4', '2023/5', '2023/6', '2023/7', '2023/8', '2023/9', '2023/10', '2022/1', '2022/2', '2022/3', '2022/4', '2022/5', '2022/6', '2022/7', '2022/8', '2022/9', '2022/10', '2022/11', '2022/12', '2021/1', '2021/2', '2021/3', '2021/4', '2021/5', '2021/6', '2021/7', '2021/8', '2021/9', '2021/10', '2021/11', '2021/12', '2020/1', '2020/2', '2020/3', '2020/4', '2020/5', '2020/6', '2020/7', '2020/8', '2020/9', '2020/10', '2020/11', '2020/12', '2019/1', '2019/2', '2019/3', '2019/4', '2019/5', '2019/6', '2019/7', '2019/8', '2019/9', '2019/10', '2019/11', '2019/12', '2018/1', '2018/2', '2018/3', '2018/4', '2018/5', '2018/6', '2018/7', '2018/8', '2018/9', '2018/10', '2018/11', '2018/12', '2017/1', '2017/2', '2017/3', '2017/4', '2017/5', '2017/6', '2017/7', '2017/8', '2017/9', '2017/10', '2017/11', '2017/12', '2016/1', '2016/2', '2016/3', '2016/4', '2016/5', '2016/6', '2016/7', '2016/8', '2016/9', '2016/10', '2016/11', '2016/12', '2015/1', '2015/2', '2015/3', '2015/4', '2015/5', '2015/6', '2015/7', '2015/8', '2015/9', '2015/10', '2015/11', '2015/12', '2014/12', '2014/11', '2014/10', '2014/7', '2012/5', '2011/12', '2010/5']

for pg in closings_archive_pages:
      
    scraper = cloudscraper.create_scraper()
    page = scraper.get(f"{closings_url}{pg}")
    soup = BeautifulSoup(page.text, "html.parser")

    article_anchors = soup.find_all('a', {'data-chorus-optimize-field':'hed'})

    for a in article_anchors:
        print(a.text)
        closings_txt.write(f"{a['href']}\n")

closings_txt.close()

#### b. Openings

In [None]:
#openings articles
openings_txt = open('data/eater-openings-links.txt', 'w')
openings_url = "https://ny.eater.com/archives/restaurant-openings/"

openings_archive_pages = ['2023/1', '2023/2', '2023/3', '2023/4', '2023/5', '2023/6', '2023/7', '2023/8', '2023/9', '2023/10', '2023/11', '2022/1', '2022/2', '2022/3', '2022/4', '2022/5', '2022/6', '2022/7', '2022/8', '2022/9', '2022/10', '2022/11', '2022/12', '2021/1', '2021/2', '2021/3', '2021/4', '2021/5', '2021/6', '2021/7', '2021/8', '2021/9', '2021/10', '2021/11', '2021/12', '2020/1', '2020/2', '2020/3', '2020/4', '2020/5', '2020/6', '2020/7', '2020/8', '2020/9', '2020/10', '2020/11', '2020/12', '2019/1', '2019/2', '2019/3', '2019/4', '2019/5', '2019/6', '2019/7', '2019/8', '2019/9', '2019/10', '2019/11', '2019/12', '2018/1', '2018/2', '2018/3', '2018/4', '2018/5', '2018/6', '2018/7', '2018/8', '2018/9', '2018/10', '2018/11', '2018/12', '2017/1', '2017/2', '2017/3', '2017/4', '2017/5', '2017/6', '2017/7', '2017/8', '2017/9', '2017/10', '2017/11', '2017/12', '2016/1', '2016/2', '2016/3', '2016/4', '2016/5', '2016/6', '2016/7', '2016/8', '2016/9', '2016/10', '2016/11', '2016/12', '2015/1', '2015/2', '2015/3', '2015/4', '2015/5', '2015/6', '2015/7', '2015/8', '2015/9', '2015/10', '2015/11', '2015/12', '2014/1', '2014/2', '2014/3', '2014/4', '2014/5', '2014/6', '2014/7', '2014/8', '2014/9', '2014/10', '2014/11', '2014/12', '2013/1', '2013/2', '2013/3', '2013/4', '2013/5', '2013/6', '2013/7', '2013/8', '2013/9', '2013/10', '2013/11', '2013/12', '2012/1', '2012/2', '2012/3', '2012/4', '2012/5', '2012/6', '2012/7', '2012/8', '2012/9', '2012/10', '2012/11', '2012/12', '2011/1', '2011/2', '2011/3', '2011/4', '2011/5', '2011/6', '2011/7', '2011/8', '2011/9', '2011/10', '2011/11', '2011/12', '2010/1', '2010/2', '2010/3', '2010/4', '2010/5', '2010/6', '2010/7', '2010/8', '2010/9', '2010/10', '2010/11', '2010/12', '2009/9', '2009/10', '2009/11', '2009/12']

for pg in openings_archive_pages:

    scraper = cloudscraper.create_scraper()
    page = scraper.get(f"{openings_url}{pg}")
    soup = BeautifulSoup(page.text, "html.parser")

    article_anchors = soup.find_all('a', {'data-chorus-optimize-field':'hed'})

    for a in article_anchors:
        openings_txt.write(f"{a['href']}\n")

openings_txt.close()

### 2.2 Scrape content from links via txt files

In [None]:
# Pull URLs from txt files
closing_urls = []
with open('data/eater-closings-links.txt') as closing_file:
    for line in closing_file:
        closing_urls.append(line.strip())

total_closings = len(closing_urls)

opening_urls = []
with open('data/eater-openings-links.txt') as opening_file:
    for line in opening_file:
        opening_urls.append(line.strip())

total_openings = len(opening_urls)

#Create CSV file
eater_csv = open('data/eater_data.csv', 'w')
csv_writer = csv.writer(eater_csv)
headers = ['type', 'title', 'article_date', 'summary', 'place_name', 'place_address', 'bolded', 'body_text', 'link']
csv_writer.writerow(headers)

#CLOSINGS ------------------------------------------------------------

counter = 1
for url in closing_urls:
    scraper = cloudscraper.create_scraper()
    page = scraper.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    #get title
    title = soup.find('h1', {'class' : 'c-page-title'}).text.strip()
    print(f"\"{title}\"")

    article_date = soup.find('time', {'class': 'c-byline__item'})
    if article_date is not None:
        article_date = article_date.text.strip()
        print(article_date)
    
    #get article summary
    summary = soup.find('p', {'class' : 'c-entry-summary p-dek'})
    if summary is not None:
        summary = summary.text.strip()
        print(summary)

    #get body paragraphs
    body_text = []
    body = soup.find('div', {'class' : "l-col__main"})
    body_ps = body.find_all('p')
    for p in body_ps:
        body_text.append(p.text.strip())

    place_embed = soup.find('div', {'class':"c-place-embed__body"})
    if place_embed is not None:
        place_name = place_embed.find('h2').text.strip()
        place_address = place_embed.find('span', {'class' : 'c-place-embed__address'}).text.strip()
        print(place_name)
        print(place_address)
    else:
        place_name = None
        place_address = None
    
    bolded_phrases = []
    strong = soup.find_all('strong')
    for name in strong:
        bolded_phrases.append(name.text)
    bolded = "|".join(bolded_phrases)

    #writes each article data to the CSV file
    write_list = ['closing', title, article_date, summary, place_name, place_address, bolded, body_text, url]
    csv_writer.writerow(write_list)

    print(f"{counter}/{total_closings}\n")
    print("----------------------------------------------------------------------------------------\n")
    time.sleep(3)
    counter += 1

#OPENINGS ------------------------------------------------------------

counter = 1
for url in opening_urls:
    scraper = cloudscraper.create_scraper()
    page = scraper.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    #get title
    title = soup.find('h1', {'class' : 'c-page-title'}).text.strip()
    print(f"\"{title}\"")

    article_date = soup.find('time', {'class': 'c-byline__item'})
    if article_date is not None:
        article_date = article_date.text.strip()
        print(article_date)
    
    #get article summary
    summary = soup.find('p', {'class' : 'c-entry-summary p-dek'})
    if summary is not None:
        summary = summary.text.strip()
        print(summary)

    place_embed = soup.find('div', {'class':"c-place-embed__body"})
    if place_embed is not None:
        place_name = place_embed.find('h2').text.strip()
        place_address = place_embed.find('span', {'class' : 'c-place-embed__address'}).text.strip()
        print(place_name)
        print(place_address)
    else:
        place_name = None
        place_address = None
    
    bolded_phrases = []
    strong = soup.find_all('strong')
    for name in strong:
        bolded_phrases.append(name.text)
    bolded = "|".join(bolded_phrases)

    legacy_text = []
    legacy_body = soup.find('div', {'class' : "l-col__main"})
    legacy_ps = legacy_body.find_all('p')
    if legacy_ps is not None:
        for p in legacy_ps:
            legacy_text.append(p.text.strip())

    current_text = []
    current_body = soup.find('div', {'class':'l-wrapper'})
    current_ps = current_body.find_all('p')
    if current_ps is not None:
        for ps in current_ps:
            current_text.append(ps.text.strip())

    if len(legacy_text) > len(current_text):
        article_text = legacy_text
    else:
        article_text = current_text

    #writes each article data to the CSV file
    write_list = ['opening', title, article_date, summary, place_name, place_address, bolded, article_text, url]
    csv_writer.writerow(write_list)

    print(f"{counter}/{total_openings}\n")
    print("----------------------------------------------------------------------------------------\n")

    time.sleep(3)
    counter += 1

eater_csv.close()

## 3. Gayot

### 3.1 Scrape restaurant news

In [None]:

#create new CSV file and initiate a CSV writer
article_csv = open('data/gayot_scrape.csv', 'w')
csv_writer = csv.writer(article_csv)

#write headers to the new CVS file
headers = ['month', 'article_link', 'names', 'hrefs', 'text']
csv_writer.writerow(headers)

months = ['august-2023','july-2023','february-2023','october-2022','september-2022','july-2022','june-2022','march-2022','january-2022','december-2021','november-2021','october-2021','june-2021','may-2021','february-2021','september-2020','august-2020','june-2020','april-2020','march-2020','february-2020','january-2020','december-2019','november-2019','october-2019','september-2019','july-2019','june-2019','may-2019','april-2019','march-2019','february-2019','january-2019','december-2018','november-2018','october-2018','september-2018','august-2018','july-2018','june-2018','may-2018','april-2018','march-2018','february-2018','january-2018','december-2017','november-2017','october-2017','september-2017','august-2017','july-2017','june-2017','may-2017','april-2017','march-2017','february-2017','january-2017','december-2016','november-2016','october-2016','september-2016','august-2016','july-2016','june-2016','may-2016','april-2016','march-2016','february-2016','january-2016','december-2015','november-2015','october-2015','september-2015','august-2015','july-2015','june-2015','may-2015','april-2015','march-2015','february-2015','january-2015','december-2014','november-2014','october-2014','september-2014','august-2014','july-2014','june-2014','may-2014','april-2014','march-2014','february-2014','january-2014','december-2013','november-2013','october-2013','september-2013','august-2013','july-2013','june-2013','may-2013','april-2013','march-2013','february-2013','january-2013','december-2012','november-2012','october-2012','september-2012','august-2012','july-2012','june-2012','may-2012','april-2012','march-2012','february-2012','january-2012','december-2011','november-2011','october-2011','september-2011','august-2011','july-2011','june-2011','may-2011','april-2011','march-2011','february-2011','january-2011','december-2010','november-2010','october-2010','september-2010','august-2010','july-2010','june-2010','may-2010','april-2010','march-2010','february-2010','january-2010','december-2009','november-2009','october-2009','september-2009','august-2009','july-2009','june-2009','may-2009','april-2009','march-2009','february-2009','january-2009','december-2008','november-2008','october-2008','september-2008','august-2008','july-2008','june-2008','may-2008','april-2008','march-2008']

for month in months:
    scraper = cloudscraper.create_scraper()
    page = scraper.get(f"https://www.gayot.com/restaurants/newyorknews/archive-{month}.html")
    soup = BeautifulSoup(page.text, "html.parser")

    body = soup.find('table')
    divs = soup.find_all('div', {'style' : 'width: 100%; padding-top: 10px;'})

    print(f"\n{month}-------------------------------------------------------------------------------------\n")
    for div in divs: 

        strong = div.find_all('strong')
        names = []
        for a in strong: 
            names.append(a.text)

        links = div.find_all('a')
        hrefs = []
        for link in links:
            if 'href' in link.attrs:
                hrefs.append(link['href'])

        text = div.text.strip()

        article_link = f"https://www.gayot.com/restaurants/newyorknews/archive-{month}.html"

        write_list = [month, article_link, names, hrefs, text]
        csv_writer.writerow(write_list)

        print(names)
        print(text)
    
    print(f"\n{month} completed. \n")
    time.sleep(2)

### 3.2 Scrape restaurant pages collected in initial scrape

In [None]:
restaurant_pages = []

with open('/Users/Alyse/Desktop/PRATT/612 Adv GIS/Final Project /data/gayot-clean.csv') as gayot_file: 
    gayot_csv = csv.reader(gayot_file)
    for row in gayot_csv:
        article_link = row[3]

        rest_links = row[5].replace("[", "").replace("]","").replace('\'', "").split(', ')
        
        for link in rest_links:
            if link.startswith('http://www.gayot.com/restaurants/'):
                restaurant_pages.append(link)

gayot_restaurants_csv = open('../data/gayot_restaurants.csv', 'w')
csv_writer = csv.writer(gayot_restaurants_csv)
headers = ['restaurant_name', 'blurb', 'price', 'description', 'address', 'cuisine', 'features', 'link']
csv_writer.writerow(headers)

print(f"{len(restaurant_pages)} found. Getting data...\n")

counter = 1
for pg in restaurant_pages:
    print(pg)
    scraper = cloudscraper.create_scraper()
    page = scraper.get(pg)
    soup = BeautifulSoup(page.text, "html.parser")

    rest_name = soup.find('h1').text.strip()

    blurb = soup.find('span', {'id': 'blurb'})
    if blurb is not None:
        blurb = blurb.text.strip()

    price = soup.find('span', {'class':'pricerange'})
    if price is not None:
        price = len(price.text.strip())

    description = soup.find('div', {'class' : 'description'})
    if description is not None: 
        description = description.text.strip()

    address = soup.find('div', {'class':'info-contact_address'})
    if address is not None:
        address = address.text.replace('USA', "").strip()
        address = re.sub("[\n\r\t]", " ", address)

    cuisine = soup.find('div', {'id':'cuisine'})
    if cuisine is not None:
        cuisine = cuisine.text.replace('Cuisine:\n',"").replace(' / ',"|").strip()

    features = []
    features_div = soup.find('div', {'id':'features'})
    if features_div is not None:
        features_lis = features_div.find_all('li')
        for li in features_lis:
            features.append(li.text.strip())
        features = "|".join(features)

    print(rest_name)
    print(blurb)
    print(price)
    print(cuisine)
    print(features)
    print(f"({counter}/{len(restaurant_pages)})")
    print("========================================")

    write_list = [rest_name, blurb, price, description, address, cuisine, features, pg]
    csv_writer.writerow(write_list)
    counter += 1