# Web Scraping Restaurant Location Data

Use the "Run" button to execute the code.  Choose Colab as the notebook only runs on this platform.  Adjustments will be needed for other data science platforms.

### Problem Statement

1. Go to https://food.grab.com/ph/en/ and enter any place in Manila. The website will display a number of restaurants around the given location.
2. Scroll down and click “Load More” when you see one. There would need to be clicked many such “Load More” buttons if you want to see all the restaurants.
3. For this assignment, you have to fetch and give the latitudes and longitudes of all the restaurants on this page.
Step 1 and 2 can done manually (if you can automate step 2, that’s great). We are looking for a script only for the part 3. The task shouldn’t take more than 2-3 hours.

Deliverables: Link to Python Script / Jupyter Notebook that can be downloaded & executed to perform the above task.

### Additional Considerations

For ease of use the solution:



*   Would only use commonly used python libraries
*   Would not use APIs or similar that require a developer account or API key
*   Requests to the food.grab site and any other sites used would be normal search requests to avoid triggering any security concerns
*   Output the restaurants and their latitudes and longitudes to a standard file format like CSV
*   In addition to displaying the data would also present a visualisation
*   Would capture the website address for the restaurant for ease of following up on restaurants of interest  

### Solution Design



1.   Load the python libraries needed
2.   Load the food.grab.com page and automatically activate the "Load More" button until the page contains all the restaurants in the Manila area
3.   Scrape from the page the name of the restaurants and the address and website url and capture within a dataframe
4.   Load the food.grab.com pacge for each restaurant in turn, scape the geo-location data and add their latitude and longitude to the dataframe
5. Save the dataframe to a CSV file
6. Visualise the dataframe on a map using the folium library and including the url for each restaurant



### Installing the Python Libraries
The following are required by the project:
* Jovian to allow the notebook to be stored and submitted on the Jovian platform.
* Requests which allows the notebook to load the web pages.
* Beautiful Soup 4 which provides functionality to scrape specific fields from the web pages.
* Pandas which provides functions to manage the dataset and output it to a CSV file.
* The module re which is used to work with regular expressions

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import os

### Selecting the Page of Restaurants from "food.grab.com"

This is the url for the web-sites main landing page.

In [3]:
# Setting the url as a variable
manila_restaurants_url = 'https://food.grab.com/ph/en/'

### Using Selenium to Load Up Restaurants Around Manila

From the sites landing page the location needs to be entered.  An address at the centre of Manila is selected and input.  This loads a web page with the restaurants in the immediate vicinity with a "Load More" button for further restaurants.  The load more button needs to be activitated with Selenium to create a page with more restaurants in Manila.

**WARNING**  Selenium running on Collab using the Chromium driver with this site is quite unstable, so if this cell does not run first time please try again several times.

In [4]:
!pip install selenium --upgrade -q
!apt-get update --quiet
!apt install chromium-chromedriver --upgrade -q

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver =webdriver.Chrome('chromedriver',chrome_options=chrome_options)

[?25l[K     |▍                               | 10kB 12.7MB/s eta 0:00:01[K     |▊                               | 20kB 17.1MB/s eta 0:00:01[K     |█                               | 30kB 7.2MB/s eta 0:00:01[K     |█▌                              | 40kB 8.6MB/s eta 0:00:01[K     |█▉                              | 51kB 5.5MB/s eta 0:00:01[K     |██▏                             | 61kB 5.5MB/s eta 0:00:01[K     |██▌                             | 71kB 6.1MB/s eta 0:00:01[K     |███                             | 81kB 4.3MB/s eta 0:00:01[K     |███▎                            | 92kB 4.7MB/s eta 0:00:01[K     |███▋                            | 102kB 5.2MB/s eta 0:00:01[K     |████                            | 112kB 5.2MB/s eta 0:00:01[K     |████▍                           | 122kB 5.2MB/s eta 0:00:01[K     |████▊                           | 133kB 5.2MB/s eta 0:00:01[K     |█████                           | 143kB 5.2MB/s eta 0:00:01[K     |█████▍                   

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


### Loading up the page of restaurants

Using the Selenium library and Chrome driver the page of restaurants around central Manila is created by robotic automation of repeating the "Load More" button.

A timer is introduced and only 3 "Load More" events are used here as the site is sensitive to web-scraping activities.  Try not to repeatedly run this cell as the site's sentinel processes will kick in and block you.  

This notebook is for research, but if all the restaurant data was needed for Manila then the count could be set higher.

**WARNING**  Selenium running on Collab using the Chromium driver with this site is quite unstable, so if this cell does not run first time please try again several times.

In [5]:
import time
manila_restaurants_url = 'https://food.grab.com/ph/en/'
url = manila_restaurants_url
driver.get(url)
enter_location_in_centre_of_manila = driver.find_element_by_class_name("ant-input").send_keys("Manila City Hall - 369 Antonio Villegas St., Ermita, Manila, Metro Manila, NCR, 1000, Philippines")
button = driver.find_element_by_class_name("ant-btn")
button.click()
count = 0

while count < 3:
  try:

    time.sleep(10)
    button = driver.find_element_by_class_name("ant-btn")
    button.click()
    count = count + 1

  except:
    print("Number of pages scraped: ",count)
    count = 50

Just a check here as to how far we got.

In [6]:
driver.save_screenshot('screenshot.png')

True

Next we need to find the information for the restaurants on the page using beautiful soup.

In [7]:
name_elements = driver.find_elements_by_xpath("//div[@class ='ant-layout']" and "//div[@class = 'ant-row-flex RestaurantListRow___1SbZY']" and "//div[@class ='ant-col-24 RestaurantListCol___1FZ8V  ant-col-md-12 ant-col-lg-6']" and "//h6[@class = 'name___2epcT']")

Because the page has an initial section on promoted restaurants that we don't need these are dropped.

In [8]:
name_elements = name_elements[10:]

Another check to see how we are doing.  Again the interations of Selenium and Collab and Chrome Drivers are proving unstable, to if the main_elements set is empty then go back and re-run the web-scraping cells in order from import time.

The number of elements is the number of restaurants we now have information on, albeit in a raw format.

In [9]:
name_elements

[<selenium.webdriver.remote.webelement.WebElement (session="e029d3eb48d8c5e6f549afcddc8337d9", element="7e21ace8-80bc-417f-bd04-f36279bc34b7")>,
 <selenium.webdriver.remote.webelement.WebElement (session="e029d3eb48d8c5e6f549afcddc8337d9", element="85491bea-364e-4321-9c2d-a0bff10d801e")>,
 <selenium.webdriver.remote.webelement.WebElement (session="e029d3eb48d8c5e6f549afcddc8337d9", element="2ee9956f-28df-44b5-aba5-9adceba04da9")>,
 <selenium.webdriver.remote.webelement.WebElement (session="e029d3eb48d8c5e6f549afcddc8337d9", element="f9c9cd94-4b4a-4788-bc56-ccd4037bcf63")>,
 <selenium.webdriver.remote.webelement.WebElement (session="e029d3eb48d8c5e6f549afcddc8337d9", element="d575d219-8003-41ac-8bea-f64d708a96cf")>,
 <selenium.webdriver.remote.webelement.WebElement (session="e029d3eb48d8c5e6f549afcddc8337d9", element="f2e33997-03e3-48fb-b516-3d04df9bb4ed")>,
 <selenium.webdriver.remote.webelement.WebElement (session="e029d3eb48d8c5e6f549afcddc8337d9", element="3374cda2-fa0c-444a-ae82-20

We now need to extract the restaurant names.

In [10]:
name_list = []
for name_element in name_elements:
    name_list.append(name_element.text)

Again checking how we are doing.

In [11]:
name_list

['KFC - Sta Cruz',
 'CoCo Fresh Tea & Juice - Lucky Chinatown Mall',
 "Angel's Pizza - Legarda [Available for LONG-DISTANCE DELIVERY]",
 'Popeyes - SM Manila [Available for LONG-DISTANCE DELIVERY]',
 'Tokyo Tokyo - SM Manila [Available for LONG-DISTANCE DELIVERY]',
 'Starbucks - Pacific Center Binondo [Available for LONG-DISTANCE DELIVERY]',
 'J.Co Donuts & Coffee - Lucky Chinatown Mall [Available for LONG-DISTANCE DELIVERY]',
 "Papa John's Pizza - Tri Loyola Building [Available for LONG-DISTANCE DELIVERY]",
 'Army Navy Burger + Burrito - UST Dapitan [Available for LONG-DISTANCE DELIVERY]',
 'Jollibee - Raon',
 'BonChon - Legarda [Available for LONG-DISTANCE DELIVERY]',
 'Al Jograts - V. Concepcion [Available for LONG-DISTANCE DELIVERY]',
 'Happilee Korean Kitchen - Grabkitchen (Sampaloc) [Available for LONG-DISTANCE DELIVERY]',
 'Yellow Cab Pizza - Espana [Available for LONG-DISTANCE DELIVERY]',
 'Pepper Lunch Express - Lucky Chinatown [Available for LONG-DISTANCE DELIVERY]',
 'Tim Ho

We can now start to build the dataframe ....

In [12]:
manila_restaurants_dataset = pd.DataFrame(name_list, columns = ['Restaurant'])

In [13]:
manila_restaurants_dataset

Unnamed: 0,Restaurant
0,KFC - Sta Cruz
1,CoCo Fresh Tea & Juice - Lucky Chinatown Mall
2,Angel's Pizza - Legarda [Available for LONG-DI...
3,Popeyes - SM Manila [Available for LONG-DISTAN...
4,Tokyo Tokyo - SM Manila [Available for LONG-DI...
...,...
91,Cafe Mezzanine - Binondo
92,Grub King Enterprise - Ycaza Street [Available...
93,Kung Pow Express - UST [Available for LONG-DIS...
94,Tio Paengs - Loyola [Available for LONG-DISTAN...


We can now go back again to the elements and extract the url to the individual page for each of the restaurants.  And then add them to the dataframe.

In [14]:
url_elements = driver.find_elements_by_xpath("//a[contains(@href, '/ph/en/restaurant')]")


In [15]:
url_elements_list = []
for url_element in url_elements:
    url_elements_list.append(url_element.get_attribute("href"))

In [16]:
url_elements_list = url_elements_list[10:]

In [17]:
manila_restaurants_dataset_temp = pd.DataFrame(url_elements_list, columns = ['url'])

In [18]:
manila_restaurants_dataset_temp


Unnamed: 0,url
0,https://food.grab.com/ph/en/restaurant/kfc-sta...
1,https://food.grab.com/ph/en/restaurant/coco-fr...
2,https://food.grab.com/ph/en/restaurant/angel-s...
3,https://food.grab.com/ph/en/restaurant/popeyes...
4,https://food.grab.com/ph/en/restaurant/tokyo-t...
...,...
123,https://food.grab.com/ph/en/restaurant/turks-r...
124,https://food.grab.com/ph/en/restaurant/black-s...
125,https://food.grab.com/ph/en/restaurant/r-lapid...
126,https://food.grab.com/ph/en/restaurant/chachag...


In [19]:
manila_restaurants_dataset = pd.concat([manila_restaurants_dataset, manila_restaurants_dataset_temp], axis=1, join='inner')

In [20]:
manila_restaurants_dataset

Unnamed: 0,Restaurant,url
0,KFC - Sta Cruz,https://food.grab.com/ph/en/restaurant/kfc-sta...
1,CoCo Fresh Tea & Juice - Lucky Chinatown Mall,https://food.grab.com/ph/en/restaurant/coco-fr...
2,Angel's Pizza - Legarda [Available for LONG-DI...,https://food.grab.com/ph/en/restaurant/angel-s...
3,Popeyes - SM Manila [Available for LONG-DISTAN...,https://food.grab.com/ph/en/restaurant/popeyes...
4,Tokyo Tokyo - SM Manila [Available for LONG-DI...,https://food.grab.com/ph/en/restaurant/tokyo-t...
...,...,...
91,Cafe Mezzanine - Binondo,https://food.grab.com/ph/en/restaurant/cafe-me...
92,Grub King Enterprise - Ycaza Street [Available...,https://food.grab.com/ph/en/restaurant/grub-ki...
93,Kung Pow Express - UST [Available for LONG-DIS...,https://food.grab.com/ph/en/restaurant/kung-po...
94,Tio Paengs - Loyola [Available for LONG-DISTAN...,https://food.grab.com/ph/en/restaurant/tio-pae...


### Loading Individual Pages for Each Restaurant

Next we use the url to load each page in turn and find the geo-location for the restaurant and add it to the dataframe. 

**NOTE** that the loading is deliberately slowed by a sleep command for 10 seconds after each page load.  Also the full dataframe is not used, instead the first 25 are used.  By removing this contraint the full dataset would be used.  Both of these are to reduce the load on the web site and avoid triggering any security sentinel functions, whilst proving that the web scraping goals can be achieved.

In [21]:
manila_restaurants_dataset['Latitude'] = ''
manila_restaurants_dataset['Longitude'] = ''

In [22]:
manila_restaurants_dataset

Unnamed: 0,Restaurant,url,Latitude,Longitude
0,KFC - Sta Cruz,https://food.grab.com/ph/en/restaurant/kfc-sta...,,
1,CoCo Fresh Tea & Juice - Lucky Chinatown Mall,https://food.grab.com/ph/en/restaurant/coco-fr...,,
2,Angel's Pizza - Legarda [Available for LONG-DI...,https://food.grab.com/ph/en/restaurant/angel-s...,,
3,Popeyes - SM Manila [Available for LONG-DISTAN...,https://food.grab.com/ph/en/restaurant/popeyes...,,
4,Tokyo Tokyo - SM Manila [Available for LONG-DI...,https://food.grab.com/ph/en/restaurant/tokyo-t...,,
...,...,...,...,...
91,Cafe Mezzanine - Binondo,https://food.grab.com/ph/en/restaurant/cafe-me...,,
92,Grub King Enterprise - Ycaza Street [Available...,https://food.grab.com/ph/en/restaurant/grub-ki...,,
93,Kung Pow Express - UST [Available for LONG-DIS...,https://food.grab.com/ph/en/restaurant/kung-po...,,
94,Tio Paengs - Loyola [Available for LONG-DISTAN...,https://food.grab.com/ph/en/restaurant/tio-pae...,,


In [23]:
!pip install requests --quiet


In [26]:
import requests

# Looping through each retaurant page in the dataframe

i = 0

for index, row in manila_restaurants_dataset.iterrows():
    
    manila_restaurants_dataset_url = row[1]
    print(manila_restaurants_dataset_url)
    manila_restaurants_dataset_page = requests.get(manila_restaurants_dataset_url)
    
    # extracting the geo-location data using Beautiful Soup 4
    manila_restaurants_dataset_page = BeautifulSoup(manila_restaurants_dataset_page.text, 'html.parser')
   
    test_text = manila_restaurants_dataset_page.text.find('latlng":{')
    
    test_text_lat = manila_restaurants_dataset_page.text[test_text:].find('14') + test_text

    restaurant_latitude = manila_restaurants_dataset_page.text[(test_text_lat):(test_text_lat+10)]
    if str.isdecimal(restaurant_latitude[3:]) == False:
      restaurant_latitude = restaurant_latitude[:-3]+'000'

    test_text_long = manila_restaurants_dataset_page.text[(test_text_lat+12):].find('120') + test_text_lat+12
    restaurant_longitude = manila_restaurants_dataset_page.text[test_text_long:(test_text_long+11)]
    if str.isdecimal(restaurant_longitude[4:]) == False:
      restaurant_longitude = restaurant_longitude[:-3]+'000'

    # final check to leave out any restaurants missing the lat and long
    if str.isdecimal(restaurant_latitude[3:]) == True and str.isdecimal(restaurant_longitude[4:]) == True:
      manila_restaurants_dataset.at[i,'Latitude'] = restaurant_latitude
      manila_restaurants_dataset.at[i,'Longitude'] = restaurant_longitude
    
    print(restaurant_latitude, restaurant_longitude)
    # introducing a time delay so that the web-site doesn't get concerned by hundreds of rapid requests from one IP address.
    time.sleep(5)
    
    i = i+1


https://food.grab.com/ph/en/restaurant/kfc-sta-cruz-delivery/2-CYUZC8BTGJ51GJ
14.5999682 120.9800219
https://food.grab.com/ph/en/restaurant/coco-fresh-tea-juice-lucky-chinatown-mall-delivery/PHGFSTI0000019i
14.6034369 120.9735280
https://food.grab.com/ph/en/restaurant/angel-s-pizza-legarda-available-for-long-distance-delivery-delivery/PHGFSTI000000zw
14.5992255 120.9900034
https://food.grab.com/ph/en/restaurant/popeyes-sm-manila-available-for-long-distance-delivery-delivery/2-CZMHAYAGMEC1SE
14.5900564 120.9831886
https://food.grab.com/ph/en/restaurant/tokyo-tokyo-sm-manila-available-for-long-distance-delivery-delivery/PHGFSTI0000018o
14.5903358 120.9829344
https://food.grab.com/ph/en/restaurant/starbucks-pacific-center-binondo-available-for-long-distance-delivery-delivery/2-CY42TKKBTU4DA2
14.5985393 120.9756623
https://food.grab.com/ph/en/restaurant/j-co-donuts-coffee-lucky-chinatown-mall-available-for-long-distance-delivery-delivery/PHGFSTI000001ow
14.6034786 120.9741344
https://food.

In [27]:
manila_restaurants_dataset

Unnamed: 0,Restaurant,url,Latitude,Longitude
0,KFC - Sta Cruz,https://food.grab.com/ph/en/restaurant/kfc-sta...,14.5999682,120.9800219
1,CoCo Fresh Tea & Juice - Lucky Chinatown Mall,https://food.grab.com/ph/en/restaurant/coco-fr...,14.6034369,120.9735280
2,Angel's Pizza - Legarda [Available for LONG-DI...,https://food.grab.com/ph/en/restaurant/angel-s...,14.5992255,120.9900034
3,Popeyes - SM Manila [Available for LONG-DISTAN...,https://food.grab.com/ph/en/restaurant/popeyes...,14.5900564,120.9831886
4,Tokyo Tokyo - SM Manila [Available for LONG-DI...,https://food.grab.com/ph/en/restaurant/tokyo-t...,14.5903358,120.9829344
...,...,...,...,...
91,Cafe Mezzanine - Binondo,https://food.grab.com/ph/en/restaurant/cafe-me...,14.6004265,120.9755354
92,Grub King Enterprise - Ycaza Street [Available...,https://food.grab.com/ph/en/restaurant/grub-ki...,14.5972830,120.9954107
93,Kung Pow Express - UST [Available for LONG-DIS...,https://food.grab.com/ph/en/restaurant/kung-po...,14.6116183,120.9877000
94,Tio Paengs - Loyola [Available for LONG-DISTAN...,https://food.grab.com/ph/en/restaurant/tio-pae...,14.6058478,120.9909107


### Writing information to a csv file

In [28]:
# file is in the files directory on Colab

manila_restaurants_dataset.to_csv("manila_restaurants_dataset.csv")

### Displaying restaurants on a map using folium



In [32]:
import folium
from folium import plugins
from folium.plugins import MarkerCluster

In [33]:
#Create the Map
map_manila_restaurants = folium.Map(
    location = [14.599512, 120.984222],
    zoom_start = 16
)
map_manila_restaurants

#Mark the point in Map
for indice, row in manila_restaurants_dataset.iterrows():
    if row['Latitude'] == "":
          print("Restaurant missing lat & long is: ", row['Restaurant'])
    else:
          folium.Marker(
            location=[float(row["Latitude"]), float(row["Longitude"])],
            popup=row['Restaurant'] + '<br>' + row['url'] + '<br>' + "Lat: " + row['Latitude'] + '<br>' + "Long: " + row['Longitude'],
            tooltip = 'Click for restaurant name, url, lat & long', 
            icon=folium.Icon(color="green")
          ).add_to(map_manila_restaurants)
map_manila_restaurants

### Issues and Further Work

1. Selenium running in Colab can only be run once before needing to "factory restart".  Investigations could be undertaken with a view to making the notebook more robuts.

2. One of two of the restaurants are missing as there is no latitude and longitude on the site.  These could be sourced using Google Maps and web scraped from that source.

### Summary and Conclusion

This exercise has delivered on the goals set by usin Selenium to click the Load More button repeatedly.  Then using Requests and BS4 to load pages from the site for individual restaurants and find the latitudes and longitudes.  Pandas was used to creat a dataframe and output the restaurant names, urls, latitude and longitude to a .csv file.  Finally the restaurants were visualised on a map of Manila using Folium.

### Saving the workbook at Jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="web-scraping-restaurant-location-data")

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
Committed successfully! https://jovian.ai/metanoialondon/web-scraping-restaurant-location-data
