# Scraping Tool

The code below includes comments wherever possible. For reasons of reproducibility, the final data dictionary is stored in a `.pkl` format. If a different selection of restaurants is desired, simply delete the `top3restaurants.pkl` file from the same directory and execute this notebook again.

### 1. Imports and Setting up Webdriver

In [1]:
import time # for timekeeping
toolStart = time.time()
from selenium import webdriver # used here for automated operation of the website, and to scrape content by element-by-element
from bs4 import BeautifulSoup # used here to scrape entire divs/elements, and extract information from their sub-elements
import re # Regular expression matching
import pandas as pd # Dataframe/.csv manipulation
import pickle # saving objects into files for later use and reproducibility


# standard headers to prevent restrictions from Swiggy's end
headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    
# initialising Chrome webdriver
browser = webdriver.Chrome(executable_path='./chromedriver_win32/chromedriver.exe')
browser.maximize_window()

# page containing all restaurant listings in Bangalore
listingPage = 'https://www.swiggy.com/city/bangalore'


c:\Users\utkar\.conda\envs\tensorflow\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
c:\Users\utkar\.conda\envs\tensorflow\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
  stacklevel=1)


### 2. Selecting Restaurants from Swiggy's Bangalore page

The method employed here will pick the top 3 restaurants (w.r.t. star rating) on the front page of Swiggy's Bangalore page. This ensures that if the ratings on the front page change, the top 3 restaurants will be selected regardless.

*For context, the top restaurants on the frontpage changed **five times** over the course of testing this scraper.*

In [2]:
browser.get(listingPage)
links = browser.find_elements_by_class_name('_1j_Yo')
ratings = []
for i in browser.find_elements_by_class_name('_3Mn31'):
    stars = i.text.split()[0]
    if stars == '--':
        stars = 0 # star rating unavailable
    else:
        stars = float(stars)
    ratings.append(stars)
ratingDict = {}
for link, rating in zip(links, ratings):
    ratingDict[link.get_attribute('href')] = rating

# Sorting ratingDict on the basis of values (i.e. ratings)
top3Links = sorted(ratingDict.items(), reverse = True, key = lambda di:(di[1], di[0]))[:3] # lambda function represents tuple comparison
print('Scraped %d frontpage restaurant links, out of which the top 3 are:'%(len(ratingDict)), '\n', top3Links)

Scraped 16 frontpage restaurant links, out of which the top 3 are: 
 [('https://www.swiggy.com/restaurants/kwality-walls-frozen-dessert-and-ice-cream-shop-btm-btm-2nd-stage-bangalore-298068', 4.3), ('https://www.swiggy.com/restaurants/craving-o-clock-btm-btm-2nd-stage-bangalore-362852', 4.3), ('https://www.swiggy.com/restaurants/chai-point-doddakannelli-villaymma-layouts-bangalore-286575', 4.3)]


### 3. Scraping and Collating Menu Info for Top 3 Restaurants

On Swiggy, the restaurant page already loads all menu items upon opening (i.e. no lazyloading). Thus, we don't need to simulate a scroll to the bottom. This also means we can extract all HTML content at once through `BeautifulSoup`.

In [3]:
restLinks = [i[0] for i in top3Links]

def menuScrape(browser, link):
    restData = [] # creating a list of dicts (menu items), so that DataFrame organisation is easier
    browser.get(link)
    name = browser.find_element_by_class_name('_3aqeL').text # restaurant name
    area = browser.find_elements_by_class_name('_3duMr')[2].text # restaurant area
    rating = browser.find_element_by_class_name('_2l3H5').text # restaurant star rating
    numRev = browser.find_element_by_class_name('_1iYuU').text.split()[0] # number of restaurant reviews

    pageSource = browser.page_source
    pageSoup = BeautifulSoup(pageSource)
    catDivs = pageSoup.find_all('div', class_='_2dS-v')

    for catDiv in catDivs:
        itemCat = catDiv.find('h2', class_=['M_o7R _27PKo', 'M_o7R']).get_text() # item category
        if itemCat == "Recommended":
            continue # ignoring 'Recommended' category wherever detected to reduce data redundancy
        itemDivs = catDiv.find_all('div', class_='_2wg_t')

        for itemDiv in itemDivs:
            itemName = itemDiv.find('div', class_='styles_itemName__hLfgz').get_text()
            itemPrice = itemDiv.find('span', class_='rupee').get_text()
            desc = itemDiv.find('div', class_='styles_itemDesc__3vhM0')
            itemDesc = desc.get_text() if desc else 'N.A.' # since some items don't have a description
            tag = itemDiv.find('span', class_='styles_ribbon__3tZ21 styles_itemRibbon__353Fy')
            itemTag = tag.get_text() if tag else 'NoTag'
            itemBest = 1 if re.search('Bestseller', itemTag) else 0 # sets the tag to 1 if the item is a bestseller, otherwise 0
            
            # this dict will become a single row of the DataFrame we create in the next step. It contains all the required information
            itemDict = {'restName': name, 'restArea': area, 'restRating': rating, 'restNumRev': numRev, 
                        'itemCat': itemCat, 'itemName': itemName, 'itemPrice': itemPrice, 
                        'itemDesc': itemDesc, 'itemBest': itemBest}
            restData.append(itemDict)

    return restData

scrapeStart = time.time()
# scraping the required information for all 3 restaurants
itemData = []
restaurants = []
for restaurant in restLinks:
    cache = menuScrape(browser, restaurant)
    restId = cache[0]['restName'] + ', ' + cache[0]['restArea']
    itemData.extend(cache)
    restaurants.append(restId)
cache = 0 # resetting cache in memory after execution
scrapeEnd = time.time()

print('Menu organisation concluded with a total of %d items scraped, taking only %.2f seconds!'%(len(itemData), scrapeEnd-scrapeStart))
print('Restaurants scraped: ', restaurants[0], '|', restaurants[1], '|', restaurants[2])

Menu organisation concluded with a total of 257 items scraped, taking only 7.28 seconds!
Restaurants scraped:  Kwality Walls Frozen Dessert and Ice Cream Shop, Btm 2nd Stage | Craving O Clock, Btm 2nd Stage | Chai Point, Villaymma Layouts


### 4. Exporting Data as .CSV

This .csv is used for analysis in Tableau and Python later on.

In [4]:
pd.DataFrame(itemData).to_csv('top3_menudata.csv', index = False)
browser.close() # closing automated browser window
toolEnd = time.time()
print('The scraping process took %.2f seconds in total.'%(toolEnd-toolStart))

# saving data dict into a .pkl file for later retrieval. The if-conditional secures it against being overwritten.
if os.path.exists('top3restaurants.pkl'):
    pass
else:
    with open('top3restaurants.pkl', 'wb') as filePath:
        pickle.dump(itemData, filePath)

The scraping process took 15.06 seconds in total.
