## ZOMATO PARSING PROJECT

### Some Important Highlights about the Project

The primary features available on Zomato's platform can be broadly divided into four different categories.
1. Summary Highlights of Restaurant and Cuisine w.r.t demographics
2. Restaurant Attributes, i.e. Rating, Review Counts, Type of Cuisines Offered and Discounts. 
3. Menu details, price points and dish-types.
4. Reviews and sentiments of individuals about food. 

For this phase of the project, I will only parse and analyze data for the first two categories. The second set of features will be parsed and analyzed in the second phase to study the distribution of menu items, major restaurant offerings and price points. 

The fourth feature of reviews is also precious as it provides a gateway to analyze the sentiments of individuals and their remarks about various foods, cuisines and restaurants. Combined with the cuisine offering and segmentation, an NLP model can reveal important insights to steer brand and marketing activities to gain more traction and higher sales recommendations. 

#### PROJECT DESIGN

The first part of the project starts with parsing the key features, structuring it and inserting it in a schema of 4 tables, which are broadly designed based on the type of data points. 

As this overall analysis is granular to the demographics level, the table has been assigned id column as many to many relationship signifying the area they belong to. The reason I have used database to store and retrieve data in the intermediate step is to reduce any data loss due to server errors or any complications that might arise as parser runs. A database also provides more structured tabular form to store data and can help do some hands-on ETL. 

In [1]:
#Importing Libraries
import re
import mysql.connector
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
from urllib.error import HTTPError
import datetime
import requests
import random
import numpy as np
import pandas as pd
from time import sleep
import codecs
import time

In [2]:
#connecting and defining cursors for connection with database 
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"}
mydb = mysql.connector.connect(host = 'localhost', user='root', password='*****')
cur = mydb.cursor(buffered=True)

In [4]:
cur.execute('use zomato')

### PART 1

In [5]:
cur.execute('drop table if exists cuisine_summary')
cur.execute('drop table if exists time_summary')
cur.execute('drop table if exists minimum_requirement')
cur.execute('''drop table if exists restaurants''')


In [5]:
# defining schema of the project database 
cur.execute('''Create table if not exists cuisine_summary(
id int not null, 
area_name varchar(128),
cuisine_type varchar(128),
cuisine_count varchar(128)
)''')

cur.execute('''create table if not exists time_summary(
id int not null,
area_name varchar(128),
expected_delivery_time varchar(128),
count_of_restaurants varchar(128)
)''')

cur.execute('''create table if not exists minimum_requirement(
id int not null, 
area_name varchar(128),
minimum_order_type varchar(128),
minimum_order_count varchar(128)
)''')

cur.execute('''Create table if not exists restaurants(
id int not null,
area_name varchar(128),
restaurant_name varchar(128),
restaurant_rating varchar(128),
review_count varchar(128),
offered_cuisines varchar(128),
average_cost_per_person varchar(128),
expected_delivery_time varchar(128),
minimum_order_requirement varchar(128)
)''')

In [6]:
#zomato did not have any clear distinction or pattern for the zone numbering, so I had to manually fetch the area codes on their website
#While chosing the codes I made sure I am making a fair and equal distribution of all the areas in dubai

delivery_zones = [3831, 3832, 3835, 42656, 7980, 3836, 3850, 3790, 4716, 3849, 3882, 3847, 3878, 3879, 8458,42643, 3880, 3786, 3787, 3788, 8465, 8466, 
                 7721, 7722, 7723, 3839, 3840, 3844, 3801, 3802, 4721, 4720, 4723, 3797, 4724, 3794, 7990, 3816, 3821, 3822, 7986, 3820]

In [7]:
counter = 1 #to count iterations
for id1,areas in enumerate(delivery_zones,1):
    for page in range(1,5): # for the purpose of this project I only parsed 4 pages from a single zone
        try:
            new_list = []
            url = 'https://www.zomato.com/dubai/order-food-online?delivery_subzone={}&page={}'.format(areas,page)
            http = requests.get(url, headers = headers)
            bs = BeautifulSoup(http.text, 'html.parser')
            area = bs.h1.text
            #classes for the related data points
            cuisine = bs.find_all('div',{'class':'link_hover w100 search_filter cuisine'})
            deliver_time = bs.find_all('div',{'class':'link_hover w100 search_filter cft cursor-pointer'})
            res_name = bs.find_all(attrs={'data-result-type':'ResCard_Name'})
            rating = bs.find_all(attrs={'class':'rating-value'})
            rating_class = bs.find_all(attrs={'class':'review-count medium'})
            desc = bs.find_all(attrs = {'class':'description'})
            print('Parsing Area {}, Page Number {}'.format(id1, page)) #Printing to see how many iterations have been run and to debug any errors
            counter +=1
        except requests.exceptions.RequestException:
            print('Encountered an Error')
            continue
        for x in deliver_time:
            new_list.append(x)
        try: #Fetching cusines summary level view and storing in its respective table
            for x in cuisine:
                split1 = x.text.split(' ')
                count = ''.join(split1[-1])
                food_type = ' '.join(split1[:-1])
                cur.execute('''insert into cuisine_summary (id, area_name, cuisine_type, cuisine_count) values (%s, %s, %s, %s)''',
                            (id1,area, food_type,count ))
                mydb.commit()
        except:
            print('Pathway 1 Error Encountered')
            continue

        try: #fetching delivery time summary level view and storing in database
            for x in new_list[0:3]:
                x = x.text.split()
                delivery_time = ' '.join(x[:4])
                count_delivery = ''.join(x[4:])
                cur.execute('''insert into time_summary (id, area_name, expected_delivery_time, count_of_restaurants) values (%s, %s, %s, %s)''',
                            (id1,area, delivery_time,count_delivery ))
                mydb.commit()  
        except:
            print('Pathway 2 Error Encountered')
            continue
        try:    #fetching minimum order data points summaries for table 3 and stroing in database
            for x in new_list[9:]:
                x = x.text.split()
                min_order = ' '.join(x[:-1])
                min_order_count = ''.join(x[-1])
                cur.execute('''insert into minimum_requirement (id, area_name, minimum_order_type, minimum_order_count) values (%s, %s, %s, %s)''',
                                (id1,area, min_order,min_order_count))
                mydb.commit() 
        except:
            print('Pathway 3 error encountered')
            continue

        try: #main table for our analysis parsing features of the restaurants and stroing in its respective table
            for name,x, y, z in zip(res_name, rating, rating_class, desc):
                restaurant_name = name.text.strip()
                rate = x.text
                reviews = y.text
                cuisine = z.text.strip().split('\n')
                type_of_food = cuisine[0]
                cost_one = cuisine[1]
                order_values = cuisine[2]
                minimum_order = order_values[-8:-1]
                minimum_order1 = order_values[:-17].strip()
                cur.execute('''insert into restaurants (id, area_name,restaurant_name, restaurant_rating, 
                review_count, offered_cuisines, average_cost_per_person,expected_delivery_time,
                minimum_order_requirement) values (%s, %s, %s, %s, %s,%s,%s,%s,%s)''',
                                        (id1,area, restaurant_name, rate, reviews, type_of_food, cost_one, minimum_order, minimum_order1))
            mydb.commit()

        except:
            print('Pathway 4 error encountered')
            continue
        
        if counter%10 ==0: #making the parser sleep randomly every 10 iterations so server does not find it suspicious
            sleeper = np.random.randint(4,9)
            print('Now I will rest for {} Seconds and will then start working'.format(sleeper))
            time.sleep(sleeper)
            

Parsing Area 1, Page Number 1
Parsing Area 1, Page Number 2
Parsing Area 1, Page Number 3
Parsing Area 1, Page Number 4
Parsing Area 2, Page Number 1
Parsing Area 2, Page Number 2
Parsing Area 2, Page Number 3
Parsing Area 2, Page Number 4
Parsing Area 3, Page Number 1
Now I will rest for 6 Seconds and will then start working
Parsing Area 3, Page Number 2
Parsing Area 3, Page Number 3
Parsing Area 3, Page Number 4
Parsing Area 4, Page Number 1
Parsing Area 4, Page Number 2
Parsing Area 4, Page Number 3
Parsing Area 4, Page Number 4
Parsing Area 5, Page Number 1
Parsing Area 5, Page Number 2
Parsing Area 5, Page Number 3
Now I will rest for 7 Seconds and will then start working
Parsing Area 5, Page Number 4
Parsing Area 6, Page Number 1
Parsing Area 6, Page Number 2
Parsing Area 6, Page Number 3
Parsing Area 6, Page Number 4
Parsing Area 7, Page Number 1
Parsing Area 7, Page Number 2
Parsing Area 7, Page Number 3
Parsing Area 7, Page Number 4
Parsing Area 8, Page Number 1
Now I will res

### THE END OF THE PART FIRST

This is the end of the first part and the next notebook fetches data from the sql database, puts it in a dataframe, preprocess it, and prepares the data for the final representations. Lets move to the other notebook of this project.