* The basic idea of analyzing the Zomato dataset is to get a fair idea about the factors affecting the aggregate rating of each restaurant, establishment of different types of restaurant at different places, Bengaluru being one such city has more than 12,000 restaurants with restaurants serving dishes from all over the world. 
* With each day new restaurants opening the industry has'nt been saturated yet and the demand is increasing day by day. 
* Inspite of increasing demand it however has become difficult for new restaurants to compete with established restaurants. 
* Most of them serving the same food. Bengaluru being an IT capital of India. 
* Most of the people here are dependent mainly on the restaurant food as they don't have time to cook for themselves. 
* With such an overwhelming demand of restaurants it has therefore become important to study the demography of a location. 
* What kind of a food is more popular in a locality. 
* Do the entire locality loves vegetarian food. If yes then is that locality populated by a particular sect of people for eg. Jain, Marwaris, Gujaratis who are mostly vegetarian. 
* These kind of analysis can be done using the data, by studying different factors.


In [91]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Locating path of dataset locally and reading the file using pd.read_csv()

In [92]:
path= r'/home/aman/Documents/files/zomato.csv'
df=pd.read_csv(path)
df.head()

Unnamed: 0,url,address,name,online_order,book_table,rate,votes,phone,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)
0,https://www.zomato.com/bangalore/jalsa-banasha...,"942, 21st Main Road, 2nd Stage, Banashankari, ...",Jalsa,Yes,Yes,4.1/5,775,080 42297555\r\n+91 9743772233,Banashankari,Casual Dining,"Pasta, Lunch Buffet, Masala Papad, Paneer Laja...","North Indian, Mughlai, Chinese",800,"[('Rated 4.0', 'RATED\n A beautiful place to ...",[],Buffet,Banashankari
1,https://www.zomato.com/bangalore/spice-elephan...,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",Spice Elephant,Yes,No,4.1/5,787,080 41714161,Banashankari,Casual Dining,"Momos, Lunch Buffet, Chocolate Nirvana, Thai G...","Chinese, North Indian, Thai",800,"[('Rated 4.0', 'RATED\n Had been here for din...",[],Buffet,Banashankari
2,https://www.zomato.com/SanchurroBangalore?cont...,"1112, Next to KIMS Medical College, 17th Cross...",San Churro Cafe,Yes,No,3.8/5,918,+91 9663487993,Banashankari,"Cafe, Casual Dining","Churros, Cannelloni, Minestrone Soup, Hot Choc...","Cafe, Mexican, Italian",800,"[('Rated 3.0', ""RATED\n Ambience is not that ...",[],Buffet,Banashankari
3,https://www.zomato.com/bangalore/addhuri-udupi...,"1st Floor, Annakuteera, 3rd Stage, Banashankar...",Addhuri Udupi Bhojana,No,No,3.7/5,88,+91 9620009302,Banashankari,Quick Bites,Masala Dosa,"South Indian, North Indian",300,"[('Rated 4.0', ""RATED\n Great food and proper...",[],Buffet,Banashankari
4,https://www.zomato.com/bangalore/grand-village...,"10, 3rd Floor, Lakshmi Associates, Gandhi Baza...",Grand Village,No,No,3.8/5,166,+91 8026612447\r\n+91 9901210005,Basavanagudi,Casual Dining,"Panipuri, Gol Gappe","North Indian, Rajasthani",600,"[('Rated 4.0', 'RATED\n Very good restaurant ...",[],Buffet,Banashankari


#### Get information about shape and columns of dataset, to look and remove and not required columns

In [93]:
print(f"The shape of the data set is {df.shape}")
print(f"The columns present in the Dataset are \n{df.columns}")

The shape of the data set is (51717, 17)
The columns present in the Dataset are 
Index(['url', 'address', 'name', 'online_order', 'book_table', 'rate', 'votes',
       'phone', 'location', 'rest_type', 'dish_liked', 'cuisines',
       'approx_cost(for two people)', 'reviews_list', 'menu_item',
       'listed_in(type)', 'listed_in(city)'],
      dtype='object')


#### Drop the unrequired datasets
* url
* address
* phone
* dish_liked
* menu_item

In [94]:
df= df.drop(['url','address','phone','dish_liked','menu_item','reviews_list','name'], axis=1)

In [95]:
print(f"The shape of the data set after dropping some columns is {df.shape}")
print(f"The columns present in the Dataset are \n{df.columns}")

The shape of the data set after dropping some columns is (51717, 10)
The columns present in the Dataset are 
Index(['online_order', 'book_table', 'rate', 'votes', 'location', 'rest_type',
       'cuisines', 'approx_cost(for two people)', 'listed_in(type)',
       'listed_in(city)'],
      dtype='object')


#### Get the information about number of null values in the dataset

In [96]:
df.isnull().sum()

online_order                      0
book_table                        0
rate                           7775
votes                             0
location                         21
rest_type                       227
cuisines                         45
approx_cost(for two people)     346
listed_in(type)                   0
listed_in(city)                   0
dtype: int64

#### Dropping the Duplicate rows in the dataset


In [97]:
df.drop_duplicates(inplace=True)

In [98]:
print(f"The shape of the data set after dropping rows with duplicates is {df.shape}")
print(f"The columns present in the Dataset are \n{df.columns}")

The shape of the data set after dropping rows with duplicates is (51345, 10)
The columns present in the Dataset are 
Index(['online_order', 'book_table', 'rate', 'votes', 'location', 'rest_type',
       'cuisines', 'approx_cost(for two people)', 'listed_in(type)',
       'listed_in(city)'],
      dtype='object')


#### Edit the rate column


In [99]:
print(df['rate'].unique())

['4.1/5' '3.8/5' '3.7/5' '3.6/5' '4.6/5' '4.0/5' '4.2/5' '3.9/5' '3.1/5'
 '3.0/5' '3.2/5' '3.3/5' '2.8/5' '4.4/5' '4.3/5' 'NEW' '2.9/5' '3.5/5' nan
 '2.6/5' '3.8 /5' '3.4/5' '4.5/5' '2.5/5' '2.7/5' '4.7/5' '2.4/5' '2.2/5'
 '2.3/5' '3.4 /5' '-' '3.6 /5' '4.8/5' '3.9 /5' '4.2 /5' '4.0 /5' '4.1 /5'
 '3.7 /5' '3.1 /5' '2.9 /5' '3.3 /5' '2.8 /5' '3.5 /5' '2.7 /5' '2.5 /5'
 '3.2 /5' '2.6 /5' '4.5 /5' '4.3 /5' '4.4 /5' '4.9/5' '2.1/5' '2.0/5'
 '1.8/5' '4.6 /5' '4.9 /5' '3.0 /5' '4.8 /5' '2.3 /5' '4.7 /5' '2.4 /5'
 '2.1 /5' '2.2 /5' '2.0 /5' '1.8 /5']


In [100]:
# if the value is 'NEW' then convert into nan
# else condition is slicing the rate


def editratecolumn(value):
    if(value=='NEW' or value=='-'):
        return np.nan
    else:
        value=str(value).split('/')
        value= value[0]
        return float(value)


In [101]:
df['rate']= df['rate'].apply(editratecolumn)
df['rate'].head()

0    4.1
1    4.1
2    3.8
3    3.7
4    3.8
Name: rate, dtype: float64

In [102]:
# null values in that columns are replaced with the mean of that columns
df['rate'].fillna(df['rate'].mean(),inplace=True)

In [103]:
df.isnull().sum()

online_order                     0
book_table                       0
rate                             0
votes                            0
location                        19
rest_type                      225
cuisines                        43
approx_cost(for two people)    341
listed_in(type)                  0
listed_in(city)                  0
dtype: int64

#### Dropping NUll Values

In [104]:
df.dropna(inplace=True)

In [105]:
df.shape

(50781, 10)

### Renaming the column names

In [106]:
df.rename(columns = {'approx_cost(for two people)':'Cost2plates', 'listed_in(type)':'Type'}, inplace = True)

In [107]:
df.columns

Index(['online_order', 'book_table', 'rate', 'votes', 'location', 'rest_type',
       'cuisines', 'Cost2plates', 'Type', 'listed_in(city)'],
      dtype='object')

In [108]:
df

Unnamed: 0,online_order,book_table,rate,votes,location,rest_type,cuisines,Cost2plates,Type,listed_in(city)
0,Yes,Yes,4.100000,775,Banashankari,Casual Dining,"North Indian, Mughlai, Chinese",800,Buffet,Banashankari
1,Yes,No,4.100000,787,Banashankari,Casual Dining,"Chinese, North Indian, Thai",800,Buffet,Banashankari
2,Yes,No,3.800000,918,Banashankari,"Cafe, Casual Dining","Cafe, Mexican, Italian",800,Buffet,Banashankari
3,No,No,3.700000,88,Banashankari,Quick Bites,"South Indian, North Indian",300,Buffet,Banashankari
4,No,No,3.800000,166,Basavanagudi,Casual Dining,"North Indian, Rajasthani",600,Buffet,Banashankari
...,...,...,...,...,...,...,...,...,...,...
51712,No,No,3.600000,27,Whitefield,Bar,Continental,1500,Pubs and bars,Whitefield
51713,No,No,3.700164,0,Whitefield,Bar,Finger Food,600,Pubs and bars,Whitefield
51714,No,No,3.700164,0,Whitefield,Bar,Finger Food,2000,Pubs and bars,Whitefield
51715,No,Yes,4.300000,236,"ITPL Main Road, Whitefield",Bar,Finger Food,2500,Pubs and bars,Whitefield


In [109]:
# remove ',' from the numbers Example: 1,200 to 1200
df['Cost2plates']= df['Cost2plates'].str.replace(',','')

In [110]:
## change datatypes of Cost2plates and votes
df= df.astype({'votes':'int'})
df= df.astype({'Cost2plates':'int'})

In [111]:
df.dtypes

online_order        object
book_table          object
rate               float64
votes                int64
location            object
rest_type           object
cuisines            object
Cost2plates          int64
Type                object
listed_in(city)     object
dtype: object

#### There are two columns in df where we find location, drop one of it

In [112]:
df=df.drop(['listed_in(city)'],axis=1)

#### cleaning rest type column

In [113]:
df['rest_type'].value_counts()

Quick Bites                   18796
Casual Dining                 10252
Cafe                           3681
Delivery                       2554
Dessert Parlor                 2238
                              ...  
Dessert Parlor, Kiosk             2
Food Court, Beverage Shop         2
Dessert Parlor, Food Court        2
Quick Bites, Kiosk                1
Sweet Shop, Dessert Parlor        1
Name: rest_type, Length: 93, dtype: int64

#### Making rest_types less than 1000 in frequency as others

In [114]:
rest_types= df['rest_type'].value_counts(ascending=False)
rest_types

Quick Bites                   18796
Casual Dining                 10252
Cafe                           3681
Delivery                       2554
Dessert Parlor                 2238
                              ...  
Dessert Parlor, Kiosk             2
Food Court, Beverage Shop         2
Dessert Parlor, Food Court        2
Quick Bites, Kiosk                1
Sweet Shop, Dessert Parlor        1
Name: rest_type, Length: 93, dtype: int64

In [115]:
rest_types_less_than1000 = rest_types[rest_types<1000]
rest_types_less_than1000

Beverage Shop                 863
Bar                           686
Food Court                    616
Sweet Shop                    468
Bar, Casual Dining            411
                             ... 
Dessert Parlor, Kiosk           2
Food Court, Beverage Shop       2
Dessert Parlor, Food Court      2
Quick Bites, Kiosk              1
Sweet Shop, Dessert Parlor      1
Name: rest_type, Length: 85, dtype: int64

In [116]:
def handle_rest_types(value):
    if(value in rest_types_less_than1000):
        return 'others'
    else:
        return value


df['rest_type']= df['rest_type'].apply(handle_rest_types)
df['rest_type'].unique()
df['rest_type'].value_counts()

Quick Bites           18796
Casual Dining         10252
others                 9003
Cafe                   3681
Delivery               2554
Dessert Parlor         2238
Takeaway, Delivery     2004
Casual Dining, Bar     1130
Bakery                 1123
Name: rest_type, dtype: int64

#### Cleaning the Location Column

In [117]:
df['location'].value_counts()

BTM                      5012
HSR                      2484
Koramangala 5th Block    2479
JP Nagar                 2209
Whitefield               2066
                         ... 
West Bangalore              6
Yelahanka                   5
Jakkur                      3
Rajarajeshwari Nagar        2
Peenya                      1
Name: location, Length: 93, dtype: int64

In [120]:
location = df['location'].value_counts(ascending=False)

location_less_than300 = location[location<300]
def handle_location(value):
    if(value in location_less_than300):
        return 'others'
    else:
        return value

In [121]:
df['location']=df['location'].apply(handle_location)
df['location'].value_counts()

BTM                      5012
others                   4934
HSR                      2484
Koramangala 5th Block    2479
JP Nagar                 2209
Whitefield               2066
Indiranagar              2019
Jayanagar                1909
Marathahalli             1779
Bannerghatta Road        1604
Bellandur                1258
Koramangala 1st Block    1227
Electronic City          1212
Brigade Road             1210
Koramangala 7th Block    1174
Koramangala 6th Block    1127
Sarjapur Road            1046
Koramangala 4th Block    1016
Ulsoor                   1011
Banashankari              897
MG Road                   893
Kalyan Nagar              841
Richmond Road             795
Malleshwaram              720
Frazer Town               711
Basavanagudi              683
Residency Road            671
Brookefield               656
New BEL Road              642
Kammanahalli              636
Banaswadi                 634
Rajajinagar               582
Church Street             566
Lavelle Ro

In [None]:
df.head()

Unnamed: 0,online_order,book_table,rate,votes,location,rest_type,cuisines,Cost2plates,Type
0,Yes,Yes,4.1,775,Banashankari,Casual Dining,"North Indian, Mughlai, Chinese",800,Buffet
1,Yes,No,4.1,787,Banashankari,Casual Dining,"Chinese, North Indian, Thai",800,Buffet
2,Yes,No,3.8,918,Banashankari,others,"Cafe, Mexican, Italian",800,Buffet
3,No,No,3.7,88,Banashankari,Quick Bites,"South Indian, North Indian",300,Buffet
4,No,No,3.8,166,Basavanagudi,Casual Dining,"North Indian, Rajasthani",600,Buffet
