# Exploratory Data Analysis of Zomato Data

# Data Context
The basic idea of analyzing the Zomato dataset is to get a fair idea about the factors affecting the establishment of different types of restaurant at different places in Bengaluru, aggregate rating of each restaurant, Bengaluru being one such city has more than 12,000 restaurants with restaurants serving dishes from all over the world. With each day new restaurants opening the industry has’nt been saturated yet and the demand is increasing day by day.

This Zomato data aims at analysing demography of the location. Most importantly it will help new restaurants in deciding their theme, menus, cuisine, cost etc for a particular location. It also aims at finding similarity between neighborhoods of Bengaluru on the basis of food. The dataset also contains reviews for each of the restaurant which will help in finding overall rating for the place.The data is accurate to that available on the zomato website until 15 March 2019. The data was scraped from Zomato in two phase. After going through the structure of the website I found that for each neighborhood there are 6-7 category of restaurants viz. Buffet, Cafes, Delivery, Desserts, Dine-out, Drinks & nightlife, Pubs and bars.

# Import Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

#set the style 
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = 14,7
plt.style.use("seaborn")

# Load Data

In [None]:
#load the data
zomato_data = pd.read_csv("../input/zomato.csv")

In [None]:
zomato_data.head()

### Data Dictionary:

* url - contains the url of the restaurant in the zomato website
* address - contains the address of the restaurant in Bengaluru
* name - contains the name of the restaurant
* online_order - whether online ordering is available in the restaurant or not
* book_table - table book option available or not
* rate - contains the overall rating of the restaurant out of 5
* votes - contains total number of rating for the restaurant as of the above mentioned date
* phone - contains the phone number of the restaurant
* location - contains the neighborhood in which the restaurant is located
* rest_type - restaurant type
* dish_liked - dishes people liked in the restaurant
* cuisines - food styles, separated by comma
* approx_cost(for two people) - contains the approximate cost for meal for two people
* reviews_list - list of tuples containing reviews for the restaurant, each tuple consists of two values, rating and review by the customer
* menu_item - contains list of menus available in the restaurant
* listed_in(type) - type of meal
* listed_in(city) - contains the neighborhood in which the restaurant is listed

In [None]:
#shape of the dataset
zomato_data.shape

# Basic Data Understanding

In [None]:
zomato_data.columns

In [None]:
zomato_data.info()

In [None]:
#get the datatypes of the columns
zomato_data.dtypes

In [None]:
#count of data types
zomato_data.get_dtype_counts()

* Only the variable `votes` is read as an integer, remaining 16 columns are read as objects

In [None]:
#basic stats
zomato_data.describe() #only for votes

In [None]:
#check for missing values

pd.DataFrame(round(zomato_data.isnull().sum()/zomato_data.shape[0] * 100,3), columns = ["Missing"])

* The variable `dish_liked` as more than 54 % of missing data. If we drop the missing data, we would lose more than 50% of the data.
* `rate` variable has more than 15 % of missing data.

# Data Cleaning/Manipulation

In [None]:
#check for any duplicate values
zomato_data.duplicated().sum()

In [None]:
#cleaning the column names
zomato_data.columns

In [None]:
zomato_data.rename(columns={"approx_cost(for two people)": "cost_two", "listed_in(type)":"service_type", "listed_in(city)":"serve_to"},
                   inplace = True)

In [None]:
#dropping the url and address column - because they are not very useful in data analysis
zomato_data.drop(["url", "address",  "phone"], axis = 1, inplace = True)
zomato_data.head()

In [None]:
#Manipulating the rate column - rate is read as object, but for analysis we need that to be present in numerical format.

zomato_data.rate.unique()

In [None]:
#removing the "/5" in the rate column
zomato_data.rate = zomato_data.rate.astype('str')
zomato_data.rate = zomato_data.rate.apply(lambda x: x.replace('/5','').strip())

In [None]:
#rate column contains 'NEW' and '-' replacing those with nan and drop those fields without any rating
# Replace "NEW" & "-" to np.nan
zomato_data.rate.replace(('NEW','-'),np.nan,inplace =True)

In [None]:
#dropping the observations where rate and cost_two is null
zomato_data.dropna(subset = ["rate", "cost_two"], inplace = True)
#Converting Rate Column datetype to float
zomato_data.rate = zomato_data.rate.astype('float')

In [None]:
#online_order and book_table are given as 'Yes' and 'No'. Converting these two True and False for better manipulation.
zomato_data.online_order.replace(('Yes','No'),(True,False),inplace =True)
zomato_data.book_table.replace(('Yes','No'),(True,False),inplace =True)

In [None]:
#converting the cost_two variable to float.
zomato_data.cost_two = zomato_data.cost_two.apply(lambda x: int(x.replace(',','')))

In [None]:
#converting to int
zomato_data.cost_two = zomato_data.cost_two.astype('int')

In [None]:
zomato_data.head()

# Exploratory Data Analysis

In [None]:
#lets plot the distribution of votes
plt.rcParams['figure.figsize'] = 14,7
sns.distplot(zomato_data["votes"], kde=False,bins=5,color="y")
plt.title("Distribution of votes")
plt.ylabel("Count")
plt.show()

> **The distplot shows the distribution of a univariate set of observations.In this plot, we can see that the majority of votes lie in the bucket of 500-2500.Only a few restaurants pooled votes more than 2500**

In [None]:
#plot the count of rating.
plt.rcParams['figure.figsize'] = 14,7
sns.countplot(zomato_data["rate"], palette="Set1")
plt.title("Count plot of rate variable")
plt.show()

* The rate variable follows near **normal distribution with mean equal to 3.7**. The rating for majority of the restaurants lie with in the range of 3.5-4.2
* Very few restaurants (~350) has rating more than 4.8

In [None]:
#lets check if there is any relationship between rate and votes

plt.scatter(zomato_data["rate"], zomato_data["votes"], marker='+',color="purple",cmap = "viridis")
plt.xlabel("rating")
plt.ylabel("votes")
plt.title("Scatter plot between rate and votes")
plt.show()

* From the plot, we can infere that the restaurant with high rating gets more votes. No surprises here 

In [None]:
sns.jointplot(x = "rate", y = "votes", data = zomato_data, height=8, ratio=4, color="g")
plt.show()

In [None]:
#similarly lets plot the relationship between rate and cost_two

sns.jointplot(x = "rate", y = "cost_two", data = zomato_data, height=8, ratio=4, kind = "kde", space=0, color="g")
plt.show()

## Correlation

In [None]:
sns.heatmap(zomato_data.corr(), annot = True, cmap = "viridis",linecolor='white',linewidths=1)
plt.show()

* Restaurants with online order facility has inverse relationship with average cost of two.
* Restaurants which provide an option of booking table in advance has a high average cost.

## Restaurants Location

In [None]:
plt.rcParams['figure.figsize'] = 14,7
zomato_data.location.value_counts().nlargest(10).plot(kind = "barh")
plt.title("Number of restaurants by location")
plt.xlabel("Count")
plt.show()

* Most of the restaurants are located in **BTM** followed by **Kormangala 5th Block**
* **Bellandur** has lowest number of restaurants (in Top 10).

## Restaurant Listed in
- Lets see to in which area most of the restaurants are listed in or deliver to

In [None]:
plt.rcParams['figure.figsize'] = 14,7
zomato_data.serve_to.value_counts().nlargest(10).plot(kind = "barh")
plt.title("Number of restaurants listed in a particular location")
plt.xlabel("Count")
plt.show()

* As expected most of the restaurants listed_in (deliver to) **BTM Layout** because this area is home to over 4750 restaurants.
* Even though **Kormangala 7th Block** doesn't have many restaurants, it stands second in terms of the number of restaurants that deliver to this location. 

## Online Order
- Analysing based on availability of online order 

In [None]:
plt.rcParams['figure.figsize'] = 14,7
sns.countplot(zomato_data["online_order"], palette = "Set2")
plt.show()

In [None]:
#lets check if restaurants listed online offer delivery or not.
plt.rcParams['figure.figsize'] = 14,7
sns.countplot(zomato_data["online_order"], palette = "Set2", hue = zomato_data["service_type"])
plt.show()

* As expected most of the restaurants which provide online order option also delivers food
* Many of the Buffet type restaurants doesn't provide an option of online order
* Very few Pubs and bars has option of order online, that makes sense

In [None]:
#checking whether online_order impacts rating of the restaurant
sns.countplot(hue = zomato_data["online_order"], palette = "Set1", x = zomato_data["rate"])
plt.title("Distribution of restaurant rating over online order facility")
plt.show()

* **Restaurants which provide online order facility has more rating than the traditional restaurants**

## Booking Table

In [None]:
#rating vs booking table
sns.countplot(hue = zomato_data["book_table"], palette = "Set2", x = zomato_data["rate"])
plt.title("Distribution of restaurant rating over booking table facility")
plt.show()

In [None]:
#Use catplot() to combine a countplot() and a FacetGrid. This allows grouping within additional categorical variables
g = sns.catplot(x="book_table", hue="service_type", col="online_order", data=zomato_data, kind="count")

* Most of the highly rated restaurants (rating more than 4.0) provide an option of booking table

## Restaurant Service Type

In [None]:
#check the restaurant service type

zomato_data.service_type.value_counts().plot(kind = "pie", autopct='%.1f%%')
plt.show()

* Majority of restaurants (~>50%) provides an option of home delivery.
* 35% of the restaurants listed on Zomato provides Dine out option

**Does the service type effects ratings given to the restaurant?**

In [None]:
#ratings vs service type
sns.boxplot(x="service_type", y="rate", data = zomato_data)
plt.show()

In [None]:
#lets plot swarmplot and violin plot together better understanding of rating vs service type

sns.violinplot(x = "service_type", y = "rate",data = zomato_data,palette="rainbow")
plt.show()

A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

* ** The Restaurants which serve drinks (pubs and bars) has median rating more than 4.5 but from the violin plot we can see that thse restaurants receive very few ratings compared to other type of restaurants**

## Biggest Restaurant Chain and Best Restaurant Chain

In [None]:
plt.rcParams['figure.figsize'] = 14,7
plt.subplot(1,2,1)
zomato_data.name.value_counts().head().plot(kind = "barh", color = sns.color_palette("hls", 5))
plt.xlabel("Number of restaurants")
plt.title("Biggest Restaurant Chain (Top 5)")

plt.subplot(1,2,2)
zomato_data[zomato_data['rate']>=4.5]['name'].value_counts().nlargest(5).plot(kind = "barh", color = sns.color_palette("Paired"))
plt.xlabel("Number of restaurants")
plt.title("Best Restaurant Chain (Top 5) - Rating More than 4.5")
plt.tight_layout()

* Cafe Coffee Day & Onesta has more restaurants across the city
* Truffles on the other hand good restaurants - rating more than 4.5

## Top Restaurant Type

In [None]:
plt.rcParams['figure.figsize'] = 14,7
zomato_data.rest_type.value_counts().nlargest(10).plot(kind = "barh", color = sns.color_palette("hls", 10))
plt.xlabel("Count")
plt.title("Top Restaurant Type (Top 10)")
plt.show()