# Geospatial and Text Data Analysis in Python - Zomato Case Study

__Build models for segmenting the neighborhoods to find the most conducive locations for starting a cafe business in Toronto City.__

Personal Data Analytics Project <br>
Author: [Diardano Raihan](https://www.linkedin.com/in/diardanoraihan)
<hr>

## Table of Content
* [Introduction: Business Problem](#Introduction)
* [Data](#Data)
* [Methodology: Analytic Approach](#methodology)
* [Mehtodology: Exploratory Data Analysis](#analysis)
* [Mehtodology: Cluster the Neighborhoods](#cluster)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

# Introduction
<hr>

## Background

Zomato (Indian foodtech unicorn) is a multinational restaurant aggregator and online food delivery company available in 20+ countries. According to [Firstpost](https://www.firstpost.com/india/swiggy-clocks-over-9000-orders-zomato-crosses-7000-orders-per-minute-on-new-years-eve-10250391.html), on 31 December 2021, Zomato have touched a high of 7.100 orders per minute (~ 118 orders per second). Therefore, just imagine how large the amount of data have been generated by its customers.

## Business Problem
Now, we will try to analyze and squeeze possibly every insight of the data so that it will be worthwhile for our decision-making in the future. What makes this project interesting is that the type of data we have collected is . . . . . ; Then, what type of analysis we can perform on this Zomato data? 
- Graphical Analysis - what are those outlets and where exactly are those outlets present in your region?
- What are the most famous dishes of a particular restaurant? 
- What are most popular cuisines in your region? 
- What is the relationship between rating vs price for Zomato's restaurant (online order available) and non-restaurant (online order not available)?
- What is the highest-rated restaurant in your region?

## Target Audience
__Entrepreneurs__ who are passionate about opening a restaurant in a city would be very interested in this project. The project is also for __business owners__ and __stakeholders__ who want to expand their businesses by collaborating with any foodtech and wonder how data science could be applied to the questions at hand.

# Data
<hr>
    
## Data Requirement and Collection

Data Source: https://www.kaggle.com/datasets/himanshupoddar/zomato-bangalore-restaurants

## Data Preprocessing for Analysis

### Import the Libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%config IPCompleter.greedy=True
%config IPCompleter.use_jedi=False

### Read the Data
- Read the first 5 rows of the dataset

In [134]:
df = pd.read_csv('Data/zomato.csv')
print(df.shape)
df.head()

(51717, 17)


Unnamed: 0,url,address,name,online_order,book_table,rate,votes,phone,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)
0,https://www.zomato.com/bangalore/jalsa-banasha...,"942, 21st Main Road, 2nd Stage, Banashankari, ...",Jalsa,Yes,Yes,4.1/5,775,080 42297555\r\n+91 9743772233,Banashankari,Casual Dining,"Pasta, Lunch Buffet, Masala Papad, Paneer Laja...","North Indian, Mughlai, Chinese",800,"[('Rated 4.0', 'RATED\n A beautiful place to ...",[],Buffet,Banashankari
1,https://www.zomato.com/bangalore/spice-elephan...,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",Spice Elephant,Yes,No,4.1/5,787,080 41714161,Banashankari,Casual Dining,"Momos, Lunch Buffet, Chocolate Nirvana, Thai G...","Chinese, North Indian, Thai",800,"[('Rated 4.0', 'RATED\n Had been here for din...",[],Buffet,Banashankari
2,https://www.zomato.com/SanchurroBangalore?cont...,"1112, Next to KIMS Medical College, 17th Cross...",San Churro Cafe,Yes,No,3.8/5,918,+91 9663487993,Banashankari,"Cafe, Casual Dining","Churros, Cannelloni, Minestrone Soup, Hot Choc...","Cafe, Mexican, Italian",800,"[('Rated 3.0', ""RATED\n Ambience is not that ...",[],Buffet,Banashankari
3,https://www.zomato.com/bangalore/addhuri-udupi...,"1st Floor, Annakuteera, 3rd Stage, Banashankar...",Addhuri Udupi Bhojana,No,No,3.7/5,88,+91 9620009302,Banashankari,Quick Bites,Masala Dosa,"South Indian, North Indian",300,"[('Rated 4.0', ""RATED\n Great food and proper...",[],Buffet,Banashankari
4,https://www.zomato.com/bangalore/grand-village...,"10, 3rd Floor, Lakshmi Associates, Gandhi Baza...",Grand Village,No,No,3.8/5,166,+91 8026612447\r\n+91 9901210005,Basavanagudi,Casual Dining,"Panipuri, Gol Gappe","North Indian, Rajasthani",600,"[('Rated 4.0', 'RATED\n Very good restaurant ...",[],Buffet,Banashankari


- Check the number of data along with its corresponding data type

In [135]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   url                          51717 non-null  object
 1   address                      51717 non-null  object
 2   name                         51717 non-null  object
 3   online_order                 51717 non-null  object
 4   book_table                   51717 non-null  object
 5   rate                         43942 non-null  object
 6   votes                        51717 non-null  int64 
 7   phone                        50509 non-null  object
 8   location                     51696 non-null  object
 9   rest_type                    51490 non-null  object
 10  dish_liked                   23639 non-null  object
 11  cuisines                     51672 non-null  object
 12  approx_cost(for two people)  51371 non-null  object
 13  reviews_list                 51

- Check the shape of the Data

In [136]:
df.shape

(51717, 17)

### Missing Values Observation

- Check the number of missing values

In [137]:
df.isnull().sum()

url                                0
address                            0
name                               0
online_order                       0
book_table                         0
rate                            7775
votes                              0
phone                           1208
location                          21
rest_type                        227
dish_liked                     28078
cuisines                          45
approx_cost(for two people)      346
reviews_list                       0
menu_item                          0
listed_in(type)                    0
listed_in(city)                    0
dtype: int64

In [138]:
# define a blank dictionary
nan_pct = {}
# obtain the number of rows of our dataset
rows = df.shape[0]
for col in df.columns:
    # sum the number of missing values
    nan = df[col].isnull().sum()
    # calculate the percentage of missing values
    pct = 100*nan/rows
    if pct != 0:
        # update the dictionary for any missing value exists
        nan_pct[col] = pct
        
# Sort the dictionary by its value in ascending order
nan_pct = dict(sorted(nan_pct.items(), key = lambda x: x[1], reverse=True))

# print the result
for item in nan_pct.items():
    print("Field '{}' has {}% missing values.".format(item[0], np.round(item[1],1)))

Field 'dish_liked' has 54.3% missing values.
Field 'rate' has 15.0% missing values.
Field 'phone' has 2.3% missing values.
Field 'approx_cost(for two people)' has 0.7% missing values.
Field 'rest_type' has 0.4% missing values.
Field 'cuisines' has 0.1% missing values.
Field 'location' has 0.0% missing values.


From here, we can see that our data has 7 features that contain missing values with the feature `dish_liked` has the most with more than 50%, followed by `rate` and `phone` as the Top 3.

### Missing Values Treatment

Depending on the problems at hand, you might want or not want to handle all the missing values from all the related fields. For this project, we try to manage it one step at a time. 

Say we deal with `rate` field first since we want to know what is the highest rated restaurant in the region.

In [139]:
df.rate.unique()

array(['4.1/5', '3.8/5', '3.7/5', '3.6/5', '4.6/5', '4.0/5', '4.2/5',
       '3.9/5', '3.1/5', '3.0/5', '3.2/5', '3.3/5', '2.8/5', '4.4/5',
       '4.3/5', 'NEW', '2.9/5', '3.5/5', nan, '2.6/5', '3.8 /5', '3.4/5',
       '4.5/5', '2.5/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5',
       '3.4 /5', '-', '3.6 /5', '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5',
       '4.1 /5', '3.7 /5', '3.1 /5', '2.9 /5', '3.3 /5', '2.8 /5',
       '3.5 /5', '2.7 /5', '2.5 /5', '3.2 /5', '2.6 /5', '4.5 /5',
       '4.3 /5', '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '4.6 /5',
       '4.9 /5', '3.0 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5',
       '2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)

As you can see that the values are kind of messy and notice that we have `NaN` also representing the missing value. Since it will not make sense to replace the corresponding values with any number (i.e. mean), we will drop the data containing such values. For `NEW` and `-` we will replace it with 0.

We can also modify the values in `rate` field so that it does not have `/5` for each rate value.

In [140]:
# Drop the row containing missing values in rate field
df.dropna(axis = 0, subset = ['rate'], inplace = True)
# Replace the non numerical values with 0 if such representation is acceptable
df.rate.replace( to_replace=['NEW', '-'], value = '0', inplace = True)
# Drop the '/5' if any and convert the data type to float
df.rate = df.rate.apply(lambda x: float(x.split('/')[0]))
print(df.shape)
df.head()

(43942, 17)


Unnamed: 0,url,address,name,online_order,book_table,rate,votes,phone,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)
0,https://www.zomato.com/bangalore/jalsa-banasha...,"942, 21st Main Road, 2nd Stage, Banashankari, ...",Jalsa,Yes,Yes,4.1,775,080 42297555\r\n+91 9743772233,Banashankari,Casual Dining,"Pasta, Lunch Buffet, Masala Papad, Paneer Laja...","North Indian, Mughlai, Chinese",800,"[('Rated 4.0', 'RATED\n A beautiful place to ...",[],Buffet,Banashankari
1,https://www.zomato.com/bangalore/spice-elephan...,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",Spice Elephant,Yes,No,4.1,787,080 41714161,Banashankari,Casual Dining,"Momos, Lunch Buffet, Chocolate Nirvana, Thai G...","Chinese, North Indian, Thai",800,"[('Rated 4.0', 'RATED\n Had been here for din...",[],Buffet,Banashankari
2,https://www.zomato.com/SanchurroBangalore?cont...,"1112, Next to KIMS Medical College, 17th Cross...",San Churro Cafe,Yes,No,3.8,918,+91 9663487993,Banashankari,"Cafe, Casual Dining","Churros, Cannelloni, Minestrone Soup, Hot Choc...","Cafe, Mexican, Italian",800,"[('Rated 3.0', ""RATED\n Ambience is not that ...",[],Buffet,Banashankari
3,https://www.zomato.com/bangalore/addhuri-udupi...,"1st Floor, Annakuteera, 3rd Stage, Banashankar...",Addhuri Udupi Bhojana,No,No,3.7,88,+91 9620009302,Banashankari,Quick Bites,Masala Dosa,"South Indian, North Indian",300,"[('Rated 4.0', ""RATED\n Great food and proper...",[],Buffet,Banashankari
4,https://www.zomato.com/bangalore/grand-village...,"10, 3rd Floor, Lakshmi Associates, Gandhi Baza...",Grand Village,No,No,3.8,166,+91 8026612447\r\n+91 9901210005,Basavanagudi,Casual Dining,"Panipuri, Gol Gappe","North Indian, Rajasthani",600,"[('Rated 4.0', 'RATED\n Very good restaurant ...",[],Buffet,Banashankari


Moreover, we can deal with other features that have missing values the same way depending on what type of data that we are missing.

# Methodology

## Analytic Approach

## Exploratory Data Analysis