# THE BATTLE OF NEIGHBORHOOD 

### Applied Data Science Capstone by IBM/Coursera

## INTRODUCTION

The purpose of this project is to study the neighborhoods in Mumbai to determine possible locations for opening a restaurant. This project can be useful for business owners and entrepreneurs who are looking to invest in a restaurant in Mumbai. The main objective of this project is to carefully analyze appropriate data and find recommendations for the stakeholders.

The multi-cultural nature of the city of Mumbai has brought along with it numerous cuisines from all over the world. The people of India generally love food and try different cuisines and experience different flavors.

This Project aim to create an analysis of features for business owners and entrepreneurs who are looking to invest in a restaurant in Mumbai to search a best neighborhood as a comparative analysis between neighborhoods.

### Problem which tried to solve

The major purpose of this project, is to suggest a better neighborhood in Mumbai for entrepreneurs who are looking forward to open a restaurant.This project will help to determine the best neighborhood for a particular type of cusine.

### The Location

Mumbai is the financial capital of India and is one of the most densely populated cities in the world. It lies on the west coast of India and attracts heavy tourism from all over the globe every year.It is one of the major hubs of the world and is extremely diverse with people from various ethnicities residing here.

### Foursquare API

This project would use Four-square API as its prime data gathering source as it has a database of millions of places, especially their places API which provides the ability to perform location search, location sharing and details about a business.

### Clustering Approach

We will be using K-Means Clustering Algorithms for clustering the neighborhood data. K-Means is an unsupervised form of Machine Learning.

### Libraries used

1. Pandas
2. Numpy
3. Matplotlib
4. Seaborn
5. Scikit Learn
6. Geopy
7. Geocoder
8. Folium
9. Json
10. Requests

## Dataset Description

The data has been collected from various sources.

### Neighborhood Data

The data of the neighborhoods in Mumbai was scraped from Wikipedia. The data is read into a pandas data frame using the read_html() method. The main reason for doing so is that the Wikipedia page provides a comprehensive and detailed table of the data which can easily be scraped using the read_html() method of pandas.

Data Link: https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai

### Geographical Coordinates

The geographical coordinates for Mumbai data has been obtained from the GeoPy library in python. This data is relevant for plotting the map of Mumbai using the Folium library in python. The geocoder library in python has been used to obtain latitude and longitude data for various neighborhoods in Mumbai. The coordinates of all neighborhoods in Mumbai are used to check the accuracy of coordinates given on Wikipedia and replace them in our data frame if the absolute difference is more than 0.001. These coordinates are then further used for plotting using the Folium library in python.

### Venue Data

The venue data has been extracted using the Foursquare API. This data contains venue recommendations for all neighborhoods in Mumbai and is used to study the popular venues of different neighborhoods.

### Importing Required Libraries

In [2]:
!pip install geopy
!pip install geocoder
!pip install folium

import numpy as np
import pandas as pd
import json
from geopy.geocoders import Nominatim
import geocoder
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.io.json import json_normalize
from sklearn.metrics import silhouette_score

%matplotlib notebook
print("All Libraries imported")

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
All Libraries imported


### Data Retrieval

Scraping data from https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai and reading it into a dataframe.

In [4]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai')[-1]
df.rename(columns={'Area': 'Neighborhood'}, inplace=True)
df.head(10)

Unnamed: 0,Neighborhood,Location,Latitude,Longitude
0,Amboli,"Andheri,Western Suburbs",19.1293,72.8434
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.82721
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.82927
5,Marol,"Andheri,Western Suburbs",19.119219,72.882743
6,Sahar,"Andheri,Western Suburbs",19.098889,72.867222
7,Seven Bungalows,"Andheri,Western Suburbs",19.129052,72.817018
8,Versova,"Andheri,Western Suburbs",19.12,72.82
9,Mira Road,"Mira-Bhayandar,Western Suburbs",19.284167,72.871111


### Data Wrangling

Lets look at the different values for Location present in the Location column.

In [5]:
df['Location'].value_counts()

South Mumbai                       30
Andheri,Western Suburbs             8
Western Suburbs                     6
Eastern Suburbs                     4
Bandra,Western Suburbs              3
Mira-Bhayandar,Western Suburbs      3
Powai,Eastern Suburbs               3
Kandivali West,Western Suburbs      3
Ghatkopar,Eastern Suburbs           3
Malad,Western Suburbs               2
Mumbai                              2
Kalbadevi,South Mumbai              2
Khar,Western Suburbs                2
Goregaon,Western Suburbs            2
Harbour Suburbs                     2
Borivali (West),Western Suburbs     2
Vasai,Western Suburbs               2
Fort,South Mumbai                   1
Colaba,South Mumbai                 1
Vile Parle,Western Suburbs          1
Govandi,Harbour Suburbs             1
Antop Hill,South Mumbai             1
Byculla,South Mumbai                1
Sanctacruz,Western Suburbs          1
Tardeo,South Mumbai                 1
Mulund,Eastern Suburbs              1
Kurla,Easter

We can see that there are many locations that appear only once or twice. This is because the main locations like "Western Suburbs" or "South Mumbai" are being further divided by the area within these locations. Lets clean the Location column to make it easier to understand.

In [6]:
df['Location'] = df['Location'].apply(lambda x: x.split(',')[-1])
df.head(10)

Unnamed: 0,Neighborhood,Location,Latitude,Longitude
0,Amboli,Western Suburbs,19.1293,72.8434
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,Western Suburbs,19.124085,72.831373
3,Four Bungalows,Western Suburbs,19.124714,72.82721
4,Lokhandwala,Western Suburbs,19.130815,72.82927
5,Marol,Western Suburbs,19.119219,72.882743
6,Sahar,Western Suburbs,19.098889,72.867222
7,Seven Bungalows,Western Suburbs,19.129052,72.817018
8,Versova,Western Suburbs,19.12,72.82
9,Mira Road,Western Suburbs,19.284167,72.871111


Now lets again look at the values in Location column.

In [7]:
df['Location'].value_counts()

South Mumbai       39
Western Suburbs    36
Eastern Suburbs    12
Harbour Suburbs     4
Mumbai              2
Name: Location, dtype: int64

Now that the data is much easier to interpret, lets display the dataframe created.

In [8]:
df

Unnamed: 0,Neighborhood,Location,Latitude,Longitude
0,Amboli,Western Suburbs,19.129300,72.843400
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,Western Suburbs,19.124085,72.831373
3,Four Bungalows,Western Suburbs,19.124714,72.827210
4,Lokhandwala,Western Suburbs,19.130815,72.829270
...,...,...,...,...
88,Parel,South Mumbai,18.990000,72.840000
89,Gowalia Tank,South Mumbai,18.962450,72.809703
90,Dava Bazaar,South Mumbai,18.946882,72.831362
91,Dharavi,Mumbai,19.040208,72.850850


Although the data we gathered contained latitude and longitude information, we can reconfirm these coordinates using Geocoder.

In [9]:
df['Latitude1'] = None
df['Longitude1'] = None

for i, neigh in enumerate(df['Neighborhood']):
    lat_lng_coords = None
    
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Mumbai, India'.format(neigh))
        lat_lng_coords = g.latlng
    
    if lat_lng_coords:
        latitude = lat_lng_coords[0]
        longitude = lat_lng_coords[1]
    
    df.loc[i, 'Latitude1'] = latitude
    df.loc[i, 'Longitude1'] = longitude

df.head(10)


Unnamed: 0,Neighborhood,Location,Latitude,Longitude,Latitude1,Longitude1
0,Amboli,Western Suburbs,19.1293,72.8434,19.1291,72.8464
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833,19.1084,72.8623
2,D.N. Nagar,Western Suburbs,19.124085,72.831373,19.1251,72.8325
3,Four Bungalows,Western Suburbs,19.124714,72.82721,19.1264,72.8242
4,Lokhandwala,Western Suburbs,19.130815,72.82927,19.1432,72.8249
5,Marol,Western Suburbs,19.119219,72.882743,19.1191,72.8828
6,Sahar,Western Suburbs,19.098889,72.867222,19.1027,72.8626
7,Seven Bungalows,Western Suburbs,19.129052,72.817018,19.1286,72.8212
8,Versova,Western Suburbs,19.12,72.82,19.1377,72.8135
9,Mira Road,Western Suburbs,19.284167,72.871111,19.2656,72.8706


We can create new columns to see the difference between coordinate values obtained from wikipedia and those obtained from geocoder. We will take the absolute difference between these values and store them in our dataframe.

In [10]:
df['Latdiff'] = abs(df['Latitude'] - df['Latitude1'])
df['Longdiff'] = abs(df['Longitude'] - df['Longitude1'])
df.head(10)

Unnamed: 0,Neighborhood,Location,Latitude,Longitude,Latitude1,Longitude1,Latdiff,Longdiff
0,Amboli,Western Suburbs,19.1293,72.8434,19.1291,72.8464,0.00024,0.00304
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833,19.1084,72.8623,0.003028,0.001497
2,D.N. Nagar,Western Suburbs,19.124085,72.831373,19.1251,72.8325,0.000965,0.001107
3,Four Bungalows,Western Suburbs,19.124714,72.82721,19.1264,72.8242,0.001666,0.00301
4,Lokhandwala,Western Suburbs,19.130815,72.82927,19.1432,72.8249,0.012345,0.0044
5,Marol,Western Suburbs,19.119219,72.882743,19.1191,72.8828,0.000169,6.7e-05
6,Sahar,Western Suburbs,19.098889,72.867222,19.1027,72.8626,0.00377822,0.00462255
7,Seven Bungalows,Western Suburbs,19.129052,72.817018,19.1286,72.8212,0.000492,0.004162
8,Versova,Western Suburbs,19.12,72.82,19.1377,72.8135,0.01769,0.00652
9,Mira Road,Western Suburbs,19.284167,72.871111,19.2656,72.8706,0.0185438,0.000467611


We can see that the latitude and longitudes from wikipedia and geocoder are very similar, yet there are some differences. We will replace the values with the coordinates obtained from geocoder if the absolute difference is more than 0.001.

In [11]:
df.loc[df.Latdiff>0.001, 'Latitude'] = df.loc[df.Latdiff>0.001, 'Latitude1']
df.loc[df.Longdiff>0.001, 'Longitude'] = df.loc[df.Longdiff>0.001, 'Longitude1']
df.head(10)


Unnamed: 0,Neighborhood,Location,Latitude,Longitude,Latitude1,Longitude1,Latdiff,Longdiff
0,Amboli,Western Suburbs,19.1293,72.8464,19.1291,72.8464,0.00024,0.00304
1,"Chakala, Andheri",Western Suburbs,19.1084,72.8623,19.1084,72.8623,0.003028,0.001497
2,D.N. Nagar,Western Suburbs,19.1241,72.8325,19.1251,72.8325,0.000965,0.001107
3,Four Bungalows,Western Suburbs,19.1264,72.8242,19.1264,72.8242,0.001666,0.00301
4,Lokhandwala,Western Suburbs,19.1432,72.8249,19.1432,72.8249,0.012345,0.0044
5,Marol,Western Suburbs,19.1192,72.8827,19.1191,72.8828,0.000169,6.7e-05
6,Sahar,Western Suburbs,19.1027,72.8626,19.1027,72.8626,0.00377822,0.00462255
7,Seven Bungalows,Western Suburbs,19.1291,72.8212,19.1286,72.8212,0.000492,0.004162
8,Versova,Western Suburbs,19.1377,72.8135,19.1377,72.8135,0.01769,0.00652
9,Mira Road,Western Suburbs,19.2656,72.8711,19.2656,72.8706,0.0185438,0.000467611


In order to confirm if values have actually been replaced we can use the where method. Values with NaN means those values have not been replaced.

In [12]:
df.where(df['Latitude']==df['Latitude1'])

Unnamed: 0,Neighborhood,Location,Latitude,Longitude,Latitude1,Longitude1,Latdiff,Longdiff
0,,,,,,,,
1,"Chakala, Andheri",Western Suburbs,19.1084,72.8623,19.1084,72.8623,0.003028,0.001497
2,,,,,,,,
3,Four Bungalows,Western Suburbs,19.1264,72.8242,19.1264,72.8242,0.001666,0.00301
4,Lokhandwala,Western Suburbs,19.1432,72.8249,19.1432,72.8249,0.012345,0.0044
...,...,...,...,...,...,...,...,...
88,Parel,South Mumbai,18.9957,72.84,18.9957,72.8391,0.0057,0.00087
89,Gowalia Tank,South Mumbai,18.9645,72.8112,18.9645,72.8112,0.00201,0.001467
90,Dava Bazaar,South Mumbai,19.1314,72.927,19.1314,72.927,0.184518,0.095598
91,Dharavi,Mumbai,19.0467,72.8546,19.0467,72.8546,0.006532,0.00376


We can do the same for the Longitude column.

In [13]:
df.where(df['Longitude']==df['Longitude1'])

Unnamed: 0,Neighborhood,Location,Latitude,Longitude,Latitude1,Longitude1,Latdiff,Longdiff
0,Amboli,Western Suburbs,19.1293,72.8464,19.1291,72.8464,0.00024,0.00304
1,"Chakala, Andheri",Western Suburbs,19.1084,72.8623,19.1084,72.8623,0.003028,0.001497
2,D.N. Nagar,Western Suburbs,19.1241,72.8325,19.1251,72.8325,0.000965,0.001107
3,Four Bungalows,Western Suburbs,19.1264,72.8242,19.1264,72.8242,0.001666,0.00301
4,Lokhandwala,Western Suburbs,19.1432,72.8249,19.1432,72.8249,0.012345,0.0044
...,...,...,...,...,...,...,...,...
88,,,,,,,,
89,Gowalia Tank,South Mumbai,18.9645,72.8112,18.9645,72.8112,0.00201,0.001467
90,Dava Bazaar,South Mumbai,19.1314,72.927,19.1314,72.927,0.184518,0.095598
91,Dharavi,Mumbai,19.0467,72.8546,19.0467,72.8546,0.006532,0.00376


Now that we have the data, we can drop the columns that are no longer useful.

In [14]:
df.drop(['Latitude1', 'Longitude1', 'Latdiff', 'Longdiff'], axis=1, inplace=True)
df.head(10)

Unnamed: 0,Neighborhood,Location,Latitude,Longitude
0,Amboli,Western Suburbs,19.1293,72.8464
1,"Chakala, Andheri",Western Suburbs,19.1084,72.8623
2,D.N. Nagar,Western Suburbs,19.1241,72.8325
3,Four Bungalows,Western Suburbs,19.1264,72.8242
4,Lokhandwala,Western Suburbs,19.1432,72.8249
5,Marol,Western Suburbs,19.1192,72.8827
6,Sahar,Western Suburbs,19.1027,72.8626
7,Seven Bungalows,Western Suburbs,19.1291,72.8212
8,Versova,Western Suburbs,19.1377,72.8135
9,Mira Road,Western Suburbs,19.2656,72.8711


### Data Visualization

To understand our data better, we can see how many neighborhoods are in each location

In [15]:
neighborhoods_mumbai = df.groupby('Location')['Neighborhood'].nunique()
neighborhoods_mumbai

Location
Eastern Suburbs    12
Harbour Suburbs     4
Mumbai              2
South Mumbai       39
Western Suburbs    36
Name: Neighborhood, dtype: int64

In [16]:
fig = plt.figure(figsize=(12,8))

ax = neighborhoods_mumbai.plot(kind='barh', color='lightcoral')
ax.set_title('Number of Neighborhoods Grouped by Location in Mumbai', fontsize=20)
ax.set_xlabel('Number of Neighborhoods', fontsize=15)
ax.set_ylabel('Location', fontsize=15)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.tick_params(which='major', left=False)

fig.tight_layout()

<IPython.core.display.Javascript object>

Clearly we can see that South Mumbai and Western Suburbs have the most number of neighborhoods. Notice how we see one of the locations as Mumbai itself? This is because the neighborhoods contained in this location are located at the outskirts of Mumbai and thus have been grouped as just Mumbai.


Now lets visualize the neighborhoods on a map using Folium. First we will obtain the geographical coordinates of Mumbai using GeoPy.

In [19]:
address = 'Mumbai, IN'
geolocator = Nominatim(user_agent="http")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Mumbai are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Mumbai are 19.0759899, 72.8773928.


Now, we can plot the map

In [20]:

map_mum = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, location, neighborhood in zip(df['Latitude'], df['Longitude'], df['Location'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, location)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_mum)  
    
map_mum

### Using Foursquare API

Now we can start working with the Foursquare API to obtain venue recommendations.

Lets create the Foursquare credentials first.

In [28]:
# The code was removed by Watson Studio for sharing.

In [29]:
neighborhood_name = df.loc[0, 'Neighborhood']
neighborhood_lat = df.loc[0, 'Latitude']
neighborhood_long = df.loc[0, 'Longitude']

print("The neighborhood is {} and it's geographical coordinates are {} latitude and {} longitude".format(neighborhood_name,
                                                                                                        neighborhood_lat, neighborhood_long))

The neighborhood is Amboli and it's geographical coordinates are 19.1293 latitude and 72.84644000000003 longitude


We will now extract the top 200 venues near Amboli with a radius of 1000m or 1km. In order to do this, we will start by creating a url.

In [30]:
LIMIT = 200
radius = 1000

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_lat, 
    neighborhood_long, 
    radius, 
    LIMIT)

We can now use the GET method to get our results.

In [31]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '60c38279234a3337add9da08'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Jogeshwari West',
  'headerFullLocation': 'Jogeshwari West, Mumbai',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 23,
  'suggestedBounds': {'ne': {'lat': 19.13830000900001,
    'lng': 72.85594823590122},
   'sw': {'lat': 19.120299990999992, 'lng': 72.83693176409884}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4d10d39b7177b1f7d2c75322',
       'name': 'Cafe Arfa',
       'location': {'address': 'S V Road',
        'crossStreet': 'Andheri West',
        'lat': 19.12893009094341,
        'lng': 72.84714004510111,
        'labeledLatLngs': [{'label'

We will now create a function get_category_type to extract the categories of venues.

In [32]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we can clean the JSON obtained using the GET method and store our results in a dataframe.

In [33]:

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues)

filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  app.launch_new_instance()


Unnamed: 0,name,categories,lat,lng
0,Cafe Arfa,Indian Restaurant,19.12893,72.84714
1,"5 Spice , Bandra",Chinese Restaurant,19.130421,72.847206
2,Domino's Pizza,Pizza Place,19.131,72.848
3,Jaffer Bhai's Delhi Darbar,Mughlai Restaurant,19.137714,72.845909
4,Narayan Sandwich,Sandwich Place,19.121398,72.85027



We can check how many venues were returned by Foursquare.

In [34]:
print("{} venues were returned for {} by Foursquare".format(len(nearby_venues), neighborhood_name))

23 venues were returned for Amboli by Foursquare


### Generalizing Foursquare API

Now that we have seen how the API call works and how we can clean our data to get relevant information, we can generalize this procedure to get nearby venues for all neighborhoods by creating the function getNearbyVenues.

In [35]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We can apply the function created to get nearby venues for all neighborhoods in Mumbai. We will get 200 nearby venues within a 1km radius, same as before.

In [36]:
mum_venues = getNearbyVenues(names=df['Neighborhood'], latitudes=df['Latitude'], longitudes=df['Longitude'], radius=radius)

Amboli
Chakala, Andheri
D.N. Nagar
Four Bungalows
Lokhandwala
Marol
Sahar
Seven Bungalows
Versova
Mira Road
Bhayandar
Uttan
Bandstand Promenade
Kherwadi
Pali Hill
I.C. Colony
Gorai
Dahisar
Aarey Milk Colony
Bangur Nagar
Jogeshwari West
Juhu
Charkop
Poisar
Mahavir Nagar
Thakur village
Pali Naka
Khar Danda
Dindoshi
Sunder Nagar
Kalina
Naigaon
Nalasopara
Virar
Irla
Vile Parle
Bhandup
Amrut Nagar
Asalfa
Pant Nagar
Kanjurmarg
Nehru Nagar
Nahur
Chandivali
Hiranandani Gardens
Indian Institute of Technology Bombay campus
Vidyavihar
Vikhroli
Chembur
Deonar
Mankhurd
Mahul
Agripada
Altamount Road
Bhuleshwar
Breach Candy
Carmichael Road
Cavel
Churchgate
Cotton Green
Cuffe Parade
Cumbala Hill
Currey Road
Dhobitalao
Dongri
Kala Ghoda
Kemps Corner
Lower Parel
Mahalaxmi
Mahim
Malabar Hill
Marine Drive
Marine Lines
Mumbai Central
Nariman Point
Prabhadevi
Sion
Walkeshwar
Worli
C.G.S. colony
Dagdi Chawl
Navy Nagar
Hindu colony
Ballard Estate
Chira Bazaar
Fanas Wadi
Chor Bazaar
Matunga
Parel
Gowalia Tank


Lets see what our dataframe looks like.

In [37]:
print(mum_venues.shape)
mum_venues.head(10)

(3228, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Amboli,19.1293,72.84644,Cafe Arfa,19.12893,72.84714,Indian Restaurant
1,Amboli,19.1293,72.84644,"5 Spice , Bandra",19.130421,72.847206,Chinese Restaurant
2,Amboli,19.1293,72.84644,Domino's Pizza,19.131,72.848,Pizza Place
3,Amboli,19.1293,72.84644,Jaffer Bhai's Delhi Darbar,19.137714,72.845909,Mughlai Restaurant
4,Amboli,19.1293,72.84644,Narayan Sandwich,19.121398,72.85027,Sandwich Place
5,Amboli,19.1293,72.84644,Persia Darbar,19.136952,72.846822,Indian Restaurant
6,Amboli,19.1293,72.84644,Garden Court,19.127188,72.837478,Indian Restaurant
7,Amboli,19.1293,72.84644,Subway,19.12786,72.844461,Sandwich Place
8,Amboli,19.1293,72.84644,Kamal Chhaya Bar,19.128245,72.83761,Bar
9,Amboli,19.1293,72.84644,Domino's Pizza,19.13,72.837,Pizza Place


Lets see how many venues were returned for each neighborhood.

In [38]:
mum_venues.groupby('Neighborhood', as_index=False).count()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Aarey Milk Colony,12,12,12,12,12,12
1,Agripada,24,24,24,24,24,24
2,Altamount Road,65,65,65,65,65,65
3,Amboli,23,23,23,23,23,23
4,Amrut Nagar,14,14,14,14,14,14
...,...,...,...,...,...,...,...
88,Vikhroli,5,5,5,5,5,5
89,Vile Parle,57,57,57,57,57,57
90,Virar,13,13,13,13,13,13
91,Walkeshwar,9,9,9,9,9,9


We can now check how many unique categories are there in our data.

In [39]:
print("There are {} unique categories".format(mum_venues['Venue Category'].nunique()))

There are 213 unique categories


### Analyzing each neighborhood

We can start analyzing each neighborhood by One-hot Encoding to see which categories belong in which neighborhoods.

In [40]:
mum_onehot = pd.get_dummies(mum_venues[['Venue Category']], prefix="", prefix_sep="")
mum_onehot.head()

Unnamed: 0,Accessories Store,Airport Lounge,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Arts & Crafts Store,Arts & Entertainment,...,Trail,Train,Train Station,Vegetarian / Vegan Restaurant,Waterfront,Whisky Bar,Wine Bar,Women's Store,Yoga Studio,Zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Adding Neighborhood column to the one-hot encoded dataframe.

In [41]:
mum_onehot['Neighborhood'] = mum_venues['Neighborhood']
mum_onehot.head()

Unnamed: 0,Accessories Store,Airport Lounge,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Arts & Crafts Store,Arts & Entertainment,...,Trail,Train,Train Station,Vegetarian / Vegan Restaurant,Waterfront,Whisky Bar,Wine Bar,Women's Store,Yoga Studio,Zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Moving the Neighborhood column to the first column.

In [42]:
temp = list(mum_onehot.columns)

if 'Neighborhood' in temp:
    temp.remove('Neighborhood')
    
fixed_columns = ['Neighborhood'] + temp
mum_onehot = mum_onehot[fixed_columns]

mum_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport Lounge,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Arts & Crafts Store,...,Trail,Train,Train Station,Vegetarian / Vegan Restaurant,Waterfront,Whisky Bar,Wine Bar,Women's Store,Yoga Studio,Zoo
0,Amboli,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Amboli,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Amboli,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Amboli,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Amboli,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we can groupby neighborhood and take the mean for all categories.

In [43]:
mum_grouped = mum_onehot.groupby('Neighborhood', sort=False).mean().reset_index()
print(mum_grouped.shape)
mum_grouped.head(10)

(93, 213)


Unnamed: 0,Neighborhood,Accessories Store,Airport Lounge,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Arts & Crafts Store,...,Trail,Train,Train Station,Vegetarian / Vegan Restaurant,Waterfront,Whisky Bar,Wine Bar,Women's Store,Yoga Studio,Zoo
0,Amboli,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Chakala, Andheri",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0
2,D.N. Nagar,0.023256,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.046512,0.0,0.0,0.0,0.023256,0.0,0.0
3,Four Bungalows,0.019608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.039216,0.0,0.0,0.0,0.019608,0.0,0.0
4,Lokhandwala,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.011628,0.0,0.0,0.0,0.0,0.0,0.0
5,Marol,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Sahar,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Seven Bungalows,0.019231,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.019231,0.0,0.0
8,Versova,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Mira Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In order to further understand the data, we can display the top 5 venues of all neighborhoods.

In [44]:

num_top_venues = 5

for hood in mum_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = mum_grouped[mum_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Amboli----
               venue  freq
0  Indian Restaurant  0.13
1        Pizza Place  0.09
2                Bar  0.09
3        Coffee Shop  0.09
4   Asian Restaurant  0.09


----Chakala, Andheri----
                           venue  freq
0                          Hotel  0.24
1              Indian Restaurant  0.18
2                           Café  0.09
3                    Pizza Place  0.06
4  Vegetarian / Vegan Restaurant  0.06


----D.N. Nagar----
               venue  freq
0  Indian Restaurant  0.12
1                Pub  0.09
2        Pizza Place  0.09
3                Bar  0.09
4        Coffee Shop  0.07


----Four Bungalows----
                venue  freq
0                 Pub  0.08
1  Seafood Restaurant  0.06
2   Indian Restaurant  0.06
3  Chinese Restaurant  0.06
4         Pizza Place  0.04


----Lokhandwala----
                  venue  freq
0     Indian Restaurant  0.14
1                  Café  0.09
2           Coffee Shop  0.07
3    Chinese Restaurant  0.06
4  Fast Food R

Lets now create a dataframe with the top 10 common venues for each neighborhood.

In [45]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [46]:

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = mum_grouped['Neighborhood']

for ind in np.arange(mum_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(mum_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Amboli,Indian Restaurant,Asian Restaurant,Bar,Sandwich Place,Coffee Shop,Pizza Place,Athletics & Sports,Metro Station,Chinese Restaurant,Mughlai Restaurant
1,"Chakala, Andheri",Hotel,Indian Restaurant,Café,Restaurant,Vegetarian / Vegan Restaurant,Pizza Place,Asian Restaurant,Multiplex,Burger Joint,Spa
2,D.N. Nagar,Indian Restaurant,Pizza Place,Bar,Pub,Coffee Shop,Vegetarian / Vegan Restaurant,Snack Place,Falafel Restaurant,Sports Club,Department Store
3,Four Bungalows,Pub,Chinese Restaurant,Indian Restaurant,Seafood Restaurant,Ice Cream Shop,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bar,Lounge,Pizza Place
4,Lokhandwala,Indian Restaurant,Café,Coffee Shop,Chinese Restaurant,Fast Food Restaurant,Bar,Gym / Fitness Center,Pub,Lounge,Italian Restaurant
...,...,...,...,...,...,...,...,...,...,...,...
88,Parel,Indian Restaurant,Coffee Shop,Maharashtrian Restaurant,Cafeteria,Sporting Goods Shop,Restaurant,Plaza,Playground,Chinese Restaurant,Bar
89,Gowalia Tank,Indian Restaurant,Fast Food Restaurant,Bakery,Café,Coffee Shop,Sandwich Place,Snack Place,Dessert Shop,Vegetarian / Vegan Restaurant,Pizza Place
90,Dava Bazaar,Train Station,Indian Restaurant,Café,Fish Market,Asian Restaurant,Beer Garden,Multiplex,Coffee Shop,Hotel,Falafel Restaurant
91,Dharavi,Indian Restaurant,Café,Coffee Shop,Music Venue,Sandwich Place,Seafood Restaurant,Shoe Store,Fast Food Restaurant,Lake,Juice Bar


### Clustering Neighborhoods

Now we can use KMeans clustering method to cluster the neighborhoods.

First we need to determine how many clusters to use. This will be done using the Silhouette Score.

We will define a function to plot the Silhouette Score that will be calculated using different number of clusters.

In [47]:
def plot(x, y):
    fig = plt.figure(figsize=(12,6))
    plt.plot(x, y, 'o-')
    plt.xlabel('Number of clusters')
    plt.ylabel('Silhouette Scores')
    plt.title('Checking Optimum Number of Clusters')
    ax = plt.gca()
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)

In [48]:
maxk = 15
scores = []
kval = []

for k in range(2, maxk+1):
    cl_df = mum_grouped.drop('Neighborhood', axis=1)
    kmeans = KMeans(n_clusters=k, init="k-means++", random_state=40).fit_predict(cl_df) #Choose any random_state
    
    score = silhouette_score(cl_df, kmeans, metric='euclidean', random_state=0)
    kval.append(k)
    scores.append(score)

We can now display the scores for different number of clusters and plot the data as well

In [49]:
print(scores)
print(kval)
plot(kval, scores)

[0.07088153622692314, 0.06823348459316712, 0.07648232156835003, 0.07580910881641537, 0.07648455014123968, 0.07889866614163227, 0.04209972092658602, 0.04376625876233071, 0.06882604202780437, 0.073351501029334, 0.0627733488799858, 0.06544755609164578, 0.051975140078618816, 0.061573000034444755]
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]


<IPython.core.display.Javascript object>

We can see that the silhouette scores are not very high even as we increase the number of clusters. This means that the inter-cluster distance between different clusters is not very high over the range of k-values. However, we will try to cluster our data as best as we can. For this, we will use 5 clusters for our clustering model since it provides the highest silhouette score as seen above.

In [50]:
k = 5

mum_clustering = mum_grouped.drop('Neighborhood', axis=1)
kmeans = KMeans(n_clusters=k, init="k-means++", random_state=40).fit(mum_clustering) #Can choose any random_state

kmeans.labels_

array([2, 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1,
       2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 0, 1, 2, 1, 0, 2, 0, 1,
       2, 1, 1, 2, 1, 2, 0, 1, 1, 2, 1, 2, 2, 1, 1, 0, 2, 2, 1, 1, 1, 2,
       2, 2, 1, 1, 3, 1, 1, 2, 2, 2, 1, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,
       1, 2, 0, 2, 2], dtype=int32)

Now we can create a new dataframe that includes cluster labels and the top 10 venues.

In [51]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
mum_merged = df
mum_merged = mum_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
print(mum_merged.shape)
mum_merged

(93, 15)


Unnamed: 0,Neighborhood,Location,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Amboli,Western Suburbs,19.1293,72.8464,2,Indian Restaurant,Asian Restaurant,Bar,Sandwich Place,Coffee Shop,Pizza Place,Athletics & Sports,Metro Station,Chinese Restaurant,Mughlai Restaurant
1,"Chakala, Andheri",Western Suburbs,19.1084,72.8623,1,Hotel,Indian Restaurant,Café,Restaurant,Vegetarian / Vegan Restaurant,Pizza Place,Asian Restaurant,Multiplex,Burger Joint,Spa
2,D.N. Nagar,Western Suburbs,19.1241,72.8325,2,Indian Restaurant,Pizza Place,Bar,Pub,Coffee Shop,Vegetarian / Vegan Restaurant,Snack Place,Falafel Restaurant,Sports Club,Department Store
3,Four Bungalows,Western Suburbs,19.1264,72.8242,2,Pub,Chinese Restaurant,Indian Restaurant,Seafood Restaurant,Ice Cream Shop,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bar,Lounge,Pizza Place
4,Lokhandwala,Western Suburbs,19.1432,72.8249,2,Indian Restaurant,Café,Coffee Shop,Chinese Restaurant,Fast Food Restaurant,Bar,Gym / Fitness Center,Pub,Lounge,Italian Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,Parel,South Mumbai,18.9957,72.84,1,Indian Restaurant,Coffee Shop,Maharashtrian Restaurant,Cafeteria,Sporting Goods Shop,Restaurant,Plaza,Playground,Chinese Restaurant,Bar
89,Gowalia Tank,South Mumbai,18.9645,72.8112,2,Indian Restaurant,Fast Food Restaurant,Bakery,Café,Coffee Shop,Sandwich Place,Snack Place,Dessert Shop,Vegetarian / Vegan Restaurant,Pizza Place
90,Dava Bazaar,South Mumbai,19.1314,72.927,0,Train Station,Indian Restaurant,Café,Fish Market,Asian Restaurant,Beer Garden,Multiplex,Coffee Shop,Hotel,Falafel Restaurant
91,Dharavi,Mumbai,19.0467,72.8546,2,Indian Restaurant,Café,Coffee Shop,Music Venue,Sandwich Place,Seafood Restaurant,Shoe Store,Fast Food Restaurant,Lake,Juice Bar


In [52]:

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(mum_merged['Latitude'], mum_merged['Longitude'], mum_merged['Neighborhood'], mum_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

We can now view the neighborhoods in each cluster and their top 10 most common venues.

### Cluster 1

In [53]:
mum_merged.loc[mum_merged['Cluster Labels'] == 0, mum_merged.columns[[0] + [1] + list(range(5, mum_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
36,Bhandup,Eastern Suburbs,Train Station,Fast Food Restaurant,Indian Restaurant,Asian Restaurant,Bakery,Zoo,Dhaba,Fish & Chips Shop,Field,Farmers Market
40,Kanjurmarg,Eastern Suburbs,Train Station,Multiplex,Hotel,Gym,Chinese Restaurant,Asian Restaurant,Gift Shop,Zoo,Donut Shop,Dim Sum Restaurant
42,Nahur,Eastern Suburbs,Coffee Shop,Train Station,Bus Station,Restaurant,Pub,Indian Restaurant,Fast Food Restaurant,Pizza Place,Convention Center,Community Center
50,Mankhurd,Harbour Suburbs,Coffee Shop,Train Station,Sports Bar,Bus Station,Zoo,Dhaba,Field,Fast Food Restaurant,Farmers Market,Falafel Restaurant
59,Cotton Green,South Mumbai,Plaza,Fast Food Restaurant,Pizza Place,Snack Place,Vegetarian / Vegan Restaurant,Train Station,Zoo,Donut Shop,Dhaba,Dim Sum Restaurant
90,Dava Bazaar,South Mumbai,Train Station,Indian Restaurant,Café,Fish Market,Asian Restaurant,Beer Garden,Multiplex,Coffee Shop,Hotel,Falafel Restaurant


### Cluster 2

In [54]:

mum_merged.loc[mum_merged['Cluster Labels'] == 1, mum_merged.columns[[0] + [1] + list(range(5, mum_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"Chakala, Andheri",Western Suburbs,Hotel,Indian Restaurant,Café,Restaurant,Vegetarian / Vegan Restaurant,Pizza Place,Asian Restaurant,Multiplex,Burger Joint,Spa
5,Marol,Western Suburbs,Indian Restaurant,Chinese Restaurant,Restaurant,Diner,Hotel,Coffee Shop,Farmers Market,Ice Cream Shop,Dance Studio,Electronics Store
6,Sahar,Western Suburbs,Hotel,Indian Restaurant,Café,Bakery,Asian Restaurant,Airport Terminal,Hotel Pool,Pizza Place,Italian Restaurant,Coffee Shop
9,Mira Road,Western Suburbs,Indian Restaurant,Mexican Restaurant,Fast Food Restaurant,Garden,Movie Theater,General College & University,Restaurant,Bar,Shopping Mall,Coffee Shop
20,Jogeshwari West,Western Suburbs,Indian Restaurant,Dessert Shop,Chinese Restaurant,Gym,Café,Men's Store,Mughlai Restaurant,Ice Cream Shop,Seafood Restaurant,Clothing Store
21,Juhu,Western Suburbs,Indian Restaurant,Movie Theater,Coffee Shop,Fast Food Restaurant,Vegetarian / Vegan Restaurant,Café,Convention Center,Plaza,Breakfast Spot,Restaurant
23,Poisar,Western Suburbs,Indian Restaurant,Gym / Fitness Center,Train Station,Men's Store,Mexican Restaurant,Fast Food Restaurant,Mobile Phone Shop,Food,Electronics Store,Snack Place
29,Sunder Nagar,Western Suburbs,Indian Restaurant,Coffee Shop,Movie Theater,Fast Food Restaurant,Café,Pizza Place,Breakfast Spot,Bakery,Restaurant,Train Station
31,Naigaon,Western Suburbs,Coffee Shop,Indian Restaurant,Movie Theater,Fast Food Restaurant,Café,Bar,Juice Bar,Breakfast Spot,Flower Shop,Plaza
34,Irla,Western Suburbs,Indian Restaurant,Coffee Shop,Bakery,Pizza Place,Chinese Restaurant,Café,Convenience Store,Playground,Clothing Store,Dessert Shop


### Cluster 3

In [55]:
mum_merged.loc[mum_merged['Cluster Labels'] == 2, mum_merged.columns[[0] + [1] + list(range(5, mum_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Amboli,Western Suburbs,Indian Restaurant,Asian Restaurant,Bar,Sandwich Place,Coffee Shop,Pizza Place,Athletics & Sports,Metro Station,Chinese Restaurant,Mughlai Restaurant
2,D.N. Nagar,Western Suburbs,Indian Restaurant,Pizza Place,Bar,Pub,Coffee Shop,Vegetarian / Vegan Restaurant,Snack Place,Falafel Restaurant,Sports Club,Department Store
3,Four Bungalows,Western Suburbs,Pub,Chinese Restaurant,Indian Restaurant,Seafood Restaurant,Ice Cream Shop,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bar,Lounge,Pizza Place
4,Lokhandwala,Western Suburbs,Indian Restaurant,Café,Coffee Shop,Chinese Restaurant,Fast Food Restaurant,Bar,Gym / Fitness Center,Pub,Lounge,Italian Restaurant
7,Seven Bungalows,Western Suburbs,Bar,Pub,Café,Seafood Restaurant,Pizza Place,Chinese Restaurant,Vegetarian / Vegan Restaurant,Indian Restaurant,Ice Cream Shop,Coffee Shop
8,Versova,Western Suburbs,Ice Cream Shop,Café,Beach,Restaurant,Coffee Shop,Chinese Restaurant,Sandwich Place,Clothing Store,Dim Sum Restaurant,Pub
10,Bhayandar,Western Suburbs,Ice Cream Shop,Soccer Field,Diner,Lake,Restaurant,Pizza Place,Food Truck,Train Station,Bakery,Indian Restaurant
12,Bandstand Promenade,Western Suburbs,Coffee Shop,Performing Arts Venue,Scenic Lookout,Café,Tea Room,Deli / Bodega,Indian Restaurant,Food Truck,Lounge,Fast Food Restaurant
13,Kherwadi,Western Suburbs,Café,Indian Restaurant,Hookah Bar,Fast Food Restaurant,Seafood Restaurant,Chinese Restaurant,Multiplex,Bar,Italian Restaurant,Diner
14,Pali Hill,Western Suburbs,Indian Restaurant,Bakery,Café,Fast Food Restaurant,Dessert Shop,Bar,Cupcake Shop,Seafood Restaurant,Juice Bar,Asian Restaurant


### Cluster 4

In [56]:
mum_merged.loc[mum_merged['Cluster Labels'] == 3, mum_merged.columns[[0] + [1] + list(range(5, mum_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
70,Malabar Hill,South Mumbai,Coffee Shop,Park,Indian Restaurant,Lighthouse,Dessert Shop,Zoo,Dhaba,Field,Fast Food Restaurant,Farmers Market
77,Walkeshwar,South Mumbai,Park,Indian Restaurant,Food & Drink Shop,Restaurant,Lighthouse,Food Truck,Dessert Shop,Coffee Shop,Event Space,Dhaba


### Cluster 5

In [57]:
mum_merged.loc[mum_merged['Cluster Labels'] == 4, mum_merged.columns[[0] + [1] + list(range(5, mum_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,Uttan,Western Suburbs,Skate Park,Whisky Bar,Convenience Store,Indian Restaurant,Beach,Restaurant,Zoo,Dhaba,Fast Food Restaurant,Farmers Market


### Results and Discussion

By analyzing the five clusters obtained we can see that some of the clusters are more suited for restaurants and hotels, whereas, other clusters are less suited. Neighborhoods in clusters 1, 4, and 5 contain a small percentage of restaurants, hotels, cafe and pubs in their top 10 common venues. These clusters contain a higher degree of other venues like train station, bus station, fish market, gym, performing arts venue and smoke shop, to name a few. Thus, they are well suited for opening a new restaurant as there are less number of competitors. On the other hand, neighborhoods in clusters 2 and 3 contain a much higher degree of restaurants, hotels, multiplex, cafes, bars and other food joints. Thus, the neighborhoods in these clusters would not be well suited for opening a new restaurant.

Comparing clusters 2 and 3, neighborhoods in cluster 2 seem to be more suited for starting a restaurant since they contains a smaller percentage of food joints in the top 10 most common venues than cluster 3. The neighborhoods in cluster 3 contain a variety of food joints like restaurants, tea rooms, bakery, cafe, steakhouse and pubs and also contain very diverse cuisines like Japanese, Indian, Chinese, Italian and seafood restaurants. Most neighborhoods in cluster 2 seem to have Indian Restaurant as their top most common venue; however, on careful analysis we can see that neighborhoods in cluster 2 also contain other venues like soccer field, flea market, smoke shop, gym, train station, dance studio, music store, cosmetics shop and so on. Thus, it is recommended that the new restaurant can be opened in the neighborhoods belonging to cluster 2. This neighborhood can be further plotted on a map as shown below.

### Conclusion

We have successfully analyzed the neighborhoods in Mumbai, India for determining which would be the best neighborhoods for opening a new restaurant. Based on our analysis, neighborhoods in cluster 1 are recommended as locations for the new restaurant. This has also been plotted in the map above. The stakeholders and investors can further tune this by considering various other factors like transport, legal requirements, and costs associated. These were out of the scope for this project and thus were not considered.

#### Note: if maps are not visible please use the link below

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/58a6a4d5-9a01-4008-800e-c859957b0a40/view?access_token=8674d055beb4d2d77a51a4ba5a9937bbb58e984128ae6e9a6464133b7e70a413

### THANK YOU