<h1><center>Battle of the Neighborhoods</center></h1>
<h2><center>Best Place to Start a Restaurent in Mumbai</center><h2>

<h3><center> Introduction</center></h3>
<p>This project deals with the exploring the neighborhoods of Mumbai. This project is designed for those who are looking to start a restaurant business in Mumbai. Mumbai is a financial capital of India and houses all types of cultures. One can find a restaurant of any cuisine in Mumbai. This project is for those who are looking to open a new hotel or restaurant. 

The Foursquare API is used to get the details of venues in Mumbai. These venues are then clustered based on data received from the API and then analyzed using the K-Mean Clustering and Silhouette Score. 

The target audience of this project are the small-scale hotel owners and restaurant owners who are planning to open branches in Mumbai and its neighborhoods. The project aims to answer the following questions: <p>
<ol>
    <li>What is the best location to open a new Hotel in Mumbai? </li>
    <li>Which place is most suitable for starting a Mall in Mumbai?</li>
</ol>

<p>Install Folium library for plotting maps of geographical locations</p>

In [1]:
!pip install folium

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 6.3 MB/s  eta 0:00:01
[?25hCollecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1


<p>Import Libraries required for the project</p>
<ul>
    <li>Regex</li>
    <li>Json</li>
    <li>Numpy</li>
    <li>pandas</li>
    <li>Beautiful Soup</li>
    <li>Requests</li>
    <li>Scikit Learn</li>
    <li>GeoPy</li>
    <li>Folium</li>
    <li>Matplotlib</li>
</ul>

In [2]:
import re
import json
import requests
import numpy as np
from bs4 import BeautifulSoup

import pandas as pd
#display all rows
pd.set_option('display.max_rows', None)
#display all columns
pd.set_option('display.max_columns', None)

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

from geopy.geocoders import Nominatim

import folium

import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
%matplotlib inline

print('Libraries imported.')

Libraries imported.


<p>URL of the page to be read</p>
<p>Wikipedia page for List of neighborhoods in Mumbai. Click the link get redirected to the page</p><a href="https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai">List of Neighborhoods in Mumbai</a>

In [3]:
URL = "https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai"

<p>Send a get request to get the entire HTML page.</p>

In [4]:
data = requests.get(URL).text

<p>Using Beautiful Soup to Extract Data from the page</p>
<p>Create a Beautiful Soup Object</p>

In [5]:
soup = BeautifulSoup(data, "xml")

<p>Create Lists from the Extracted Data</p>
<p>Four List are created</p>
<ul>
    <li>Area</li>
    <li>Location</li>
    <li>Latitude</li>
    <li>Longitude</li>
    
</ul>

In [6]:
Area = []
Location = []
Latitude = []
Longitude = []

td = soup.find_all("td")
for i in range(0, len(td),4):
    Area.append((td[i].text)[0:len((td[i].text))-2])
    Location.append((td[i+1].text)[0:len((td[i+1].text))-2])
    Latitude.append((td[i+2].text)[0:len((td[i+2].text))-2])
    Longitude.append((td[i+3].text)[0:len((td[i+3].text))-2])
# print(len(Area))
# print(len(Location))
# print(len(Latitude))
# print(len(Longitude))


<p>Create a pandas dataframe with the Area, Location, Latitude, Longitude columns</p>

In [7]:
df = pd.DataFrame({
    "Area" : Area,
    "Location" : Location,
    "Latitude" : Latitude,
    "Longitude" : Longitude,
})

<p>Show the first 5 elements of the dataframe to see whether the dataframe was created or not</p>

In [8]:
df.head()

Unnamed: 0,Area,Location,Latitude,Longitude
0,Ambol,"Andheri,Western Suburb",19.129,72.843
1,"Chakala, Andher",Western Suburb,19.11138,72.86083
2,D.N. Naga,"Andheri,Western Suburb",19.12408,72.83137
3,Four Bungalow,"Andheri,Western Suburb",19.12471,72.8272
4,Lokhandwal,"Andheri,Western Suburb",19.13081,72.8292


In [9]:
df.shape

(93, 4)

<p>Get the geographical coordinated of Mumbai Maharashtra</p>
<p>Using the Nominatim class in the geopy library</p>

In [10]:
address = 'Mumbai, Maharashtra'

geolocator = Nominatim(user_agent="mumbai_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Mumbai are {}, {}.'.format(latitude, longitude))

The geographical coordinate of Mumbai are 19.0759899, 72.8773928.


<p>Plot a map of Mumbai using the folium library and the coordinates recieved from above</p>

In [11]:
mumbai_map = folium.Map(location=[latitude, longitude], zoom_start=11)    
mumbai_map

In [12]:
# add neighborhood markers to map
for lat, lng, location in zip(df['Latitude'], df['Longitude'], df['Area']):
    label = '{}'.format(location)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(mumbai_map)  

mumbai_map

Get you CLIENT_ID, CLIENT_SECRET and VERSION from the FourSquare API Website and add them to variables below

In [13]:
# secret
CLIENT_ID = '3ZOIN1HQSS02FL3JXJ0V5RNMEC1ON5VEEK0OSBITXNRLGCLX' # your Foursquare ID
CLIENT_SECRET = '14EDTT1YA4AZ5W1QUGIT3Q4WFF2JPE2UCPWYQBTGGTFKYWKK' # your Foursquare Secret
VERSION = '20210613' # Foursquare API version

<p>Create a function to get the nearby venues from the latitude and longitude values in the dataframe.</p>
<p>This function sends request to the foursquare api to get the venues</p>

In [14]:
def getNearbyVenues(names, longitudes, latitudes, radius=500, limit=100):
    
    venues_list = []  # create a empty list to hold venues
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        # API call request
        url = f'https://api.foursquare.com/v2/venues/explore?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={lat},{lng}&radius={radius}&limit={limit}'
        
        # GET Request
        count = 1
        while count != 5:
            try:
                results = requests.get(url).json()["response"]["groups"][0]["items"]
                count = 5
            except:
                count += 1
                
        # Get relevent data
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue',
                            'Venue Latitude', 'Venue Longitude', 'Venue Category']
    return nearby_venues

<p>Make call to the above function and pass the area latitude and longitude values form the original dataframe</p>
<p>Show the shape and first 5 rows of the new dataframe</p>
<p>Columns of this dataframe are</p>
<ul>
    <li>Neighborood</li>
    <li>Neighborood Latitude</li>
    <li>Neighborood Longitude</li>
    <li>Venue</li>
    <li>Venue Latitude</li>
    <li>Venue Longitude</li>
    <li>Venue Category</li>
</ul>

In [None]:
mumbai_venues = getNearbyVenues(df['Area'], df['Longitude'], df['Latitude'])
print(mumbai_venues.shape)
mumbai_venues.head()

<h4>Methodology</h4>
<p>We have the data that has latitudes and longitudes of the neighborhoods in Mumbai and the venues that are nearby. We will only consider neighborhoods for which we have considerable number of venues available. We will have a look at the neighborhood that has the highest number of venues. We have 93 neighborhoods in Mumbai. From the foursquare Api we have received 858 venues. We will take a look at the venue categories and then look at how many unique types of venues were received. 
We use the KMean Clustering, One Head Encoding. These cluster are used to find the best place to open a restaurant. </p>

<h4>Analysis<h4>

<p>Group the data by neighborhoods and show the neighborhood with maximum number of venues</p>

In [None]:
v_ = mumbai_venues.groupby('Neighborhood').count()
v_[v_['Venue'] == max(v_['Venue'])] # Maximum venues

<p>Get the number of unique venues that are available</p>

In [None]:
print(f'There are {len(mumbai_venues["Venue Category"].unique())} unique categories.')

In [None]:
df1 = mumbai_venues.groupby(['Neighborhood'], sort=False)['Venue'].count()
df1.plot.bar(figsize=(18,6))

In [None]:
df1=df1[df1 >= 10]
df1.plot.bar(figsize=(18,6))

In [None]:
mumbai_venues_top = mumbai_venues[mumbai_venues['Neighborhood'].isin(df1.index.tolist())]
mumbai_venues_top.head()

In [None]:
mumbai_onehot = pd.get_dummies(mumbai_venues_top['Venue Category'], prefix = "", prefix_sep="")

#
mumbai_onehot['Neighborhood'] = mumbai_venues_top['Neighborhood']

fixed_columns = mumbai_onehot.columns.tolist()
fixed_columns.insert(0,fixed_columns.pop(fixed_columns.index('Neighborhood')))
mumbai_onehot = mumbai_onehot.reindex(columns = fixed_columns)

print(mumbai_onehot.shape)
mumbai_onehot.head()

In [None]:
mumbai_grouped = mumbai_onehot.groupby('Neighborhood').mean().reset_index()
print(mumbai_grouped.shape)
mumbai_grouped.head()

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create column names according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = mumbai_grouped['Neighborhood']

for ind in np.arange(mumbai_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(mumbai_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

In [None]:
mumbai_grouped_clustering = mumbai_grouped.drop('Neighborhood', 1)

max_score = 10
scores = []

for kclusters in range(2, max_score):
    # Run k-means clustering
    kmeans = KMeans(n_clusters = kclusters, init = 'k-means++', random_state = 0).fit_predict(mumbai_grouped_clustering)
    
    # Gets the silhouette score
    score = silhouette_score(mumbai_grouped_clustering, kmeans)
    scores.append(score)

plt.figure(figsize=(20,10))
plt.plot(np.arange(2, max_score), scores, 'ro-')
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Score")
plt.xticks(np.arange(2, max_score))
plt.show()

In [None]:
# select best number of clusters
kclusters = 9

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(mumbai_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

mumbai_merged = mumbai_venues_top[mumbai_venues_top.columns[0:3]].drop_duplicates()
mumbai_merged.reset_index(drop = True, inplace = True)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
mumbai_merged = mumbai_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

mumbai_merged.head()

In [None]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 0, mumbai_merged.columns[[0] + list(range(4, mumbai_merged.shape[1]))]]

In [None]:
cluster1 = mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 0, mumbai_merged.columns[[0] + 
                                                                                    list(range(4, mumbai_merged.shape[1]))]]
venues1 = (cluster1['1st Most Common Venue'].append(
    cluster1['2nd Most Common Venue']).append(
    cluster1['3rd Most Common Venue']).append(
    cluster1['4th Most Common Venue']).append(
    cluster1['5th Most Common Venue']).append(
    cluster1['6th Most Common Venue']).append(
    cluster1['7th Most Common Venue']).append(
    cluster1['8th Most Common Venue']).append(
    cluster1['9th Most Common Venue']).append(
    cluster1['10th Most Common Venue']))

print(venues1.value_counts().head(10))

In [None]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 1, mumbai_merged.columns[[0] + list(range(4, mumbai_merged.shape[1]))]]

In [None]:
cluster2 = mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 1, mumbai_merged.columns[[0] + 
                                                                                    list(range(4, mumbai_merged.shape[1]))]]
venues2 = (cluster2['1st Most Common Venue'].append(
    cluster2['2nd Most Common Venue']).append(
    cluster2['3rd Most Common Venue']).append(
    cluster2['4th Most Common Venue']).append(
    cluster2['5th Most Common Venue']).append(
    cluster2['6th Most Common Venue']).append(
    cluster2['7th Most Common Venue']).append(
    cluster2['8th Most Common Venue']).append(
    cluster2['9th Most Common Venue']).append(
    cluster2['10th Most Common Venue']))

print(venues2.value_counts().head(10))

In [None]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 2, mumbai_merged.columns[[0] + list(range(4, mumbai_merged.shape[1]))]]

In [None]:
cluster3 = mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 2, mumbai_merged.columns[[0] + 
                                                                                    list(range(4, mumbai_merged.shape[1]))]]
venues3 = (cluster3['1st Most Common Venue'].append(
    cluster3['2nd Most Common Venue']).append(
    cluster3['3rd Most Common Venue']).append(
    cluster3['4th Most Common Venue']).append(
    cluster3['5th Most Common Venue']).append(
    cluster3['6th Most Common Venue']).append(
    cluster3['7th Most Common Venue']).append(
    cluster3['8th Most Common Venue']).append(
    cluster3['9th Most Common Venue']).append(
    cluster3['10th Most Common Venue']))

print(venues3.value_counts().head(10))

In [None]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 3, mumbai_merged.columns[[0] + list(range(4, mumbai_merged.shape[1]))]]

In [None]:
cluster4 = mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 3, mumbai_merged.columns[[0] + 
                                                                                    list(range(4, mumbai_merged.shape[1]))]]
venues4 = (cluster4['1st Most Common Venue'].append(
    cluster4['2nd Most Common Venue']).append(
    cluster4['3rd Most Common Venue']).append(
    cluster4['4th Most Common Venue']).append(
    cluster4['5th Most Common Venue']).append(
    cluster4['6th Most Common Venue']).append(
    cluster4['7th Most Common Venue']).append(
    cluster4['8th Most Common Venue']).append(
    cluster4['9th Most Common Venue']).append(
    cluster4['10th Most Common Venue']))

print(venues4.value_counts().head(10))

In [None]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 4, mumbai_merged.columns[[0] + list(range(4, mumbai_merged.shape[1]))]]

In [None]:
cluster5 = mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 4, mumbai_merged.columns[[0] + 
                                                                                    list(range(4, mumbai_merged.shape[1]))]]
venues5 = (cluster5['1st Most Common Venue'].append(
    cluster5['2nd Most Common Venue']).append(
    cluster5['3rd Most Common Venue']).append(
    cluster5['4th Most Common Venue']).append(
    cluster5['5th Most Common Venue']).append(
    cluster5['6th Most Common Venue']).append(
    cluster5['7th Most Common Venue']).append(
    cluster5['8th Most Common Venue']).append(
    cluster5['9th Most Common Venue']).append(
    cluster5['10th Most Common Venue']))

print(venues5.value_counts().head(10))

In [None]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 5, mumbai_merged.columns[[0] + list(range(4, mumbai_merged.shape[1]))]]

In [None]:
cluster6 = mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 5, mumbai_merged.columns[[0] + 
                                                                                    list(range(4, mumbai_merged.shape[1]))]]
venues6 = (cluster6['1st Most Common Venue'].append(
    cluster6['2nd Most Common Venue']).append(
    cluster6['3rd Most Common Venue']).append(
    cluster6['4th Most Common Venue']).append(
    cluster6['5th Most Common Venue']).append(
    cluster6['6th Most Common Venue']).append(
    cluster6['7th Most Common Venue']).append(
    cluster6['8th Most Common Venue']).append(
    cluster6['9th Most Common Venue']).append(
    cluster6['10th Most Common Venue']))

print(venues6.value_counts().head(10))

In [None]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 6, mumbai_merged.columns[[0] + list(range(4, mumbai_merged.shape[1]))]]

In [None]:
cluster7 = mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 6, mumbai_merged.columns[[0] + 
                                                                                    list(range(4, mumbai_merged.shape[1]))]]
venues7 = (cluster7['1st Most Common Venue'].append(
    cluster7['2nd Most Common Venue']).append(
    cluster7['3rd Most Common Venue']).append(
    cluster7['4th Most Common Venue']).append(
    cluster7['5th Most Common Venue']).append(
    cluster7['6th Most Common Venue']).append(
    cluster7['7th Most Common Venue']).append(
    cluster7['8th Most Common Venue']).append(
    cluster7['9th Most Common Venue']).append(
    cluster7['10th Most Common Venue']))

print(venues7.value_counts().head(10))

In [None]:
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 7, mumbai_merged.columns[[0] + list(range(4, mumbai_merged.shape[1]))]]

In [None]:
cluster8 = mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 7, mumbai_merged.columns[[0] + 
                                                                                    list(range(4, mumbai_merged.shape[1]))]]
venues8 = (cluster8['1st Most Common Venue'].append(
    cluster8['2nd Most Common Venue']).append(
    cluster8['3rd Most Common Venue']).append(
    cluster8['4th Most Common Venue']).append(
    cluster8['5th Most Common Venue']).append(
    cluster8['6th Most Common Venue']).append(
    cluster8['7th Most Common Venue']).append(
    cluster8['8th Most Common Venue']).append(
    cluster8['9th Most Common Venue']).append(
    cluster8['10th Most Common Venue']))

print(venues8.value_counts().head(10))

<h4>Discussion</h4>

In [None]:
df_list = [venues1 ,venues2, venues3, venues4, venues5, venues6, venues7, venues8]
fig, axes = plt.subplots(4, 2)

count = 0
for r in range(4):
    for c in range(2):
        df_list[count].value_counts().head().plot.barh(ax = axes[r,c], width=0.5, figsize=(15,10))
        axes[r,c].set_title('Cluster {}'.format(count+1))
        plt.sca(axes[r, c])
        plt.xticks(np.arange(0, 15, 5))
        plt.xlabel('No. of venues')
        count += 1

fig.tight_layout()

In [None]:
# create map
mumbai_clusters_map = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
colors_array = cm.rainbow(np.linspace(0, 1, kclusters))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mumbai_merged['Neighborhood Latitude'], mumbai_merged['Neighborhood Longitude'], mumbai_merged['Neighborhood'], mumbai_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster+1), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(mumbai_clusters_map)
       
mumbai_clusters_map