<h1 align="center"> A Tale of Two cities</h1>
<h2 align="center">Clustering the Neighbourhoods of Mumbai and London</h2>

<p align = "center">Amanul Rahiman Shamshuddin Attar
<br>
<br>
30th January 2021
</p>

# Introduction

Mumbai and London are the most popular cities in the world. These two cities have major history in past. A lt has changed over the years and we now take a look at how cities have grown.

Mumbai and London are quite popular tourist and vacation destination for peoplearround the world. They are diverse and multicultural and offer a wide variety of experiences thet is widely sought after. We try to group the neighbourhoods of Mumbai and London respectively and  draw insights to what they look like now.

# Business Problem

The aim is to help tourists choose their destinations depending on the experiences that the neighbourhoods have to offer and what they would want to have. This also helps people make decisions if they are thinking about migrating to Mumbai or London or even if they want to relocate neighbourhoods within the city. Our findings will help stakeholders make informed decisions and address any concerns they have including the different kinds of cuisines, provision stores and what the city has to offer. 


# Data Description

We require geolocation data for both Mumbai and London. Postal codes in each city serve as a starting point. Using Postal codes we use can find out the neighbourhoods, boroughs, venues and their most popular venue categories.

## Mumbai

To derive our solution, We scrape our data from 
https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai

This wikipedia page has information about all the neighbourhoods, we limit it South Mumbai.

1. *borough* : Name of Neighbourhood
2. *town* : Name of borough
3. *latitude* : Latitude for Neighbourhood
4. *longitude* : Longitude for Neighbourhood

## London

To derive our solution, We scrape our data from https://en.wikipedia.org/wiki/List_of_areas_of_London

This wikipedia page has information about all the neighbourhoods, we limit it London.

1. *borough* : Name of Neighbourhood
2. *town* : Name of borough
3. *post_code* : Postal codes for London.

This wikipedia page lacks information about the geographical locations. To solve this problem we use ArcGIS API

### ArcGIS API

ArcGIS Online enables you to connect people, locations, and data using interactive maps. Work with smart, data-driven styles and intuitive analysis tools that deliver location intelligence. Share your insights with the world or specific groups. 

More specifically, we use ArcGIS to get the geo locations of the neighbourhoods of London. The following columns are added to our initial dataset which prepares our data. 

4. *latitude* : Latitude for Neighbourhood
5. *longitude* : Longitude for Neighbourhood

## Foursquare API Data

We will need data about different venues in different neighbourhoods of that specific borough. In order to gain that information we will use "Foursquare" locational information. Foursquare is a location data provider with information about all manner of venues and events within an area of interest. Such information includes venue names, locations, menus and even photos. As such, the foursquare location platform will be used as the sole data source since all the stated required information can be obtained through the API.

After finding the list of neighbourhoods, we then connect to the Foursquare API to gather information about venues inside each and every neighbourhood. For each neighbourhood, we have chosen the radius to be 1000 meters.

The data retrieved from Foursquare contained information of venues within a specified distance of the longitude and latitude of the postcodes. The information obtained per venue as follows:

1. *Neighbourhood* : Name of the Neighbourhood
2. *Neighbourhood Latitude* : Latitude of the Neighbourhood
3. *Neighbourhood Longitude* : Longitude of the Neighbourhood
4. *Venue* : Name of the Venue
5. *Venue Latitude* : Latitude of Venue
6. *Venue Longitude* : Longitude of Venue
7. *Venue Category* : Category of Venue


Based on all the information collected for both Mumbai and London, we have sufficient data to build our model. We cluster the neighbourhoods together based on similar venue categories. We then present our observations and findings. Using this data, our stakeholders can take the necessary decision.

# Methodology

We will be creating a model with the help of Python so we start off by importing all the required packages

In [7]:
import pandas as pd
import requests
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium 

#import k-means for the clustering stage
from sklearn.cluster import KMeans

The approach taken here is to explore each of the cities individually, plot the map to show the neighbourhoods being considered and then build our model by clustering all of the similar neighbourhoods together and finally plot the new map with the clustered neighbourhoods. We draw insights and then compare and discuss our findings.

# Exploring Mumbai

### Neighbourhood of London

We begin to start collecting and refining the data needed for the our business solution to work.

### Data Collection

To get the neighbourhoods in Mumbai, we start by scraping the list of areas of Mumbai wiki page.

In [8]:
url_mumbai = "https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai"
wiki_mumbai_url = requests.get(url_mumbai)
wiki_mumbai_url

<Response [200]>

Response 200 means that we are able to make the connection

In [9]:
wiki_mumbai_data = pd.read_html(wiki_mumbai_url.text)
wiki_mumbai_data

[                Area                 Location   Latitude  Longitude
 0             Amboli  Andheri,Western Suburbs  19.129300  72.843400
 1   Chakala, Andheri          Western Suburbs  19.111388  72.860833
 2         D.N. Nagar  Andheri,Western Suburbs  19.124085  72.831373
 3     Four Bungalows  Andheri,Western Suburbs  19.124714  72.827210
 4        Lokhandwala  Andheri,Western Suburbs  19.130815  72.829270
 ..               ...                      ...        ...        ...
 88             Parel             South Mumbai  18.990000  72.840000
 89      Gowalia Tank      Tardeo,South Mumbai  18.962450  72.809703
 90       Dava Bazaar             South Mumbai  18.946882  72.831362
 91           Dharavi                   Mumbai  19.040208  72.850850
 92             Thane                   Mumbai  19.200000  72.970000
 
 [93 rows x 4 columns]]

Scraping the webpage gives us all the tables present on the page. We need the 1st table, so selecting the 1st table.

In [10]:
wiki_mumbai_data = wiki_mumbai_data[0]
wiki_mumbai_data

Unnamed: 0,Area,Location,Latitude,Longitude
0,Amboli,"Andheri,Western Suburbs",19.129300,72.843400
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.827210
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.829270
...,...,...,...,...
88,Parel,South Mumbai,18.990000,72.840000
89,Gowalia Tank,"Tardeo,South Mumbai",18.962450,72.809703
90,Dava Bazaar,South Mumbai,18.946882,72.831362
91,Dharavi,Mumbai,19.040208,72.850850


### Data Preprocessing

we remove the spaces in the column titels and then we add _ between words

In [11]:
wiki_mumbai_data.rename(columns=lambda x: x.strip().replace(" ", "_"), inplace=True)
wiki_mumbai_data

Unnamed: 0,Area,Location,Latitude,Longitude
0,Amboli,"Andheri,Western Suburbs",19.129300,72.843400
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.827210
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.829270
...,...,...,...,...
88,Parel,South Mumbai,18.990000,72.840000
89,Gowalia Tank,"Tardeo,South Mumbai",18.962450,72.809703
90,Dava Bazaar,South Mumbai,18.946882,72.831362
91,Dharavi,Mumbai,19.040208,72.850850


We see that few columns have no '_' between the words despite applying our function meaning that there are special characters

### Feature Selection

In [12]:
df1 = wiki_mumbai_data
df1.head()

Unnamed: 0,Area,Location,Latitude,Longitude
0,Amboli,"Andheri,Western Suburbs",19.1293,72.8434
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.82721
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.82927


let's rename the area, location column and the london borough to something simpler

In [36]:
df1.columns = ["borough", "town","latitude","longitude"]
df1

Unnamed: 0,borough,town,latitude,longitude
52,Agripada,South Mumbai,18.9777,72.8273
53,Altamount Road,South Mumbai,18.9681,72.8095
54,Bhuleshwar,South Mumbai,18.95,72.83
55,Breach Candy,South Mumbai,18.967,72.805
56,Carmichael Road,South Mumbai,18.9722,72.8113
57,Cavel,South Mumbai,18.9474,72.8272
58,Churchgate,South Mumbai,18.93,72.82
59,Cotton Green,South Mumbai,18.986209,72.844076
60,Cuffe Parade,South Mumbai,18.91,72.81
61,Cumbala Hill,South Mumbai,18.965833,72.805833


Let's remove the Square brackets [ ] and numbers from the borough column

In [37]:
df1['borough'] = df1['borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))
df1

Unnamed: 0,borough,town,latitude,longitude
52,Agripada,South Mumbai,18.9777,72.8273
53,Altamount Road,South Mumbai,18.9681,72.8095
54,Bhuleshwar,South Mumbai,18.95,72.83
55,Breach Candy,South Mumbai,18.967,72.805
56,Carmichael Road,South Mumbai,18.9722,72.8113
57,Cavel,South Mumbai,18.9474,72.8272
58,Churchgate,South Mumbai,18.93,72.82
59,Cotton Green,South Mumbai,18.986209,72.844076
60,Cuffe Parade,South Mumbai,18.91,72.81
61,Cumbala Hill,South Mumbai,18.965833,72.805833


take dimension of the data frame

In [38]:
df1.shape

(39, 4)

We currently have 533 records and 3 columns of our data. It's time to perform Feature Engineering

### Feature Engineering

We can only focusing on the neighbourhoods of South Mumbai, so performing the changes

In [39]:
df1 = df1[df1['town'].str.contains('South Mumbai')]
df1

Unnamed: 0,borough,town,latitude,longitude
52,Agripada,South Mumbai,18.9777,72.8273
53,Altamount Road,South Mumbai,18.9681,72.8095
54,Bhuleshwar,South Mumbai,18.95,72.83
55,Breach Candy,South Mumbai,18.967,72.805
56,Carmichael Road,South Mumbai,18.9722,72.8113
57,Cavel,South Mumbai,18.9474,72.8272
58,Churchgate,South Mumbai,18.93,72.82
59,Cotton Green,South Mumbai,18.986209,72.844076
60,Cuffe Parade,South Mumbai,18.91,72.81
61,Cumbala Hill,South Mumbai,18.965833,72.805833


In [40]:
df1.shape

(39, 4)

## Geolocations of the London Neighbourhoods

### ArcGis API

We need to get the geographical co-ordinates for the neighbourhoods to plot out map. We will use the arcgis package to do so. 

Arcgis doesn't have a limitation on the number of API calls made so it fits our use case perfectly.

In [41]:
pip install arcgis

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\users\amana\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.


Getting th geocodes for Mumbai to help visualize it on the map

In [42]:
from arcgis.geocoding import geocode
from arcgis.gis import GIS
gis = GIS()

### Co-ordinates for Mumbai

Getting the geocode for Mumbai to help visualize it on the map

In [43]:
mumbai = geocode(address='Mumbai,India')[0]
mumbai_lng_coords = mumbai['location']['x']
mumbai_lat_coords = mumbai['location']['y']
mumbai_lng_coords

72.83483000000007

In [44]:
mumbai_lat_coords

18.940170000000023

# Visualize the Map of Mumbai

To help visualize the Map of Mumbai and the neighbourhoods in Mumbai, we make use of the folium package.

In [45]:
# Creating the map of Mumbai
map_Mumbai = folium.Map(location=[mumbai_lat_coords, mumbai_lng_coords], zoom_start=12)
map_Mumbai

# adding markers to map
for latitude, longitude, borough, town in zip(df1['latitude'], df1['longitude'], df1['borough'], df1['town']):
    label = '{}, {}'.format(town, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='red',
        fill=True
        ).add_to(map_Mumbai)  
    
map_Mumbai

### Venues in Mumbai

To proceed with the next part, we need to define Foursquare API credentials.

Using Foursquare API, we are able to get the venue and venue categories around each neighbourhood in Mumbai.

In [55]:
CLIENT_ID = 'XOCJVA41PPQHX5ZVPCWPWESBQEJT5J05QUJVEQXPPFROB3YJ'
CLIENT_SECRET = 'KHQOY5O3TIKTEOGRFGQBJ5JUMBLBCLZJMIWGJPCA5NTXABIF'
VERSION = '20210101'

Defining a function to get the neraby venues in the neighbourhood. This will help us get venue categories which is important for our analysis

In [56]:
LIMIT=400

def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

Getting the venues in London

In [57]:
venues_in_Mumbai = getNearbyVenues(df1['borough'], df1['latitude'], df1['longitude'])

Agripada
Altamount Road
Bhuleshwar
Breach Candy
Carmichael Road
Cavel
Churchgate
Cotton Green
Cuffe Parade
Cumbala Hill
Currey Road
Dhobitalao
Dongri
Kala Ghoda
Kemps Corner
Lower Parel
Mahalaxmi
Mahim
Malabar Hill
Marine Drive
Marine Lines
Mumbai Central
Nariman Point
Prabhadevi
Sion
Walkeshwar
Worli
C.G.S. colony
Dagdi Chawl
Navy Nagar
Hindu colony
Ballard Estate
Chira Bazaar
Fanas Wadi
Chor Bazaar
Matunga
Parel
Gowalia Tank
Dava Bazaar


Sampling our data

In [58]:
venues_in_Mumbai.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,Agripada,18.9777,72.8273,Celejor,Bakery
1,Agripada,18.9777,72.8273,Tote On The Turf,Nightclub
2,Agripada,18.9777,72.8273,Mahalaxmi Race Course (Royal Western India Tur...,Club House
3,Agripada,18.9777,72.8273,Bhau Daji Lad Museum,History Museum
4,Agripada,18.9777,72.8273,Persian Darbar,Indian Restaurant


In [59]:
venues_in_Mumbai.shape

(1707, 5)

Wow, we have scraped together 1707 records for venues. This will definitely make the clustering interesting.


### Grouping by Venue Categories
We need to now see how many Venue Categories are there for further processing

In [60]:
venues_in_Mumbai.groupby("Venue Category").max()

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
American Restaurant,Dhobitalao,18.950000,72.840000,Subway
Antique Shop,Chor Bazaar,18.960321,72.827176,Chor Bazaar (Thieves' Market)
Aquarium,Marine Lines,18.951811,72.827200,Taraporewala Aquarium
Arcade,Worli,19.017980,72.844763,Worli Sea Face Kebab rolls
Art Gallery,Worli,19.000000,72.833100,SaffronArt
...,...,...,...,...
Wine Bar,Worli,19.000000,72.823000,Vinoteca by Sula
Wine Shop,Dhobitalao,18.950000,72.840000,Peekay Wines
Women's Store,Nariman Point,19.017980,72.844763,Westside
Yoga Studio,Gowalia Tank,19.016378,72.856629,Moksh - The Wellness Place


We can see 156 records, just goes to show how diverse and interesting the place is.

### One Hot Encoding 
We need to Encode our venue categories to get a better result for our clustering

In [62]:
Mumbai_venue_cat = pd.get_dummies(venues_in_Mumbai[['Venue Category']], prefix="", prefix_sep="")
Mumbai_venue_cat

Unnamed: 0,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,...,Trail,Train,Train Station,Vegetarian / Vegan Restaurant,Video Store,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1702,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1703,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1704,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1705,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Adding Neighbourhood into the mix.

In [64]:
Mumbai_venue_cat['Neighbourhood'] = venues_in_Mumbai['Neighbourhood']

# moving neghbourhood colun to the first column
fixed_columns = [Mumbai_venue_cat.columns[-1]] + list(Mumbai_venue_cat.columns[:-1])
Mumbai_venue_cat = Mumbai_venue_cat[fixed_columns]

Mumbai_venue_cat.head()

Unnamed: 0,Neighbourhood,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,...,Trail,Train,Train Station,Vegetarian / Vegan Restaurant,Video Store,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,Agripada,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Agripada,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Agripada,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Agripada,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Agripada,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Venue categories mean value
We will group the Neighbourhoods and calculate the mean venue categories value in each Neighbourhood

In [65]:
Mumbai_grouped = Mumbai_venue_cat.groupby('Neighbourhood').mean().reset_index()
Mumbai_grouped.head()

Unnamed: 0,Neighbourhood,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,...,Trail,Train,Train Station,Vegetarian / Vegan Restaurant,Video Store,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,Agripada,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037
1,Altamount Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.012195,0.012195,0.0
2,Ballard Estate,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,...,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0
3,Bhuleshwar,0.019608,0.0,0.0,0.019608,0.0,0.0,0.0,0.0,0.019608,...,0.0,0.0,0.019608,0.0,0.0,0.0,0.019608,0.0,0.0,0.0
4,Breach Candy,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.018519,0.0,0.0,0.0,0.018519,0.0,0.0


Let's make a function to get the top most common venue categories

In [66]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

There are way too many venue categories, we can take the top 10 to cluster the neighbourhoods.

Creating a function to label the columns of the venue correctly

In [67]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))


### Top venue categories

Getting the top venue categories in London

In [69]:
# create a new dataframe for Mumbai
neighborhoods_venues_sorted_mumbai = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted_mumbai['Neighbourhood'] = Mumbai_grouped['Neighbourhood']

for ind in np.arange(Mumbai_grouped.shape[0]):
    neighborhoods_venues_sorted_mumbai.iloc[ind, 1:] = return_most_common_venues(Mumbai_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted_mumbai.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agripada,Indian Restaurant,Bakery,Gym,Restaurant,Coffee Shop,Zoo,History Museum,Pizza Place,Cupcake Shop,Fast Food Restaurant
1,Altamount Road,Chinese Restaurant,Café,Bar,Indian Restaurant,Bakery,Fast Food Restaurant,Pizza Place,Sandwich Place,Coffee Shop,Salon / Barbershop
2,Ballard Estate,Indian Restaurant,Harbor / Marina,Café,Smoke Shop,Ice Cream Shop,Furniture / Home Store,Rest Area,Middle Eastern Restaurant,Market,Hotel
3,Bhuleshwar,Indian Restaurant,Bakery,Café,Chinese Restaurant,Fast Food Restaurant,Dessert Shop,Jewelry Store,Music Store,Multiplex,Middle Eastern Restaurant
4,Breach Candy,Café,Bakery,Coffee Shop,Indian Restaurant,Sandwich Place,Dessert Shop,Pizza Place,Chinese Restaurant,Park,Bar


## Model Building

### K Mean
Let's cluster the city of Mumbai to roughly 5 to make it easier to analyze

We use K Means clustering technique to do so.


In [78]:
# set number of clusters
k_num_clusters= 5

Mumbai_grouped_clustering = Mumbai_grouped.drop('Neighbourhood',1)

# run k-means clustering
kmeans_mumbai = KMeans(n_clusters=k_num_cluster, random_state=0).fit(Mumbai_grouped_clustering)
kmeans_mumbai

KMeans(n_clusters=5, random_state=0)

### Labelling Clustered Data

In [79]:
kmeans_mumbai.labels_

array([3, 0, 3, 3, 0, 0, 0, 3, 3, 3, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 0, 3,
       0, 0, 1, 3, 3, 3, 3, 3, 0, 0, 4, 0, 3, 3, 3, 0])

So our model has labeled the city

In [80]:
neighborhoods_venues_sorted_mumbai.insert(0,'Cluster Labels',kmeans_mumbai.labels_ +1)

ValueError: cannot insert Cluster Labels, already exists

Join df1 with our neighbourhood venues sorted to add latitude & longitude for each of the neighborhood to prepare it for plotting

In [81]:
mumbai_data = df1

mumbai_data = mumbai_data.join(neighborhoods_venues_sorted_mumbai.set_index('Neighbourhood'), on = 'borough')

mumbai_data.head()

Unnamed: 0,borough,town,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
52,Agripada,South Mumbai,18.9777,72.8273,4.0,Indian Restaurant,Bakery,Gym,Restaurant,Coffee Shop,Zoo,History Museum,Pizza Place,Cupcake Shop,Fast Food Restaurant
53,Altamount Road,South Mumbai,18.9681,72.8095,1.0,Chinese Restaurant,Café,Bar,Indian Restaurant,Bakery,Fast Food Restaurant,Pizza Place,Sandwich Place,Coffee Shop,Salon / Barbershop
54,Bhuleshwar,South Mumbai,18.95,72.83,4.0,Indian Restaurant,Bakery,Café,Chinese Restaurant,Fast Food Restaurant,Dessert Shop,Jewelry Store,Music Store,Multiplex,Middle Eastern Restaurant
55,Breach Candy,South Mumbai,18.967,72.805,1.0,Café,Bakery,Coffee Shop,Indian Restaurant,Sandwich Place,Dessert Shop,Pizza Place,Chinese Restaurant,Park,Bar
56,Carmichael Road,South Mumbai,18.9722,72.8113,1.0,Fast Food Restaurant,Chinese Restaurant,Indian Restaurant,Bar,Bakery,Ice Cream Shop,Sandwich Place,Pizza Place,Shopping Mall,Park


Drop all the NaN values to prevent data skew

In [82]:
mumbai_data_nonan = mumbai_data.dropna(subset=['Cluster Labels'])

### Visualizing the clustered neighbourhood
Let's plot the clusters

In [84]:
map_clusters_mumbai = folium.Map(location=[mumbai_lat_coords,mumbai_lng_coords], zoom_start=12)

# set color scheme for the clusters
x = np.arange(k_num_clusters)
ys = [i + x + (i*x)**2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mumbai_data_nonan['latitude'], mumbai_data_nonan['longitude'], mumbai_data_nonan['borough'], mumbai_data_nonan['Cluster Labels']):
    label = folium.Popup('Cluster ' + str(int(cluster) +1) + '\n' + str(poi) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters_mumbai)
        
map_clusters_mumbai

### Examining our Clusters

Cluster 1

In [85]:
mumbai_data_nonan.loc[mumbai_data_nonan['Cluster Labels'] == 1, mumbai_data_nonan.columns[[1] + list(range(5, mumbai_data_nonan.shape[1]))]]

Unnamed: 0,town,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
53,South Mumbai,Chinese Restaurant,Café,Bar,Indian Restaurant,Bakery,Fast Food Restaurant,Pizza Place,Sandwich Place,Coffee Shop,Salon / Barbershop
55,South Mumbai,Café,Bakery,Coffee Shop,Indian Restaurant,Sandwich Place,Dessert Shop,Pizza Place,Chinese Restaurant,Park,Bar
56,South Mumbai,Fast Food Restaurant,Chinese Restaurant,Indian Restaurant,Bar,Bakery,Ice Cream Shop,Sandwich Place,Pizza Place,Shopping Mall,Park
58,South Mumbai,Fast Food Restaurant,Italian Restaurant,Hotel,Indian Restaurant,Ice Cream Shop,Coffee Shop,Restaurant,Theater,Bar,Cricket Ground
59,South Mumbai,Train Station,Ice Cream Shop,Fast Food Restaurant,Multiplex,Plaza,Snack Place,Hotel,Vegetarian / Vegan Restaurant,Food & Drink Shop,Flower Shop
60,South Mumbai,Park,Indian Restaurant,Department Store,Italian Restaurant,Asian Restaurant,Grocery Store,Basketball Court,Beach,Zoo,Farmers Market
61,South Mumbai,Café,Bakery,Coffee Shop,Indian Restaurant,Sandwich Place,Pizza Place,Salon / Barbershop,Fast Food Restaurant,Park,Soccer Field
62,South Mumbai,Indian Restaurant,Café,Coffee Shop,Shopping Mall,Pizza Place,Fast Food Restaurant,Restaurant,Clothing Store,Chinese Restaurant,Lounge
66,South Mumbai,Indian Restaurant,Café,Bakery,Park,Restaurant,Fast Food Restaurant,Salon / Barbershop,Sandwich Place,Coffee Shop,Theater
67,South Mumbai,Indian Restaurant,Café,Pub,Restaurant,Clothing Store,Coffee Shop,Italian Restaurant,Hotel,Chinese Restaurant,Molecular Gastronomy Restaurant


Cluster 2

In [86]:
mumbai_data_nonan.loc[mumbai_data_nonan['Cluster Labels'] == 2, mumbai_data_nonan.columns[[1] + list(range(5, mumbai_data_nonan.shape[1]))]]

Unnamed: 0,town,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
68,South Mumbai,Arcade,Zoo,Farmers Market,Food & Drink Shop,Food,Flower Shop,Flea Market,Fish Market,Field,Fast Food Restaurant


Cluster 3

In [87]:
mumbai_data_nonan.loc[mumbai_data_nonan['Cluster Labels'] == 3, mumbai_data_nonan.columns[[1] + list(range(5, mumbai_data_nonan.shape[1]))]]

Unnamed: 0,town,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
64,South Mumbai,Beach,Playground,Resort,Bus Station,Indian Restaurant,Dance Studio,Farmers Market,Flower Shop,Flea Market,Fish Market


Cluster 4

In [88]:
mumbai_data_nonan.loc[mumbai_data_nonan['Cluster Labels'] == 4, mumbai_data_nonan.columns[[1] + list(range(5, mumbai_data_nonan.shape[1]))]]

Unnamed: 0,town,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
52,South Mumbai,Indian Restaurant,Bakery,Gym,Restaurant,Coffee Shop,Zoo,History Museum,Pizza Place,Cupcake Shop,Fast Food Restaurant
54,South Mumbai,Indian Restaurant,Bakery,Café,Chinese Restaurant,Fast Food Restaurant,Dessert Shop,Jewelry Store,Music Store,Multiplex,Middle Eastern Restaurant
57,South Mumbai,Indian Restaurant,Bakery,Food,Chinese Restaurant,Restaurant,Fast Food Restaurant,Cricket Ground,Jewelry Store,Café,Boutique
63,South Mumbai,Indian Restaurant,Train Station,Café,Bakery,Bar,Jewelry Store,Multiplex,Boutique,Chinese Restaurant,Coffee Shop
65,South Mumbai,Indian Restaurant,Café,Bar,Seafood Restaurant,Coffee Shop,Chinese Restaurant,Fast Food Restaurant,Japanese Restaurant,Bakery,Hotel
69,South Mumbai,Indian Restaurant,Chinese Restaurant,Café,Coffee Shop,Beach,Fast Food Restaurant,Dessert Shop,Movie Theater,Bakery,Snack Place
70,South Mumbai,Indian Restaurant,Ice Cream Shop,Park,Gym,Dessert Shop,Coffee Shop,Lighthouse,Falafel Restaurant,Flea Market,Fish Market
71,South Mumbai,Indian Restaurant,Café,Chinese Restaurant,Bridal Shop,Cricket Ground,Train Station,Gastropub,Bakery,Jewelry Store,Cheese Shop
72,South Mumbai,Indian Restaurant,Café,Jewelry Store,Bar,Chinese Restaurant,Bridal Shop,Food Truck,Cricket Ground,Train Station,Fast Food Restaurant
75,South Mumbai,Indian Restaurant,Coffee Shop,Café,Fast Food Restaurant,Chinese Restaurant,Bakery,Electronics Store,Theater,Seafood Restaurant,Scenic Lookout


Cluster 5

In [89]:
mumbai_data_nonan.loc[mumbai_data_nonan['Cluster Labels'] == 5, mumbai_data_nonan.columns[[1] + list(range(5, mumbai_data_nonan.shape[1]))]]

Unnamed: 0,town,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
81,"Colaba,South Mumbai",Golf Course,Asian Restaurant,General Entertainment,Beach,Farmers Market,Food,Flower Shop,Flea Market,Fish Market,Field
