<h1 align="center"> A Tale of Two cities</h1>
<h2 align="center">Clustering the Neighbourhoods of Mumbai and London</h2>

<p align = "center">Amanul Rahiman Shamshuddin Attar
<br>
<br>
30th January 2021
</p>

# Introduction

Mumbai and London are the most popular cities in the world. These two cities have major history in past. A lt has changed over the years and we now take a look at how cities have grown.

Mumbai and London are quite popular tourist and vacation destination for peoplearround the world. They are diverse and multicultural and offer a wide variety of experiences thet is widely sought after. We try to group the neighbourhoods of Mumbai and London respectively and  draw insights to what they look like now.

# Business Problem

The aim is to help tourists choose their destinations depending on the experiences that the neighbourhoods have to offer and what they would want to have. This also helps people make decisions if they are thinking about migrating to Mumbai or London or even if they want to relocate neighbourhoods within the city. Our findings will help stakeholders make informed decisions and address any concerns they have including the different kinds of cuisines, provision stores and what the city has to offer. 


# Data Description

We require geolocation data for both Mumbai and London. Postal codes in each city serve as a starting point. Using Postal codes we use can find out the neighbourhoods, boroughs, venues and their most popular venue categories.

## Mumbai

To derive our solution, We scrape our data from 
https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai

This wikipedia page has information about all the neighbourhoods, we limit it South Mumbai.

1. *borough* : Name of Neighbourhood
2. *town* : Name of borough
3. *latitude* : Latitude for Neighbourhood
4. *longitude* : Longitude for Neighbourhood

## London

To derive our solution, We scrape our data from https://en.wikipedia.org/wiki/List_of_areas_of_London

This wikipedia page has information about all the neighbourhoods, we limit it London.

1. *borough* : Name of Neighbourhood
2. *town* : Name of borough
3. *post_code* : Postal codes for London.

This wikipedia page lacks information about the geographical locations. To solve this problem we use ArcGIS API

### ArcGIS API

ArcGIS Online enables you to connect people, locations, and data using interactive maps. Work with smart, data-driven styles and intuitive analysis tools that deliver location intelligence. Share your insights with the world or specific groups. 

More specifically, we use ArcGIS to get the geo locations of the neighbourhoods of London. The following columns are added to our initial dataset which prepares our data. 

4. *latitude* : Latitude for Neighbourhood
5. *longitude* : Longitude for Neighbourhood

## Foursquare API Data

We will need data about different venues in different neighbourhoods of that specific borough. In order to gain that information we will use "Foursquare" locational information. Foursquare is a location data provider with information about all manner of venues and events within an area of interest. Such information includes venue names, locations, menus and even photos. As such, the foursquare location platform will be used as the sole data source since all the stated required information can be obtained through the API.

After finding the list of neighbourhoods, we then connect to the Foursquare API to gather information about venues inside each and every neighbourhood. For each neighbourhood, we have chosen the radius to be 1000 meters.

The data retrieved from Foursquare contained information of venues within a specified distance of the longitude and latitude of the postcodes. The information obtained per venue as follows:

1. *Neighbourhood* : Name of the Neighbourhood
2. *Neighbourhood Latitude* : Latitude of the Neighbourhood
3. *Neighbourhood Longitude* : Longitude of the Neighbourhood
4. *Venue* : Name of the Venue
5. *Venue Latitude* : Latitude of Venue
6. *Venue Longitude* : Longitude of Venue
7. *Venue Category* : Category of Venue


Based on all the information collected for both Mumbai and London, we have sufficient data to build our model. We cluster the neighbourhoods together based on similar venue categories. We then present our observations and findings. Using this data, our stakeholders can take the necessary decision.

# Methodology

We will be creating a model with the help of Python so we start off by importing all the required packages

In [2]:
import pandas as pd
import requests
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium 

#import k-means for the clustering stage
from sklearn.cluster import KMeans

The approach taken here is to explore each of the cities individually, plot the map to show the neighbourhoods being considered and then build our model by clustering all of the similar neighbourhoods together and finally plot the new map with the clustered neighbourhoods. We draw insights and then compare and discuss our findings.

# Exploring Mumbai

### Neighbourhood of London

We begin to start collecting and refining the data needed for the our business solution to work.

### Data Collection

To get the neighbourhoods in Mumbai, we start by scraping the list of areas of Mumbai wiki page.

In [3]:
url_mumbai = "https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai"
wiki_mumbai_url = requests.get(url_mumbai)
wiki_mumbai_url

<Response [200]>

Response 200 means that we are able to make the connection

In [4]:
wiki_mumbai_data = pd.read_html(wiki_mumbai_url.text)
wiki_mumbai_data

[                Area                 Location   Latitude  Longitude
 0             Amboli  Andheri,Western Suburbs  19.129300  72.843400
 1   Chakala, Andheri          Western Suburbs  19.111388  72.860833
 2         D.N. Nagar  Andheri,Western Suburbs  19.124085  72.831373
 3     Four Bungalows  Andheri,Western Suburbs  19.124714  72.827210
 4        Lokhandwala  Andheri,Western Suburbs  19.130815  72.829270
 ..               ...                      ...        ...        ...
 88             Parel             South Mumbai  18.990000  72.840000
 89      Gowalia Tank      Tardeo,South Mumbai  18.962450  72.809703
 90       Dava Bazaar             South Mumbai  18.946882  72.831362
 91           Dharavi                   Mumbai  19.040208  72.850850
 92             Thane                   Mumbai  19.200000  72.970000
 
 [93 rows x 4 columns]]

Scraping the webpage gives us all the tables present on the page. We need the 1st table, so selecting the 1st table.

In [5]:
wiki_mumbai_data = wiki_mumbai_data[0]
wiki_mumbai_data

Unnamed: 0,Area,Location,Latitude,Longitude
0,Amboli,"Andheri,Western Suburbs",19.129300,72.843400
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.827210
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.829270
...,...,...,...,...
88,Parel,South Mumbai,18.990000,72.840000
89,Gowalia Tank,"Tardeo,South Mumbai",18.962450,72.809703
90,Dava Bazaar,South Mumbai,18.946882,72.831362
91,Dharavi,Mumbai,19.040208,72.850850


### Data Preprocessing

we remove the spaces in the column titels and then we add _ between words

In [6]:
wiki_mumbai_data.rename(columns=lambda x: x.strip().replace(" ", "_"), inplace=True)
wiki_mumbai_data

Unnamed: 0,Area,Location,Latitude,Longitude
0,Amboli,"Andheri,Western Suburbs",19.129300,72.843400
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.827210
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.829270
...,...,...,...,...
88,Parel,South Mumbai,18.990000,72.840000
89,Gowalia Tank,"Tardeo,South Mumbai",18.962450,72.809703
90,Dava Bazaar,South Mumbai,18.946882,72.831362
91,Dharavi,Mumbai,19.040208,72.850850


We see that few columns have no '_' between the words despite applying our function meaning that there are special characters

### Feature Selection

In [7]:
df1 = wiki_mumbai_data
df1.head()

Unnamed: 0,Area,Location,Latitude,Longitude
0,Amboli,"Andheri,Western Suburbs",19.1293,72.8434
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.82721
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.82927


let's rename the area, location column and the london borough to something simpler

In [8]:
df1.columns = ["borough", "town","latitude","longitude"]
df1

Unnamed: 0,borough,town,latitude,longitude
0,Amboli,"Andheri,Western Suburbs",19.129300,72.843400
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.827210
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.829270
...,...,...,...,...
88,Parel,South Mumbai,18.990000,72.840000
89,Gowalia Tank,"Tardeo,South Mumbai",18.962450,72.809703
90,Dava Bazaar,South Mumbai,18.946882,72.831362
91,Dharavi,Mumbai,19.040208,72.850850


Let's remove the Square brackets [ ] and numbers from the borough column

In [9]:
df1['borough'] = df1['borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))
df1

Unnamed: 0,borough,town,latitude,longitude
0,Amboli,"Andheri,Western Suburbs",19.129300,72.843400
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.827210
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.829270
...,...,...,...,...
88,Parel,South Mumbai,18.990000,72.840000
89,Gowalia Tank,"Tardeo,South Mumbai",18.962450,72.809703
90,Dava Bazaar,South Mumbai,18.946882,72.831362
91,Dharavi,Mumbai,19.040208,72.850850


take dimension of the data frame

In [10]:
df1.shape

(93, 4)

We currently have 533 records and 3 columns of our data. It's time to perform Feature Engineering

In [11]:
df1.shape

(93, 4)

## Geolocations of the Mumbai Neighbourhoods

### ArcGis API

We need to get the geographical co-ordinates for the neighbourhoods to plot out map. We will use the arcgis package to do so. 

Arcgis doesn't have a limitation on the number of API calls made so it fits our use case perfectly.

In [12]:
pip install arcgis




You should consider upgrading via the 'c:\users\amana\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.




Getting th geocodes for Mumbai to help visualize it on the map

In [13]:
from arcgis.geocoding import geocode
from arcgis.gis import GIS
gis = GIS()

### Co-ordinates for Mumbai

Getting the geocode for Mumbai to help visualize it on the map

In [14]:
mumbai = geocode(address='Mumbai,India')[0]
mumbai_lng_coords = mumbai['location']['x']
mumbai_lat_coords = mumbai['location']['y']
mumbai_lng_coords

72.83483000000007

In [15]:
mumbai_lat_coords

18.940170000000023

# Visualize the Map of Mumbai

To help visualize the Map of Mumbai and the neighbourhoods in Mumbai, we make use of the folium package.

In [16]:
# Creating the map of Mumbai
map_Mumbai = folium.Map(location=[mumbai_lat_coords, mumbai_lng_coords], zoom_start=12)
map_Mumbai

# adding markers to map
for latitude, longitude, borough, town in zip(df1['latitude'], df1['longitude'], df1['borough'], df1['town']):
    label = '{}, {}'.format(town, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='red',
        fill=True
        ).add_to(map_Mumbai)  
    
map_Mumbai

### Venues in Mumbai

To proceed with the next part, we need to define Foursquare API credentials.

Using Foursquare API, we are able to get the venue and venue categories around each neighbourhood in Mumbai.

In [17]:
CLIENT_ID = 'XOCJVA41PPQHX5ZVPCWPWESBQEJT5J05QUJVEQXPPFROB3YJ'
CLIENT_SECRET = 'KHQOY5O3TIKTEOGRFGQBJ5JUMBLBCLZJMIWGJPCA5NTXABIF'
VERSION = '20180605' # Foursquare API version

Defining a function to get the neraby venues in the neighbourhood. This will help us get venue categories which is important for our analysis

In [18]:
LIMIT=400

def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

Getting the venues in London

In [19]:
venues_in_Mumbai = getNearbyVenues(df1['borough'], df1['latitude'], df1['longitude'])

Amboli
Chakala, Andheri
D.N. Nagar
Four Bungalows
Lokhandwala
Marol
Sahar
Seven Bungalows
Versova
Mira Road
Bhayandar
Uttan
Bandstand Promenade
Kherwadi
Pali Hill
I.C. Colony
Gorai
Dahisa
Aarey Milk Colony
Bangur Nagar
Jogeshwari West
Juhu
Charkop
Poisar
Mahavir Nagar
Thakur village
Pali Naka
Khar Danda
Dindoshi
Sunder Nagar
Kalina
Naigaon
Nalasopara
Virar
Irla
Vile Parle
Bhandup
Amrut Nagar
Asalfa
Pant Nagar
Kanjurmarg
Nehru Nagar
Nahur
Chandivali
Hiranandani Gardens
Indian Institute of Technology Bombay campus
Vidyavihar
Vikhroli
Chembur
Deonar
Mankhurd
Mahul
Agripada
Altamount Road
Bhuleshwar
Breach Candy
Carmichael Road
Cavel
Churchgate
Cotton Green
Cuffe Parade
Cumbala Hill
Currey Road
Dhobitalao
Dongri
Kala Ghoda
Kemps Corner
Lower Parel
Mahalaxmi
Mahim
Malabar Hill
Marine Drive
Marine Lines
Mumbai Central
Nariman Point
Prabhadevi
Sion
Walkeshwar
Worli
C.G.S. colony
Dagdi Chawl
Navy Nagar
Hindu colony
Ballard Estate
Chira Bazaar
Fanas Wadi
Chor Bazaar
Matunga
Parel
Gowalia Tank
D

Sampling our data

In [20]:
venues_in_Mumbai.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,Amboli,19.1293,72.8434,Shawarma Factory,Falafel Restaurant
1,Amboli,19.1293,72.8434,Pizza Express,Pizza Place
2,Amboli,19.1293,72.8434,Jaffer Bhai's Delhi Darbar,Mughlai Restaurant
3,Amboli,19.1293,72.8434,Cafe Arfa,Indian Restaurant
4,Amboli,19.1293,72.8434,"5 Spice , Bandra",Chinese Restaurant


In [21]:
venues_in_Mumbai.shape

(3715, 5)

Wow, we have scraped together 3715 records for venues. This will definitely make the clustering interesting.


### Grouping by Venue Categories
We need to now see how many Venue Categories are there for further processing

In [22]:
venues_in_Mumbai.groupby("Venue Category").max()

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Accessories Store,Kalina,19.091000,72.901000,World of Titan
Afghan Restaurant,Amrut Nagar,19.102077,72.912835,Zaffran
Airport,Sahar,19.108056,72.867222,Pawan Hans
Airport Lounge,Sahar,19.098889,72.867222,GVK First and Business Lounge
Airport Service,Sahar,19.098889,72.867222,Security
...,...,...,...,...
Wine Bar,Worli,19.000000,72.823000,Vinoteca by Sula
Wine Shop,Dhobitalao,18.950000,72.840000,Peekay Wines
Women's Store,Vile Parle,19.140000,72.930000,misakee
Yoga Studio,Kalina,19.124714,72.856629,True Fitness


We can see 225 records, just goes to show how diverse and interesting the place is.

### One Hot Encoding 
We need to Encode our venue categories to get a better result for our clustering

In [23]:
Mumbai_venue_cat = pd.get_dummies(venues_in_Mumbai[['Venue Category']], prefix="", prefix_sep="")
Mumbai_venue_cat

Unnamed: 0,Accessories Store,Afghan Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Antique Shop,Aquarium,...,Train,Train Station,Udupi Restaurant,Vegetarian / Vegan Restaurant,Video Store,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3710,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Adding Neighbourhood into the mix.

In [24]:
Mumbai_venue_cat['Neighbourhood'] = venues_in_Mumbai['Neighbourhood']

# moving neghbourhood colun to the first column
fixed_columns = [Mumbai_venue_cat.columns[-1]] + list(Mumbai_venue_cat.columns[:-1])
Mumbai_venue_cat = Mumbai_venue_cat[fixed_columns]

Mumbai_venue_cat.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Antique Shop,...,Train,Train Station,Udupi Restaurant,Vegetarian / Vegan Restaurant,Video Store,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,Amboli,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Amboli,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Amboli,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Amboli,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Amboli,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Venue categories mean value
We will group the Neighbourhoods and calculate the mean venue categories value in each Neighbourhood

In [25]:
Mumbai_grouped = Mumbai_venue_cat.groupby('Neighbourhood').mean().reset_index()
Mumbai_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Antique Shop,...,Train,Train Station,Udupi Restaurant,Vegetarian / Vegan Restaurant,Video Store,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,Aarey Milk Colony,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Agripada,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037
2,Altamount Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.012195,0.012195,0.0
3,Amboli,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0
4,Amrut Nagar,0.0,0.019231,0.0,0.0,0.0,0.0,0.019231,0.0,0.0,...,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.019231,0.0,0.0


Let's make a function to get the top most common venue categories

In [26]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

There are way too many venue categories, we can take the top 10 to cluster the neighbourhoods.

Creating a function to label the columns of the venue correctly

In [27]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))


### Top venue categories

Getting the top venue categories in London

In [28]:
# create a new dataframe for Mumbai
neighborhoods_venues_sorted_mumbai = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted_mumbai['Neighbourhood'] = Mumbai_grouped['Neighbourhood']

for ind in np.arange(Mumbai_grouped.shape[0]):
    neighborhoods_venues_sorted_mumbai.iloc[ind, 1:] = return_most_common_venues(Mumbai_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted_mumbai.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Aarey Milk Colony,Café,Hotel,Resort,Farm,Indian Restaurant,Restaurant,Golf Course,Gym / Fitness Center,Electronics Store,Duty-free Shop
1,Agripada,Indian Restaurant,Gym,Restaurant,Coffee Shop,Bakery,Cupcake Shop,Chinese Restaurant,Club House,Pizza Place,Nightclub
2,Altamount Road,Chinese Restaurant,Bakery,Fast Food Restaurant,Bar,Indian Restaurant,Café,Pizza Place,Coffee Shop,Sandwich Place,Park
3,Amboli,Indian Restaurant,Bar,Pizza Place,Asian Restaurant,Coffee Shop,Chinese Restaurant,Pub,Burger Joint,Smoke Shop,Snack Place
4,Amrut Nagar,Café,Indian Restaurant,Lounge,Clothing Store,Fast Food Restaurant,Pizza Place,Electronics Store,Vegetarian / Vegan Restaurant,Asian Restaurant,Diner


## Model Building

### K Mean
Let's cluster the city of Mumbai to roughly 5 to make it easier to analyze

We use K Means clustering technique to do so.


In [30]:
# set number of clusters
k_num_clusters= 5

Mumbai_grouped_clustering = Mumbai_grouped.drop('Neighbourhood',1)

# run k-means clustering
kmeans_mumbai = KMeans(n_clusters=k_num_clusters, random_state=0).fit(Mumbai_grouped_clustering)
kmeans_mumbai

KMeans(n_clusters=5, random_state=0)

### Labelling Clustered Data

In [31]:
kmeans_mumbai.labels_

array([0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 2, 2, 2,
       0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 4, 2, 0, 4, 2, 2, 2, 2, 2,
       2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2,
       0, 3, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 4, 0, 2, 0, 0, 0,
       2, 0])

So our model has labeled the city

In [32]:
neighborhoods_venues_sorted_mumbai.insert(0,'Cluster Labels',kmeans_mumbai.labels_ +1)

Join df1 with our neighbourhood venues sorted to add latitude & longitude for each of the neighborhood to prepare it for plotting

In [33]:
mumbai_data = df1

mumbai_data = mumbai_data.join(neighborhoods_venues_sorted_mumbai.set_index('Neighbourhood'), on = 'borough')

mumbai_data.head()

Unnamed: 0,borough,town,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Amboli,"Andheri,Western Suburbs",19.1293,72.8434,1.0,Indian Restaurant,Bar,Pizza Place,Asian Restaurant,Coffee Shop,Chinese Restaurant,Pub,Burger Joint,Smoke Shop,Snack Place
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833,1.0,Hotel,Indian Restaurant,Café,Seafood Restaurant,Fast Food Restaurant,Pizza Place,Chinese Restaurant,Multiplex,Vegetarian / Vegan Restaurant,Restaurant
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373,1.0,Bar,Pizza Place,Vegetarian / Vegan Restaurant,Indian Restaurant,Gym / Fitness Center,Pub,Women's Store,Snack Place,Lounge,Pharmacy
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.82721,1.0,Indian Restaurant,Pub,Coffee Shop,Vegetarian / Vegan Restaurant,Bar,Café,Gym / Fitness Center,Lounge,Smoke Shop,Pizza Place
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.82927,1.0,Bar,Indian Restaurant,Pub,Pizza Place,Coffee Shop,Lounge,Italian Restaurant,Asian Restaurant,Electronics Store,Multiplex


Drop all the NaN values to prevent data skew

In [34]:
mumbai_data_nonan = mumbai_data.dropna(subset=['Cluster Labels'])

### Visualizing the clustered neighbourhood
Let's plot the clusters

In [35]:
map_clusters_mumbai = folium.Map(location=[mumbai_lat_coords,mumbai_lng_coords], zoom_start=12)

# set color scheme for the clusters
x = np.arange(k_num_clusters)
ys = [i + x + (i*x)**2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mumbai_data_nonan['latitude'], mumbai_data_nonan['longitude'], mumbai_data_nonan['borough'], mumbai_data_nonan['Cluster Labels']):
    label = folium.Popup('Cluster ' + str(int(cluster) +1) + '\n' + str(poi) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters_mumbai)
        
map_clusters_mumbai

### Examining our Clusters

Cluster 1

In [189]:
mumbai_data_nonan.loc[mumbai_data_nonan['Cluster Labels'] == 1, mumbai_data_nonan.columns[[1] + list(range(5, mumbai_data_nonan.shape[1]))]]

Unnamed: 0,town,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Andheri,Western Suburbs",Indian Restaurant,Bar,Pizza Place,Asian Restaurant,Coffee Shop,Chinese Restaurant,Pub,Burger Joint,Smoke Shop,Snack Place
1,Western Suburbs,Hotel,Indian Restaurant,Café,Seafood Restaurant,Fast Food Restaurant,Pizza Place,Chinese Restaurant,Multiplex,Vegetarian / Vegan Restaurant,Restaurant
2,"Andheri,Western Suburbs",Bar,Pizza Place,Vegetarian / Vegan Restaurant,Indian Restaurant,Gym / Fitness Center,Pub,Women's Store,Snack Place,Lounge,Pharmacy
3,"Andheri,Western Suburbs",Indian Restaurant,Pub,Coffee Shop,Vegetarian / Vegan Restaurant,Bar,Café,Gym / Fitness Center,Lounge,Smoke Shop,Pizza Place
4,"Andheri,Western Suburbs",Bar,Indian Restaurant,Pub,Pizza Place,Coffee Shop,Lounge,Italian Restaurant,Asian Restaurant,Electronics Store,Multiplex
6,"Andheri,Western Suburbs",Coffee Shop,Airport Service,Indian Restaurant,Fast Food Restaurant,Hotel,Airport Lounge,Café,Lounge,Spa,Airport Terminal
7,"Andheri,Western Suburbs",Café,Pub,Chinese Restaurant,Ice Cream Shop,Coffee Shop,Bar,Pizza Place,Indian Restaurant,Vegetarian / Vegan Restaurant,Beach
8,"Andheri,Western Suburbs",Indian Restaurant,Café,Smoke Shop,Chinese Restaurant,Multiplex,Pub,Diner,Coffee Shop,South Indian Restaurant,Donut Shop
9,"Mira-Bhayandar,Western Suburbs",Indian Restaurant,Pizza Place,Chinese Restaurant,Gym / Fitness Center,Café,Multiplex,Ice Cream Shop,Dessert Shop,Recreation Center,Bank
12,"Bandra,Western Suburbs",Coffee Shop,Scenic Lookout,Gym,Indian Restaurant,Boat or Ferry,Fast Food Restaurant,Lounge,Beach,Food Truck,Café


Cluster 2

In [190]:
mumbai_data_nonan.loc[mumbai_data_nonan['Cluster Labels'] == 2, mumbai_data_nonan.columns[[1] + list(range(5, mumbai_data_nonan.shape[1]))]]

Unnamed: 0,town,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
68,South Mumbai,Arcade,Zoo,Dhaba,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space,Electronics Store,Duty-free Shop


Cluster 3

In [191]:
mumbai_data_nonan.loc[mumbai_data_nonan['Cluster Labels'] == 3, mumbai_data_nonan.columns[[1] + list(range(5, mumbai_data_nonan.shape[1]))]]

Unnamed: 0,town,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,"Andheri,Western Suburbs",Indian Restaurant,Bakery,Hotel,Fast Food Restaurant,Ice Cream Shop,Diner,Boat or Ferry,Chinese Restaurant,Restaurant,Lounge
10,"Mira-Bhayandar,Western Suburbs",Indian Restaurant,Fast Food Restaurant,Mexican Restaurant,Restaurant,Multiplex,Burger Joint,Light Rail Station,Ice Cream Shop,Food Truck,Vegetarian / Vegan Restaurant
15,"Borivali (West),Western Suburbs",Indian Restaurant,Chinese Restaurant,Bar,Fast Food Restaurant,Dessert Shop,Ice Cream Shop,Italian Restaurant,Soccer Field,Intersection,Snack Place
20,Western Suburbs,Indian Restaurant,Fast Food Restaurant,Shopping Mall,Pizza Place,Salon / Barbershop,Shoe Store,Light Rail Station,Restaurant,Mobile Phone Shop,Department Store
24,"Kandivali West,Western Suburbs",Indian Restaurant,Fast Food Restaurant,Coffee Shop,Pizza Place,Mexican Restaurant,Juice Bar,Food,Sporting Goods Shop,Bike Rental / Bike Share,Donut Shop
34,"Vile Parle,Western Suburbs",Indian Restaurant,Café,Sandwich Place,Snack Place,Ice Cream Shop,Fast Food Restaurant,Seafood Restaurant,Bar,Coffee Shop,Pub
38,"Ghatkopar,Eastern Suburbs",Indian Restaurant,Coffee Shop,Juice Bar,Accessories Store,Donut Shop,Factory,Metro Station,Men's Store,Gym / Fitness Center,Department Store
39,"Ghatkopar,Eastern Suburbs",Indian Restaurant,Fast Food Restaurant,Gym / Fitness Center,Ice Cream Shop,Bakery,Pizza Place,Dessert Shop,Restaurant,Vegetarian / Vegan Restaurant,Coffee Shop
42,"Mulund,Eastern Suburbs",Indian Restaurant,Restaurant,Gym / Fitness Center,Department Store,Park,Café,Coffee Shop,Bar,Train Station,Clothing Store
44,"Powai,Eastern Suburbs",Indian Restaurant,Bakery,Pizza Place,Ice Cream Shop,Bar,Lounge,Chinese Restaurant,Restaurant,Arcade,Café


Cluster 4

In [192]:
mumbai_data_nonan.loc[mumbai_data_nonan['Cluster Labels'] == 4, mumbai_data_nonan.columns[[1] + list(range(5, mumbai_data_nonan.shape[1]))]]

Unnamed: 0,town,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,"Vasai,Western Suburbs",Bus Station,Bar,Zoo,Farmers Market,Farm,Falafel Restaurant,Factory,Event Space,Electronics Store,Duty-free Shop


Cluster 5

In [193]:
mumbai_data_nonan.loc[mumbai_data_nonan['Cluster Labels'] == 5, mumbai_data_nonan.columns[[1] + list(range(5, mumbai_data_nonan.shape[1]))]]

Unnamed: 0,town,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,"Mira-Bhayandar,Western Suburbs",Beach,Playground,Indian Restaurant,Resort,Bus Station,Zoo,Dhaba,Falafel Restaurant,Factory,Event Space
16,"Borivali (West),Western Suburbs",Beach,Food,Seafood Restaurant,Indian Restaurant,Zoo,Dim Sum Restaurant,Farm,Falafel Restaurant,Factory,Event Space
64,South Mumbai,Beach,Playground,Indian Restaurant,Resort,Bus Station,Zoo,Dhaba,Falafel Restaurant,Factory,Event Space




---



# Exploring London

### Neighbourhoods of London

We begin to start collecting and refining the data needed for the our business solution to work.

### Data Collection

To get the neighbourhoods in london, we start by scraping the list of areas of london wiki page.

In [36]:
url_london = "https://en.wikipedia.org/wiki/List_of_areas_of_London"
wiki_london_url = requests.get(url_london)
wiki_london_url

<Response [200]>

Response 200 means that we are able to make the connection

In [37]:
wiki_london_data = pd.read_html(wiki_london_url.text)
wiki_london_data

[                                                   0
 0  Map all coordinates in "Category:Areas of Lond...
 1                 Download coordinates as: KML · GPX,
             Location                     London borough       Post town  \
 0         Abbey Wood              Bexley, Greenwich [7]          LONDON   
 1              Acton  Ealing, Hammersmith and Fulham[8]          LONDON   
 2          Addington                         Croydon[8]         CROYDON   
 3         Addiscombe                         Croydon[8]         CROYDON   
 4        Albany Park                             Bexley  BEXLEY, SIDCUP   
 ..               ...                                ...             ...   
 526         Woolwich                          Greenwich          LONDON   
 527   Worcester Park       Sutton, Kingston upon Thames  WORCESTER PARK   
 528  Wormwood Scrubs             Hammersmith and Fulham          LONDON   
 529          Yeading                         Hillingdon           HAYES   
 

Scraping the webpage gives us all the tables present on the page. We need the 2nd table, so selecting the 2nd table.

In [38]:
wiki_london_data = wiki_london_data[1]
wiki_london_data

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,020,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",020,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,020,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,020,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",020,TQ478728
...,...,...,...,...,...,...
526,Woolwich,Greenwich,LONDON,SE18,020,TQ435795
527,Worcester Park,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4,020,TQ225655
528,Wormwood Scrubs,Hammersmith and Fulham,LONDON,W12,020,TQ225815
529,Yeading,Hillingdon,HAYES,UB4,020,TQ115825


### Data Preprocessing

we remove the spaces in the column titles and then we add _ between words.

In [39]:
wiki_london_data.rename(columns=lambda x: x.strip().replace(" ", "_"), inplace=True)
wiki_london_data

Unnamed: 0,Location,London borough,Post_town,Postcode district,Dial code,OS_grid_ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,020,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",020,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,020,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,020,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",020,TQ478728
...,...,...,...,...,...,...
526,Woolwich,Greenwich,LONDON,SE18,020,TQ435795
527,Worcester Park,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4,020,TQ225655
528,Wormwood Scrubs,Hammersmith and Fulham,LONDON,W12,020,TQ225815
529,Yeading,Hillingdon,HAYES,UB4,020,TQ115825


We see that few columns have no '_' between the words despite applying our function meaning that there are special characters

### Feature Selection

We need only the boroughs, Postal codes, Post town for further steps. We can drop the locations, dial codes and OS grid.

In [40]:
df2 = wiki_london_data.drop( [ wiki_london_data.columns[0], wiki_london_data.columns[4], wiki_london_data.columns[5] ], axis=1)

In [41]:
df2.head()

Unnamed: 0,London borough,Post_town,Postcode district
0,"Bexley, Greenwich [7]",LONDON,SE2
1,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Croydon[8],CROYDON,CR0
3,Croydon[8],CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


let's rename the Postcode district column and the london borough to something simpler

In [42]:
df2.columns = ['borough','town','post_code']
df2

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich [7]",LONDON,SE2
1,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Croydon[8],CROYDON,CR0
3,Croydon[8],CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
...,...,...,...
526,Greenwich,LONDON,SE18
527,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4
528,Hammersmith and Fulham,LONDON,W12
529,Hillingdon,HAYES,UB4


Let's remove the Square brackets [ ] and numbers from the borough column

In [43]:
df2['borough'] = df2['borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))
df2

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Croydon,CROYDON,CR0
3,Croydon,CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
...,...,...,...
526,Greenwich,LONDON,SE18
527,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4
528,Hammersmith and Fulham,LONDON,W12
529,Hillingdon,HAYES,UB4


Take the dimension of the dataframe

In [45]:
df2.shape

(531, 3)

We currently have 533 records and 3 columns of our data. It's time to perform Feature Engineering

### Feature Engineering

We can only focusing on the neighbourhoods of London, so performing the changes

In [46]:
df2 = df2[df2['town'].str.contains('LONDON')]
df2

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
6,City,LONDON,EC3
7,Westminster,LONDON,WC2
9,Bromley,LONDON,SE20
...,...,...,...
521,Redbridge,LONDON,"IG8, E18"
522,"Redbridge, Waltham Forest","LONDON, WOODFORD GREEN",IG8
525,Barnet,LONDON,N12
526,Greenwich,LONDON,SE18


In [47]:
df2.shape

(308, 3)

We now have only 310 rows. We can proceed with our further steps. Getting some descriptive statistics

In [48]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 308 entries, 0 to 528
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   borough    308 non-null    object
 1   town       308 non-null    object
 2   post_code  308 non-null    object
dtypes: object(3)
memory usage: 9.6+ KB


## Geolocations of the London Neighbourhoods

### ArcGis API

We need to get the geographical co-ordinates for the neighbourhoods to plot out map. We will use the arcgis package to do so. 

Arcgis doesn't have a limitation on the number of API calls made so it fits our use case perfectly.

In [49]:
pip install arcgis




You should consider upgrading via the 'c:\users\amana\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.




In [50]:
from arcgis.geocoding import geocode
from arcgis.gis import GIS
gis = GIS()

Defining London arcgis geocode function to return latitude and longitude

In [51]:
def get_x_y_uk(address1):
   lat_coords = 0
   lng_coords = 0
   g = geocode(address='{}, London, England, GBR'.format(address1))[0]
   lng_coords = g['location']['x']
   lat_coords = g['location']['y']
   return str(lat_coords) +","+ str(lng_coords)

Checking sample data

In [52]:
c = get_x_y_uk('SE2')

In [53]:
c

'51.492450000000076,0.12127000000003818'

Looks good, We Copy over the postal codes of london to pass it into the geolocator function that we just defined above

In [54]:
geo_coordinates_uk = df2['post_code']    
geo_coordinates_uk

0           SE2
1        W3, W4
6           EC3
7           WC2
9          SE20
         ...   
521    IG8, E18
522         IG8
525         N12
526        SE18
528         W12
Name: post_code, Length: 308, dtype: object

Passing postal codes of london to get the geographical co-ordinates

In [55]:
coordinates_latlng_uk = geo_coordinates_uk.apply(lambda x: get_x_y_uk(x))
coordinates_latlng_uk

0       51.492450000000076,0.12127000000003818
1        51.51324000000005,-0.2674599999999714
6       51.51200000000006,-0.08057999999994081
7       51.51651000000004,-0.11967999999995982
9       51.41009000000008,-0.05682999999993399
                        ...                   
521    51.589770000000044,0.030520000000024083
522      51.50642000000005,-0.1272099999999341
525     51.615920000000074,-0.1767399999999384
526      51.48207000000008,0.07143000000002075
528      51.50645000000003,-0.2369099999999662
Name: post_code, Length: 308, dtype: object

### Latitude

Extracting the latitude from our previously collected coordinates

In [56]:
lat_uk = coordinates_latlng_uk.apply(lambda x: x.split(',')[0])
lat_uk

0      51.492450000000076
1       51.51324000000005
6       51.51200000000006
7       51.51651000000004
9       51.41009000000008
              ...        
521    51.589770000000044
522     51.50642000000005
525    51.615920000000074
526     51.48207000000008
528     51.50645000000003
Name: post_code, Length: 308, dtype: object

### Longitude

Extracting the Longitude from our previously collected coordinates

In [57]:
lng_uk = coordinates_latlng_uk.apply(lambda x: x.split(',')[1])
lng_uk

0       0.12127000000003818
1       -0.2674599999999714
6      -0.08057999999994081
7      -0.11967999999995982
9      -0.05682999999993399
               ...         
521    0.030520000000024083
522     -0.1272099999999341
525     -0.1767399999999384
526     0.07143000000002075
528     -0.2369099999999662
Name: post_code, Length: 308, dtype: object

We now have the geographical co-ordinates of the London Neighbourhoods.

We proceed with Merging our source data with the geographical co-ordinates to make our dataset ready for the next stage

In [58]:
london_merged = pd.concat([df1,lat_uk.astype(float), lng_uk.astype(float)], axis=1)
london_merged.columns= ['borough','town','post_code','latitude','longitude']
london_merged

ValueError: Length mismatch: Expected axis has 6 elements, new values have 5 elements

In [None]:
london_merged.dtypes

### Co-ordinates for London

Getting the geocode for London to help visualize it on the map

In [None]:
london = geocode(address='London, England, GBR')[0]
london_lng_coords = london['location']['x']
london_lat_coords = london['location']['y']
london_lng_coords

In [None]:
london_lat_coords

## Visualize the Map of London

To help visualize the Map of London and the neighbourhoods in London, we make use of the folium package.

In [None]:
# Creating the map of London
map_London = folium.Map(location=[london_lat_coords, london_lng_coords], zoom_start=12)
map_London

# adding markers to map
for latitude, longitude, borough, town in zip(london_merged['latitude'], london_merged['longitude'], london_merged['borough'], london_merged['town']):
    label = '{}, {}'.format(town, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='red',
        fill=True
        ).add_to(map_London)  
    
map_London

### Venues in London

To proceed with the next part, we need to define Foursquare API credentials.

Using Foursquare API, we are able to get the venue and venue categories around each neighbourhood in London.

In [None]:
CLIENT_ID = 'XOCJVA41PPQHX5ZVPCWPWESBQEJT5J05QUJVEQXPPFROB3YJ'
CLIENT_SECRET = 'KHQOY5O3TIKTEOGRFGQBJ5JUMBLBCLZJMIWGJPCA5NTXABIF'
VERSION = '20180605' # Foursquare API version

Defining a function to get the neraby venues in the neighbourhood. This will help us get venue categories which is important for our analysis

In [None]:
LIMIT=100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

Getting the venues in London

In [131]:
venues_in_London = getNearbyVenues(london_merged['borough'], london_merged['latitude'], london_merged['longitude'])

Bexley, Greenwich 
Ealing, Hammersmith and Fulham
City
Westminster
Bromley
Islington
Islington
Barnet
Enfield
Wandsworth
Southwark
City
Richmond upon Thames
Barnet
Islington
Wandsworth
Westminster
Bromley
Newham
Ealing
Westminster
Lewisham
Camden
Southwark
Tower Hamlets
Bexley
City
Lewisham
Greenwich
Tower Hamlets
Camden
Haringey
Tower Hamlets
Haringey
Barnet
Brent
Lambeth
Lewisham
Tower Hamlets
Kensington and Chelsea, Hammersmith and Fulham
Brent
Barnet
Barnet
Southwark
Tower Hamlets
Camden
Tower Hamlets
Waltham Forest
Newham
Islington
Richmond upon Thames
Lewisham
Camden
Westminster
Greenwich
Kensington and Chelsea
Barnet
Westminster
Lewisham
Waltham Forest
Hounslow, Ealing, Hammersmith and Fulham
Brent
Barnet
Lambeth, Wandsworth
Islington
Barnet
Merton
Barnet
Westminster
Barnet, Brent, Camden
Lewisham
Bexley
Haringey
Bromley
Tower Hamlets
Newham
Hackney
Islington
Southwark
Lewisham
Brent
Southwark
Ealing
Kensington and Chelsea
Wandsworth
Southwark
Barnet
Newham
Richmond upon Thames


Sampling our data

In [132]:
venues_in_London.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,"Bexley, Greenwich",51.49245,0.12127,Lesnes Abbey,Historic Site
1,"Bexley, Greenwich",51.49245,0.12127,Sainsbury's,Supermarket
2,"Bexley, Greenwich",51.49245,0.12127,Lidl,Supermarket
3,"Bexley, Greenwich",51.49245,0.12127,Abbey Wood Railway Station (ABW),Train Station
4,"Bexley, Greenwich",51.49245,0.12127,Costcutter,Convenience Store


In [133]:
venues_in_London.shape

(10351, 5)


Wow, we have scraped together 10351 records for venues. This will definitely make the clustering interesting.



### Grouping by Venue Categories
We need to now see how many Venue Categories are there for further processing

In [134]:
venues_in_London.groupby('Venue Category').max()

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ATM,Kensington and Chelsea,51.49807,-0.17404,NatWest ATM
Accessories Store,Westminster,51.51656,-0.11968,James Smith & Sons
Adult Boutique,Islington,51.52969,-0.08697,Sh! Women's Erotic Emporium
African Restaurant,Westminster,51.52587,-0.08808,Red Sea Restaurant
American Restaurant,Waltham Forest,51.61780,0.02795,Spielburger
...,...,...,...,...
Wings Joint,Hammersmith and Fulham,51.54187,-0.19795,Wingmans
Women's Store,Westminster,51.55457,-0.11478,Vivien of Holloway
Xinjiang Restaurant,Southwark,51.47480,-0.09313,Silk Road
Yoga Studio,Westminster,51.55457,-0.03558,yogahaven


We can see 304 records, just goes to show how diverse and interesting the place is.

### One Hot Encoding 
We need to Encode our venue categories to get a better result for our clustering

In [135]:
London_venue_cat = pd.get_dummies(venues_in_London[['Venue Category']], prefix="", prefix_sep="")
London_venue_cat

Unnamed: 0,ATM,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10346,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10347,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10348,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10349,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Adding Neighbourhood into the mix.

In [136]:
London_venue_cat['Neighbourhood'] = venues_in_London['Neighbourhood'] 

# moving neighborhood column to the first column
fixed_columns = [London_venue_cat.columns[-1]] + list(London_venue_cat.columns[:-1])
London_venue_cat = London_venue_cat[fixed_columns]

London_venue_cat.head()

Unnamed: 0,Neighbourhood,ATM,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Venue categories mean value
We will group the Neighbourhoods and calculate the mean venue categories value in each Neighbourhood

In [137]:
London_grouped = London_venue_cat.groupby('Neighbourhood').mean().reset_index()
London_grouped.head()

Unnamed: 0,Neighbourhood,ATM,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,Barnet,0.0,0.0,0.0,0.0,0.001761,0.0,0.0,0.0,0.007042,...,0.001761,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Barnet, Brent, Camden",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bexley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Bexley, Greenwich",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bexley, Greenwich",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



Let's make a function to get the top most common venue categories

In [138]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


There are way too many venue categories, we can take the top 10 to cluster the neighbourhoods.

Creating a function to label the columns of the venue correctly

In [139]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))


### Top venue categories

Getting the top venue categories in London

## Model Building

### K Means
Let's cluster the city of london to roughly 5 to make it easier to analyze. 

We use the K Means clustering technique to do so.

In [141]:
# set number of clusters
k_num_clusters = 5

London_grouped_clustering = London_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans_london = KMeans(n_clusters=k_num_clusters, random_state=0).fit(London_grouped_clustering)
kmeans_london

KMeans(n_clusters=5, random_state=0)

### Labelling Clustered Data

In [142]:
kmeans_london.labels_

array([0, 2, 3, 0, 3, 4, 4, 4, 0, 4, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

So our model has labeled the city

In [143]:
neighborhoods_venues_sorted_london.insert(0, 'Cluster Labels', kmeans_london.labels_ +1)

Join London_merged with our neighbourhood venues sorted to add latitude & longitude for each of the neighborhood to prepare it for plotting

In [144]:
london_data = london_merged

london_data = london_data.join(neighborhoods_venues_sorted_london.set_index('Neighbourhood'), on='borough')

london_data.head()

Unnamed: 0,borough,town,post_code,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127,4,Supermarket,Platform,Convenience Store,Train Station,Historic Site,Film Studio,Event Space,Exhibit,Falafel Restaurant,Farmers Market
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.51324,-0.26746,5,Grocery Store,Indian Restaurant,Breakfast Spot,Park,Train Station,Film Studio,Event Space,Exhibit,Falafel Restaurant,Farmers Market
6,City,LONDON,EC3,51.512,-0.08058,1,Hotel,Italian Restaurant,Coffee Shop,Gym / Fitness Center,Pub,Restaurant,Wine Bar,French Restaurant,Sandwich Place,Cocktail Bar
7,Westminster,LONDON,WC2,51.51651,-0.11968,1,Hotel,Coffee Shop,Café,Pub,Sandwich Place,Italian Restaurant,Theater,Restaurant,Hotel Bar,Sushi Restaurant
9,Bromley,LONDON,SE20,51.41009,-0.05683,5,Supermarket,Convenience Store,Grocery Store,Fast Food Restaurant,Hotel,Park,Indian Restaurant,Gastropub,Bistro,Sandwich Place



Drop all the NaN values to prevent data skew

In [145]:
london_data_nonan = london_data.dropna(subset=['Cluster Labels'])

### Visualizing the clustered neighbourhood
Let's plot the clusters

In [147]:
map_clusters_london = folium.Map(location=[london_lat_coords, london_lng_coords], zoom_start=12)

# set color scheme for the clusters
x = np.arange(k_num_clusters)
ys = [i + x + (i*x)**2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_data_nonan['latitude'], london_data_nonan['longitude'], london_data_nonan['borough'], london_data_nonan['Cluster Labels']):
    label = folium.Popup('Cluster ' + str(int(cluster) +1) + '\n' + str(poi) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters_london)
        
map_clusters_london

## Examining our Clusters

Cluster 1

In [148]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 1, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,LONDON,1,Hotel,Italian Restaurant,Coffee Shop,Gym / Fitness Center,Pub,Restaurant,Wine Bar,French Restaurant,Sandwich Place,Cocktail Bar
7,LONDON,1,Hotel,Coffee Shop,Café,Pub,Sandwich Place,Italian Restaurant,Theater,Restaurant,Hotel Bar,Sushi Restaurant
10,LONDON,1,Coffee Shop,Pub,Food Truck,Café,Italian Restaurant,Vietnamese Restaurant,Breakfast Spot,Park,Hotel,Gym / Fitness Center
12,LONDON,1,Coffee Shop,Pub,Food Truck,Café,Italian Restaurant,Vietnamese Restaurant,Breakfast Spot,Park,Hotel,Gym / Fitness Center
14,"BARNET, LONDON",1,Coffee Shop,Café,Grocery Store,Bus Stop,Pub,Italian Restaurant,Supermarket,Pharmacy,Turkish Restaurant,Sushi Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...
521,LONDON,1,Café,Pub,Grocery Store,Seafood Restaurant,Liquor Store,Bakery,Coffee Shop,BBQ Joint,Park,Bar
522,"LONDON, WOODFORD GREEN",1,Hotel,Café,Theater,Plaza,Garden,Pub,Bakery,Steakhouse,Ramen Restaurant,Pharmacy
525,LONDON,1,Coffee Shop,Café,Grocery Store,Bus Stop,Pub,Italian Restaurant,Supermarket,Pharmacy,Turkish Restaurant,Sushi Restaurant
526,LONDON,1,Pub,Grocery Store,Bus Stop,Coffee Shop,Indian Restaurant,Construction & Landscaping,Turkish Restaurant,Golf Course,Middle Eastern Restaurant,Fish & Chips Shop


Cluster 2

In [149]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 2, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
377,"HARROW, STANMOREEDGWARE, LONDON",2,Bakery,Gym,Fish Market,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Film Studio,Fish & Chips Shop


Cluster 3

In [150]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 3, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
121,LONDON,3,Gym / Fitness Center,Clothing Store,Hardware Store,Supermarket,Zoo Exhibit,Film Studio,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant


Cluster 4

In [151]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 4, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,LONDON,4,Supermarket,Platform,Convenience Store,Train Station,Historic Site,Film Studio,Event Space,Exhibit,Falafel Restaurant,Farmers Market
45,"BEXLEYHEATH, LONDON",4,Supermarket,Historic Site,Platform,Convenience Store,Train Station,Bus Stop,Park,Construction & Landscaping,Golf Course,Zoo Exhibit
124,LONDON,4,Supermarket,Historic Site,Platform,Convenience Store,Train Station,Bus Stop,Park,Construction & Landscaping,Golf Course,Zoo Exhibit
291,"LONDON, SIDCUP",4,Supermarket,Historic Site,Platform,Convenience Store,Train Station,Bus Stop,Park,Construction & Landscaping,Golf Course,Zoo Exhibit
505,LONDON,4,Supermarket,Historic Site,Platform,Convenience Store,Train Station,Bus Stop,Park,Construction & Landscaping,Golf Course,Zoo Exhibit


Cluster 5

In [152]:
london_data_nonan.loc[london_data_nonan['Cluster Labels'] == 5, london_data_nonan.columns[[1] + list(range(5, london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,LONDON,5,Grocery Store,Indian Restaurant,Breakfast Spot,Park,Train Station,Film Studio,Event Space,Exhibit,Falafel Restaurant,Farmers Market
9,LONDON,5,Supermarket,Convenience Store,Grocery Store,Fast Food Restaurant,Hotel,Park,Indian Restaurant,Gastropub,Bistro,Sandwich Place
29,"BECKENHAM, LONDON",5,Supermarket,Convenience Store,Grocery Store,Fast Food Restaurant,Hotel,Park,Indian Restaurant,Gastropub,Bistro,Sandwich Place
61,LONDON,5,Indian Restaurant,Sandwich Place,Fast Food Restaurant,Pharmacy,Warehouse Store,Chinese Restaurant,Discount Store,Bed & Breakfast,Convenience Store,Pub
69,LONDON,5,Indian Restaurant,Sandwich Place,Fast Food Restaurant,Pharmacy,Warehouse Store,Chinese Restaurant,Discount Store,Bed & Breakfast,Convenience Store,Pub
100,LONDON,5,Indian Restaurant,Sandwich Place,Fast Food Restaurant,Pharmacy,Warehouse Store,Chinese Restaurant,Discount Store,Bed & Breakfast,Convenience Store,Pub
127,LONDON,5,Supermarket,Convenience Store,Grocery Store,Fast Food Restaurant,Hotel,Park,Indian Restaurant,Gastropub,Bistro,Sandwich Place
137,LONDON,5,Indian Restaurant,Sandwich Place,Fast Food Restaurant,Pharmacy,Warehouse Store,Chinese Restaurant,Discount Store,Bed & Breakfast,Convenience Store,Pub
217,LONDON,5,Indian Restaurant,Sandwich Place,Fast Food Restaurant,Pharmacy,Warehouse Store,Chinese Restaurant,Discount Store,Bed & Breakfast,Convenience Store,Pub
260,LONDON,5,Indian Restaurant,Sandwich Place,Fast Food Restaurant,Pharmacy,Warehouse Store,Chinese Restaurant,Discount Store,Bed & Breakfast,Convenience Store,Pub




---





---




# Results and Discussion


Mumbai is relatively big in size geographically. It has a wide variety of cusines and eateries including French, Thai, Cambodian, Asian, Chinese etc. There are a lot of hangout spots including many Restaurants, Bars and Clubs.Different means of public transport in Mumbai which includes buses, taxies, trains and rikshaws.For leisure and sight seeing, there are a lot of Plazas, Trails, Parks, Historic sites, clothing shops, Art galleries. 

Overall, Mumbai seems like the relaxing vacation spot with a mix of lakes, historic spots and a wide variety of cusines to try out.

The neighbourhoods of London are very mulitcultural. There are a lot of different cusines including Indian, Italian, Turkish and Chinese. London seems to take a step further in this direction by having a lot of Restaurants, bars, juice bars, coffee shops, Fish and Chips shop and Breakfast spots. It has a lot of shopping options too with that of the Flea markets, flower shops, fish markets, Fishing stores, clothing stores. The main modes of transport seem to be Buses and trains. For leisure, the neighbourhoods are set up to have lots of parks, golf courses, zoo, gyms and Historic sites.

Overall, the city of London offers a multicultural, diverse and certainly an entertaining experience.




# Conclusion

The purpose of this project was to explore the cities of Mumbai and London and see how attractive it is to potential tourists and migrants. We explored both the cities based on their extrapolated the common venues present in each of the neighbourhoods finally concluding with clustering similar neighbourhoods together.

We could see that each of the neighbourhoods in both the cities have a wide variety of experiences to offer which is unique in it's own way. The cultural diversity is quite evident which also gives the feeling of a sense of inclusion.

Both LOndon and Mumbai seem to offer a vacation stay or a romantic gateaway with a lot of places to explore, beautiful landscapes and a wide variety of culture.Overall, it's upto the stakeholders to decide which experience they would prefer more and which would more to their liking.