# Introduction business problem

In this project I want to analyze a dataset of London City to find where a person could open a place for his/her business.

In particular with the Foursquare location data I will try to figure out the main problem: which is the best place to open a business?

# Dataset

The information that I'm going to use for the Capstone project are located in this [site](https://www.doogal.co.uk/london_postcodes.php).

Basically the dataset is quite huge (about 135 MB and 320426 rows with 44 attribute), so for this reason I will use basic info such as:
1. District 
2. District Code
3. Ward (Neighborhood)

Before to start the analyze I decided to filter some of the rows with the "In Use?" attribute that defines if a postcode is it used or not.

After this filtering I'm going to use an Unsupervised Algorithm: K-Means.

Unfortunately the latitude and the longitude of the dataframe are for every postcode so I'm going to use some external services to find the lat and lon of each district code. At the end for each district I will try to find with the Foursquare data the best place to open a business for each district.

In the below lines there are some info about the dataset: number of rows, the first lines of the df etc.

### Import libraries and creation of dataframe 

In [1]:
import random 
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
%matplotlib inline 

from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs
import itertools

print('Libraries imported!')

Libraries imported!


In [2]:
# Creation the dataframe through the link
link = "https://www.doogal.co.uk/UKPostcodesCSV.ashx?area=London"
df_london = pd.read_csv(link)

In [3]:
df_london.head()

Unnamed: 0,Postcode,In Use?,Latitude,Longitude,Easting,Northing,Grid Ref,County,District,Ward,...,Constituency Code,Index of Multiple Deprivation,Quality,User Type,Last updated,Nearest station,Distance to station,Postcode area,Postcode district,Police force
0,BR1 1AA,Yes,51.401546,0.015415,540291,168873,TQ402688,Greater London,Bromley,Bromley Town,...,E14000604,20532,1,0,2019-05-29,Bromley South,0.218257,BR,BR1,Metropolitan Police
1,BR1 1AB,Yes,51.406333,0.015208,540262,169405,TQ402694,Greater London,Bromley,Bromley Town,...,E14000604,10169,1,0,2019-05-29,Bromley North,0.253666,BR,BR1,Metropolitan Police
2,BR1 1AD,No,51.400057,0.016715,540386,168710,TQ403687,Greater London,Bromley,Bromley Town,...,E14000604,20532,1,1,2019-05-29,Bromley South,0.044559,BR,BR1,Metropolitan Police
3,BR1 1AE,Yes,51.404543,0.014195,540197,169204,TQ401692,Greater London,Bromley,Bromley Town,...,E14000604,19350,1,0,2019-05-29,Bromley North,0.462939,BR,BR1,Metropolitan Police
4,BR1 1AF,Yes,51.401392,0.014948,540259,168855,TQ402688,Greater London,Bromley,Bromley Town,...,E14000604,20532,1,0,2019-05-29,Bromley South,0.227664,BR,BR1,Metropolitan Police


In [4]:
df_london.columns

Index(['Postcode', 'In Use?', 'Latitude', 'Longitude', 'Easting', 'Northing',
       'Grid Ref', 'County', 'District', 'Ward', 'District Code', 'Ward Code',
       'Country', 'County Code', 'Constituency', 'Introduced', 'Terminated',
       'Parish', 'National Park', 'Population', 'Households', 'Built up area',
       'Built up sub-division', 'Lower layer super output area', 'Rural/urban',
       'Region', 'Altitude', 'London zone', 'LSOA Code', 'Local authority',
       'MSOA Code', 'Middle layer super output area', 'Parish Code',
       'Census output area', 'Constituency Code',
       'Index of Multiple Deprivation', 'Quality', 'User Type', 'Last updated',
       'Nearest station', 'Distance to station', 'Postcode area',
       'Postcode district', 'Police force'],
      dtype='object')

Filtering the postcodes that are not used

In [5]:
print("The number of rows of the df is: ", df_london.shape[0])

The number of rows of the df is:  320426


In [6]:
print("After the filtering the number of rows to analyze are: ", df_london[df_london['In Use?'] == 'Yes'].shape[0])

After the filtering the number of rows to analyze are:  177967


List of districts

In [7]:
print(df_london['District'].unique())

['Bromley' 'Lewisham' 'Lambeth' 'Croydon' 'Greenwich' 'Havering' 'Camden'
 'Sutton' 'Merton' 'Bexley' 'Tower Hamlets' 'City of London' 'Hackney'
 'Waltham Forest' 'Redbridge' 'Newham' 'Enfield' 'Islington' 'Westminster'
 'Barnet' 'Brent' 'Ealing' 'Harrow' 'Hillingdon' 'Barking and Dagenham'
 'Kingston upon Thames' 'Richmond upon Thames' 'Haringey'
 'Hammersmith and Fulham' 'Southwark' 'Kensington and Chelsea'
 'Wandsworth' 'Hounslow']


# Map and clustering data

Remove post code not used and drop attributes not useful for the analysis

In [8]:
df_london.columns

Index(['Postcode', 'In Use?', 'Latitude', 'Longitude', 'Easting', 'Northing',
       'Grid Ref', 'County', 'District', 'Ward', 'District Code', 'Ward Code',
       'Country', 'County Code', 'Constituency', 'Introduced', 'Terminated',
       'Parish', 'National Park', 'Population', 'Households', 'Built up area',
       'Built up sub-division', 'Lower layer super output area', 'Rural/urban',
       'Region', 'Altitude', 'London zone', 'LSOA Code', 'Local authority',
       'MSOA Code', 'Middle layer super output area', 'Parish Code',
       'Census output area', 'Constituency Code',
       'Index of Multiple Deprivation', 'Quality', 'User Type', 'Last updated',
       'Nearest station', 'Distance to station', 'Postcode area',
       'Postcode district', 'Police force'],
      dtype='object')

In [9]:
# Remove post code not used
df_london = df_london[df_london['In Use?'] == 'Yes']

In [10]:
# Before to grouping data I'm going to remove all attributes that I will not use for my analysis
columns = {'District', 'District Code', 'Ward', 'Ward Code'}

for col in df_london.columns:
    if(col not in columns):
        df_london = df_london.drop(col, axis=1)


In [11]:
# Number of rows
df_london.shape

(177967, 4)

In [12]:
df_london.head(5)

Unnamed: 0,District,Ward,District Code,Ward Code
0,Bromley,Bromley Town,E09000006,E05000109
1,Bromley,Bromley Town,E09000006,E05000109
3,Bromley,Bromley Town,E09000006,E05000109
4,Bromley,Bromley Town,E09000006,E05000109
5,Bromley,Bromley Town,E09000006,E05000109


In [13]:
# Group by the same postal code and borough transforming to df and after resetting index
df_london = df_london.groupby(['District', 'District Code'])['Ward'].apply(lambda tags: ', '.join(tags)).to_frame().reset_index()

In [14]:
# Remove duplicates
for i in range(len(df_london)):
    ward = df_london.loc[i]['Ward']
    df_london.loc[i]['Ward'] = tuple(np.unique(ward.split(',')))

In [15]:
df_london

Unnamed: 0,District,District Code,Ward
0,Barking and Dagenham,E09000002,"( Abbey, Alibon, Becontree, Chadwell Heath,..."
1,Barnet,E09000003,"( Brunswick Park, Burnt Oak, Childs Hill, C..."
2,Bexley,E09000004,"( Barnehurst, Belvedere, Bexleyheath, Black..."
3,Brent,E09000005,"( Alperton, Barnhill, Brondesbury Park, Dol..."
4,Bromley,E09000006,"( Bickley, Biggin Hill, Bromley Common and K..."
5,Camden,E09000007,"( Belsize, Bloomsbury, Camden Town with Prim..."
6,City of London,E09000001,"( Aldersgate, Aldgate, Bassishaw, Billingsg..."
7,Croydon,E09000008,"( Addiscombe East, Addiscombe West, Bensham ..."
8,Ealing,E09000009,"( Acton Central, Cleveland, Dormers Wells, ..."
9,Enfield,E09000010,"( Bowes, Bush Hill Park, Chase, Cockfosters..."


Adding geo spatial coordinates to every district

In [16]:
# Retrieve lat and long through thìis website https://www.latlong.net/
coordinates = { 'E09000002' : (51.532822, 0.108530),
'E09000003' : (51.625149, -0.152936),
'E09000004' : (51.441349, 0.148610),
'E09000005' : (51.575169, -0.234730),
'E09000006' : (51.524979, -0.028180),
'E09000007' : (51.539188, -0.142500),
'E09000001' : (51.513329, -0.088950),
'E09000008' : (51.376163, -0.098234),
'E09000009' : (51.525024, -0.341500),
'E09000010' : (51.652100, -0.081530),
'E09000011' : (51.478790, -0.010680),
'E09000012' : (51.545792, -0.055420),
'E09000013' : (51.477951, -0.199030),
'E09000014' : (51.585870, -0.104330),
'E09000015' : (51.588142, -0.342274),
'E09000016' : (51.615829, 0.183440),
'E09000017' : (51.533581, -0.452580),
'E09000018' : (51.482838, -0.388206),
'E09000019' : (51.534969, -0.103750),
'E09000020' : (51.488460, -0.173500),
'E09000021' :  (51.412320, -0.300440),
'E09000022' : (51.457150, -0.123068),
'E09000023' : (51.441460, -0.011701),
'E09000024' : (51.415670, -0.191810),
'E09000025' : (51.517540, 0.023530),
'E09000026' : (51.476010, -0.080770),
'E09000027' : (51.431122, -0.307070),
'E09000028' : (51.501720, -0.097960),
'E09000029' : (51.357372, -0.175281),
'E09000030' :  (51.550419, 0.016950),
'E09000031' : (51.590176, -0.017344),
'E09000032' : (51.457073, -0.181782),
'E09000033' : (51.510357, -0.116773)}

In [17]:
# Adding two new columns
df_london['Latitude'] = None
df_london['Longitude'] = None

In [18]:
# For each Postal Code of the dataframe df_london add Latitude and Longitude (from the dictionary coordinates)
for i in range(len(df_london)):
    post_code = df_london['District Code'][i]
    df_london.Latitude[i] = coordinates[str(post_code)][0]
    df_london.Longitude[i] = coordinates[str(post_code)][1]


In [19]:
# The dataframe with the respective Latitude and Longitudes
df_london.head()

Unnamed: 0,District,District Code,Ward,Latitude,Longitude
0,Barking and Dagenham,E09000002,"( Abbey, Alibon, Becontree, Chadwell Heath,...",51.5328,0.10853
1,Barnet,E09000003,"( Brunswick Park, Burnt Oak, Childs Hill, C...",51.6251,-0.152936
2,Bexley,E09000004,"( Barnehurst, Belvedere, Bexleyheath, Black...",51.4413,0.14861
3,Brent,E09000005,"( Alperton, Barnhill, Brondesbury Park, Dol...",51.5752,-0.23473
4,Bromley,E09000006,"( Bickley, Biggin Hill, Bromley Common and K...",51.525,-0.02818


# Exploring the neighbourhood of London

In [20]:
# Installing and importing folium library
!pip install folium==0.5.0
import folium

print("Library corrected imported!")

Library corrected imported!


In [21]:
# Lat and Long of London
latitude = 51.509865
longitude = -0.118092

In [22]:
# Map of London
# create map of London using latitude and longitude values
map_london = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_london['Latitude'], df_london['Longitude'], df_london['District'], 
                                           df_london['Ward']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)  
    
map_london


# Clustering (k = 4)

In [23]:
#Inserting my credentials of foursquare
CLIENT_ID = 'NZMJXNBNOLC2M5GCPLUZXFVDM14HFSE1ZJUCLFKC0UBB4ZDH' # your Foursquare ID
CLIENT_SECRET = '1RZD3FIPVIFA5Y5A2G5HIFLG34FYHWZFMRCXZHTGCHRNYVJH' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)



Your credentails:
CLIENT_ID: NZMJXNBNOLC2M5GCPLUZXFVDM14HFSE1ZJUCLFKC0UBB4ZDH
CLIENT_SECRET:1RZD3FIPVIFA5Y5A2G5HIFLG34FYHWZFMRCXZHTGCHRNYVJH


Let's explore the first neighborhood in our dataframe

In [24]:
neighborhood_latitude = df_london.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_london.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = df_london.loc[0, 'Ward'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(np.unique(neighborhood_name), 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of [' Abbey' ' Alibon' ' Becontree' ' Chadwell Heath' ' Eastbrook'
 ' Eastbury' ' Gascoigne' ' Goresbrook' ' Heath' ' Longbridge'
 ' Mayesbrook' ' Parsloes' ' River' ' Thames' ' Valence' ' Village'
 ' Whalebone' 'Thames'] are 51.532822, 0.10853.


## Explore neighbourhoods

In [25]:
import requests

LIMIT = 100
radius = 500

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)



In [26]:
london_venues = getNearbyVenues(names=df_london['Ward'],
                                   latitudes=df_london['Latitude'],
                                   longitudes=df_london['Longitude']
                                  )

(' Abbey', ' Alibon', ' Becontree', ' Chadwell Heath', ' Eastbrook', ' Eastbury', ' Gascoigne', ' Goresbrook', ' Heath', ' Longbridge', ' Mayesbrook', ' Parsloes', ' River', ' Thames', ' Valence', ' Village', ' Whalebone', 'Thames')
(' Brunswick Park', ' Burnt Oak', ' Childs Hill', ' Colindale', ' Coppetts', ' East Barnet', ' East Finchley', ' Edgware', ' Finchley Church End', ' Garden Suburb', ' Golders Green', ' Hale', ' Hendon', ' High Barnet', ' Mill Hill', ' Oakleigh', ' Totteridge', ' Underhill', ' West Finchley', ' West Hendon', ' Woodhouse', 'High Barnet')
(' Barnehurst', ' Belvedere', ' Bexleyheath', ' Blackfen & Lamorbey', ' Blendon & Penhill', ' Crayford', ' Crook Log', ' East Wickham', ' Erith', ' Falconwood & Welling', ' Longlands', ' Northumberland Heath', ' Sidcup', ' Slade Green & Northend', " St Mary's & St James", ' Thamesmead East', ' West Heath', 'Crayford')
(' Alperton', ' Barnhill', ' Brondesbury Park', ' Dollis Hill', ' Dudden Hill', ' Fryent', ' Harlesden', ' Ke

In [27]:
london_venues.columns

Index(['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude',
       'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category'],
      dtype='object')

In [28]:
#Size of DF
print(london_venues.shape)
london_venues.head()

# Let's check how many venues were returned for each neighborhood
london_venues.groupby('Neighborhood').count()

# Let's find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(london_venues['Venue Category'].unique())))

(1147, 7)
There are 207 uniques categories.


## Analyze neighbourhood

In [29]:
# one hot encoding
london_onehot = pd.get_dummies(london_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
london_onehot['Neighborhood'] = london_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [london_onehot.columns[-1]] + list(london_onehot.columns[:-1])
london_onehot = london_onehot[fixed_columns]

london_onehot.head()



Unnamed: 0,Neighborhood,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Windmill,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,"( Abbey, Alibon, Becontree, Chadwell Heath,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"( Abbey, Alibon, Becontree, Chadwell Heath,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"( Abbey, Alibon, Becontree, Chadwell Heath,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"( Abbey, Alibon, Becontree, Chadwell Heath,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"( Abbey, Alibon, Becontree, Chadwell Heath,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
# Group rows by neighbourhood
london_grouped = london_onehot.groupby('Neighborhood').mean().reset_index()
london_grouped

Unnamed: 0,Neighborhood,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Windmill,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,"( Abbey, Alibon, Becontree, Chadwell Heath,...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"( Abbey, Cannon Hill, Colliers Wood, Cricke...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,...,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0
2,"( Abbey Road, Bayswater, Bryanston and Dorse...",0.0,0.0,0.01,0.0,0.0,0.02,0.01,0.0,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.01,0.0,0.0
3,"( Abbey Wood, Blackheath Westcombe, Charlton...",0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0
4,"( Abingdon, Brompton & Hans Town, Campden, ...",0.0,0.0,0.034483,0.0,0.0,0.017241,0.0,0.0,0.017241,...,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.017241,0.0
5,"( Acton Central, Cleveland, Dormers Wells, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"( Addiscombe East, Addiscombe West, Bensham ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035088,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"( Addison, Askew, Avonmore and Brook Green, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018868,...,0.0,0.0,0.0,0.0,0.0,0.0,0.018868,0.037736,0.0,0.0
8,"( Aldborough, Barkingside, Bridge, Chadwell...",0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"( Aldersgate, Aldgate, Bassishaw, Billingsg...",0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.04,...,0.01,0.0,0.0,0.0,0.02,0.0,0.03,0.0,0.0,0.01


Sorting venues in descending orders

In [31]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [32]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = london_grouped['Neighborhood']

for ind in np.arange(london_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(london_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"( Abbey, Alibon, Becontree, Chadwell Heath,...",Pharmacy,Convenience Store,Pub,Bus Line,Grocery Store,Auto Garage,Yoga Studio,Donut Shop,Fish Market,Fish & Chips Shop
1,"( Abbey, Cannon Hill, Colliers Wood, Cricke...",Café,Coffee Shop,Bus Stop,Grocery Store,Lebanese Restaurant,Bar,Sushi Restaurant,Pizza Place,Theater,Noodle House
2,"( Abbey Road, Bayswater, Bryanston and Dorse...",Theater,Coffee Shop,Dessert Shop,Hotel,Italian Restaurant,Bookstore,Pub,Burger Joint,Restaurant,History Museum
3,"( Abbey Wood, Blackheath Westcombe, Charlton...",Pub,Café,History Museum,Bar,Hotel,Gym / Fitness Center,Bakery,Garden,Gastropub,Market
4,"( Abingdon, Brompton & Hans Town, Campden, ...",Italian Restaurant,Café,Bakery,Burger Joint,Plaza,American Restaurant,Grocery Store,Coffee Shop,Pizza Place,Juice Bar


### Which are the best 3 business to open in London?

In [33]:
from collections import Counter

c = Counter(neighborhoods_venues_sorted['1st Most Common Venue'])
c.most_common

<bound method Counter.most_common of Counter({'Coffee Shop': 7, 'Café': 5, 'Pub': 5, 'Bus Stop': 2, 'Pizza Place': 2, 'Pharmacy': 1, 'Theater': 1, 'Italian Restaurant': 1, 'Home Service': 1, 'Kebab Restaurant': 1, 'Fast Food Restaurant': 1, 'Bakery': 1, 'Breakfast Spot': 1, 'African Restaurant': 1, 'Park': 1, 'Convenience Store': 1, 'Moving Target': 1})>

## Cluster neighbourhood

In [34]:
# set number of clusters
kclusters = 5

london_grouped_clustering = london_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(london_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 3, 3, 3, 3, 0, 3, 3, 0, 3], dtype=int32)

In [35]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

london_merged = df_london

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
london_merged = london_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Ward')

london_merged.head() # check the last columns!


Unnamed: 0,District,District Code,Ward,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Barking and Dagenham,E09000002,"( Abbey, Alibon, Becontree, Chadwell Heath,...",51.5328,0.10853,3,Pharmacy,Convenience Store,Pub,Bus Line,Grocery Store,Auto Garage,Yoga Studio,Donut Shop,Fish Market,Fish & Chips Shop
1,Barnet,E09000003,"( Brunswick Park, Burnt Oak, Childs Hill, C...",51.6251,-0.152936,4,Café,Bus Stop,Yoga Studio,Convenience Store,Food,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Falafel Restaurant
2,Bexley,E09000004,"( Barnehurst, Belvedere, Bexleyheath, Black...",51.4413,0.14861,3,Fast Food Restaurant,Pub,Breakfast Spot,Train Station,Greek Restaurant,Toy / Game Store,Indian Restaurant,Italian Restaurant,Chinese Restaurant,English Restaurant
3,Brent,E09000005,"( Alperton, Barnhill, Brondesbury Park, Dol...",51.5752,-0.23473,3,Kebab Restaurant,Fried Chicken Joint,Pub,Bus Stop,Distillery,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Falafel Restaurant,Event Space
4,Bromley,E09000006,"( Bickley, Biggin Hill, Bromley Common and K...",51.525,-0.02818,3,Bus Stop,Coffee Shop,Pub,Burger Joint,Bar,Pizza Place,Metro Station,Middle Eastern Restaurant,Breakfast Spot,Grocery Store


# Visualize resulting cluster

In [36]:


# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_merged['Latitude'], london_merged['Longitude'],
                                  london_merged['Ward'], london_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(np.nan_to_num(cluster))-1],
        fill=True,
        fill_color=rainbow[int(np.nan_to_num(cluster))-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

