# Capstone Project - Battle of the Neighbourhoods (week 2)

### Applied Data Sceince Capstone by IBM/ Coursera

## Table of Contents

1. Introduction: Business Problem
2. Data Acquisition and Cleaning 
3. Methodology
4. Analysis
5. Results 
6. Discussion
7. Conclusion

## Introduction: Business Problem

London is the captial of England and the United Kingdom.  It is not only one of the leading toursim destinations, but London also attacts many foreigners looking to work and start a new life there.  In 2019 there were approximately 6.2 million people with non-British nationality living in the United Kingdom and 9.5 million people who were born abroad. The UK's migrant population is concentrated in London.  Even given these statistics it is always daunting moving to a new city that one knows nothing about.

This assignment will use data science tools and methodologies to help determine which neighbourhood would be the best in London to open up a new cafe.  One of the aims of this assignment is to create an analysis of features of the different neighbourhoods and choose the best neighbourhood to open a new cafe.  Factors such as number of households, population density and average income per neighbourhood will be reviewed.  A final factor that will also be looked at is the distance to station which would be beneficial for a newly immigrated person that does not have a method of transportation yet.

This information would be beneficial to any entrepreneur looking to open a new cafe in London. It would further be beneficial to anyone that is also looking to immigrate to London as a thorough analysis of the neighbourhoods will be done of what venues are in the neighbourhood and the closest public transportation.

## Data acquisition and Cleaning 

### Data acquisition

There will be two sources from which the data for this assignment will be used.  

#### 1. London data that contains Borough, Neighbourhoods as well as their latitude and longitudes for each postal code
- Data source: https://www.doogal.co.uk/london_postcodes.php 
- Description: This dataset contains all the required information to properly analyse the different neighbourhoods in London.  This data will be used in conjunction with Foursquare API to decide on the best neighbourhood based on preferences and cafe options
- Comments: This dataset will have to be cleaned before using, as there are 323306 rows of data and 49 columns and not all the information is relevant.  The following columns will remain in the data:
    - _Postcode_
    - _Latitude_
    - _Longitude_
    - _County_
    - _District_ (this is the British name for Borough which will be renamed)
    - _Ward_ (this is the British name for Neighbourhood which will be renamed)
    - _Population_:  This field will be beneficial as the restaurateur would like a neighbourhood that has enough people to support the cafe
    - _Households_: This field will be beneficial like the population field, if not more, as households will ensure that there are adults in the population that will support the cafe 
    - _Nearest station_
    - _Distance to station_:  This field could be an incentivising factor for a person just moving to a new city as they will not likely have transport of their own
    - _Average income_: This field would be good to understand the econmic demographic of the neighbourhood to open a new business


#### 2. Venue Data using Foursquare 
- For each neighbourhood Foursquare API will be able to get information and availability of cafes in the respective neighbourhoods
- Using the explore function, the most common venue catergories for the neighbourhood will be revealed  
- The k-means clustering algorithm will be used to group the neighbourhoods into clusters from the explore function
- Finally, the Python Folium library will be used to visualise the neighbourhoods in London and their clusters

### Data cleaning

#### Before getting the data, the relevant libraries must be imported first

In [1]:
import numpy as np 

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import zipfile

import json 

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim 

import requests 
from pandas.io.json import json_normalize 

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

!pip install folium
import folium 

print('Libraries imported.')

Collecting folium
  Downloading folium-0.12.0-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 6.1 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.0
Libraries imported.


#### The data was obtained from a URL website which had all the Postal codes in England, the GPS coordinates, the neighbourhood name as well as many other attributes.  This came in the format of a csv file which has to now be loaded:

In [3]:
url = "https://www.doogal.co.uk/UKPostcodesCSV.ashx?area=London"
df = pd.read_csv(url)

#### Looking quickly at the first five rows of the data to see that it has been imported correctly

In [5]:
df.head()

Unnamed: 0,Postcode,In Use?,Latitude,Longitude,Easting,Northing,Grid Ref,County,District,Ward,District Code,Ward Code,Country,County Code,Constituency,Introduced,Terminated,Parish,National Park,Population,Households,Built up area,Built up sub-division,Lower layer super output area,Rural/urban,Region,Altitude,London zone,LSOA Code,Local authority,MSOA Code,Middle layer super output area,Parish Code,Census output area,Constituency Code,Index of Multiple Deprivation,Quality,User Type,Last updated,Nearest station,Distance to station,Postcode area,Postcode district,Police force,Water company,Plus Code,Average Income,Sewage Company,Travel To Work Area
0,BR1 1AA,Yes,51.401546,0.015415,540291,168873,TQ402688,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,2016-05-01,,"Bromley, unparished area",,,,Greater London,Bromley,Bromley 018B,Urban major conurbation,London,71,5,E01000675,,E02000144,Bromley South,E43000196,E00003264,E14000604,24305,1,0,2020-11-21,Bromley South,0.218257,BR,BR1,Metropolitan Police,Thames Water,9F32C228+J5,63100,,London
1,BR1 1AB,Yes,51.406333,0.015208,540262,169405,TQ402694,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,2012-03-01,,"Bromley, unparished area",,,,Greater London,Bromley,Bromley 008B,Urban major conurbation,London,71,4,E01000676,,E02000134,Bromley North & Sundridge,E43000196,E00003255,E14000604,13716,1,0,2020-11-21,Bromley North,0.253666,BR,BR1,Metropolitan Police,Thames Water,9F32C248+G3,56100,,London
2,BR1 1AD,No,51.400057,0.016715,540386,168710,TQ403687,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,2014-09-01,2017-09-01,"Bromley, unparished area",,,,Greater London,Bromley,Bromley 018B,Urban major conurbation,London,53,5,E01000675,,E02000144,Bromley South,E43000196,E00003264,E14000604,24305,1,1,2020-11-21,Bromley South,0.044559,BR,BR1,Metropolitan Police,,9F32C228+2M,63100,,London
3,BR1 1AE,Yes,51.404543,0.014195,540197,169204,TQ401692,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,2008-08-01,,"Bromley, unparished area",,34.0,21.0,Greater London,Bromley,Bromley 018C,Urban major conurbation,London,71,4,E01000677,,E02000144,Bromley South,E43000196,E00003266,E14000604,20694,1,0,2020-11-21,Bromley North,0.462939,BR,BR1,Metropolitan Police,Thames Water,9F32C237+RM,63100,,London
4,BR1 1AF,Yes,51.401392,0.014948,540259,168855,TQ402688,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,2015-05-01,,"Bromley, unparished area",,,,Greater London,Bromley,Bromley 018B,Urban major conurbation,London,58,5,E01000675,,E02000144,Bromley South,E43000196,E00003264,E14000604,24305,1,0,2020-11-21,Bromley South,0.227664,BR,BR1,Metropolitan Police,Thames Water,9F32C227+HX,63100,,London


#### Looking at the shape of the dataset:

In [6]:
df.shape

(323306, 49)

#### The dataset is massive at over 300,000 rows of data.  Looking at the first five rows of the dataset, there are already some rows that can be excluded from the dataset.
##### 1. If there is no households for a given postal code then it will be dropped from the dataset:

In [7]:
df.dropna(subset=["Households"], axis = 0, inplace = True)
df.head()

Unnamed: 0,Postcode,In Use?,Latitude,Longitude,Easting,Northing,Grid Ref,County,District,Ward,District Code,Ward Code,Country,County Code,Constituency,Introduced,Terminated,Parish,National Park,Population,Households,Built up area,Built up sub-division,Lower layer super output area,Rural/urban,Region,Altitude,London zone,LSOA Code,Local authority,MSOA Code,Middle layer super output area,Parish Code,Census output area,Constituency Code,Index of Multiple Deprivation,Quality,User Type,Last updated,Nearest station,Distance to station,Postcode area,Postcode district,Police force,Water company,Plus Code,Average Income,Sewage Company,Travel To Work Area
3,BR1 1AE,Yes,51.404543,0.014195,540197,169204,TQ401692,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,2008-08-01,,"Bromley, unparished area",,34.0,21.0,Greater London,Bromley,Bromley 018C,Urban major conurbation,London,71,4,E01000677,,E02000144,Bromley South,E43000196,E00003266,E14000604,20694,1,0,2020-11-21,Bromley North,0.462939,BR,BR1,Metropolitan Police,Thames Water,9F32C237+RM,63100,,London
13,BR1 1BQ,Yes,51.408058,0.015874,540303,169598,TQ403695,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,2002-07-01,,"Bromley, unparished area",,38.0,37.0,Greater London,Bromley,Bromley 018B,Urban major conurbation,London,71,4,E01000675,,E02000144,Bromley South,E43000196,E00003254,E14000604,24305,1,0,2020-11-21,Bromley North,0.083058,BR,BR1,Metropolitan Police,Thames Water,9F32C258+68,63100,,London
26,BR1 1DG,Yes,51.409191,0.010068,539896,169713,TQ398697,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,1980-01-01,,"Bromley, unparished area",,1.0,1.0,Greater London,Bromley,Bromley 008A,Urban major conurbation,London,72,4,E01000674,,E02000134,Bromley North & Sundridge,E43000196,E00003251,E14000604,21905,1,0,2020-11-21,Bromley North,0.489492,BR,BR1,Metropolitan Police,Thames Water,9F32C256+M2,56100,,London
39,BR1 1EA,Yes,51.400462,0.016716,540385,168755,TQ403687,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,1980-01-01,,"Bromley, unparished area",,4.0,4.0,Greater London,Bromley,Bromley 018B,Urban major conurbation,London,56,5,E01000675,,E02000144,Bromley South,E43000196,E00003264,E14000604,24305,1,0,2020-11-21,Bromley South,0.067908,BR,BR1,Metropolitan Police,Thames Water,9F32C228+5M,63100,,London
41,BR1 1EG,Yes,51.401684,0.015705,540311,168889,TQ403688,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,1980-01-01,,"Bromley, unparished area",,14.0,6.0,Greater London,Bromley,Bromley 018B,Urban major conurbation,London,63,5,E01000675,,E02000144,Bromley South,E43000196,E00003264,E14000604,24305,1,0,2020-11-21,Bromley South,0.219361,BR,BR1,Metropolitan Police,Thames Water,9F32C228+M7,63100,,London


#### Looking again at the shape of the dataset to see if it has been narrowed down:

In [8]:
df.shape

(140850, 49)

##### 2. If there is a blank value in Built up area

In [9]:
df.dropna(subset=["Built up area"], axis = 0, inplace = True)
df.head()

Unnamed: 0,Postcode,In Use?,Latitude,Longitude,Easting,Northing,Grid Ref,County,District,Ward,District Code,Ward Code,Country,County Code,Constituency,Introduced,Terminated,Parish,National Park,Population,Households,Built up area,Built up sub-division,Lower layer super output area,Rural/urban,Region,Altitude,London zone,LSOA Code,Local authority,MSOA Code,Middle layer super output area,Parish Code,Census output area,Constituency Code,Index of Multiple Deprivation,Quality,User Type,Last updated,Nearest station,Distance to station,Postcode area,Postcode district,Police force,Water company,Plus Code,Average Income,Sewage Company,Travel To Work Area
3,BR1 1AE,Yes,51.404543,0.014195,540197,169204,TQ401692,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,2008-08-01,,"Bromley, unparished area",,34.0,21.0,Greater London,Bromley,Bromley 018C,Urban major conurbation,London,71,4,E01000677,,E02000144,Bromley South,E43000196,E00003266,E14000604,20694,1,0,2020-11-21,Bromley North,0.462939,BR,BR1,Metropolitan Police,Thames Water,9F32C237+RM,63100,,London
13,BR1 1BQ,Yes,51.408058,0.015874,540303,169598,TQ403695,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,2002-07-01,,"Bromley, unparished area",,38.0,37.0,Greater London,Bromley,Bromley 018B,Urban major conurbation,London,71,4,E01000675,,E02000144,Bromley South,E43000196,E00003254,E14000604,24305,1,0,2020-11-21,Bromley North,0.083058,BR,BR1,Metropolitan Police,Thames Water,9F32C258+68,63100,,London
26,BR1 1DG,Yes,51.409191,0.010068,539896,169713,TQ398697,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,1980-01-01,,"Bromley, unparished area",,1.0,1.0,Greater London,Bromley,Bromley 008A,Urban major conurbation,London,72,4,E01000674,,E02000134,Bromley North & Sundridge,E43000196,E00003251,E14000604,21905,1,0,2020-11-21,Bromley North,0.489492,BR,BR1,Metropolitan Police,Thames Water,9F32C256+M2,56100,,London
39,BR1 1EA,Yes,51.400462,0.016716,540385,168755,TQ403687,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,1980-01-01,,"Bromley, unparished area",,4.0,4.0,Greater London,Bromley,Bromley 018B,Urban major conurbation,London,56,5,E01000675,,E02000144,Bromley South,E43000196,E00003264,E14000604,24305,1,0,2020-11-21,Bromley South,0.067908,BR,BR1,Metropolitan Police,Thames Water,9F32C228+5M,63100,,London
41,BR1 1EG,Yes,51.401684,0.015705,540311,168889,TQ403688,Greater London,Bromley,Bromley Town,E09000006,E05000109,England,E11000009,Bromley and Chislehurst,1980-01-01,,"Bromley, unparished area",,14.0,6.0,Greater London,Bromley,Bromley 018B,Urban major conurbation,London,63,5,E01000675,,E02000144,Bromley South,E43000196,E00003264,E14000604,24305,1,0,2020-11-21,Bromley South,0.219361,BR,BR1,Metropolitan Police,Thames Water,9F32C228+M7,63100,,London


In [10]:
df.shape

(140361, 49)

#### There are too many columns that are not useful for this analysis and thus they are going to be dropped so that the dataset will become easier to work with.

In [11]:
df.drop(["In Use?","Easting", "Northing", "Grid Ref" , "District Code", "Ward Code", "Country", "County Code","Constituency" ,"Introduced", "Terminated", "Parish", "National Park", "Built up area", "Built up sub-division", "Lower layer super output area", "Rural/urban","Region", "Altitude", "London zone", "LSOA Code", "Local authority", "MSOA Code", "Middle layer super output area", "Parish Code", "Census output area", "Constituency Code", "Index of Multiple Deprivation","Quality", "User Type", "Last updated", "Postcode area", "Postcode district","Police force", "Water company", "Plus Code", "Sewage Company", "Travel To Work Area"], axis = 1, inplace = True)

In [12]:
df.head()

Unnamed: 0,Postcode,Latitude,Longitude,County,District,Ward,Population,Households,Nearest station,Distance to station,Average Income
3,BR1 1AE,51.404543,0.014195,Greater London,Bromley,Bromley Town,34.0,21.0,Bromley North,0.462939,63100
13,BR1 1BQ,51.408058,0.015874,Greater London,Bromley,Bromley Town,38.0,37.0,Bromley North,0.083058,63100
26,BR1 1DG,51.409191,0.010068,Greater London,Bromley,Bromley Town,1.0,1.0,Bromley North,0.489492,56100
39,BR1 1EA,51.400462,0.016716,Greater London,Bromley,Bromley Town,4.0,4.0,Bromley South,0.067908,63100
41,BR1 1EG,51.401684,0.015705,Greater London,Bromley,Bromley Town,14.0,6.0,Bromley South,0.219361,63100


In [13]:
df.shape

(140361, 11)

#### Looking at a map it is clear that County is British for borough and District is British for neighbourhood.  Thus the headings need to be changed accordingly.  Although Brent has been mentioned twice there are different distance to stations, stations and average incomes and thus as a result they have not been grouped, but it is noted that they are in close proximity to one another.

In [14]:
df.rename(columns={"District": "Borough", "Ward": "Neighbourhood"}, inplace= True)

In [15]:
df.head()

Unnamed: 0,Postcode,Latitude,Longitude,County,Borough,Neighbourhood,Population,Households,Nearest station,Distance to station,Average Income
3,BR1 1AE,51.404543,0.014195,Greater London,Bromley,Bromley Town,34.0,21.0,Bromley North,0.462939,63100
13,BR1 1BQ,51.408058,0.015874,Greater London,Bromley,Bromley Town,38.0,37.0,Bromley North,0.083058,63100
26,BR1 1DG,51.409191,0.010068,Greater London,Bromley,Bromley Town,1.0,1.0,Bromley North,0.489492,56100
39,BR1 1EA,51.400462,0.016716,Greater London,Bromley,Bromley Town,4.0,4.0,Bromley South,0.067908,63100
41,BR1 1EG,51.401684,0.015705,Greater London,Bromley,Bromley Town,14.0,6.0,Bromley South,0.219361,63100


#### To ensure that only postal codes with high number of households are reviewed, the data is then sorted from largest to smallest in terms of households.  The top 5 postal codes will be made into a new dataframe.

In [16]:
df.sort_values(["Households"], ascending= False, axis = 0, inplace = True)
df_top5 = df.head()
df_top5

Unnamed: 0,Postcode,Latitude,Longitude,County,Borough,Neighbourhood,Population,Households,Nearest station,Distance to station,Average Income
246925,SW8 2AU,51.485134,-0.127283,Greater London,Lambeth,Oval,613.0,402.0,Vauxhall,0.248118,57200
78705,HA1 3GX,51.574325,-0.315569,Greater London,Brent,Northwick Park,529.0,281.0,Northwick Park,0.49912,61900
91620,HA9 0AB,51.558298,-0.284928,Greater London,Brent,Tokyngton,560.0,258.0,Wembley Stadium,0.412166,52500
233039,SW1V 3RA,51.486132,-0.136287,Greater London,Westminster,Tachbrook,343.0,246.0,Pimlico,0.439171,51700
132977,N7 0EG,51.558271,-0.135016,Greater London,Islington,Junction,620.0,233.0,Tufnell Park,0.277188,60300


#### With the London data now in a more workeable format, the exploratory data analysis can now begin.

In [7]:
df_test = df[["Borough", "Households", "Average Income"]]
df_group = df_test.groupby(["Borough", "Average Income"], as_index=False).mean()
df_group

KeyError: "['Borough'] not in index"

## Methodology

#### Using the geopy library to get the Latitude and Longitude values of London

In [17]:
address = 'London, UK'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of London are 51.5073219, -0.1276474.


#### Since there are too many data points present, as can be seen by the map.  Only the top 5 neighbourhoods are going to be used.

#### Now using Foursquare API, the neighbourhoods are going to be explored.

In [19]:
# The code was removed by Watson Studio for sharing.

Foursquare API successfully imported


In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [22]:
top5_venues = getNearbyVenues(names=df_top5['Neighbourhood'],
                                   latitudes=df_top5['Latitude'],
                                   longitudes=df_top5['Longitude']
                                  )

Oval
Northwick Park
Tokyngton
Tachbrook
Junction


In [23]:
print(top5_venues.shape)
top5_venues.head()

(187, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Oval,51.485134,-0.127283,Embody Wellness,51.485345,-0.125442,Yoga Studio
1,Oval,51.485134,-0.127283,LASSCO,51.484649,-0.126197,Antique Shop
2,Oval,51.485134,-0.127283,Brunswick House Cafe,51.484866,-0.126224,Café
3,Oval,51.485134,-0.127283,Whistle Punks Axe Throwing,51.483272,-0.124627,Athletics & Sports
4,Oval,51.485134,-0.127283,Pret A Manger,51.486244,-0.124837,Sandwich Place


In [24]:
top5_venues

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Oval,51.485134,-0.127283,Embody Wellness,51.485345,-0.125442,Yoga Studio
1,Oval,51.485134,-0.127283,LASSCO,51.484649,-0.126197,Antique Shop
2,Oval,51.485134,-0.127283,Brunswick House Cafe,51.484866,-0.126224,Café
3,Oval,51.485134,-0.127283,Whistle Punks Axe Throwing,51.483272,-0.124627,Athletics & Sports
4,Oval,51.485134,-0.127283,Pret A Manger,51.486244,-0.124837,Sandwich Place
5,Oval,51.485134,-0.127283,Riverwalk,51.483713,-0.131693,Trail
6,Oval,51.485134,-0.127283,New Covent Garden Market,51.483197,-0.129346,Flower Shop
7,Oval,51.485134,-0.127283,The Royal Vauxhall Tavern,51.485853,-0.121729,Gay Bar
8,Oval,51.485134,-0.127283,The Gym,51.485305,-0.126244,Gym
9,Oval,51.485134,-0.127283,Vauxwall Climbing Centre,51.484924,-0.12268,Climbing Gym


In [26]:
top5_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Junction,33,33,33,33,33,33
Northwick Park,6,6,6,6,6,6
Oval,43,43,43,43,43,43
Tachbrook,31,31,31,31,31,31
Tokyngton,74,74,74,74,74,74


In [27]:
print('There are {} uniques categories.'.format(len(top5_venues['Venue Category'].unique())))

There are 82 uniques categories.


In [28]:
# one hot encoding
top5_onehot = pd.get_dummies(top5_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
top5_onehot['Neighbourhood'] = top5_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [top5_onehot.columns[-1]] + list(top5_onehot.columns[:-1])
top5_onehot = top5_onehot[fixed_columns]

top5_onehot.head()

Unnamed: 0,Neighbourhood,American Restaurant,Antique Shop,Asian Restaurant,Athletics & Sports,Bakery,Bar,Beer Bar,Bike Rental / Bike Share,Bistro,Bookstore,Bubble Tea Shop,Burger Joint,Bus Stop,Café,Canal Lock,Chocolate Shop,Climbing Gym,Clothing Store,Coffee Shop,Convenience Store,Cosmetics Shop,Deli / Bodega,Electronics Store,English Restaurant,Ethiopian Restaurant,Fast Food Restaurant,Fish & Chips Shop,Flower Shop,Food Court,Garden,Gastropub,Gay Bar,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Health & Beauty Service,Hookah Bar,Hotel,Hotel Bar,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Juice Bar,Kebab Restaurant,Latin American Restaurant,Lounge,Metro Station,Modern European Restaurant,Moroccan Restaurant,Multiplex,Music Venue,Outdoor Sculpture,Outlet Mall,Outlet Store,Park,Performing Arts Venue,Pier,Pizza Place,Plaza,Pub,Restaurant,Roof Deck,Sandwich Place,Seafood Restaurant,Spa,Sporting Goods Shop,Sports Bar,Stadium,Supermarket,Tapas Restaurant,Tennis Court,Thai Restaurant,Theater,Trail,Train Station,Turkish Restaurant,Warehouse Store,Wine Bar,Wine Shop,Yoga Studio
0,Oval,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,Oval,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Oval,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Oval,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Oval,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [29]:
top5_onehot.shape

(187, 83)

In [30]:
top5_grouped = top5_onehot.groupby('Neighbourhood').mean().reset_index()
top5_grouped

Unnamed: 0,Neighbourhood,American Restaurant,Antique Shop,Asian Restaurant,Athletics & Sports,Bakery,Bar,Beer Bar,Bike Rental / Bike Share,Bistro,Bookstore,Bubble Tea Shop,Burger Joint,Bus Stop,Café,Canal Lock,Chocolate Shop,Climbing Gym,Clothing Store,Coffee Shop,Convenience Store,Cosmetics Shop,Deli / Bodega,Electronics Store,English Restaurant,Ethiopian Restaurant,Fast Food Restaurant,Fish & Chips Shop,Flower Shop,Food Court,Garden,Gastropub,Gay Bar,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Health & Beauty Service,Hookah Bar,Hotel,Hotel Bar,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Juice Bar,Kebab Restaurant,Latin American Restaurant,Lounge,Metro Station,Modern European Restaurant,Moroccan Restaurant,Multiplex,Music Venue,Outdoor Sculpture,Outlet Mall,Outlet Store,Park,Performing Arts Venue,Pier,Pizza Place,Plaza,Pub,Restaurant,Roof Deck,Sandwich Place,Seafood Restaurant,Spa,Sporting Goods Shop,Sports Bar,Stadium,Supermarket,Tapas Restaurant,Tennis Court,Thai Restaurant,Theater,Trail,Train Station,Turkish Restaurant,Warehouse Store,Wine Bar,Wine Shop,Yoga Studio
0,Junction,0.0,0.0,0.0,0.0,0.030303,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.090909,0.0,0.0,0.0,0.0,0.060606,0.030303,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.060606,0.0,0.030303,0.030303,0.090909,0.0,0.0,0.030303,0.0,0.030303,0.0,0.030303,0.0,0.0,0.060606,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.030303,0.0,0.0,0.0,0.030303,0.0,0.030303,0.030303,0.0
1,Northwick Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Oval,0.0,0.023256,0.0,0.023256,0.0,0.0,0.023256,0.0,0.0,0.0,0.0,0.023256,0.0,0.093023,0.023256,0.0,0.023256,0.0,0.023256,0.0,0.0,0.0,0.0,0.0,0.023256,0.0,0.0,0.046512,0.0,0.046512,0.023256,0.046512,0.023256,0.069767,0.023256,0.0,0.0,0.023256,0.023256,0.0,0.0,0.023256,0.023256,0.023256,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.069767,0.0,0.0,0.0,0.0,0.046512,0.046512,0.0,0.023256,0.0,0.0,0.0,0.0,0.0,0.023256,0.0,0.023256,0.0,0.023256,0.023256,0.0,0.0,0.023256,0.0,0.0,0.023256
3,Tachbrook,0.0,0.0,0.0,0.0,0.032258,0.032258,0.0,0.032258,0.0,0.0,0.0,0.0,0.064516,0.032258,0.0,0.0,0.0,0.0,0.064516,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.032258,0.0,0.0,0.0,0.032258,0.032258,0.032258,0.0,0.0,0.064516,0.0,0.0,0.064516,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.032258,0.064516,0.064516,0.032258,0.032258,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.032258,0.032258,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0
4,Tokyngton,0.027027,0.0,0.013514,0.0,0.013514,0.067568,0.0,0.0,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.0,0.013514,0.0,0.067568,0.081081,0.0,0.013514,0.0,0.013514,0.013514,0.0,0.013514,0.0,0.0,0.013514,0.0,0.0,0.0,0.040541,0.0,0.027027,0.0,0.0,0.0,0.054054,0.013514,0.013514,0.027027,0.027027,0.0,0.013514,0.0,0.013514,0.0,0.0,0.0,0.0,0.013514,0.013514,0.013514,0.013514,0.013514,0.0,0.0,0.0,0.013514,0.013514,0.013514,0.027027,0.013514,0.040541,0.013514,0.0,0.067568,0.013514,0.013514,0.013514,0.0,0.0,0.0,0.0,0.0,0.013514,0.0,0.0,0.0,0.0,0.0


In [31]:
top5_grouped.shape

(5, 83)

In [32]:
num_top_venues = 5

for hood in top5_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = top5_grouped[top5_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Junction----
                venue  freq
0  Italian Restaurant  0.09
1                 Pub  0.09
2                Café  0.09
3         Music Venue  0.06
4               Hotel  0.06


----Northwick Park----
           venue  freq
0    Coffee Shop  0.33
1  Grocery Store  0.17
2  Metro Station  0.17
3           Park  0.17
4      Bookstore  0.17


----Oval----
         venue  freq
0         Café  0.09
1          Gym  0.07
2         Park  0.07
3  Flower Shop  0.05
4       Garden  0.05


----Tachbrook----
               venue  freq
0  Indian Restaurant  0.06
1              Plaza  0.06
2              Hotel  0.06
3           Bus Stop  0.06
4        Coffee Shop  0.06


----Tokyngton----
                 venue  freq
0          Coffee Shop  0.08
1       Clothing Store  0.07
2  Sporting Goods Shop  0.07
3                  Bar  0.07
4                Hotel  0.05




In [33]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [35]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = top5_grouped['Neighbourhood']

for ind in np.arange(top5_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(top5_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Junction,Italian Restaurant,Pub,Café,Hotel,Coffee Shop,Music Venue,Turkish Restaurant,Bus Stop,Modern European Restaurant,Ethiopian Restaurant
1,Northwick Park,Coffee Shop,Park,Grocery Store,Metro Station,Bookstore,Yoga Studio,Food Court,English Restaurant,Ethiopian Restaurant,Fast Food Restaurant
2,Oval,Café,Gym,Park,Garden,Flower Shop,Pub,Restaurant,Gay Bar,Hotel,Indian Restaurant
3,Tachbrook,Coffee Shop,Hotel,Plaza,Indian Restaurant,Pizza Place,Bus Stop,Performing Arts Venue,Restaurant,Pub,Pier
4,Tokyngton,Coffee Shop,Sporting Goods Shop,Bar,Clothing Store,Hotel,Sandwich Place,Grocery Store,Restaurant,American Restaurant,Burger Joint


In [37]:
# set number of clusters
kclusters = 5

top5_grouped_clustering = top5_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(top5_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 1, 3, 4, 0], dtype=int32)

In [40]:
# add clustering labels

neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

top5_merged = df_top5

top5_merged = top5_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

top5_merged.head() # check the last columns!

Unnamed: 0,Postcode,Latitude,Longitude,County,Borough,Neighbourhood,Population,Households,Nearest station,Distance to station,Average Income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
246925,SW8 2AU,51.485134,-0.127283,Greater London,Lambeth,Oval,613.0,402.0,Vauxhall,0.248118,57200,3,Café,Gym,Park,Garden,Flower Shop,Pub,Restaurant,Gay Bar,Hotel,Indian Restaurant
78705,HA1 3GX,51.574325,-0.315569,Greater London,Brent,Northwick Park,529.0,281.0,Northwick Park,0.49912,61900,1,Coffee Shop,Park,Grocery Store,Metro Station,Bookstore,Yoga Studio,Food Court,English Restaurant,Ethiopian Restaurant,Fast Food Restaurant
91620,HA9 0AB,51.558298,-0.284928,Greater London,Brent,Tokyngton,560.0,258.0,Wembley Stadium,0.412166,52500,0,Coffee Shop,Sporting Goods Shop,Bar,Clothing Store,Hotel,Sandwich Place,Grocery Store,Restaurant,American Restaurant,Burger Joint
233039,SW1V 3RA,51.486132,-0.136287,Greater London,Westminster,Tachbrook,343.0,246.0,Pimlico,0.439171,51700,4,Coffee Shop,Hotel,Plaza,Indian Restaurant,Pizza Place,Bus Stop,Performing Arts Venue,Restaurant,Pub,Pier
132977,N7 0EG,51.558271,-0.135016,Greater London,Islington,Junction,620.0,233.0,Tufnell Park,0.277188,60300,2,Italian Restaurant,Pub,Café,Hotel,Coffee Shop,Music Venue,Turkish Restaurant,Bus Stop,Modern European Restaurant,Ethiopian Restaurant


In [41]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(top5_merged['Latitude'], top5_merged['Longitude'], top5_merged['Neighbourhood'], top5_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
df_test = df

## Analysis

## Results

## Discussion

## Conclusion