# London Boroughs

## 1. Introduction

In this notebook we will explore London neighborhoods and use crime data along data science machine learning to come up with selections to live.

## 2. Business Problem

When you want to move in to London the choice can be overwhelming as there are 32 boroughs with hundrends of neighborhoods to choose from. Before you decide on the location, one criterion is to avoid areas with high crime. Once you have selected a low crime area then you would like to examine the venues and facilities availability by each area that complements you lifestyle. From then on you have reached a level of analysis that has narrowed down your options from a hundrent to less than 20! 

In this notebook we will use data science along with fourquare venue details to help guide a daunting task as such.

## 3. Data

For this project we will scrape London Boroughs from wikipedia https://en.wikipedia.org/wiki/List_of_London_boroughs and relevant homicide crime data from https://en.wikipedia.org/wiki/Crime_in_London in order to sort neighborhoods via homides and also download a comprehensive crime list per area from https://data.london.gov.uk/dataset/recorded_crime_summary to check the mean crime rates per area.

We will then decide which area to focus on and get a list of London neighborhoods from https://en.wikipedia.org/wiki/List_of_areas_of_London and link their geographical coordinates from https://geohack.toolforge.org

For the selected borough we will use the Fourquare API to get venues in each neighborhood and cluster them using the k means algorithm from the scikit learn python library. All visualizations will use folio.

## 4. Methodology

### Data collection and processing

First we import all python modules necessary.

In [1]:
import pandas as pd
import numpy as np
import requests 
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium 
print('Libraries imported successfully!')

Libraries imported successfully!


We create a 'soup' of the London Boroughs page

In [2]:
URL = "https://en.wikipedia.org/wiki/List_of_London_boroughs"
res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')

borough_list = []
#print(soup)

We then iterate though the page elements and get the columns we want

In [3]:
for items in soup.find('table', class_= 'wikitable sortable').find_all('tr')[1::]:
    table_data = items.find_all(['td'])
    boroughs_data = table_data[0] # iterate through first table column
    coords_data = table_data[8]    # iterate through 8th table column
    try:
        borough_name = boroughs_data.get_text()
        borough_name = borough_name.split('[')
        borough_name = borough_name[0]
        borough_name = borough_name.strip()
        
        coords = coords_data.get_text()
        coords = coords.split('/')
        lat_long = coords[2]
        lat_long = lat_long.split('(')
        lat_long = lat_long[0]
        lat_long = lat_long.split(';')
        lat = lat_long[0]
        lat = lat.strip()
        long = lat_long[1]
        long = long.strip()
        long = long.replace(u'\ufeff', '')
        lat = float(lat)
        long = float(long)

#       Append the borough name, latitude and logitude in a list
        borough_list.append((borough_name, lat, long))
    except IndexError: pass

We end up with a bourough_list including the geographical coordinates and put create a pandas dataframe of it

In [4]:
borough_list

[('Barking and Dagenham', 51.5607, 0.1557),
 ('Barnet', 51.6252, -0.1517),
 ('Bexley', 51.4549, 0.1505),
 ('Brent', 51.5588, -0.2817),
 ('Bromley', 51.4039, 0.0198),
 ('Camden', 51.529, -0.1255),
 ('Croydon', 51.3714, -0.0977),
 ('Ealing', 51.513, -0.3089),
 ('Enfield', 51.6538, -0.0799),
 ('Greenwich', 51.4892, 0.0648),
 ('Hackney', 51.545, -0.0553),
 ('Hammersmith and Fulham', 51.4927, -0.2339),
 ('Haringey', 51.6, -0.1119),
 ('Harrow', 51.5898, -0.3346),
 ('Havering', 51.5812, 0.1837),
 ('Hillingdon', 51.5441, -0.476),
 ('Hounslow', 51.4746, -0.368),
 ('Islington', 51.5416, -0.1022),
 ('Kensington and Chelsea', 51.502, -0.1947),
 ('Kingston upon Thames', 51.4085, -0.3064),
 ('Lambeth', 51.4607, -0.1163),
 ('Lewisham', 51.4452, -0.0209),
 ('Merton', 51.4014, -0.1958),
 ('Newham', 51.5077, 0.0469),
 ('Redbridge', 51.559, 0.0741),
 ('Richmond upon Thames', 51.4479, -0.326),
 ('Southwark', 51.5035, -0.0804),
 ('Sutton', 51.3618, -0.1945),
 ('Tower Hamlets', 51.5099, -0.0059),
 ('Waltham

In [5]:
boroughs_df = pd.DataFrame(borough_list, columns=['Borough','Latitude','Longitude'])
boroughs_df

Unnamed: 0,Borough,Latitude,Longitude
0,Barking and Dagenham,51.5607,0.1557
1,Barnet,51.6252,-0.1517
2,Bexley,51.4549,0.1505
3,Brent,51.5588,-0.2817
4,Bromley,51.4039,0.0198
5,Camden,51.529,-0.1255
6,Croydon,51.3714,-0.0977
7,Ealing,51.513,-0.3089
8,Enfield,51.6538,-0.0799
9,Greenwich,51.4892,0.0648


In [6]:
print(boroughs_df.dtypes)

Borough       object
Latitude     float64
Longitude    float64
dtype: object


Let's get London's coordinates using geopy

In [7]:
address = 'London, UK'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of London are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of London are 51.5073219, -0.1276474.


Let's visualize london boroughs using folio

In [8]:
# create map of London using latitude and longitude values
map_lon = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough in zip(boroughs_df['Latitude'], boroughs_df['Longitude'], boroughs_df['Borough']):
    label = '{}'.format(borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_lon)  
    
map_lon

Let's get a data for London crimes. This will help us select a borough to live in

In [9]:
CRIME = 'https://en.wikipedia.org/wiki/Crime_in_London' # scrape homicides per borough
homicide_tables = pd.read_html(CRIME)
print(len(homicide_tables))

9


In [10]:
homicides_df = pd.DataFrame(homicide_tables[3])

The homicides_df dataframe sorted gives us the areas with less serious crimes. Richmond upon Thames is leading this list.

In [11]:
homicides_df.sort_values(by='Number of homicides 2000 to 2012').head()

Unnamed: 0,Rank,Borough,Number of homicides 2000 to 2012
31,32,Richmond upon Thames,14
30,31,Kingston upon Thames,17
29,30,Kensington & Chelsea,23
28,29,Harrow,24
27,28,Sutton,25


We complement homicides with latest 24 months crime data from https://data.london.gov.uk/dataset/recorded_crime_summary and analyze total crimes per borough.

In [12]:
# I have downoalded the crime data in a csv file stored locally
crime_df = pd.read_csv('MPS_Borough_Level_Crime_(most_recent_24_months).csv')
print(crime_df.shape)
crime_df.rename(columns={'LookUp_BoroughName': 'Borough', 'MajorText': 'Crime'}, inplace=True)
crime_df.head()

(1556, 27)


Unnamed: 0,Crime,MinorText,Borough,201901,201902,201903,201904,201905,201906,201907,...,202003,202004,202005,202006,202007,202008,202009,202010,202011,202012
0,Arson and Criminal Damage,Arson,Barking and Dagenham,5,2,5,5,11,3,5,...,6,2,2,4,4,6,2,7,4,2
1,Arson and Criminal Damage,Criminal Damage,Barking and Dagenham,97,127,138,130,140,113,134,...,107,80,86,121,122,114,116,119,100,108
2,Burglary,Burglary - Business and Community,Barking and Dagenham,45,24,29,27,21,27,31,...,28,29,16,16,28,24,32,21,19,24
3,Burglary,Burglary - Residential,Barking and Dagenham,114,107,99,96,114,96,71,...,97,57,42,63,72,63,54,68,90,91
4,Drug Offences,Drug Trafficking,Barking and Dagenham,6,2,6,5,9,6,11,...,6,15,15,12,21,9,11,14,17,14


Let's create a mean for each row and name it 24month.

In [13]:
crime_list = []
for i in range(crime_df.shape[0]):
    
    crime_list.append(round(crime_df.iloc[i,3:].mean(),1))

In [14]:
crime_df['24month'] = crime_list
# view only 24month mean crime events 
crime_df = crime_df.loc[:,['Crime', 'Borough', '24month']].copy()
crime_df.head()

Unnamed: 0,Crime,Borough,24month
0,Arson and Criminal Damage,Barking and Dagenham,4.8
1,Arson and Criminal Damage,Barking and Dagenham,112.8
2,Burglary,Barking and Dagenham,27.0
3,Burglary,Barking and Dagenham,88.4
4,Drug Offences,Barking and Dagenham,10.0


We can now sum crime counts per borough and sort

In [46]:
crime_df.groupby('Borough').sum().sort_values(by='24month', ascending=True).head()

Unnamed: 0_level_0,24month
Borough,Unnamed: 1_level_1
London Heathrow and London City Airports,214.7
Kingston upon Thames,1023.9
Richmond upon Thames,1049.0
Sutton,1123.7
Merton,1167.9


It seems Richmond is a resonable choice as it appears first on the list with less homicides and sendond on the list of total crime count.

We select Richmond upon Thames borough to live, but which area? We continue getting info we want from wikipedia.

In [17]:
# scarpe List of London Areas
URL = "https://en.wikipedia.org/wiki/List_of_areas_of_London"
res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')

codes = []
areas_list = []
href_links_list = []
for items in soup.find('table', class_= 'wikitable sortable').find_all('tr')[1::]:
    data = items.find_all(['td'])
    data0 = data[0]
    area_name = data0.text

    data1 = data[1]
    data1 = data1.text
    borough = data1.split('[')
    borough_name = borough[0]
    data5 = data[5]
    code = data5.text
    code = code.strip()
    
    if borough_name == 'Richmond upon Thames':
        codes.append(code)
        areas_list.append((borough_name,area_name,code))

                
for link in soup.findAll('a', attrs={'href': re.compile("^https://geohack.toolforge.org")}):
            htext = link.text
            if htext in codes:
                hlink = link.get('href')
                href_links_list.append((htext, hlink))

After scraping the data we need we get a list of neighborhoods in Richmond and put them in a dataframe areas_df.

In [18]:
areas_df = pd.DataFrame(areas_list, columns=['Borough', 'Neighborhood', 'Code'])

We now have a list of neighborhoods in Richmond upon Thames

In [19]:
areas_df

Unnamed: 0,Borough,Neighborhood,Code
0,Richmond upon Thames,Barnes,TQ225765
1,Richmond upon Thames,Castelnau,TQ226776
2,Richmond upon Thames,East Sheen,TQ205755
3,Richmond upon Thames,Eel Pie Island,TQ164731
4,Richmond upon Thames,Fulwell,TQ149719
5,Richmond upon Thames,Ham,TQ175725
6,Richmond upon Thames,Hampton,TQ135705
7,Richmond upon Thames,Hampton Hill,TQ144710
8,Richmond upon Thames,Hampton Wick,TQ176695
9,Richmond upon Thames,Kew,TQ195775


We then translate location codes to coordinates

In [20]:
links_df = pd.DataFrame(href_links_list, columns=['Code','href'])

In [21]:
links_df

Unnamed: 0,Code,href
0,TQ225765,https://geohack.toolforge.org/geohack.php?page...
1,TQ226776,https://geohack.toolforge.org/geohack.php?page...
2,TQ205755,https://geohack.toolforge.org/geohack.php?page...
3,TQ164731,https://geohack.toolforge.org/geohack.php?page...
4,TQ149719,https://geohack.toolforge.org/geohack.php?page...
5,TQ175725,https://geohack.toolforge.org/geohack.php?page...
6,TQ135705,https://geohack.toolforge.org/geohack.php?page...
7,TQ144710,https://geohack.toolforge.org/geohack.php?page...
8,TQ176695,https://geohack.toolforge.org/geohack.php?page...
9,TQ195775,https://geohack.toolforge.org/geohack.php?page...


In [22]:
# create a list to hold coordinates
area_coords = []
# iterate through links_df hrefs and get latitude and longitude values for each row
for row in links_df.itertuples():
    url = row.href
    code = row.Code
    res = requests.get(url).text
    soup1 = BeautifulSoup(res,'lxml')
    
    for lat in soup1.find('span',{'class':'latitude'}):
        latitude = lat
        latitude = float(latitude)
            
    for long in soup1.find('span',{'class':'longitude'}):    
        longitude = long
        longitude = float(longitude)
        
    area_coords.append((code, latitude, longitude))

print(area_coords)

[('TQ225765', 51.474209, -0.237571), ('TQ226776', 51.484074, -0.23575), ('TQ205755', 51.465651, -0.266694), ('TQ164731', 51.444938, -0.326478), ('TQ149719', 51.434458, -0.348442), ('TQ175725', 51.439318, -0.310856), ('TQ135705', 51.422157, -0.369021), ('TQ144710', 51.42647, -0.355922), ('TQ176695', 51.412335, -0.310413), ('TQ195775', 51.483838, -0.280407), ('TQ205755', 51.465651, -0.266694), ('TQ195765', 51.47485, -0.280745), ('TQ175735', 51.448306, -0.310525), ('TQ185745', 51.457085, -0.295807), ('TQ168742', 51.454742, -0.320362), ('TQ155725', 51.439729, -0.339618), ('TQ159708', 51.424368, -0.334422), ('TQ155735', 51.448717, -0.339292), ('TQ145735', 51.44892, -0.353676)]


We then create area_coords_df dataframe to store coordinates and merge it with the areas_df dataframe.

In [23]:
area_coords_df = pd.DataFrame(area_coords, columns=['Code','Latitude','Longitude'])
area_coords_df.head()

Unnamed: 0,Code,Latitude,Longitude
0,TQ225765,51.474209,-0.237571
1,TQ226776,51.484074,-0.23575
2,TQ205755,51.465651,-0.266694
3,TQ164731,51.444938,-0.326478
4,TQ149719,51.434458,-0.348442


In [24]:
areas_df = areas_df.merge(area_coords_df, how='inner', on='Code').drop_duplicates()
areas_df.reset_index(drop=True)
areas_df

Unnamed: 0,Borough,Neighborhood,Code,Latitude,Longitude
0,Richmond upon Thames,Barnes,TQ225765,51.474209,-0.237571
1,Richmond upon Thames,Castelnau,TQ226776,51.484074,-0.23575
2,Richmond upon Thames,East Sheen,TQ205755,51.465651,-0.266694
4,Richmond upon Thames,Mortlake,TQ205755,51.465651,-0.266694
6,Richmond upon Thames,Eel Pie Island,TQ164731,51.444938,-0.326478
7,Richmond upon Thames,Fulwell,TQ149719,51.434458,-0.348442
8,Richmond upon Thames,Ham,TQ175725,51.439318,-0.310856
9,Richmond upon Thames,Hampton,TQ135705,51.422157,-0.369021
10,Richmond upon Thames,Hampton Hill,TQ144710,51.42647,-0.355922
11,Richmond upon Thames,Hampton Wick,TQ176695,51.412335,-0.310413


Let's visualize neighbourhoods in Richmond upon Thames using folio

In [25]:
# create map of London using latitude and longitude values
map_rich = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, area in zip(areas_df['Latitude'], areas_df['Longitude'], areas_df['Borough'], areas_df['Neighborhood']):
    label = '{}, {}'.format(area,borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_rich)  
    
map_rich

### Cluster Rihmond neighborhoods

Get venues from Foursquare API

In [26]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
RADIUS = 500 # define radius of search

In [27]:
# create function to get venues for neighborhoods process
def getNearbyVenues(names, latitudes, longitudes, radius=RADIUS):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [28]:
richmond_venues = getNearbyVenues(names=areas_df['Neighborhood'],
                                  latitudes=areas_df['Latitude'],
                                  longitudes=areas_df['Longitude'])

Barnes
Castelnau
East Sheen
Mortlake
Eel Pie Island
Fulwell
Ham
Hampton
Hampton Hill
Hampton Wick
Kew
North Sheen
Petersham
Richmond
St Margarets
Strawberry Hill
Teddington
Twickenham
Whitton


In [29]:
 # Check the size of the resulting dataframe
print(richmond_venues.shape)
richmond_venues.head()

(357, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Barnes,51.474209,-0.237571,Olympic Studios Cafe + Dining Room,51.475158,-0.240333,Indie Movie Theater
1,Barnes,51.474209,-0.237571,ArteChef,51.474705,-0.241282,Pizza Place
2,Barnes,51.474209,-0.237571,Barn Elmes,51.475235,-0.235042,Park
3,Barnes,51.474209,-0.237571,The Red Lion,51.47552,-0.239,Pub
4,Barnes,51.474209,-0.237571,London Wetland Centre,51.476864,-0.235513,Nature Preserve


Let's have a look at the number of venues per neighborhood in descending order

In [30]:
richmond_venues.groupby('Neighborhood').count().sort_values(by='Venue', ascending=False)

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Hampton Wick,74,74,74,74,74,74
Eel Pie Island,47,47,47,47,47,47
Teddington,30,30,30,30,30,30
North Sheen,28,28,28,28,28,28
Whitton,21,21,21,21,21,21
East Sheen,20,20,20,20,20,20
Mortlake,20,20,20,20,20,20
Barnes,18,18,18,18,18,18
Kew,16,16,16,16,16,16
St Margarets,15,15,15,15,15,15


In [31]:
print('There are {} unique categories.'.format(len(richmond_venues['Venue Category'].unique())))

There are 108 unique categories.


Analyze each neighborhood

In [32]:
# one hot encoding
richmond_onehot = pd.get_dummies(richmond_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
richmond_onehot['Neighborhood'] = richmond_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [richmond_onehot.columns[-1]] + list(richmond_onehot.columns[:-1])
richmond_onehot = richmond_onehot[fixed_columns]
print(richmond_onehot.shape)
richmond_onehot.head()

(357, 109)


Unnamed: 0,Neighborhood,American Restaurant,Asian Restaurant,Australian Restaurant,Bakery,Bar,Beer Bar,Beer Garden,Beer Store,Boat or Ferry,...,Sushi Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theater,Track,Trail,Train Station,Vietnamese Restaurant,Wine Shop
0,Barnes,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Barnes,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Barnes,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Barnes,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Barnes,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's create a table with mean number of venues per neighborhood

In [33]:
richmond_grouped = richmond_onehot.groupby('Neighborhood').mean().reset_index()
print(richmond_grouped.shape)
richmond_grouped.head()

(19, 109)


Unnamed: 0,Neighborhood,American Restaurant,Asian Restaurant,Australian Restaurant,Bakery,Bar,Beer Bar,Beer Garden,Beer Store,Boat or Ferry,...,Sushi Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Theater,Track,Trail,Train Station,Vietnamese Restaurant,Wine Shop
0,Barnes,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.055556,0.0,0.055556,0.0,0.0,0.0,0.0
1,Castelnau,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,East Sheen,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,...,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Eel Pie Island,0.0,0.0,0.0,0.021277,0.0,0.0,0.021277,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021277,0.0
4,Fulwell,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's print top 5 venues per neighborood

In [34]:
num_top_venues = 5

for area in richmond_grouped['Neighborhood']:
    print("----"+area+"----")
    temp = richmond_grouped[richmond_grouped['Neighborhood'] == area].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Barnes----
               venue  freq
0               Park  0.11
1        Sports Club  0.06
2      Movie Theater  0.06
3         Restaurant  0.06
4  Food & Drink Shop  0.06


----Castelnau----
                venue  freq
0                Café  0.18
1   Indian Restaurant  0.09
2      Cosmetics Shop  0.09
3   French Restaurant  0.09
4  Chinese Restaurant  0.09


----East Sheen----
                 venue  freq
0          Coffee Shop  0.15
1                  Pub  0.10
2        Grocery Store  0.10
3          Pizza Place  0.10
4  American Restaurant  0.05


----Eel Pie Island----
                venue  freq
0         Coffee Shop  0.15
1                 Pub  0.15
2  Italian Restaurant  0.11
3   Indian Restaurant  0.04
4            Pharmacy  0.04


----Fulwell----
                venue  freq
0                Café  0.07
1  Chinese Restaurant  0.07
2         Golf Course  0.07
3                 Pub  0.07
4           Gastropub  0.07


----Ham----
               venue  freq
0        Bus Station

Next we put this info on a dataframe (neighborhoods_venues_sorted)

In [35]:
# function to get top venues per neighborhood
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [36]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = richmond_grouped['Neighborhood']

for ind in np.arange(richmond_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(richmond_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Barnes,Park,Sports Club,Movie Theater,Restaurant,Food & Drink Shop,Convenience Store,Community Center,Pizza Place,Coffee Shop,Pub
1,Castelnau,Café,Indian Restaurant,Cosmetics Shop,French Restaurant,Chinese Restaurant,Bus Stop,Grocery Store,Lake,Italian Restaurant,Bar
2,East Sheen,Coffee Shop,Pub,Grocery Store,Pizza Place,American Restaurant,Middle Eastern Restaurant,Park,Gelato Shop,Supermarket,Tapas Restaurant
3,Eel Pie Island,Coffee Shop,Pub,Italian Restaurant,Indian Restaurant,Pharmacy,Pizza Place,Grocery Store,Deli / Bodega,Farmers Market,Fast Food Restaurant
4,Fulwell,Café,Chinese Restaurant,Golf Course,Pub,Gastropub,Bus Station,Garden Center,Supermarket,Pizza Place,Convenience Store


### Create neighborhood clusters

We run k-means to cluster the neighborhoods into 5 clusters

In [37]:
# set number of clusters
kclusters = 5

richmond_grouped_clustering = richmond_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(richmond_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[:]

array([1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 0, 2, 1, 1, 1, 4, 1],
      dtype=int32)

We create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [38]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

richmond_merged = areas_df

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
richmond_merged = richmond_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

richmond_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Richmond upon Thames,Barnes,TQ225765,51.474209,-0.237571,1,Park,Sports Club,Movie Theater,Restaurant,Food & Drink Shop,Convenience Store,Community Center,Pizza Place,Coffee Shop,Pub
1,Richmond upon Thames,Castelnau,TQ226776,51.484074,-0.23575,1,Café,Indian Restaurant,Cosmetics Shop,French Restaurant,Chinese Restaurant,Bus Stop,Grocery Store,Lake,Italian Restaurant,Bar
2,Richmond upon Thames,East Sheen,TQ205755,51.465651,-0.266694,1,Coffee Shop,Pub,Grocery Store,Pizza Place,American Restaurant,Middle Eastern Restaurant,Park,Gelato Shop,Supermarket,Tapas Restaurant
4,Richmond upon Thames,Mortlake,TQ205755,51.465651,-0.266694,1,Coffee Shop,Pub,Grocery Store,Pizza Place,American Restaurant,Middle Eastern Restaurant,Park,Gelato Shop,Supermarket,Tapas Restaurant
6,Richmond upon Thames,Eel Pie Island,TQ164731,51.444938,-0.326478,1,Coffee Shop,Pub,Italian Restaurant,Indian Restaurant,Pharmacy,Pizza Place,Grocery Store,Deli / Bodega,Farmers Market,Fast Food Restaurant


Let's visualize clusters!

In [39]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(richmond_merged['Latitude'], richmond_merged['Longitude'], richmond_merged['Neighborhood'], richmond_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Results: Examining clusters

Cluster 1 

In [40]:
richmond_merged.loc[richmond_merged['Cluster Labels'] == 0, richmond_merged.columns[[1] + list(range(5, richmond_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Petersham,0,Boat or Ferry,Café,Playground,Historic Site,Sports Club,Park,American Restaurant,Portuguese Restaurant,Pool,Plaza


Cluster 2 with 15 neighborhoods

In [41]:
richmond_merged.loc[richmond_merged['Cluster Labels'] == 1, richmond_merged.columns[[1] + list(range(5, richmond_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Barnes,1,Park,Sports Club,Movie Theater,Restaurant,Food & Drink Shop,Convenience Store,Community Center,Pizza Place,Coffee Shop,Pub
1,Castelnau,1,Café,Indian Restaurant,Cosmetics Shop,French Restaurant,Chinese Restaurant,Bus Stop,Grocery Store,Lake,Italian Restaurant,Bar
2,East Sheen,1,Coffee Shop,Pub,Grocery Store,Pizza Place,American Restaurant,Middle Eastern Restaurant,Park,Gelato Shop,Supermarket,Tapas Restaurant
4,Mortlake,1,Coffee Shop,Pub,Grocery Store,Pizza Place,American Restaurant,Middle Eastern Restaurant,Park,Gelato Shop,Supermarket,Tapas Restaurant
6,Eel Pie Island,1,Coffee Shop,Pub,Italian Restaurant,Indian Restaurant,Pharmacy,Pizza Place,Grocery Store,Deli / Bodega,Farmers Market,Fast Food Restaurant
7,Fulwell,1,Café,Chinese Restaurant,Golf Course,Pub,Gastropub,Bus Station,Garden Center,Supermarket,Pizza Place,Convenience Store
9,Hampton,1,Park,Indian Restaurant,Convenience Store,Coffee Shop,Pedestrian Plaza,Portuguese Restaurant,Pool,Plaza,Playground,Platform
10,Hampton Hill,1,Wine Shop,Grocery Store,Fast Food Restaurant,Coffee Shop,Chinese Restaurant,Pet Café,Café,Pub,Italian Restaurant,Mexican Restaurant
11,Hampton Wick,1,Coffee Shop,Pub,Clothing Store,Café,Italian Restaurant,Department Store,Sandwich Place,Bookstore,Hotel,Sushi Restaurant
12,Kew,1,Pub,Park,Coffee Shop,Hotel,Brasserie,Pedestrian Plaza,History Museum,Restaurant,Trail,Playground


Cluster 3

In [42]:
richmond_merged.loc[richmond_merged['Cluster Labels'] == 2, richmond_merged.columns[[1] + list(range(5, richmond_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,Richmond,2,Pub,Bakery,Gastropub,Kitchen Supply Store,Park,Portuguese Restaurant,Pool,Plaza,Playground,Platform


Cluster 4

In [43]:
richmond_merged.loc[richmond_merged['Cluster Labels'] == 3, richmond_merged.columns[[1] + list(range(5, richmond_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Ham,3,Bus Station,Trail,German Restaurant,Garden Center,Optical Shop,Pool,Plaza,Playground,Platform,Pizza Place


Cluster 5

In [44]:
richmond_merged.loc[richmond_merged['Cluster Labels'] == 4, richmond_merged.columns[[1] + list(range(5, richmond_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Twickenham,4,Stadium,Rugby Stadium,Gym / Fitness Center,Grocery Store,Supermarket,Nature Preserve,Plaza,Playground,Platform,Pizza Place


## 6. Discussion

From the above analysis we ended up with 5 clusters of a London Borough with low rate crime we can select to live in. Our individual preferences and needs can be superimposed on the clusters's most common venues to make the final selection.

## 7. Conclusions

In this report we showcased how we can narrow down seemingly overwhelming tasks, like selecting a neighborhood in London, with the help of python. By getting relevant data from the internet such as areas of interest and crime data and then implementing data science methodology to clean, analyse and visualise series of data we can reach a point of insight to help informed based decisions.