# Assignment Week 5 Capstone Data Science

### Business Problem

The client, Mr Roy, would like to start up an Indian Restaurant in one of the London Boroughs. He currently owns two successful restaurants in Birmingham but is unfamiliar with the London Boroughs and which one would be best suited to set up his third restaurant.

Mr Roy’s restaurant would be opened 7 days a week in the evening only and ideally, he would like there to be passing trade from people enjoying an evening out. Mr Roy’s restaurant would also provide a take-away service and he would offer free delivery within a 3-mile radius.

Mr Roy would like to understand the make-up of the London boroughs and would like a recommendation on which ones best meet his criteria. Areas with evening entertainment but also a good residential area to requiring a take-away service.


### Data Description

The following wiki page will be used to obtain the London Boroughs, the geo data and populations.

https://en.wikipedia.org/wiki/List_of_London_boroughs

The following CSV document shows the business success for each of the London Boroughs. This can be merged with the wiki data to give a further factor for later analysis. The 5-year survival will be used.

https://data.london.gov.uk/dataset/business-demographics-and-survival-rates-borough
 
Downloaded data: Business Survival Rate (CSV)

The Foursquare API will be used to show venue categories for the London Boroughs. The London Borough central geo data will be used as the central search point. As the boroughs differ in size considerably, the search radius will be calculated as the square root of the borough area and converted the figure to metres.

Using K-means, the venue categories for the different London boroughs will be clustered. The clusters can then be interpreted and summarised. The most suitable areas can then be presented to Mr. Roy.


#### Import libraries required for assignment

In [1]:
#Required to open URL
import requests

#Install BeautifulSoup - comment out else when done
!conda install -c conda-forge beautifulsoup4 --yes
from bs4 import BeautifulSoup

#Pandas library
import pandas as pd

#Install geocoder - comment out else when done
!conda install -c conda-forge geocoder --yes
import geocoder

#Install geopy - comment out else when done
!conda install -c conda-forge geopy --yes
import folium
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import json # library to handle JSON files
# import k-means from clustering stage
from sklearn.cluster import KMeans
from pandas.io.json import json_normalize

import numpy as np # library to handle data in a vectorized manner

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    soupsieve-1.9.4            |           py36_0          58 KB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    beautifulsoup4-4.8.0       |           py36_0         144 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         348 KB

The following NEW packages will be INSTALLED:

    soupsieve:      1.9.4-py36_0     conda-forge

The following packages will be UPDATED:

    beautifulsoup4: 4.6.3-py37_0                 --> 4.8.0-py36_

### First stage is to scrape the wiki page for the data and format data into a data frame.

In [2]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_London_boroughs').text

#### Parse the HTML of the URL into a Beautiful Soup tree

In [3]:
soup = BeautifulSoup(website_url)

In [4]:
#Find the title
soup.title

<title>List of London boroughs - Wikipedia</title>

#### Bring back the table using find

In [5]:
pcode_table=soup.find('table', class_='wikitable sortable')
#pcode_table

#### Each row in the table is starts with "tr" and the data within is row is within "td". 

Loop through the pcode table and find the "tr" then the "td" tags. If the row has 3 columns then store the data in lists A (first column), B (second column) and C(third column).

In [6]:
A=[]
G=[]
K=[]


for row in pcode_table.findAll('tr'):
    cells=row.findAll('td')
    subcells=row.findAll('span')
    if len(cells)==10:
        A.append(cells[0].find(text=True))
        K.append(cells[6].find(text=True))
        G.append(subcells[7].find(text=True))
        

In [7]:
#Have a quick check of the data
print(A)

['Barking and Dagenham', 'Barnet', 'Bexley', 'Brent', 'Bromley', 'Camden', 'Croydon', 'Ealing', 'Enfield', 'Greenwich', 'Hackney', 'Hammersmith and Fulham', 'Haringey', 'Harrow', 'Havering', 'Hillingdon', 'Hounslow', 'Islington', 'Kensington and Chelsea', 'Kingston upon Thames', 'Lambeth', 'Lewisham', 'Merton', 'Newham', 'Redbridge', 'Richmond upon Thames', 'Southwark', 'Sutton', 'Tower Hamlets', 'Waltham Forest', 'Wandsworth', 'Westminster']


In [8]:
print(K)

['13.93\n', '33.49\n', '23.38\n', '16.70\n', '57.97\n', '8.40\n', '33.41\n', '21.44\n', '31.74\n', '18.28\n', '7.36\n', '6.33\n', '11.42\n', '19.49\n', '43.35\n', '44.67\n', '21.61\n', '5.74\n', '4.68\n', '14.38\n', '10.36\n', '13.57\n', '14.52\n', '13.98\n', '21.78\n', '22.17\n', '11.14\n', '16.93\n', '7.63\n', '14.99\n', '13.23\n', '8.29\n']


In [9]:
print(G)

['51.5607°N 0.1557°E', '51.6252°N 0.1517°W', '51.4549°N 0.1505°E', '51.5588°N 0.2817°W', '51.4039°N 0.0198°E', '51.5290°N 0.1255°W', '51.3714°N 0.0977°W', '51.5130°N 0.3089°W', '51.6538°N 0.0799°W', '51.4892°N 0.0648°E', '51.5450°N 0.0553°W', '51.4927°N 0.2339°W', '51.6000°N 0.1119°W', '51.5898°N 0.3346°W', '51.5812°N 0.1837°E', '51.5441°N 0.4760°W', '51.4746°N 0.3680°W', '51.5416°N 0.1022°W', '51.5020°N 0.1947°W', '51.4085°N 0.3064°W', '51.4607°N 0.1163°W', '51.4452°N 0.0209°W', '51.4014°N 0.1958°W', '51.5077°N 0.0469°E', '51.5590°N 0.0741°E', '51.4479°N 0.3260°W', '51.5035°N 0.0804°W', '51.3618°N 0.1945°W', '51.5099°N 0.0059°W', '51.5908°N 0.0134°W', '51.4567°N 0.1910°W', '51.4973°N 0.1372°W']


#### There appears to be some newlines in the columns. Use rstrip to go through list and clean.

In [10]:
j=0
for i in A:
    A[j]=i.rstrip()
    j=j+1
j=0
for i in K:
    K[j]=i.rstrip()
    j=j+1
j=0
for i in G:
    G[j]=i.rstrip()
    j=j+1

#### Convert to dataframe and add columns headings

In [11]:
df=pd.DataFrame(A,columns=['Borough'])
df['Area']=K
df['Geo']=G

#### Check the top and bottom of the dataframe to ensure it's captured

In [12]:
df.head()

Unnamed: 0,Borough,Area,Geo
0,Barking and Dagenham,13.93,51.5607°N 0.1557°E
1,Barnet,33.49,51.6252°N 0.1517°W
2,Bexley,23.38,51.4549°N 0.1505°E
3,Brent,16.7,51.5588°N 0.2817°W
4,Bromley,57.97,51.4039°N 0.0198°E


##### Need to tidy the dataframe up

In [13]:
#Check the data types
df.dtypes

Borough    object
Area       object
Geo        object
dtype: object

In [14]:
#Area needs to be converted to float
df['Area'] = df['Area'].astype('float64')
df.dtypes

Borough     object
Area       float64
Geo         object
dtype: object

##### Now need to sort the Geo data into the lats and longs

In [15]:
#Split Geo column by space and assign to a new dataframe
dfgeo= df['Geo'].str.split(' ',expand=True)

In [16]:
dfgeo.head()

Unnamed: 0,0,1
0,51.5607°N,0.1557°E
1,51.6252°N,0.1517°W
2,51.4549°N,0.1505°E
3,51.5588°N,0.2817°W
4,51.4039°N,0.0198°E


In [17]:
#Remove the degrees north
dfgeo[0]= dfgeo[0].str.split('°N',expand=True)
#Convert to float
dfgeo[0] = pd.to_numeric(dfgeo[0],errors='coerce')
#Rename Column 0
dfgeo.columns = ['Latitude','Longitude']

In [18]:
dfgeo.head()

Unnamed: 0,Latitude,Longitude
0,51.5607,0.1557°E
1,51.6252,0.1517°W
2,51.4549,0.1505°E
3,51.5588,0.2817°W
4,51.4039,0.0198°E


In [19]:
#Process the longitude, split off the direction and convert to float and integer
dlong = dfgeo['Longitude'].str.split('°',expand=True)
dlong.columns = ['Longitude','Direction']

In [20]:
dlong.dtypes

Longitude    object
Direction    object
dtype: object

In [21]:
dlong['Longitude'] = pd.to_numeric(dlong['Longitude'],errors='coerce')
dlong['Direction'] = dlong['Direction'].str.replace('E', '1')
dlong['Direction'] = dlong['Direction'].str.replace('W', '-1')
dlong['Direction'] = pd.to_numeric(dlong['Direction'],errors='coerce')
dlong.dtypes

Longitude    float64
Direction      int64
dtype: object

In [22]:
#Process direction
dlong['Longitude'] = dlong['Direction'] * dlong['Longitude']

In [23]:
dfgeo['Longitude'] = dlong['Longitude']

In [24]:
#Drop the Geo column and add on the longitude and Latitude columns
df.drop(['Geo'], axis=1, inplace=True)
df['Latitude'] = dfgeo['Latitude']
df['Longitude'] = dfgeo['Longitude']

In [25]:
#Create a search radius column. Sq root (Area) and convert miles to metres.
df['Radius'] = df['Area']**(1/2)
df['Radius'] = (df['Radius']*1609.34) * 0.5

In [26]:
df.head()

Unnamed: 0,Borough,Area,Latitude,Longitude,Radius
0,Barking and Dagenham,13.93,51.5607,0.1557,3003.263018
1,Barnet,33.49,51.6252,-0.1517,4656.669159
2,Bexley,23.38,51.4549,0.1505,3890.810359
3,Brent,16.7,51.5588,-0.2817,3288.33493
4,Bromley,57.97,51.4039,0.0198,6126.599065


##### Read in the business survival rates as a CSV

In [27]:
dfbus=pd.read_csv('business_survival_rates.csv')

In [28]:
dfbus.head()

Unnamed: 0,code,area,year,births,1_year_survival_number,1_year_survival_rate,2_year_survival_number,2_year_survival_rate,3_year_survival_number,3_year_survival_rate,4_year_survival_number,4_year_survival_rate,5_year_survival_number,5_year_survival_rate
0,E09000001,City of London,2002,1145,1025,89.5,915,79.9,760,66.4,660,57.6,600,52.4
1,E09000002,Barking and Dagenham,2002,435,410,94.3,335,77.0,250,57.5,205,47.1,175,40.2
2,E09000003,Barnet,2002,2330,2185,93.8,1805,77.5,1290,55.4,1030,44.2,855,36.7
3,E09000004,Bexley,2002,765,710,92.8,595,77.8,470,61.4,385,50.3,325,42.5
4,E09000005,Brent,2002,1635,1530,93.6,1135,69.4,810,49.5,625,38.2,525,32.1


In [29]:
#Rename area to Borough
dfbus.rename(columns={'area':'Borough'}, inplace=True)

In [30]:
#Select just the 2012 rows as this is the latest data
dfnew = dfbus[dfbus['year'] == 2012]

In [31]:
#Drop the unwanted columns
dfnew.drop(['code','year','births','1_year_survival_number','1_year_survival_rate','2_year_survival_number','2_year_survival_rate'], axis=1, inplace=True)
dfnew.drop(['3_year_survival_number','3_year_survival_rate','4_year_survival_number','4_year_survival_rate','5_year_survival_number'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [32]:
dfnew.head()

Unnamed: 0,Borough,5_year_survival_rate
510,City of London,37.9
511,Barking and Dagenham,39.6
512,Barnet,40.9
513,Bexley,44.1
514,Brent,39.2


In [33]:
dfnew.dtypes

Borough                 object
5_year_survival_rate    object
dtype: object

In [34]:
dfnew['5_year_survival_rate'] = pd.to_numeric(dfnew['5_year_survival_rate'],errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [35]:
dfnew.dtypes

Borough                  object
5_year_survival_rate    float64
dtype: object

In [36]:
#merge the survival data with the completed borough data to complete data cleansing
df = df.merge(dfnew,on=['Borough'],how='left')

In [37]:
df['Radius'] = df['Radius'].astype('int64')

In [38]:
df.head()

Unnamed: 0,Borough,Area,Latitude,Longitude,Radius,5_year_survival_rate
0,Barking and Dagenham,13.93,51.5607,0.1557,3003,39.6
1,Barnet,33.49,51.6252,-0.1517,4656,40.9
2,Bexley,23.38,51.4549,0.1505,3890,44.1
3,Brent,16.7,51.5588,-0.2817,3288,39.2
4,Bromley,57.97,51.4039,0.0198,6126,44.6


In [39]:
#Output to check if required
#df.to_csv('London_borough_data.csv')

#### Create a map of London with the boroughs superimposed on top.

In [40]:
#Get the central lats and longs for Toronto
address = 'London, UK'

geolocator = Nominatim(user_agent="lon_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of London are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of London are 51.4893335, -0.144055084527687.


In [41]:
map_london = folium.Map(location=[latitude, longitude])
map_london

In [42]:
# add borough markers to map
for lat, lng, bor in zip(df['Latitude'], df['Longitude'], df['Borough']):
    label = '{}'.format(bor)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)  


In [43]:
map_london

#### Use the FourSquare API to get venue data for the Toronto neighbourhoods

##### Define Foursquare credentials

In [44]:
CLIENT_ID = 'FQK1QA11AK1OPXWQ1X1D4AVU1UKV1OV1SNOSJ04VVQJJGVHZ' # your Foursquare ID
CLIENT_SECRET = 'PWBEB4JGCJVP0XAFQ33YRC43ZMZIWXANE4LTCZ42XQNZRYYP' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: FQK1QA11AK1OPXWQ1X1D4AVU1UKV1OV1SNOSJ04VVQJJGVHZ
CLIENT_SECRET:PWBEB4JGCJVP0XAFQ33YRC43ZMZIWXANE4LTCZ42XQNZRYYP


##### Define function that will retrieve venue data for a borough

In [45]:
def getNearbyVenues(names, latitudes, longitudes, radius):
    
    LIMIT = 100
    venues_list=[]
    for name, lat, lng, rad in zip(names, latitudes, longitudes, radius):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            rad, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng,
            rad,
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough', 
                  'Latitude', 
                  'Longitude',
                  'Radius',
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [46]:
#Get venue data for the different boroughs
borough_venues = getNearbyVenues(names=df['Borough'],latitudes=df['Latitude'],longitudes=df['Longitude'],radius=df['Radius'])

Barking and Dagenham
Barnet
Bexley
Brent
Bromley
Camden
Croydon
Ealing
Enfield
Greenwich
Hackney
Hammersmith and Fulham
Haringey
Harrow
Havering
Hillingdon
Hounslow
Islington
Kensington and Chelsea
Kingston upon Thames
Lambeth
Lewisham
Merton
Newham
Redbridge
Richmond upon Thames
Southwark
Sutton
Tower Hamlets
Waltham Forest
Wandsworth
Westminster


In [47]:
#Have a look at the data
print(borough_venues.shape)
borough_venues

(3100, 8)


Unnamed: 0,Borough,Latitude,Longitude,Radius,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Barking and Dagenham,51.5607,0.1557,3003,Central Park,51.559560,0.161981,Park
1,Barking and Dagenham,51.5607,0.1557,3003,The Range,51.575550,0.180254,Furniture / Home Store
2,Barking and Dagenham,51.5607,0.1557,3003,Hylands Park,51.572074,0.191155,Park
3,Barking and Dagenham,51.5607,0.1557,3003,Costa Coffee,51.576890,0.179497,Coffee Shop
4,Barking and Dagenham,51.5607,0.1557,3003,Harrow Lodge Park,51.555648,0.197926,Park
...,...,...,...,...,...,...,...,...
3095,Westminster,51.4973,-0.1372,2316,Victoria Embankment Gardens,51.508135,-0.122079,Garden
3096,Westminster,51.4973,-0.1372,2316,Halcyon Gallery,51.511452,-0.143736,Art Gallery
3097,Westminster,51.4973,-0.1372,2316,The Connaught,51.510138,-0.149498,Hotel
3098,Westminster,51.4973,-0.1372,2316,Harvey Nichols,51.501543,-0.159742,Department Store


#### Check how many venues were returned for each borough

In [48]:
borough_venues.groupby('Borough').count()

Unnamed: 0_level_0,Latitude,Longitude,Radius,Venue,Venue Latitude,Venue Longitude,Venue Category
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Barking and Dagenham,79,79,79,79,79,79,79
Barnet,100,100,100,100,100,100,100
Bexley,100,100,100,100,100,100,100
Brent,100,100,100,100,100,100,100
Bromley,100,100,100,100,100,100,100
Camden,100,100,100,100,100,100,100
Croydon,100,100,100,100,100,100,100
Ealing,100,100,100,100,100,100,100
Enfield,92,92,92,92,92,92,92
Greenwich,100,100,100,100,100,100,100


#### Find out how many unique categories can be curated from all the returned venues

In [49]:
print('There are {} unique categories.'.format(len(borough_venues['Venue Category'].unique())))

There are 263 unique categories.


#### Analyse each borough

In [50]:
# one hot encoding
borough_onehot = pd.get_dummies(borough_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
borough_onehot['Borough'] = borough_venues['Borough'] 

# move neighbourhood column to the first column
fixed_columns = [borough_onehot.columns[-1]] + list(borough_onehot.columns[:-1])
borough_onehot = borough_onehot[fixed_columns]

borough_onehot.tail()

Unnamed: 0,Borough,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Aquarium,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Video Game Store,Vietnamese Restaurant,Warehouse Store,Waterfront,Windmill,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio
3095,Westminster,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3096,Westminster,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3097,Westminster,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3098,Westminster,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3099,Westminster,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [51]:
borough_onehot.shape

(3100, 264)

#### Group rows by borough and by taking the mean of the frequency of occurrence of each category

In [52]:
borough_grouped = borough_onehot.groupby('Borough').mean().reset_index()
borough_grouped

Unnamed: 0,Borough,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Aquarium,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Video Game Store,Vietnamese Restaurant,Warehouse Store,Waterfront,Windmill,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio
0,Barking and Dagenham,0.0,0.0,0.0,0.025316,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012658,0.0
1,Barnet,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bexley,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,...,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Brent,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bromley,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Camden,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0
6,Croydon,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Ealing,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
8,Enfield,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Greenwich,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [53]:
borough_grouped.tail()

Unnamed: 0,Borough,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Aquarium,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Video Game Store,Vietnamese Restaurant,Warehouse Store,Waterfront,Windmill,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio
27,Sutton,0.0,0.0,0.0,0.0125,0.0,0.0,0.0,0.0,0.0,...,0.0125,0.0,0.0125,0.0,0.0,0.0,0.0,0.0,0.0125,0.0
28,Tower Hamlets,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29,Waltham Forest,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,...,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0
30,Wandsworth,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01
31,Westminster,0.0,0.0,0.0,0.01,0.0,0.01,0.03,0.03,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0


In [54]:
borough_grouped.to_csv('borough_grouped.csv')

#### Print each borough along with the top 5 most common venues

In [55]:
num_top_venues = 5

for bor in borough_grouped['Borough']:
    print("---",bor,"---")
    temp = borough_grouped[borough_grouped['Borough'] == bor].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

--- Barking and Dagenham ---
            venue  freq
0   Grocery Store  0.11
1     Coffee Shop  0.09
2            Park  0.06
3     Supermarket  0.06
4  Clothing Store  0.04


--- Barnet ---
                venue  freq
0                Café  0.13
1         Coffee Shop  0.09
2  Turkish Restaurant  0.07
3                Park  0.06
4                 Pub  0.05


--- Bexley ---
                venue  freq
0       Grocery Store  0.16
1                 Pub  0.15
2         Supermarket  0.07
3   Convenience Store  0.04
4  Italian Restaurant  0.04


--- Brent ---
               venue  freq
0        Coffee Shop  0.09
1  Indian Restaurant  0.09
2              Hotel  0.06
3     Clothing Store  0.05
4     Sandwich Place  0.05


--- Bromley ---
                  venue  freq
0                   Pub  0.16
1           Coffee Shop  0.09
2                  Park  0.09
3  Gym / Fitness Center  0.06
4           Pizza Place  0.04


--- Camden ---
         venue  freq
0  Coffee Shop  0.08
1        Hotel  0.05
2

#### Put that into a *pandas* dataframe

In [56]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Create the new dataframe and display the top 10 venues for each borough.

In [57]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
borough_venues_sorted = pd.DataFrame(columns=columns)
borough_venues_sorted['Borough'] = borough_grouped['Borough']

for ind in np.arange(borough_grouped.shape[0]):
    borough_venues_sorted.iloc[ind, 1:] = return_most_common_venues(borough_grouped.iloc[ind, :], num_top_venues)

borough_venues_sorted.head()

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Barking and Dagenham,Grocery Store,Coffee Shop,Supermarket,Park,Furniture / Home Store,Restaurant,Fast Food Restaurant,Clothing Store,Bus Stop,Multiplex
1,Barnet,Café,Coffee Shop,Turkish Restaurant,Park,Pub,Grocery Store,Bakery,Italian Restaurant,Gym / Fitness Center,Greek Restaurant
2,Bexley,Grocery Store,Pub,Supermarket,Fast Food Restaurant,Train Station,Coffee Shop,Convenience Store,Italian Restaurant,Clothing Store,Pizza Place
3,Brent,Indian Restaurant,Coffee Shop,Hotel,Clothing Store,Sandwich Place,Gym / Fitness Center,Grocery Store,Park,Pizza Place,Portuguese Restaurant
4,Bromley,Pub,Coffee Shop,Park,Gym / Fitness Center,Pizza Place,Italian Restaurant,Supermarket,Mediterranean Restaurant,Gastropub,Indian Restaurant


#### Cluster Boroughs

Run *k*-means to cluster the boroughs into 5 clusters.

In [58]:
# set number of clusters
kclusters = 5

borough_grouped_clustering = borough_grouped.drop('Borough', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(borough_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 

array([0, 2, 0, 1, 2, 4, 2, 2, 2, 0, 2, 2, 2, 1, 0, 1, 1, 2, 2, 2, 2, 2,
       2, 1, 0, 2, 4, 0, 4, 2, 2, 3], dtype=int32)

Create a new dataframe that includes the cluster as well as the top 10 venues for each borough.

In [59]:
# add clustering labels
borough_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

borough_merged = df

# merge borough_grouped with borough_data to add latitude/longitude for each borough
borough_merged = borough_merged.join(borough_venues_sorted.set_index('Borough'), on='Borough')

borough_merged.tail() # check the last columns!


Unnamed: 0,Borough,Area,Latitude,Longitude,Radius,5_year_survival_rate,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
27,Sutton,16.93,51.3618,-0.1945,3310,45.1,0,Pub,Grocery Store,Coffee Shop,Park,Italian Restaurant,Supermarket,Café,Sandwich Place,Clothing Store,Gym / Fitness Center
28,Tower Hamlets,7.63,51.5099,-0.0059,2222,35.9,4,Coffee Shop,Burger Joint,Hotel,Plaza,Bar,Park,Lounge,Italian Restaurant,Pub,Gym / Fitness Center
29,Waltham Forest,14.99,51.5908,-0.0134,3115,37.5,2,Pub,Coffee Shop,Café,Park,Pizza Place,Restaurant,Brewery,Turkish Restaurant,Supermarket,Bakery
30,Wandsworth,13.23,51.4567,-0.191,2926,39.9,2,Pub,Park,Coffee Shop,Café,Pizza Place,Bakery,Supermarket,French Restaurant,Thai Restaurant,Breakfast Spot
31,Westminster,8.29,51.4973,-0.1372,2316,35.6,3,Hotel,Boutique,Theater,Plaza,Clothing Store,Garden,Park,Art Gallery,Art Museum,Lounge


Finally, visualize the resulting clusters

In [60]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(borough_merged['Latitude'], borough_merged['Longitude'], borough_merged['Borough'], borough_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster + 1), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Closer look at the cluster data

#### Cluster 1

In [61]:
borough_merged.loc[borough_merged['Cluster Labels'] == 0, borough_merged.columns[[0] + [1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Borough,Area,5_year_survival_rate,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Barking and Dagenham,13.93,39.6,0,Grocery Store,Coffee Shop,Supermarket,Park,Furniture / Home Store,Restaurant,Fast Food Restaurant,Clothing Store,Bus Stop,Multiplex
2,Bexley,23.38,44.1,0,Grocery Store,Pub,Supermarket,Fast Food Restaurant,Train Station,Coffee Shop,Convenience Store,Italian Restaurant,Clothing Store,Pizza Place
9,Greenwich,18.28,40.0,0,Pub,Grocery Store,Clothing Store,Coffee Shop,Hotel,Park,Supermarket,Gym / Fitness Center,Bakery,Fast Food Restaurant
14,Havering,43.35,45.2,0,Coffee Shop,Pub,Grocery Store,Park,Supermarket,Fast Food Restaurant,Italian Restaurant,Furniture / Home Store,Café,Bakery
24,Redbridge,21.78,39.9,0,Grocery Store,Park,Indian Restaurant,Supermarket,Pub,Bakery,Fast Food Restaurant,Coffee Shop,Clothing Store,Pizza Place
27,Sutton,16.93,45.1,0,Pub,Grocery Store,Coffee Shop,Park,Italian Restaurant,Supermarket,Café,Sandwich Place,Clothing Store,Gym / Fitness Center


There is a high incidence of grocery stores and pubs in these boroughs. This would indicate these are residential areas. There is a good mix of venues, such as parks, restaurants, shops and coffee shops indicating these are busy areas during the day, but night life is limited to pubs and a few restaurants. These boroughs are on the outskirts of London, generally to the East.
Cluster name:  Residential 

#### Cluster 2

In [62]:
borough_merged.loc[borough_merged['Cluster Labels'] == 1, borough_merged.columns[[0]+ [1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Borough,Area,5_year_survival_rate,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Brent,16.7,39.2,1,Indian Restaurant,Coffee Shop,Hotel,Clothing Store,Sandwich Place,Gym / Fitness Center,Grocery Store,Park,Pizza Place,Portuguese Restaurant
13,Harrow,19.49,43.4,1,Indian Restaurant,Coffee Shop,Pub,Gym / Fitness Center,Grocery Store,Supermarket,Park,Café,Ice Cream Shop,Bar
15,Hillingdon,44.67,44.3,1,Pub,Coffee Shop,Indian Restaurant,Golf Course,Hotel,Supermarket,Park,Bar,Gym / Fitness Center,Grocery Store
16,Hounslow,21.61,43.4,1,Pub,Coffee Shop,Indian Restaurant,Park,Hotel,Grocery Store,Pizza Place,Rugby Stadium,Clothing Store,Gym / Fitness Center
23,Newham,13.98,36.8,1,Hotel,Coffee Shop,Pub,Park,Grocery Store,Clothing Store,Gym / Fitness Center,Supermarket,Plaza,Thai Restaurant


There is a high incidence of Indian restaurants, along with coffee shops, pubs and hotels. This also looks to be a residential area but livelier in the evening. The boroughs are on the outskirts of London, generally to the west and north.
Cluster name: Residential with love of Indian food

#### Cluster 3

In [63]:
borough_merged.loc[borough_merged['Cluster Labels'] == 2, borough_merged.columns[[0] + [1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Borough,Area,5_year_survival_rate,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Barnet,33.49,40.9,2,Café,Coffee Shop,Turkish Restaurant,Park,Pub,Grocery Store,Bakery,Italian Restaurant,Gym / Fitness Center,Greek Restaurant
4,Bromley,57.97,44.6,2,Pub,Coffee Shop,Park,Gym / Fitness Center,Pizza Place,Italian Restaurant,Supermarket,Mediterranean Restaurant,Gastropub,Indian Restaurant
6,Croydon,33.41,39.4,2,Pub,Coffee Shop,Grocery Store,Park,Mediterranean Restaurant,Café,Clothing Store,Indian Restaurant,Pizza Place,South Indian Restaurant
7,Ealing,21.44,42.5,2,Pub,Coffee Shop,Park,Hotel,Gym / Fitness Center,Pizza Place,Grocery Store,Gastropub,Italian Restaurant,Sushi Restaurant
8,Enfield,31.74,43.7,2,Coffee Shop,Pub,Garden Center,Gym / Fitness Center,Supermarket,Turkish Restaurant,Greek Restaurant,Café,Grocery Store,Park
10,Hackney,7.36,43.5,2,Pub,Café,Coffee Shop,Bakery,Yoga Studio,Bookstore,Cocktail Bar,Brewery,Pizza Place,Park
11,Hammersmith and Fulham,6.33,35.5,2,Pub,Café,Coffee Shop,Park,Gastropub,Thai Restaurant,Indian Restaurant,Hotel,French Restaurant,Pizza Place
12,Haringey,11.42,40.7,2,Café,Turkish Restaurant,Coffee Shop,Park,Pub,Bakery,Pizza Place,Mediterranean Restaurant,Indian Restaurant,Greek Restaurant
17,Islington,5.74,38.1,2,Pub,Café,Coffee Shop,Bakery,Mediterranean Restaurant,Ice Cream Shop,Theater,Pizza Place,Park,Music Venue
18,Kensington and Chelsea,4.68,38.9,2,Pub,Restaurant,Italian Restaurant,Café,Gym / Fitness Center,Hotel,Bakery,Burger Joint,Indian Restaurant,Garden


These boroughs really are dominated pubs and coffee shops and restaurants. There are few grocery stores, indicating the local people enjoy eating out during the day and night. 
Cluster Name: Pub and café society

#### Cluster 4

In [64]:
borough_merged.loc[borough_merged['Cluster Labels'] == 3, borough_merged.columns[[0]+ [1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Borough,Area,5_year_survival_rate,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
31,Westminster,8.29,35.6,3,Hotel,Boutique,Theater,Plaza,Clothing Store,Garden,Park,Art Gallery,Art Museum,Lounge


This cluster is clearly unique, dominated by hotels and theatres, and cultural venues during the day. Central London.
Cluster name: Cultural 

#### Cluster 5

In [65]:
borough_merged.loc[borough_merged['Cluster Labels'] == 4, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Area,5_year_survival_rate,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,8.4,39.5,4,Coffee Shop,Bookstore,Hotel,Pizza Place,Theater,Seafood Restaurant,Bakery,Steakhouse,Canal,Grocery Store
26,11.14,39.8,4,Coffee Shop,Hotel,Italian Restaurant,Cocktail Bar,Scenic Lookout,Theater,Café,Grocery Store,Garden,Steakhouse
28,7.63,35.9,4,Coffee Shop,Burger Joint,Hotel,Plaza,Bar,Park,Lounge,Italian Restaurant,Pub,Gym / Fitness Center


This doesn't look like this a huge residential area, as there are a small number of grocery stores, however the large number of coffee shops would indicate the area is busier in the day time. Although the area may attract tourists staying overnight as there are hotels, a some restuarants and theatres.

These boroughs are dominated by coffee shops but would be quieter in the evening. Perhaps less residential but more for tourists with its plentiful hotels. This is central London.
Cluster name:  Coffee shop and tourist.