# The Battle of the Neigbourhoods

In this notebook the neighbourhoods in the city of Toronto will be segmented en clustered based on the postalcode and borough information.<br><br>
Part 1: Collecting Toronto Neighbourhood data<br>
For the Toronto neighbourhood data, a Wikipedia page exists that has all the information that is neccessary to explore and cluster the neighbourhoods in Toronto.<br> The required HTML table from the Wikipedia page will be used to read into a pandas dataframe.<br><br>
Part 2: Collecting geographical coordinates of Toronto<br>
After cleaning en preprocessing of the data, it will be enriched with the geographical coordinates.<br><br>
Part 3: Clustering the neighbourhoods of Toronto<br>
Then the venues in each neighbourhood will be collected form the FOURSQUARE website. The top 10 venues of each neighbourhoods will be determined.<br> 
The K-means algoritm is used to cluster the neighbourhoods, which are then visualized in a map of Toronto using Folium.<br>

## Part 1: Collecting the Toronto neighbourhood data.

### Importing libraries

In [1]:
!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

from bs4 import BeautifulSoup # Library for scraping webpage
from IPython.display import display_html # Library for displaying HTML

#!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

print('Importing ready!')

Importing ready!


### Retreive neighbourhoods of Amsterdam from Wikipedia webpage

In [2]:
# Get webpage
source = requests.get('https://en.wikipedia.org/wiki/Category:Neighbourhoods_of_Amsterdam').text
# Scrape webpage
soup = BeautifulSoup(source,'lxml')
# Check title of webpage
print(soup.title)
lst = []
for item in soup.findAll('div',{'class':'mw-category-group'}):
    sub_items = item.findAll('li')
    for sub_item in sub_items:
        lst.append(['Amsterdam',sub_item.text])
        
# Get table from webpage
#html_table = str(bsoup.table)
# Display table
#display_html(html_table,raw=True)
lst

<title>Category:Neighbourhoods of Amsterdam - Wikipedia</title>


[['Amsterdam', 'Template:Neighborhoods of Amsterdam'],
 ['Amsterdam', 'Admiralenbuurt'],
 ['Amsterdam', 'Amsteldorp'],
 ['Amsterdam', 'Amsterdam Oud-West'],
 ['Amsterdam', 'Amsterdam Oud-Zuid'],
 ['Amsterdam', 'Amsterdam Science Park'],
 ['Amsterdam', 'Apollobuurt'],
 ['Amsterdam', 'Betondorp'],
 ['Amsterdam', 'Bijlmermeer'],
 ['Amsterdam', 'Binnenstad (Amsterdam)'],
 ['Amsterdam', 'Bos en Lommer'],
 ['Amsterdam', 'Buiksloot'],
 ['Amsterdam', 'Buikslotermeer'],
 ['Amsterdam', 'Buitenveldert'],
 ['Amsterdam', 'Bullewijk'],
 ['Amsterdam', 'Burgwallen Nieuwe Zijde'],
 ['Amsterdam', 'Burgwallen Oude Zijde'],
 ['Amsterdam', 'Chassébuurt'],
 ['Amsterdam', 'Cruquiuseiland'],
 ['Amsterdam', 'Czaar Peterbuurt'],
 ['Amsterdam', 'Dapperbuurt'],
 ['Amsterdam', 'De Aker'],
 ['Amsterdam', 'De Pijp'],
 ['Amsterdam', 'De Wallen'],
 ['Amsterdam', 'Diamantbuurt (Amsterdam)'],
 ['Amsterdam', 'Duivelseiland (Amsterdam)'],
 ['Amsterdam', 'Eastern Docklands'],
 ['Amsterdam', 'Eendracht (Amsterdam)'],
 ['Ams

In [3]:
df=pd.DataFrame(lst,columns=['City','Neighbourhood'])
df["Neighbourhood"]  = df["Neighbourhood"].str.strip()
df

Unnamed: 0,City,Neighbourhood
0,Amsterdam,Template:Neighborhoods of Amsterdam
1,Amsterdam,Admiralenbuurt
2,Amsterdam,Amsteldorp
3,Amsterdam,Amsterdam Oud-West
4,Amsterdam,Amsterdam Oud-Zuid
...,...,...
102,Amsterdam,Westerpark (neighbourhood)
103,Amsterdam,Willemspark (Amsterdam)
104,Amsterdam,Zeeburgereiland
105,Amsterdam,Zeeheldenbuurt


In [4]:
df.shape

(107, 2)

### Cleaning en preparing dataset

In [5]:
df.loc[df.Neighbourhood.str.contains('Template'), 'Neighbourhood'].count() # df.loc[df.Neighbourhood == 'Not assigned', 'Neighbourhood'].count()

1

### Convert HTML table to Dataframe for preprocessing

In [6]:
df1 = df[~df.Neighbourhood.str.contains('Template')]
df1

Unnamed: 0,City,Neighbourhood
1,Amsterdam,Admiralenbuurt
2,Amsterdam,Amsteldorp
3,Amsterdam,Amsterdam Oud-West
4,Amsterdam,Amsterdam Oud-Zuid
5,Amsterdam,Amsterdam Science Park
...,...,...
102,Amsterdam,Westerpark (neighbourhood)
103,Amsterdam,Willemspark (Amsterdam)
104,Amsterdam,Zeeburgereiland
105,Amsterdam,Zeeheldenbuurt


In [7]:
df1.shape

(106, 2)

In [8]:
df1.loc[df1.Neighbourhood.str.contains('\('), 'Neighbourhood'].count()

19

In [9]:
#df1['Neighbourhood'] = df1['Neighbourhood'].str.replace('\(Amsterdam\)','')
#df1['Neighbourhood'] = df1['Neighbourhood'].str.replace('\(neighbourhood\)','')
df1['Neighbourhood'] = df1['Neighbourhood'].str.split('\(').str[0]
df1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


Unnamed: 0,City,Neighbourhood
1,Amsterdam,Admiralenbuurt
2,Amsterdam,Amsteldorp
3,Amsterdam,Amsterdam Oud-West
4,Amsterdam,Amsterdam Oud-Zuid
5,Amsterdam,Amsterdam Science Park
...,...,...
102,Amsterdam,Westerpark
103,Amsterdam,Willemspark
104,Amsterdam,Zeeburgereiland
105,Amsterdam,Zeeheldenbuurt


In [10]:
df1.loc[df1.Neighbourhood.str.contains('\('), 'Neighbourhood'].count()

0

In [11]:
df1["address"] = df1["Neighbourhood"] + ', ' +  df1["City"]
df1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Unnamed: 0,City,Neighbourhood,address
1,Amsterdam,Admiralenbuurt,"Admiralenbuurt, Amsterdam"
2,Amsterdam,Amsteldorp,"Amsteldorp, Amsterdam"
3,Amsterdam,Amsterdam Oud-West,"Amsterdam Oud-West, Amsterdam"
4,Amsterdam,Amsterdam Oud-Zuid,"Amsterdam Oud-Zuid, Amsterdam"
5,Amsterdam,Amsterdam Science Park,"Amsterdam Science Park, Amsterdam"
...,...,...,...
102,Amsterdam,Westerpark,"Westerpark , Amsterdam"
103,Amsterdam,Willemspark,"Willemspark , Amsterdam"
104,Amsterdam,Zeeburgereiland,"Zeeburgereiland, Amsterdam"
105,Amsterdam,Zeeheldenbuurt,"Zeeheldenbuurt, Amsterdam"


In [12]:
#df2 = df1[df1.Neighbourhood == 'Zuidas']
df2 = df1.copy()
df2

Unnamed: 0,City,Neighbourhood,address
1,Amsterdam,Admiralenbuurt,"Admiralenbuurt, Amsterdam"
2,Amsterdam,Amsteldorp,"Amsteldorp, Amsterdam"
3,Amsterdam,Amsterdam Oud-West,"Amsterdam Oud-West, Amsterdam"
4,Amsterdam,Amsterdam Oud-Zuid,"Amsterdam Oud-Zuid, Amsterdam"
5,Amsterdam,Amsterdam Science Park,"Amsterdam Science Park, Amsterdam"
...,...,...,...
102,Amsterdam,Westerpark,"Westerpark , Amsterdam"
103,Amsterdam,Willemspark,"Willemspark , Amsterdam"
104,Amsterdam,Zeeburgereiland,"Zeeburgereiland, Amsterdam"
105,Amsterdam,Zeeheldenbuurt,"Zeeheldenbuurt, Amsterdam"


In [57]:
address = 'Amsteldorp, Amsterdam'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 52.3443384, 4.9220313.


In [14]:
#!pip install geopandas
#!pip install geopy
from geopy.extra.rate_limiter import RateLimiter

locator = Nominatim(user_agent="neighbourhoud_explorer")

# 1 - convenient function to delay between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

# 2- - create location column
df2['location'] = df2['address'].apply(geocode)

# 3 - create longitude, latitude and altitude from location column (returns tuple)
df2['point'] = df2['location'].apply(lambda loc: tuple(loc.point) if loc else None)

# 4 - split point column into latitude, longitude and altitude columns
#df2[['latitude', 'longitude', 'altitude']] = pd.DataFrame(df2['point'].tolist(), index=df2.index)

df2

Unnamed: 0,City,Neighbourhood,address,location,point
1,Amsterdam,Admiralenbuurt,"Admiralenbuurt, Amsterdam",,
2,Amsterdam,Amsteldorp,"Amsteldorp, Amsterdam","(Huisarts Amsteldorp, Middelhoffstraat, Franke...","(52.3443384, 4.9220313, 0.0)"
3,Amsterdam,Amsterdam Oud-West,"Amsterdam Oud-West, Amsterdam","(HEMA Amsterdam-Kinkerstraat, 313, Kinkerstraa...","(52.3647387, 4.8630105, 0.0)"
4,Amsterdam,Amsterdam Oud-Zuid,"Amsterdam Oud-Zuid, Amsterdam","(Amsterdam-Oud Zuid, Ringweg-Zuid, Zuidas, Zui...","(52.3391253, 4.8661853, 0.0)"
5,Amsterdam,Amsterdam Science Park,"Amsterdam Science Park, Amsterdam","(Amsterdam Science Park, Kruislaan, Watergraaf...","(52.352926, 4.948315, 0.0)"
...,...,...,...,...,...
102,Amsterdam,Westerpark,"Westerpark , Amsterdam","(Westerpark, West, Amsterdam, Noord-Holland, N...","(52.387236349999995, 4.871777328438663, 0.0)"
103,Amsterdam,Willemspark,"Willemspark , Amsterdam","(Café Willemspark, 223, Willemsparkweg, Museum...","(52.3552537, 4.8683772, 0.0)"
104,Amsterdam,Zeeburgereiland,"Zeeburgereiland, Amsterdam","(Zeeburgereiland, Schellingwoude, Amsterdam, N...","(52.372608299999996, 4.965545531374505, 0.0)"
105,Amsterdam,Zeeheldenbuurt,"Zeeheldenbuurt, Amsterdam","(Zeeheldenbuurt, Amsterdam, Noord-Holland, Ned...","(52.389329849999996, 4.888242227776295, 0.0)"


In [51]:
print(df2.loc[df2["location"].isnull()].count())

df2.loc[df2["location"].isnull()]




City             15
Neighbourhood    15
address          15
location          0
point             0
dtype: int64


Unnamed: 0,City,Neighbourhood,address,location,point
1,Amsterdam,Admiralenbuurt,"Admiralenbuurt, Amsterdam",,
17,Amsterdam,Chassébuurt,"Chassébuurt, Amsterdam",,
35,Amsterdam,Hoofddorppleinbuurt,"Hoofddorppleinbuurt, Amsterdam",,
40,Amsterdam,Jodenbuurt,"Jodenbuurt, Amsterdam",,
46,Amsterdam,Kolenkit District,"Kolenkit District, Amsterdam",,
50,Amsterdam,Middelveldsche Akerpolder,"Middelveldsche Akerpolder, Amsterdam",,
58,Amsterdam,Nieuwendammerdijk en Buiksloterdijk,"Nieuwendammerdijk en Buiksloterdijk, Amsterdam",,
72,Amsterdam,Overtoombuurt,"Overtoombuurt, Amsterdam",,
75,Amsterdam,Prinses Irenebuurt,"Prinses Irenebuurt, Amsterdam",,
78,Amsterdam,Rieteilanden,"Rieteilanden, Amsterdam",,


In [61]:
df2.dropna(inplace=True)
df2.reset_index(drop=True, inplace=True)
df2

Unnamed: 0,City,Neighbourhood,address,location,point
0,Amsterdam,Amsteldorp,"Amsteldorp, Amsterdam","(Huisarts Amsteldorp, Middelhoffstraat, Franke...","(52.3443384, 4.9220313, 0.0)"
1,Amsterdam,Amsterdam Oud-West,"Amsterdam Oud-West, Amsterdam","(HEMA Amsterdam-Kinkerstraat, 313, Kinkerstraa...","(52.3647387, 4.8630105, 0.0)"
2,Amsterdam,Amsterdam Oud-Zuid,"Amsterdam Oud-Zuid, Amsterdam","(Amsterdam-Oud Zuid, Ringweg-Zuid, Zuidas, Zui...","(52.3391253, 4.8661853, 0.0)"
3,Amsterdam,Amsterdam Science Park,"Amsterdam Science Park, Amsterdam","(Amsterdam Science Park, Kruislaan, Watergraaf...","(52.352926, 4.948315, 0.0)"
4,Amsterdam,Apollobuurt,"Apollobuurt, Amsterdam","(Apollobuurt, Zuid, Amsterdam, Noord-Holland, ...","(52.348072599999995, 4.875559011765657, 0.0)"
...,...,...,...,...,...
86,Amsterdam,Westerpark,"Westerpark , Amsterdam","(Westerpark, West, Amsterdam, Noord-Holland, N...","(52.387236349999995, 4.871777328438663, 0.0)"
87,Amsterdam,Willemspark,"Willemspark , Amsterdam","(Café Willemspark, 223, Willemsparkweg, Museum...","(52.3552537, 4.8683772, 0.0)"
88,Amsterdam,Zeeburgereiland,"Zeeburgereiland, Amsterdam","(Zeeburgereiland, Schellingwoude, Amsterdam, N...","(52.372608299999996, 4.965545531374505, 0.0)"
89,Amsterdam,Zeeheldenbuurt,"Zeeheldenbuurt, Amsterdam","(Zeeheldenbuurt, Amsterdam, Noord-Holland, Ned...","(52.389329849999996, 4.888242227776295, 0.0)"


In [62]:
df2.shape

(91, 5)

In [64]:
# 4 - split point column into latitude, longitude and altitude columns
df2[['latitude', 'longitude', 'altitude']] = pd.DataFrame(df2['point'].tolist(), index=df2.index)
df2

Unnamed: 0,City,Neighbourhood,address,location,point,latitude,longitude,altitude
0,Amsterdam,Amsteldorp,"Amsteldorp, Amsterdam","(Huisarts Amsteldorp, Middelhoffstraat, Franke...","(52.3443384, 4.9220313, 0.0)",52.344338,4.922031,0.0
1,Amsterdam,Amsterdam Oud-West,"Amsterdam Oud-West, Amsterdam","(HEMA Amsterdam-Kinkerstraat, 313, Kinkerstraa...","(52.3647387, 4.8630105, 0.0)",52.364739,4.863010,0.0
2,Amsterdam,Amsterdam Oud-Zuid,"Amsterdam Oud-Zuid, Amsterdam","(Amsterdam-Oud Zuid, Ringweg-Zuid, Zuidas, Zui...","(52.3391253, 4.8661853, 0.0)",52.339125,4.866185,0.0
3,Amsterdam,Amsterdam Science Park,"Amsterdam Science Park, Amsterdam","(Amsterdam Science Park, Kruislaan, Watergraaf...","(52.352926, 4.948315, 0.0)",52.352926,4.948315,0.0
4,Amsterdam,Apollobuurt,"Apollobuurt, Amsterdam","(Apollobuurt, Zuid, Amsterdam, Noord-Holland, ...","(52.348072599999995, 4.875559011765657, 0.0)",52.348073,4.875559,0.0
...,...,...,...,...,...,...,...,...
86,Amsterdam,Westerpark,"Westerpark , Amsterdam","(Westerpark, West, Amsterdam, Noord-Holland, N...","(52.387236349999995, 4.871777328438663, 0.0)",52.387236,4.871777,0.0
87,Amsterdam,Willemspark,"Willemspark , Amsterdam","(Café Willemspark, 223, Willemsparkweg, Museum...","(52.3552537, 4.8683772, 0.0)",52.355254,4.868377,0.0
88,Amsterdam,Zeeburgereiland,"Zeeburgereiland, Amsterdam","(Zeeburgereiland, Schellingwoude, Amsterdam, N...","(52.372608299999996, 4.965545531374505, 0.0)",52.372608,4.965546,0.0
89,Amsterdam,Zeeheldenbuurt,"Zeeheldenbuurt, Amsterdam","(Zeeheldenbuurt, Amsterdam, Noord-Holland, Ned...","(52.389329849999996, 4.888242227776295, 0.0)",52.389330,4.888242,0.0


## df_list = pd.read_html(html_table)
df = df_list[0]
df.rename(columns={'Postal Code':'Postcode'},inplace=True)
print(df.shape)
df