# Segmenting and Clustering Neighborhoods in Toronto

### 1: Scrape the Wiki page using BeautifulSoup

We start by importing requests library to send organic, grass-fed HTTP/1.1 requests.
<br>
BeautifulSoup will read the source code for a given web page and create a BeautifulSoup (soup) object with the BeautifulSoup function.
<br>
<br>
If we check the source code of the Wiki page, we can see the table we need is in class 'wikitable sortable jquery-tablesorter'. Let's retrieve that class from the soup object with the find() method.

In [1]:
import requests
from bs4 import BeautifulSoup

wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki_url,'html.parser')

My_table = soup.find('table',{'class':'wikitable sortable'})
My_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

### 2: Turn HTML code into an array

As we can see, each line of the table starts with 'tr', and each element of the column starts with 'th' (for the headers) or 'td'.
<br>
Let's create an empty array words. Then, run a for loop on every line (identified by 'tr').
<br>
Looping on each line, let's use the text.split() method to separate words, identified with 'th' or 'td', and append them to our array words.

In [2]:
words = []

for items in My_table.find_all("tr"):
    data = [' '.join(item.text.split()) for item in items.find_all(['th','td'])]
    words.append(data)
    
words[0:5]

[['Postcode', 'Borough', 'Neighbourhood'],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village']]

### 3: Turn the array of words into a DataFrame

We now have a array of rows of the original table. Let's turn it into a DataFrame.
<br>
First, we need to import pandas and DataFrame. We then use from_records() method to convert the words array into a DataFrame, passing the first row as header. 

In [3]:
import pandas as pd
from pandas import DataFrame

postal_df = DataFrame.from_records(words[1:], columns=words[0])
postal_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### 4: Data preparation

Let's remove the rows where the Borough is Not assigned.
<br>
We then group the DataFrame by Postcode and Borough, joining the aggregated Neighbourhoods into a string. The index is lost in this operation, se we need to reset it.
<br>
<br>
Finally, we locate all rows where Neighbourhood is Not assigned, and replace the it with the respective Borough.
<br>
The shape of our DataFrame is displayed below.

In [4]:
postal_df = postal_df[postal_df.Borough != 'Not assigned']
postal_df = pd.DataFrame(postal_df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join))
postal_df = postal_df.reset_index()

postal_df.loc[postal_df['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = postal_df['Borough']

postal_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [5]:
postal_df.shape

(103, 3)

### 5: Get coordinates of each neighborhood

Install and import the Geocoder Python package, to retrieve the latitude and the longitude coordinates of each neighborhood.
<br>
Since the package can be very unreliable, we will run a while loop for each postal code.
<br>
Finally, we create two new columns for latitude and longitude in our DataFrame, applying the while loop to each of the postal codes.
<br>
<br>
Since I reached the limit of queries for Geocoder, I'm just gonna import the csv with the geographical coordinates, as a new DataFrame, and merge it with my DataFrame.

In [6]:
'''
! pip install geocoder
import geocoder # import geocoder

def get_coord(code):
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google(str(code) + ', Toronto, Ontario')
        lat_lng_coords = g.latlng
    
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude, longitude


postal_df['Latitude'], postal_df['Longitude'] = postal_df.apply(lambda row: get_coord(row['Postcode']), axis=1)
postal_df
'''

url = 'http://cocl.us/Geospatial_data'
coordinates = pd.read_csv(url)

postal_df = postal_df.merge(coordinates, how='inner', left_on='Postcode', right_on='Postal Code')
postal_df.drop(columns='Postal Code', inplace=True) # Drop repeated column 'Postal Code'
postal_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### 6: Display in a map

We will now display our neighborhoods in a map, according to the coordinates of each postcode.
<br>
First, we import folium, a python library that allows us to display data on an interactive leaflet map. Uncomment the first line to install folium.
<br>
Second, we create two variables to center the map we will create. These variables should be the average latitude and longitude of our neighborhoods.
<br>
<br>
Finally, we create the map and add markers for each neighborhood, with a popup for the respective postal code.

In [7]:
#!conda install -c conda-forge folium=0.5.0 --yes
import folium

Latitude = postal_df['Latitude'].mean()
Longitude = postal_df['Longitude'].mean()

# create map
map = folium.Map(location=[Latitude, Longitude], zoom_start=11)

# add markers to the map
for lat, lon, poi in zip(postal_df['Latitude'], postal_df['Longitude'], postal_df['Postcode']):
    label = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        fill=True,
        fill_opacity=0.7).add_to(map)
       
map

### 7: Clustering

We will create clusters of neighborhoods based on geographical proximity, using k-means clustering.
<br>
K-means is available on sklearn library. Let´s create 7 clusters on this exercise, but we could try several numbers of clusters, choosing the one with less mean squared error.
<br>
<br>
The algorithm attributes one cluster label to each neighborhood identified by the postal code.

In [8]:
from sklearn.cluster import KMeans

# Drop columns not needed for K-means
postal_df_cluster = postal_df.drop(['Postcode', 'Borough', 'Neighbourhood'], 1)

# Fit model for intended number of clusters
n_clusters = 7
kmeans = KMeans(n_clusters = n_clusters).fit(postal_df_cluster)
kmeans.labels_[60:75] 

array([0, 0, 6, 4, 4, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4], dtype=int32)

Let´s merge the cluster label with our DataFrame.

In [9]:
postal_df['Cluster'] = kmeans.labels_
postal_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,2
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,2
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,2
3,M1G,Scarborough,Woburn,43.770992,-79.216917,2
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,2


### 8: Display clusters in a map

Finally, let´s recreate the map to display our neighborhoods, this time taking into account the respecive cluster.
<br>
Let´s start by importing numpy and matplotlib libraries, to help us color code the neighborhood markers.
<br>
Then, we create the map centered on the average latitude and longitude.
<br>
<br>
We set a color scheme for the clusters, add the neighborhood markers to the map, coloured according to the cluster they belong to.
<br>
As a finishing touch, the popup for each marker displays the postal code and the cluster it belongs to.

In [10]:
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[Latitude, Longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(n_clusters)
ys = [i+x+(i*x)**2 for i in range(n_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(postal_df['Latitude'], postal_df['Longitude'], postal_df['Postcode'], postal_df['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters