# Battle of the Neighbourhoods

Import Libraries 

In [1]:
import pandas as pd
!pip install lxml

import json # library to handle JSON files
!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests

# The following code below allows me to use SQL with python
# I find using SQL to merge and join dataframes easier, and I'll explain the SQL code further down
!pip install pandasql 
from pandasql import sqldf

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library
import numpy as np # library to handle data in a vectorized manner

print('Libraries imported.')


Libraries imported.


Import the html file from the Toronto webpage

In [2]:
url ='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dataset_list = pd.read_html(url, header=0)

View the data:

In [3]:
# I'm calling my dataframe 'postal_codes_df' and creating a list from the data pulled from the webpage
postal_codes_df = dataset_list[0]

In [4]:
#the head() function allows me to see a certain number of rows of my dataframe
postal_codes_df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


# Cleaning the Data

I only want to process rows that have an assigned borough. Therefore, I need to filter out all the rows where Borough is 'Not assigned' using pandas. 

In [5]:
postal_codes_df = postal_codes_df.query("Borough != 'Not assigned'")

Now, let's view the data:

In [6]:
postal_codes_df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


There is still a 'Not assigned' in the Neighbourhood column. I want to replace that 'Not assigned' with the Borough name, to make the data look more tidy.

In [7]:
postal_codes_df = postal_codes_df.groupby(['Postcode', 'Borough'], as_index=False, sort=False)
postal_codes_df = postal_codes_df.agg(','.join)
postal_codes_df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Not assigned
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


Now we make the Neighbourhoods that are 'Not assigned' equal to the Borough value. Let's see which rows meet this exception:

In [8]:
postal_codes_df.query("Neighbourhood == 'Not assigned'")

Unnamed: 0,Postcode,Borough,Neighbourhood
4,M7A,Queen's Park,Not assigned


Only one row meets this exception, so I will replace that 'Not assigned' with the Borough name:

In [9]:
postal_codes_df.loc[postal_codes_df.Neighbourhood == 'Not assigned', 'Neighbourhood'] = "Queen's Park"
postal_codes_df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


Now the data looks better and is more functional! At this point, I want to create a new dataframe using longitude and latitude of the postal codes. I'm going to read a new csv file into pandas:

In [10]:
geocode_df=pd.read_csv('Geospatial_Coordinates.csv')
geocode_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


The new dataframe looks good. Now, I want to merge this dataframe with my original postal_codes dataframe.

For this part, I'm going to use SQL to join the two dataframes. This is an easier and cleaner way to merge dataframes, in my opinion:

In [11]:
query = """
    SELECT p.*, Latitude, Longitude
    FROM postal_codes_df as p
    INNER JOIN geocode_df as g ON p.Postcode=g.'Postal Code'
"""

postal_codes_df=sqldf(query)
postal_codes_df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937


What I basically told the code to do is select everything from the postal_codes_df and only Latitude and Longitude from the geocode_df. An inner join in SQL merges the matching values in each table (so since both post codes are the same in both dataframes, they stay the same and become one column, 'Postcode').

Next, I want to plot the latitude and longitude of the postal codes using folium and create a map visualisation:

In [12]:
# the location chosen allows us to see all the plotted point from the data with the zoom level 11
# I then create a loop with 'i' in order to plot all the points simultaneously; 
# the len function allows the code to read all the rows from the dataframe
postal_map = folium.Map(location=[43.753259, -79.329656], zoom_start = 11)
for i in range(0, len(postal_codes_df)):
    folium.Marker([postal_codes_df.iloc[i]['Latitude'],postal_codes_df.iloc[i]['Longitude']]).add_to(postal_map)

postal_map

Now I need to segment and cluster the data so we can distinguish neighbourhoods!

I only want boroughs that include the word 'Toronto' in their name. I'm going to use the LIKE function to create a dataframe of all the boroughs with the name Toronto:

In [13]:

# I'm still using pandasql to run SQL queries in Python, so code is different than python code

query = """
    SELECT 
        *
    FROM postal_codes_df 
    WHERE Borough like '%Toronto%'
"""
Toronto_df = sqldf(query)
Toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
1,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


Then I want to create a dataframe with the rest of the boroughs:

In [14]:
query = """
    SELECT 
        *
    FROM postal_codes_df 
    WHERE Borough not like '%Toronto%'
"""
not_Toronto_df = sqldf(query)
not_Toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
3,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
4,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242


Next, I want to create the folium map. The code below will first loop through the "Toronto" set coordinates and plot them red on the map. Then it will loop through the non-Toronto set of coordinates and plot them blue on the map. Then it shows the map.

In [15]:
postal_map = folium.Map(location=[43.753259, -79.329656], zoom_start = 11)

for lat, lng, label in zip(Toronto_df['Latitude'], Toronto_df['Longitude'], Toronto_df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.3,
        parse_html=False).add_to(postal_map)     
    
for lat, lng, label in zip(not_Toronto_df['Latitude'], not_Toronto_df['Longitude'], not_Toronto_df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.3,
        parse_html=False).add_to(postal_map)     

postal_map
# You should be able to click on different dots and it will show you those clusters of boroughs!

All of the red dots are Neighbourhoods with 'Toronto' in the name. All the blue points are neighbourhoods without the name 'Toronto'. 
Keep in mind: this is an interactive map! Click on the points to see the borough name.

Let's look at a dataframe of the neighbourhoods with the name 'Toronto':

In [16]:
query = """
    SELECT Distinct "Borough"
    FROM postal_codes_df 
    WHERE Borough like '%Toronto%'
"""
Toronto_df = sqldf(query)
Toronto_df.head(10)

Unnamed: 0,Borough
0,Downtown Toronto
1,East Toronto
2,West Toronto
3,Central Toronto


As we can see, there are only four boroughs with 'Toronto' in the name.

Which borough is the best option for building a new apartment complex? Let's look at each borough using Foursquare API to find out!

# Toronto Boroughs: Who's got the best amenities?

Let's first take another look at our dataframe with Toronto-named boroughs and their coordinates:

In [22]:
# Using the drop function, I can keep the column names but drop any duplicate Borough names 
Toronto_subset=Toronto_df.drop_duplicates(['Borough'], keep='last').copy()
Toronto_subset


Unnamed: 0,Borough
0,Downtown Toronto
1,East Toronto
2,West Toronto
3,Central Toronto


Since we've got four potential boroughs to choose from, let's start from the top and start looking at what's trending in each borough:

In [23]:
# Using Foursquare API, I enter my credentials

CLIENT_ID = 'M3IDXSOOZWUIIRWRCND3AXKSZYRJWJJEMRKEBTKMOIVVBXA2'
CLIENT_SECRET = 'G0FJEHYLIDPREQ2QOIM2MANQC2KGR25PTNPCM3TGVZVNG4N'
VERSION = '20180604'



Let's explore what's trending in West Toronto:

In [24]:
neighborhood_latitude = Toronto_subset['Latitude'].iloc[0] 
neighborhood_longitude = Toronto_subset['Longitude'].iloc[0] 
LIMIT = 100  # limit results from foursquare to 100 rows
radius = 8046.72 # this is in meters. 5 miles equals 8046.72 meters

neighborhood_latitude

KeyError: 'Latitude'