# Toronto neighborhoods
In this notebook, I will explore and cluster the neighborhoods in Toronto

In [2]:
#importing necessary libs
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as BS  #get website data
import requests
import unicodedata
import wget
print("Libs imported")

Libs imported


This wikipedia page has the postal code of some Neighborhoods of Toronto

The data is colected using the BeatifulSoup and requests libs, and transformed as a pandas DataFrame.

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
r = requests.get(url)
soup = BS(r.text)
#print(soup.prettify())

table = soup.find('table')
rows = table.find_all('tr')

l = []
for tr in rows:
    td = tr.find_all('td')
    thisRow = [tr.text for tr in td]
    thisRow = [x.encode('ascii','ignore') for x in thisRow]
    thisRow = [x.replace('\n','') for x in thisRow]
    l.append(thisRow)
    
data = pd.DataFrame(l,columns=["PostalCode","Borough","Neighborhood"])


Removing all rows with Not assigned Boroughs.

Also, setting the Not Assigned neighborhoods to match the borough

In [4]:
data = data[data.Borough != "Not assigned"]
data = data.iloc[1:,]
data.reset_index(inplace=True,drop=True)

In [5]:
print(data.iloc[6,:])
for i in range(data.shape[0]):
    if(data.iloc[i,2] == "Not assigned"):
        data.iloc[i,2] = data.iloc[i,1]
print(data.iloc[6,:])

PostalCode               M7A
Borough         Queen's Park
Neighborhood    Not assigned
Name: 6, dtype: object
PostalCode               M7A
Borough         Queen's Park
Neighborhood    Queen's Park
Name: 6, dtype: object


Some postalCodes have multiple lines in the dataset.

At these, the borough is always the same, but for some postal codes, there are multiple Neighborhoods.

Following, I will concatenate all the diferent neighborhoods with the same postal code in an unique row.

In [6]:
ll = []
postals = data["PostalCode"].unique()
for postal in postals:
    rows = data[data.PostalCode == postal]
    uniques = rows["Neighborhood"].unique()
    val = ', '.join(str(elem) for elem in uniques)
    borough = rows["Borough"].iloc[0]
    ll.append([postal,borough,val])

data = pd.DataFrame(ll,columns=["PostalCode","Borough","Neighborhood"])
print(data.shape)
print(data[data.PostalCode=="M5A"])

(103, 3)
  PostalCode           Borough               Neighborhood
2        M5A  Downtown Toronto  Harbourfront, Regent Park


Printing the shape and head of the result dataset, finishing the first part of the assignment

In [7]:
print(data.shape)
print(data.head(10))

(103, 3)
  PostalCode           Borough                      Neighborhood
0        M3A        North York                         Parkwoods
1        M4A        North York                  Victoria Village
2        M5A  Downtown Toronto         Harbourfront, Regent Park
3        M6A        North York  Lawrence Heights, Lawrence Manor
4        M7A      Queen's Park                      Queen's Park
5        M9A         Etobicoke                  Islington Avenue
6        M1B       Scarborough                    Rouge, Malvern
7        M3B        North York                   Don Mills North
8        M4B         East York   Woodbine Gardens, Parkview Hill
9        M5B  Downtown Toronto          Ryerson, Garden District


## Start second part

Now, I'm going to get de latitude and longitude of each borough, which will be used to get Foursquare location.

In [8]:
import geocoder

def getCoords(postalCode):
    
    lat_lng = None
    
    while(lat_lng is None):  # Fix the geocoder problem
        g = geocoder.google('{}, Toronto, Ontario'.format(postalCode))
        lat_lng = g.latlng
    
    latitude = lat_lng[0]
    longitude = lat_lng[1]
    return latitude,longitude

The geocoder was not able to get the coords, so I'm going to get them from a geospatial data file

In [11]:
fileName = wget.download("https://cocl.us/Geospatial_data")

In [33]:
geoData = pd.read_csv(fileName)
print(geoData.shape)
print(geoData.head(10))

(103, 3)
  Postal Code   Latitude  Longitude
0         M1B  43.806686 -79.194353
1         M1C  43.784535 -79.160497
2         M1E  43.763573 -79.188711
3         M1G  43.770992 -79.216917
4         M1H  43.773136 -79.239476
5         M1J  43.744734 -79.239476
6         M1K  43.727929 -79.262029
7         M1L  43.711112 -79.284577
8         M1M  43.716316 -79.239476
9         M1N  43.692657 -79.264848


In [39]:
geoData.set_index(geoData["Postal Code"],inplace=True)
lat = []
lng = []
for i in range(data.shape[0]):
    postalCode = data.iloc[i,0]
    thisLat = geoData.loc[postalCode,"Latitude"]
    thisLng = geoData.loc[postalCode,"Longitude"]
    lat.append(thisLat)
    lng.append(thisLng)

data["Latitude"] = lat
data["Longitude"] = lng
print(data.head(10))

  PostalCode           Borough                      Neighborhood   Latitude  \
0        M3A        North York                         Parkwoods  43.753259   
1        M4A        North York                  Victoria Village  43.725882   
2        M5A  Downtown Toronto         Harbourfront, Regent Park  43.654260   
3        M6A        North York  Lawrence Heights, Lawrence Manor  43.718518   
4        M7A      Queen's Park                      Queen's Park  43.662301   
5        M9A         Etobicoke                  Islington Avenue  43.667856   
6        M1B       Scarborough                    Rouge, Malvern  43.806686   
7        M3B        North York                   Don Mills North  43.745906   
8        M4B         East York   Woodbine Gardens, Parkview Hill  43.706397   
9        M5B  Downtown Toronto          Ryerson, Garden District  43.657162   

   Longitude  
0 -79.329656  
1 -79.315572  
2 -79.360636  
3 -79.464763  
4 -79.389494  
5 -79.532242  
6 -79.194353  
7 -79.3521