# Toronto neighborhoods
In this notebook, I will explore and cluster the neighborhoods in Toronto

In [249]:
#importing necessary libs
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as BS  #get website data
import requests
import unicodedata
import wget
import json
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
print("Libs imported")

Libs imported


This wikipedia page has the postal code of some Neighborhoods of Toronto

The data is colected using the BeatifulSoup and requests libs, and transformed as a pandas DataFrame.

In [250]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
r = requests.get(url)
soup = BS(r.text)
#print(soup.prettify())

table = soup.find('table')
rows = table.find_all('tr')

l = []
for tr in rows:
    td = tr.find_all('td')
    thisRow = [tr.text for tr in td]
    thisRow = [x.encode('ascii','ignore') for x in thisRow]
    thisRow = [x.replace('\n','') for x in thisRow]
    l.append(thisRow)
    
data = pd.DataFrame(l,columns=["PostalCode","Borough","Neighborhood"])


Removing all rows with Not assigned Boroughs.

Also, setting the Not Assigned neighborhoods to match the borough

In [251]:
data = data[data.Borough != "Not assigned"]
data = data.iloc[1:,]
data.reset_index(inplace=True,drop=True)

In [252]:
print(data.iloc[6,:])
for i in range(data.shape[0]):
    if(data.iloc[i,2] == "Not assigned"):
        data.iloc[i,2] = data.iloc[i,1]
print(data.iloc[6,:])

PostalCode               M7A
Borough         Queen's Park
Neighborhood    Not assigned
Name: 6, dtype: object
PostalCode               M7A
Borough         Queen's Park
Neighborhood    Queen's Park
Name: 6, dtype: object


Some postalCodes have multiple lines in the dataset.

At these, the borough is always the same, but for some postal codes, there are multiple Neighborhoods.

Following, I will concatenate all the diferent neighborhoods with the same postal code in an unique row.

In [253]:
ll = []
postals = data["PostalCode"].unique()
for postal in postals:
    rows = data[data.PostalCode == postal]
    uniques = rows["Neighborhood"].unique()
    val = ', '.join(str(elem) for elem in uniques)
    borough = rows["Borough"].iloc[0]
    ll.append([postal,borough,val])

data = pd.DataFrame(ll,columns=["PostalCode","Borough","Neighborhood"])
print(data.shape)
print(data[data.PostalCode=="M5A"])

(103, 3)
  PostalCode           Borough               Neighborhood
2        M5A  Downtown Toronto  Harbourfront, Regent Park


Printing the shape and head of the result dataset, finishing the first part of the assignment

In [254]:
print(data.shape)
print(data.head(10))

(103, 3)
  PostalCode           Borough                      Neighborhood
0        M3A        North York                         Parkwoods
1        M4A        North York                  Victoria Village
2        M5A  Downtown Toronto         Harbourfront, Regent Park
3        M6A        North York  Lawrence Heights, Lawrence Manor
4        M7A      Queen's Park                      Queen's Park
5        M9A         Etobicoke                  Islington Avenue
6        M1B       Scarborough                    Rouge, Malvern
7        M3B        North York                   Don Mills North
8        M4B         East York   Woodbine Gardens, Parkview Hill
9        M5B  Downtown Toronto          Ryerson, Garden District


## Start second part

Now, I'm going to get de latitude and longitude of each borough, which will be used to get Foursquare location.

In [255]:
import geocoder

def getCoords(postalCode):
    
    lat_lng = None
    
    while(lat_lng is None):  # Fix the geocoder problem
        g = geocoder.google('{}, Toronto, Ontario'.format(postalCode))
        lat_lng = g.latlng
    
    latitude = lat_lng[0]
    longitude = lat_lng[1]
    return latitude,longitude

The geocoder was not able to get the coords, so I'm going to get them from a geospatial data file

In [22]:
#fileName = wget.download("https://cocl.us/Geospatial_data")

In [256]:
geoData = pd.read_csv(fileName)
print(geoData.shape)
print(geoData.head(10))

(103, 3)
  Postal Code   Latitude  Longitude
0         M1B  43.806686 -79.194353
1         M1C  43.784535 -79.160497
2         M1E  43.763573 -79.188711
3         M1G  43.770992 -79.216917
4         M1H  43.773136 -79.239476
5         M1J  43.744734 -79.239476
6         M1K  43.727929 -79.262029
7         M1L  43.711112 -79.284577
8         M1M  43.716316 -79.239476
9         M1N  43.692657 -79.264848


In [257]:
geoData.set_index(geoData["Postal Code"],inplace=True)
lat = []
lng = []
for i in range(data.shape[0]):
    postalCode = data.iloc[i,0]
    thisLat = geoData.loc[postalCode,"Latitude"]
    thisLng = geoData.loc[postalCode,"Longitude"]
    lat.append(thisLat)
    lng.append(thisLng)

data["Latitude"] = lat
data["Longitude"] = lng
print(data.head(10))

  PostalCode           Borough                      Neighborhood   Latitude  \
0        M3A        North York                         Parkwoods  43.753259   
1        M4A        North York                  Victoria Village  43.725882   
2        M5A  Downtown Toronto         Harbourfront, Regent Park  43.654260   
3        M6A        North York  Lawrence Heights, Lawrence Manor  43.718518   
4        M7A      Queen's Park                      Queen's Park  43.662301   
5        M9A         Etobicoke                  Islington Avenue  43.667856   
6        M1B       Scarborough                    Rouge, Malvern  43.806686   
7        M3B        North York                   Don Mills North  43.745906   
8        M4B         East York   Woodbine Gardens, Parkview Hill  43.706397   
9        M5B  Downtown Toronto          Ryerson, Garden District  43.657162   

   Longitude  
0 -79.329656  
1 -79.315572  
2 -79.360636  
3 -79.464763  
4 -79.389494  
5 -79.532242  
6 -79.194353  
7 -79.3521

# Clustering

The Fousquare personal data is stored in a file with the structure:

cliente_id\n

client_secret

In [150]:
lines = [line.rstrip('\n') for line in open('foursquareCredentials.txt')]

credentials = dict(
    client_id=lines[0],
    client_secret=lines[1]
)

In [153]:
def getData(latitude,longitude,credentials,query):
    url = 'https://api.foursquare.com/v2/venues/explore'

    params = dict(
      client_id=credentials['client_id'],
      client_secret=credentials['client_secret'],
      v='20180323',
      ll = str(latitude) + ',' + str(longitude),
      query=query,
      radius=500
    )
    
    resp = requests.get(url=url,params=params)
    return json.loads(resp.text)

Each borough will be categorized by the number of restaurants, gyms, banks, parks, theaters and schools. We are going the get this information from foursquare API

In [157]:
def getFeatures(latitude,longitude,credentials):
    rest = getData(latitude,longitude,credentials,"Restaurant")["response"]["totalResults"]
    gym = getData(latitude,longitude,credentials,"Gym")["response"]["totalResults"]
    bank = getData(latitude,longitude,credentials,"Bank")["response"]["totalResults"]
    park = getData(latitude,longitude,credentials,"Park")["response"]["totalResults"]
    movie = getData(latitude,longitude,credentials,"Theater")["response"]["totalResults"]
    school = getData(latitude,longitude,credentials,"school")["response"]["totalResults"]
    
    return [rest,gym,bank,park,movie,school]

In [223]:
new = []
for i in range(data.shape[0]):
    features = getFeatures(data.loc[i,"Latitude"],data.loc[i,"Longitude"],credentials)
    print("Done for {} with latitude {} and longitude {}".format(i, data.loc[i,"Latitude"],data.loc[i,"Longitude"]))
    new.append(features)
clusterData = pd.DataFrame(new,columns=["Restaurant","Gym","Bank","Park","Movies","School"])

Done for 0 with latitude 43.7532586 and longitude -79.3296565
Done for 1 with latitude 43.7258823 and longitude -79.3155716
Done for 2 with latitude 43.6542599 and longitude -79.3606359
Done for 3 with latitude 43.718518 and longitude -79.4647633
Done for 4 with latitude 43.6623015 and longitude -79.3894938
Done for 5 with latitude 43.6678556 and longitude -79.5322424
Done for 6 with latitude 43.8066863 and longitude -79.1943534
Done for 7 with latitude 43.7459058 and longitude -79.352188
Done for 8 with latitude 43.7063972 and longitude -79.309937
Done for 9 with latitude 43.6571618 and longitude -79.3789371
Done for 10 with latitude 43.709577 and longitude -79.4450726
Done for 11 with latitude 43.6509432 and longitude -79.5547244
Done for 12 with latitude 43.7845351 and longitude -79.1604971
Done for 13 with latitude 43.7258997 and longitude -79.340923
Done for 14 with latitude 43.6953439 and longitude -79.3183887
Done for 15 with latitude 43.6514939 and longitude -79.3754179
Done fo

KeyError: 'totalResults'

The error was generated after one successful compilation. That was caused by later tests, where the requests daily limit was reached

In [227]:
clusterData = pd.DataFrame(clusterData)
print(clusterData.shape)
print(clusterData.head(10))
scaler = MinMaxScaler()
scaler = scaler.fit(clusterData)
data = pd.DataFrame(scaler.transform(clusterData))
print(data.head(10))

(103, 6)
          0         1         2         3      4         5
0  0.011236  0.000000  0.000000  0.090909  0.000  0.076923
1  0.022472  0.000000  0.024390  0.000000  0.000  0.153846
2  0.140449  0.148148  0.097561  0.545455  0.750  0.230769
3  0.022472  0.148148  0.024390  0.000000  0.000  0.230769
4  0.162921  0.407407  0.195122  0.272727  0.375  0.384615
5  0.005618  0.000000  0.000000  0.000000  0.000  0.000000
6  0.011236  0.000000  0.000000  0.000000  0.000  0.076923
7  0.022472  0.111111  0.000000  0.181818  0.000  0.230769
8  0.039326  0.111111  0.073171  0.090909  0.000  0.153846
9  0.533708  0.925926  0.219512  0.545455  0.875  0.538462
          0         1         2         3      4         5
0  0.011236  0.000000  0.000000  0.090909  0.000  0.076923
1  0.022472  0.000000  0.024390  0.000000  0.000  0.153846
2  0.140449  0.148148  0.097561  0.545455  0.750  0.230769
3  0.022472  0.148148  0.024390  0.000000  0.000  0.230769
4  0.162921  0.407407  0.195122  0.272727  0.37

The data for clustering is ready.

Applying the Kmeans for clustering

In [293]:
from sklearn.cluster import KMeans

kClusters = 6
km = KMeans(n_clusters=kClusters)
km = km.fit(clusterData)
print(km.labels_)
print(Counter(km.labels_))

[1 1 5 1 4 1 1 3 1 2 1 1 1 1 1 2 3 1 1 3 4 3 1 1 2 3 1 1 3 1 0 3 1 1 1 3 5
 4 1 1 1 3 0 3 1 1 1 3 0 1 1 1 1 1 4 1 1 1 1 3 1 1 1 1 3 1 1 3 3 3 1 1 3 1
 3 3 1 1 1 3 4 3 1 1 4 1 1 1 1 1 3 3 2 1 1 1 4 0 1 2 3 3 1]
Counter({1: 60, 3: 25, 4: 7, 2: 5, 0: 4, 5: 2})


In [294]:
finalData = pd.concat([data,pd.DataFrame(km.labels_)],axis=1,ignore_index=True)
finalData.columns = ["PostalCode","Borough","Neighborhood","Latitude","Longitude","ClusterId"]
print(finalData.shape)
print(finalData.head(10))

(103, 6)
  PostalCode           Borough                      Neighborhood   Latitude  \
0        M3A        North York                         Parkwoods  43.753259   
1        M4A        North York                  Victoria Village  43.725882   
2        M5A  Downtown Toronto         Harbourfront, Regent Park  43.654260   
3        M6A        North York  Lawrence Heights, Lawrence Manor  43.718518   
4        M7A      Queen's Park                      Queen's Park  43.662301   
5        M9A         Etobicoke                  Islington Avenue  43.667856   
6        M1B       Scarborough                    Rouge, Malvern  43.806686   
7        M3B        North York                   Don Mills North  43.745906   
8        M4B         East York   Woodbine Gardens, Parkview Hill  43.706397   
9        M5B  Downtown Toronto          Ryerson, Garden District  43.657162   

   Longitude  ClusterId  
0 -79.329656          1  
1 -79.315572          1  
2 -79.360636          5  
3 -79.464763     

In [243]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

Solving environment: done

## Package Plan ##

  environment location: /home/william/anaconda2

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.3.1               |             py_0          25 KB  conda-forge
    altair-2.2.2               |        py27_1001         485 KB  conda-forge
    certifi-2018.11.29         |        py27_1000         145 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ca-certificates-2018.11.29 |       ha4d7672_0         143 KB  conda-forge
    openssl-1.0.2p             |       h470a237_1         3.1 MB  conda-forge
    pandas-0.23.4              |   py27hf8a1672_0        25.7 MB  conda-forge
    conda-4.5.11               |        py27_1000         651 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    numpy-1

In [311]:
lat = data.loc[0,"Latitude"]
lng = data.loc[0,"Longitude"]

map_clusters = folium.Map(location=[lat-0.05,lng],zoom_start=11)

x = np.arange(kClusters)
ys = [i+x+(i*x)**2 for i in range(kClusters)]
colors_array = cm.rainbow(np.linspace(0,1,len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

marker_colors = []

for lat,lng,poi,cluster in zip(finalData["Latitude"],finalData["Longitude"],finalData["Borough"],finalData["ClusterId"]):
    label = folium.Popup(str(poi)+' CLuster '+str(cluster),parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=1).add_to(map_clusters)


# Final Result
The map shows the boroughs used, with color representing the clusters

In [312]:
map_clusters