# Segmenting and Clustering Neighbourhoods in the city of Toronto, Canada
#### Applied Data Science Capstone - Week 3 assignement

## Part 1 - Obtaining the list of neighbourhoods
In this part we will use the BeautifulSoup package as well as the html-parsing capabilities of the pandas package to load the list of postcodes corresponding to the Toronto area from Wikipedia into a dataframe.

In [6]:
from bs4 import BeautifulSoup
import requests

import pandas as pd

In [7]:
#Finding the table in the wikipedia page using BeautifulSoup
wiki_page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
wiki_soup = BeautifulSoup(wiki_page.text)
html_table = wiki_soup.find(class_='wikitable sortable').prettify()

#Converting the html table to a pandas dataframe
df_list = pd.read_html(html_table)
Toronto_hoods_df = df_list[0]
Toronto_hoods_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [8]:
#dropping unassigned Boroughs
Toronto_hoods_df.drop(Toronto_hoods_df[Toronto_hoods_df['Borough']=='Not assigned'].index, inplace=True)
Toronto_hoods_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [9]:
Toronto_hoods_df[Toronto_hoods_df['Neighbourhood']=='Not assigned'].index #No unassigned Neighbourhoods

Int64Index([], dtype='int64')

In [10]:
[g for _, g in Toronto_hoods_df.groupby('Postal Code') if len(g) > 1] #No duplicate postal codes

[]

In [11]:
#Clean up the index
Toronto_hoods_df.reset_index(drop=True, inplace=True)
Toronto_hoods_df.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [12]:
Toronto_hoods_df.shape

(103, 3)

We now have a clean dataframe with one line for each assigned postal code and a list of corresponding neighbourhoods.

## Part 2 - Obtaining the coordinates of the neighboorhoods

In [23]:
from geopy.geocoders import ArcGIS #gets latitude and longitude from an address
from geopy.extra.rate_limiter import RateLimiter #helper class for batch processing

In [27]:
#initialise the geolocator and the rate limiter
geolocator = ArcGIS(user_agent='toronto_explorer')
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=5, max_retries=5)

In [28]:
#get the location objects from the geolocator
Toronto_hoods_df['location'] = Toronto_hoods_df['Postal Code'].apply(
    lambda postal_code: geocode('{}, Toronto, Ontario, Canada'.format(postal_code)))

RateLimiter caught an error, retrying (0/2 tries). Called with (*('M9A, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (1/2 tries). Called with (*('M3B, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (0/2 tries). Called with (*('M6M, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (0/2 tries). Called with (*('M6N, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\l

RateLimiter caught an error, retrying (1/2 tries). Called with (*('M4R, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\l

RateLimiter caught an error, retrying (1/2 tries). Called with (*('M5R, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\l

RateLimiter caught an error, retrying (0/2 tries). Called with (*('M6R, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\l

RateLimiter caught an error, retrying (0/2 tries). Called with (*('M7R, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 421, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 416, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Users\Christian\Anaconda3\lib\http\client.py", line 1344, in getresponse
    response.begin()
  File "C:\Users\Christian\Anaconda3\lib\http\client.py", line 306, in begin
    version, status, reason = self._read_status()
  File "C:\Users\Christian\Anaconda3\lib\http\client.py", line 267, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "C:\Users\Christian\Anaconda3\lib\socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "C:\Users\Christian\Anaconda3\li

RateLimiter swallowed an error after 2 retries. Called with (*('M7R, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-packag

RateLimiter caught an error, retrying (1/2 tries). Called with (*('M9V, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (0/2 tries). Called with (*('M1W, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\l

In [41]:
#check for rows where the geolocation failed
failed_rows = Toronto_hoods_df[Toronto_hoods_df['location'].isnull()].index

In [48]:
#retry only the failed rows
Toronto_hoods_df.loc[failed_rows, 'location'] = Toronto_hoods_df.loc[failed_rows, 'Postal Code'].apply(
    lambda postal_code: geocode('{}, Toronto, Ontario, Canada'.format(postal_code)))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [49]:
#make sure all rows have a location object
Toronto_hoods_df[Toronto_hoods_df['location'].isnull()] #should be empty, if not run two previous cells again

Unnamed: 0,Postal Code,Borough,Neighbourhood,location


In [60]:
#extract latitude and longitude from location object
Toronto_hoods_df['Latitude'] = Toronto_hoods_df['location'].apply(
    lambda location: location.latitude)
Toronto_hoods_df['Longitude'] = Toronto_hoods_df['location'].apply(
    lambda location: location.longitude)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [61]:
Toronto_hoods_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,location,Latitude,Longitude
0,M3A,North York,Parkwoods,"(M3A, (43.75245000000007, -79.32990999999998))",43.75245,-79.32991
1,M4A,North York,Victoria Village,"(M4A, (43.73057000000006, -79.31305999999995))",43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront","(M5A, (43.65512000000007, -79.36263999999994))",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights","(M6A, (43.72327000000007, -79.45041999999995))",43.72327,-79.45042
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government","(M7A, (43.66253000000006, -79.39187999999996))",43.66253,-79.39188
