# Applied Data Science Capstone

### Segmenting and Clustering Neighborhoods in Toronto

Creating dataframe from postal codes wiki page data.

In [4]:
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import requests

Loading page data
+ Loading postal codes wiki page by url and creating BeautifulSoup object. With it help we can process loaded html.
+ Searching for the first table in page, that is out target postal codes table.

In [19]:
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
soup = BeautifulSoup(requests.get(wiki_url).text, "lxml")
neighbours_html = soup.find('table')

Creating functions for borough and neighbourhood names determination.
- For borough checks if there is a link inside, then returns its content text otherwise returns item text.
- For neighbourhood if there is a link inside, then returns its content text otherwhise returns item text if its assigned. In other case returns borough name.

In [98]:
not_assigned_mark = 'Not assigned'

def get_borough_name(td):
    link = td.find('a')
    if (link == None): #if can't find link
        return td.text.strip()
    else:
        return link.string.strip() # link texts always assigned

def get_neighbour_name(borough, td):
    link = td.find('a')
    if (link == None): #if can't find link
        text = td.text.strip()
        if (text != not_assigned_mark):#if text assigned
            return text
        else:
            return borough
    else:
        return link.string.strip() #link texts always assigned

Creating raw data list from a wiki page table body.

For each row in table body that contains cells and has assigned borough name adding new row into raw data list.

In [120]:
raw_data = []
for row in neighbours_html.tbody.find_all('tr'):
    #searching all row cells
    tds = row.find_all('td')
    if (len(tds) == 0): #if cells not found skip row 
        continue
        
    postcode = tds[0].text
    borough = get_borough_name(tds[1])
    if (borough == not_assigned_mark): #if borough not assigned skip row
        continue
    
    neighbour = get_neighbour_name(borough, tds[2])
    raw_data.append([postcode, borough, neighbour])

Loading raw data into raw dataframe.

In [140]:
columns = ['PostalCode', 'Borough', 'Neighborhood']
raw = pd.DataFrame(raw_data, columns=columns)
raw.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Creating target dataframe by grouping raw by postcode and borough, and joining groupped neighbours with ', ' row.
After that replacing columns names.

In [141]:
df = raw.groupby(['PostalCode', 'Borough']).apply(lambda group: ', '.join(group['Neighborhood'])).reset_index()
df.columns = columns
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [142]:
df.shape

(103, 3)

## Part 2 : add latitude, longitude coordinates

Adding geocoder library, and creating coordinates loading function

In [145]:
!pip install geocoder



Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K    100% |████████████████████████████████| 102kB 16.4MB/s a 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
[31mtensorflow 1.3.0 requires tensorflow-tensorboard<0.2.0,>=0.1.0, which is not installed.[0m
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [154]:
import geocoder

def get_postal_index_coordinates(index):
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(index))
        lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return (latitude, longitude)

Since geocoder approach never terminates with result - loading a csv dataset

In [157]:
coordinates = pd.read_csv('http://cocl.us/Geospatial_data')

Adding coordinates to previous dataframe

In [165]:
coord_df = df.join(coordinates.set_index('Postal Code'), on='PostalCode')
coord_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [167]:
coord_df.shape

(103, 5)