# Segmenting and Clustering Neighborhoods in Toronto

This notebook is part of an assignment for the course _Applied Data Science Capstone_ in [Coursera](https://www.coursera.org).

## 1. Scraping the data

### Import libraries

In [107]:
# for HTTP requests
import requests  

# for HTML scrapping 
from bs4 import BeautifulSoup 

# for table analysis
import pandas as pd
import numpy as np

# for transforming addresses into latitude/longitude locations
!pip install geocoder
import geocoder



### URL for Wikipedia article

In [108]:
# URL of wikipedia page from which to scrap tabular data.
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

### Request & Response

In [109]:
# If the request was successful, reponse should be '200'.
response = requests.get(wiki_url) #.json()
response

<Response [200]>

### Wrangling HTML With BeautifulSoup

In [110]:
# Parse response content to html
soup = BeautifulSoup(response.content, 'html.parser')
#soup

### Viewing HTML content

In [111]:
# Title of Wikipedia page
soup.title.string

'List of postal codes of Canada: M - Wikipedia'

In [112]:
# Find all the tables in the HTML
all_tables=soup.find_all('table')

In [113]:
# Find the right table to scrap
right_table=soup.find('table', {"class":'wikitable sortable'})

In [114]:
# Get the 1st row of the table i.e. the header
row0 = right_table.findAll("tr")[0]

# Show the column names
header = [th.text.rstrip() for th in row0.find_all('th')]
header

['Postal Code', 'Borough', 'Neighbourhood']

### Scraping the table contents

In [115]:
# Scrap the data and append to respective lists
c0=[]
c1=[]
c2=[]

# Iterate through the rows of the table
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    if len(cells)==3 and 'Not assigned' not in cells[1].find(text=True): #Only extract assigned postal codes
        c0.append(cells[0].find(text=True).replace('\n', ''))
        c1.append(cells[1].find(text=True).replace('\n', ''))
        c2.append(cells[2].find(text=True).replace('\n', ''))

In [116]:
# Create a dictionary
dict_toronto = dict([(x,0) for x in header])
dict_toronto

{'Postal Code': 0, 'Borough': 0, 'Neighbourhood': 0}

In [117]:
# Append dictionary with corresponding data list.
dict_toronto['Postal Code'] = c0
dict_toronto['Borough'] = c1
dict_toronto['Neighbourhood'] = c2
#dict_toronto

### Creating dataframe

In [118]:
# Convert dict to dataFrame
df_toronto = pd.DataFrame(dict_toronto)

# Size of dataframe
print(f'Shape: {df_toronto.shape}')

# Top 5 records
df_toronto.head(5)

Shape: (103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


# 2. Geographical coordinates

In [119]:
# Function that retrieves the geographical coordinates for a given neighborhood
def get_coordinates(row):
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.arcgis('{}, Toronto, Ontario'.format(row['Postal Code']))
      lat_lng_coords = g.latlng
    
    # return pair lat,long
    return pd.Series([lat_lng_coords[0], lat_lng_coords[1]])

In [121]:
# Fill coordinates for each row
df_toronto[['Latitude','Longitude']] = df_toronto.apply(get_coordinates, axis=1)
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.65319,-79.51113
99,M4Y,Downtown Toronto,Church and Wellesley,43.66659,-79.38133
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.64869,-79.38544
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.63278,-79.48945
102,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.62513,-79.52681


## 3. Clustering the data