# Segmenting and Clustering Neighborhoods in the city of Toronto, Canada #

## Introduction

The aim of this project is to create a code that code identifies neighborhood area segments in Toronto and cluster them according to venues available in vicinity of those neighborhoods 

## Table of Contents

[> Part 1 - Data Scraping](https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/d9d1a6c0-8106-49be-b516-04e5dc7b0a28?projectid=277684b3-e1d9-43e7-83dc-aa7c533cbc8b&context=wdp)

[> Part 2 - Geocoding](https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/d9d1a6c0-8106-49be-b516-04e5dc7b0a28?projectid=277684b3-e1d9-43e7-83dc-aa7c533cbc8b&context=wdp)
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
___

# Part 1 - Data Scraping

**Input Data [Wikipedia: List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)**

In [58]:
import numpy as np 
import pandas as pd 
from bs4 import BeautifulSoup
import requests

**Download Postal Data Data and create webpage scrape**

In [60]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
xml_page_data = BeautifulSoup(source, 'lxml')

In [63]:
class webpage_scrapp:
       
        def parse_url(self, url):
            response = requests.get(url)
            xml_page_data = BeautifulSoup(response.text, 'lxml')
            return [(self.parse_html_table(table))\
                    for table in xml_page_data.find_all('table', class_="wikitable sortable")]  
    
        def parse_html_table(self, table):
            n_columns = 0
            n_rows=0
            column_names = []
            for row in table.find_all('tr'):
                td_tags = row.find_all('td')
                if len(td_tags) > 0:
                    n_rows+=1
                    if n_columns == 0:
                        n_columns = len(td_tags)
                        
                th_tags = row.find_all('th') 
                if len(th_tags) > 0 and len(column_names) == 0:
                    for th in th_tags:
                        column_names.append(th.get_text())
    
            if len(column_names) > 0 and len(column_names) != n_columns:
                raise Exception("Column titles do not match the number of columns")
    
            columns = column_names if len(column_names) > 0 else range(0,n_columns)
            df = pd.DataFrame(columns = columns,
                              index= range(0,n_rows))
            row_marker = 0
            for row in table.find_all('tr'):
                column_marker = 0
                columns = row.find_all('td')
                for column in columns:
                    df.iat[row_marker,column_marker] = column.get_text()
                    column_marker += 1
                if len(columns) > 0:
                    row_marker += 1
                    
            for col in df:
                try:
                    df[col] = df[col].astype(float)
                except ValueError:
                    pass
            
            return df

In [65]:
table = webpage_scrapp().parse_url('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0] 
table.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n
5,M5A,Downtown Toronto,Regent Park\n
6,M6A,North York,Lawrence Heights\n
7,M6A,North York,Lawrence Manor\n
8,M7A,Queen's Park,Not assigned\n
9,M8A,Not assigned,Not assigned\n


**Ignore cells with a borough that is Not assigned**


In [69]:
table = table[table.Borough != 'Not assigned']
table.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


**Remove \n from the data in table**

In [70]:
table = table.replace('\n',' ', regex=True)
table.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


**Combine neighborhoods belonging to the same postcode**

In [71]:
neighborhood_frame = table.groupby(['Postcode','Borough'])['Neighbourhood\n'].apply(lambda x: ", ".join(x.astype(str))).reset_index()
neighborhood_frame = neighborhood_frame.sample(frac=1).reset_index(drop=True)
neighborhood_frame.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M2M,North York,"Newtonbrook , Willowdale"
1,M6H,West Toronto,"Dovercourt Village , Dufferin"
2,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North"
3,M2J,North York,"Fairview , Henry Farm , Oriole"
4,M5V,Downtown Toronto,"CN Tower , Bathurst Quay , Island airport , Ha..."
5,M6J,West Toronto,"Little Portugal , Trinity"
6,M5B,Downtown Toronto,"Ryerson , Garden District"
7,M4T,Central Toronto,"Moore Park , Summerhill East"
8,M6C,York,Humewood-Cedarvale
9,M5L,Downtown Toronto,"Commerce Court , Victoria Hotel"


**Using Shape method, print dataframe rows**

In [73]:
print(neighborhood_frame.shape)

(103, 3)


# Part 2 - Geocoding

**Input Geospacial Data**

In [75]:
url_geo="http://cocl.us/Geospatial_data"
geo_info=pd.read_csv(url_geo)
geo_info.head(12)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


**Populate Latitide and Longitude based on Toronto City Postal Codes**

In [79]:
print(list(neighborhood_frame))
print(list(geo_info))

full_table = neighborhood_frame.set_index('Postcode').join(geo_info.set_index('Postal Code'))
full_table = full_table.sample(frac=1).reset_index(drop=True)
full_table.head(12)

['Postcode', 'Borough', 'Neighbourhood\n']
['Postal Code', 'Latitude', 'Longitude']


Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,Scarborough,"Maryvale , Wexford",43.750072,-79.295849
1,Etobicoke,"Kingsway Park South West , Mimico NW , The Que...",43.628841,-79.520999
2,Mississauga,Canada Post Gateway Processing Centre,43.636966,-79.615819
3,Central Toronto,"Forest Hill North , Forest Hill West",43.696948,-79.411307
4,York,"Del Ray , Keelesdale , Mount Dennis , Silverth...",43.691116,-79.476013
5,Downtown Toronto,"Harbourfront , Regent Park",43.65426,-79.360636
6,Etobicoke,"Bloordale Gardens , Eringate , Markland Wood ,...",43.643515,-79.577201
7,North York,"CFB Toronto , Downsview East",43.737473,-79.464763
8,Scarborough,"Clairlea , Golden Mile , Oakridge",43.711112,-79.284577
9,Central Toronto,Roselawn,43.711695,-79.416936
