# Toronto Neighborhood Segmenting and Clustering
***
For this notebook I will be exploring the neighborhoods and boroughs of Toronto to determine clusters of similar neighborhoods based on venues in the different areas.

### Scraping data from a table into a DataFrame

First, I need to install the table parser which will allow me the easily get the table from a website. I also import other libraries that will help read the table and create the dataframe.

In [3]:
! pip install html-table-parser-python3

import urllib.request

from html_table_parser import HTMLTableParser
import pandas as pd
import numpy as np

Collecting html-table-parser-python3
  Downloading html_table_parser_python3-0.1.5-py3-none-any.whl (3.5 kB)
Installing collected packages: html-table-parser-python3
Successfully installed html-table-parser-python3-0.1.5


Here I define a function that will take in the url of the website and it will return the contents of the website.

In [4]:
def url_get_contents(url):
    req = urllib.request.Request(url=url)
    f = urllib.request.urlopen(req)
    
    return f.read()

I then utilize the function to get the contents of the webpage that has the table I need. I have to set the column names and drop the first row to fix how the table is set up before starting to clean the data.

In [36]:
xhtml = url_get_contents('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').decode('utf-8')

parser = HTMLTableParser()

parser.feed(xhtml)

toronto_df = pd.DataFrame(parser.tables[0])

toronto_df.columns = toronto_df.iloc[0]
toronto_df.drop(0, inplace=True)

toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


This next line will get rid of any row where the Borough is 'Not assigned' as that will be used for the clustering. The index is reset so that it is still sequential starting at 0.

In [39]:
toronto_df = toronto_df[toronto_df['Borough'] != 'Not assigned'].reset_index(drop=True)

toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Then any neighbourhood that is 'Not assigned' will need to be changed to the Borough name to make it better to work with.

In [40]:
for index, row in toronto_df.iterrows():
    if row['Neighbourhood'] == 'Not assigned':
        row['Neighbourhood'] = row['Borough']
toronto_df[1:30]

Unnamed: 0,Postal Code,Borough,Neighbourhood
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"
10,M6B,North York,Glencairn
