# Segmenting and Clustering Neighborhoods in Toronto 1
## The purpose of this notebook is to obtain a table of Toronto neighborhoods by webscraping data from Wikipedia and merging the resulting table with latitude and longitude information.

#### We start by importing the necessary packages: Pandas, Numpy, Regex and urllib

In [1]:
import pandas as pd
import numpy as np
import re
from urllib.request import urlopen

#### We install Beautiful Soup, a Python library used for scraping web pages

In [2]:
!conda install -c anaconda beautifulsoup4 --yes
from bs4 import BeautifulSoup

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.0

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.



#### We also install lxml which aids Beautiful Soup in parsing HTML DOM

In [3]:
conda install -c anaconda lxml --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.0

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


#### We specify the Wikipedia URL containing the Toronto neighbourhoods dataset and pass it to urlopen() to get the html of the page. Next step is to create a Beautiful Soup object (soup).

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')


#### The full code below generates an empty list, extract text in between html tags for each table row, and append it to the assigned list (list_row)

In [5]:
rows = soup.find_all('tr') #extract all table rows

list_rows = []
for row in rows:
    cells = row.find_all('td')
    str_cells = str(cells)
    clean = re.compile('<.*?>')
    clean2 = (re.sub(clean, '',str_cells))
    list_rows.append(clean2)

df = pd.DataFrame(list_rows)
df.head(10)

Unnamed: 0,0
0,[]
1,"[M1A, Not assigned, Not assigned\n]"
2,"[M2A, Not assigned, Not assigned\n]"
3,"[M3A, North York, Parkwoods\n]"
4,"[M4A, North York, Victoria Village\n]"
5,"[M5A, Downtown Toronto, Harbourfront\n]"
6,"[M6A, North York, Lawrence Heights\n]"
7,"[M6A, North York, Lawrence Manor\n]"
8,"[M7A, Queen's Park, Not assigned\n]"
9,"[M8A, Not assigned, Not assigned\n]"


#### Clean the data

In [13]:
df1 = df[0].str.split(',', expand=True) #Split column 0 in the DataFrame using the ','
df1[0] = df1[0].str.strip('[') #strip out the '['
col_labels = soup.find_all('th') #extract Table headers
all_header = []
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str, "lxml").get_text()
all_header.append(cleantext2)
df2 = pd.DataFrame(all_header)
df3 = df2[0].str.split(',', expand=True)
frames = [df3, df1] #collect the header data and the body

df_4 = pd.concat(frames) # concatenate body and head to create one complete table
df_4.drop(df_4.iloc[:, 3:], inplace = True, axis = 1) # drop additional columns with NaN values (from white spaces)

df_4.head(20)

Unnamed: 0,0,1,2
0,[Postcode,Borough,Neighborhood\n
0,],,
1,M1A,Not assigned,Not assigned\n]
2,M2A,Not assigned,Not assigned\n]
3,M3A,North York,Parkwoods\n]
4,M4A,North York,Victoria Village\n]
5,M5A,Downtown Toronto,Harbourfront\n]
6,M6A,North York,Lawrence Heights\n]
7,M6A,North York,Lawrence Manor\n]
8,M7A,Queen's Park,Not assigned\n]


In [14]:
df_4[2] = df_4[2].str.strip('\n]') # strip out special charcters(/n)
df_4.head(10)

Unnamed: 0,0,1,2
0,[Postcode,Borough,Neighborhood
0,],,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned


#### Drop duplicated header and NaN rows

In [8]:
df5 = df_4.rename(columns=df_4.iloc[0])
df6 = df5.dropna(axis=0, how='any')
df7 = df6.drop(df6.index[0])
#list column names to xamine white spaces and special characters 
list(df7.columns.values)

['[Postcode', ' Borough', ' Neighborhood']

#### Change column names to remove spaces and special characters. Use Regex to remove rows with "Not assigned"

In [15]:
df7.columns = ['Postcode', 'Borough', 'Neighbourhood'] #Correct column headings
df7.drop(df7[df7['Borough'].str.contains(pat = 'Not assigned')].index, inplace = True) #drop "Non assigned" boroughs
df7['Neighbourhood'] = df7['Neighbourhood'].str.strip() #Strip out extra white spaces from Neighbourhood column
df7.head(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Downtown Toronto,Queen's Park
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern
14,M3B,North York,Don Mills North


#### The following block of code isolates rows with "Not assigned" Neighbourhoods and replaces it with the name of the Borough

In [10]:
Neighborhood = []
N = list(df7['Neighbourhood'].str.strip())
B = list(df7['Borough'].str.strip())
for n, m in zip(N, B):
    if n != 'Not assigned':
        Neighborhood.append(n)
    else:
        Neighborhood.append(m)
        
df7['Neighbourhood'] = Neighborhood
df7.head(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Downtown Toronto,Queen's Park
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern
14,M3B,North York,Don Mills North


#### Lastly, we group together Neighbourhoods with the same Post Code

In [11]:
df7['Postcode'].str.strip()
df7['Borough'].str.strip()
df_8 = df7.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df9 = df_8.iloc[2:105]
df9.head(25)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M1B,Scarborough,"Rouge, Malvern"
3,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
4,M1E,Scarborough,"Guildwood, Morningside, West Hill"
5,M1G,Scarborough,Woburn
6,M1H,Scarborough,Cedarbrae
7,M1J,Scarborough,Scarborough Village
8,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
9,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
10,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
11,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [12]:
df9.shape

(103, 3)

In [18]:
df9.to_csv('TO_Neighborhoods.csv', index=False)