# Peer-Graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

### This workbook is Gareth Mitchell-Jones' submission for the IBM Data Science Capstone Week 3 Assessment

Having looked at the table on Wikipedia against the assessment criteria below then it is clear that a number of variations to the expected format have occurred since this assessment challenge was written or we were always being asked to program redundant steps.

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

1. There are <b>no duplicate Postal codes</b> in the table - therefore there is no need to aggregate the table based on Borough using the 'groupby' function. (I will do it any way just to prove the point).

2. <b>Neighborhoods are already appended together</b> and separated by the characters " / ". (I will clean up the text too so it has commas and not slashes in).

3. There are <b>no values of 'Not assigned' in the Neighborhood column</b> - therefore only blank values (NaN) will be in the ingested dataframe - these can be selected or excluded using "''" and the appropriate operators removing all redundant rows in a single step.

4. There is a <b>1:1 relationship between Borough values of 'Not Assigned' and Neighborhood as 'NaN'</b> or "''" - therefore one need only remove borough values of 'Not Assigned' and the final data set will be complete.

### Assessment Request:

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

<img src='https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1586995200000&hmac=YOUFk7OdzAeCzR67bfb002tsUs1CUeNp3U7eVUJ99dU'></img>

3. To create the above dataframe:

<ul type="disc">a) The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood</ul>
<ul type="disc">b) Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.</ul>
<ul type="disc">c) More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.</ul>
<ul type="disc">d) If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough.</ul>
<ul type="disc">e) Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.</ul>
<ul type="disc">f) In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.</ul>
4. Submit a link to your Notebook on your Github repository. (10 marks)

Note: There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use pandas to read the table into a pandas dataframe.

Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

The package is so popular that there is a plethora of tutorials and examples on how to use it. Here is a very good Youtube video on how to use the BeautifulSoup package: https://www.youtube.com/watch?v=ng2o98k983k

Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.

### Useful websites
1. https://github.com/CoreyMSchafer/code
2. https://www.youtube.com/watch?v=ng2o98k983k
3. https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html
4. https://www.youtube.com/watch?v=OXA_ZD1gR6A
5. http://beautiful-soup-4.readthedocs.io/en/latest/


In [1]:
#Define the URL
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#download packages if needed based on youtube videos / github links
conda install pip
pip install beautifulsoup4
pip install lxml
pip install requests

In [2]:
import numpy as np
import pandas as pd

# extract tables from wikipedia
from pandas.io.html import read_html

wikitables = read_html(url, attrs={"class":"wikitable"})

#confirm capture of table
print ("Extracted {num} wikitables".format(num=len(wikitables)))

# instantiate the dataframe and check size / layout
df = wikitables[0]
df

Extracted 1 wikitables


Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
...,...,...,...
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...


In [3]:
#Remove empty Boroughs as directed and recreate index
df = df[df['Borough'] != 'Not assigned']
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


In [4]:
df.sort_values(by=['Postal code', 'Borough'])
df

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


In [5]:
#Check to see if any rows have 'Not assigned' Neighborhoods
df.loc[df['Neighborhood']=='Not assigned']

Unnamed: 0,Postal code,Borough,Neighborhood


In [6]:
#Check to see if any rows have empty Neighborhoods
df.loc[df['Neighborhood']=='']

Unnamed: 0,Postal code,Borough,Neighborhood


In [7]:
#Rename Postal code to PostalCode to match request
df.columns = ['PostalCode', 'Borough', 'Neighborhood']
#Change preset " / " with ", "
df['Neighborhood'] = df['Neighborhood'].str.replace(' / ',', ')
df['Neighborhood'] = df['Neighborhood'].str.replace('Business reply mail Processing CentrE','Business Reply Mail Processing Centre')
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [8]:
df.shape

(103, 3)

In [9]:
#Try the GroupBy funtion and check that there are no duplicate rows even though we 
#know its a redundant task
dfc = df.groupby(['PostalCode','Borough']).agg(lambda x: ', '.join(x))
dfc

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood
PostalCode,Borough,Unnamed: 2_level_1
M1B,Scarborough,"Malvern, Rouge"
M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
...,...,...
M9N,York,Weston
M9P,Etobicoke,Westmount
M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [10]:
dfc.shape

(103, 1)

In [11]:
#Replicate this using Beautiful Soup knowing what we know about the data

In [12]:
from bs4 import BeautifulSoup
import pandas
import requests
import csv

source = requests.get(url).text
soup = BeautifulSoup(source,'lxml')

table = soup.find('table',{'class':'wikitable'})
trc = table.find_all('tr')

data = []
for row in trc:
    data.append([t.text.strip() for t in row.find_all('td')])

df2 = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighborhood'])
df2 = df2[~df2['PostalCode'].isnull()]  # to filter out bad rows
df2 = df2[~df2['Neighborhood'].isnull()] # to filter out bad rows
#Change preset " / " with ", "
df2['Neighborhood'] = df2['Neighborhood'].str.replace(' / ',', ')
df2['Neighborhood'] = df2['Neighborhood'].str.replace('Business reply mail Processing CentrE','Business Reply Mail Processing Centre')
df2 = df2[df2['Borough'] != 'Not assigned']
df2.reset_index(drop=True, inplace=True)
df2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [13]:
df2.shape

(103, 3)

## Thanks for reviewing my submission