# Segmenting and Clustering Neighborhoods in Toronto

Peer-graded Assignment on Week3 of the Applied Data Science Capstone

## Create and process Dataframe

### Prepare the environment

In [84]:
# Install the needed libraries
# urllib is a standard library
! pip install beautifulsoup4 # install beautifulsoup4



In [85]:
# --------- Import libraries --------------
import urllib.request # import the library we use to open URLs
from bs4 import BeautifulSoup # import the BeautifulSoup library so we can parse HTML and XML documents
import pandas as pd

In [87]:
# --------- Initialize data by using BeautifulSoup - Web Scraping ---------------
# specify which URL/web page we are going to be scraping
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urllib.request.urlopen(url) # open the url using urllib.request and put the HTML into the page variable
soup = BeautifulSoup(page) # parse the HTML from our URL into the BeautifulSoup parse tree format

# --------- Analyze the webpage source code - I prefer using F12 in the browser rather than using prettify ---------------
# print(soup.prettify()) # use this or use view source code of the targerted webpage by using F12 in the browser
### find the <table class="wikitable sortable jquery-tablesorter">
### Scroll down a little to see how the table is made up and you’ll see the rows start and end with <tr> and </tr> tags.
### The top row of headers has <th> tags while the data rows beneath for each club has <td> tags. It’s in these tags that we will tell Python to extract our data from.

### Get the table out of the html page and save it as a dataframe

In [88]:
# --------- Get the table out of the html page and save it as a dataframe -------------------
# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML and store in 'all_tables' variable
all_tables=soup.find_all("table")
# find our demanded table
right_table=soup.find('table', class_='wikitable sortable')

postcode_col =[]
borough_col =[]
neighborhood_col =[]

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        postcode_col.append(cells[0].find(text=True))
        borough_col.append(cells[1].find(text=True))
        neighborhood_col.append(cells[2].find(text=True))

# Create the raw dataframe
df=pd.DataFrame(postcode_col,columns=['PostalCode'])
df['Borough'] = borough_col
df['Neighborhood'] = neighborhood_col

### Pre-process data as instructed in the Assignment

In [93]:
# ------------------ Pre-process data as instructed in the Assignment -----------------------
# Only process the cells that have an assigned borough. Ignore (drop) cells with a borough that is Not assigned.
df = df[df['Borough']!='Not assigned']
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
df.replace('Not assigned\n', df[df['Neighborhood'] == 'Not assigned\n']['Borough'],inplace=True)
# remove messy \n letters
df.replace(r'\n','', regex=True, inplace=True)

# combined serveral row into one row with PostalCode and Borough is the same and the sum of neighborhoods is separated with a comma 
df = df.groupby(['PostalCode','Borough']).agg({
    'Neighborhood': lambda x: ', '.join(x)
}).reset_index()

# print out the dataframe
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


### save the processed dataframe to a csv file

In [94]:
# save the processed dataframe to a csv file for further tasks
df.to_csv('Canada_neighborhood.csv',index=False)
print('Dataframe saved!')

Dataframe saved!


### Return the Shape

In [92]:
df.shape

(103, 3)