# Segmenting and Clustering Neighborhoods in Toronto
**Applied Data Science Capstone**

## Webscraping Postal Codes

In the first part of this assignment, we will scrape Wikipedia for Postal Codes for various Canadian neighborhoods. 

First, we will load the necessary libraries.

In [1]:
import requests
from bs4 import BeautifulSoup

Next, we will send a request for the html and then use BeautifulSoup to parse it.

In [2]:
url_wiki = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = requests.get(url_wiki).text
soup = BeautifulSoup(page, 'lxml')

Once we have the website parsed, we need to find the table. These are surrounded by `<table>` tags. Note that the rows in the table are surrounded by `<tr>` tags, with the first row representing the header in `<th>`. We will use the `.text` method to remove the tags.

In [3]:
table = soup.find_all("table")[0]
header = table.find_all("tr")[0].find_all("th")
cols = [i.text for i in header]
cols

['Postcode', 'Borough', 'Neighbourhood\n']

Since the name of the last column contains `"\n"`, we will want to remove it.

In [4]:
cols[2] = cols[2].replace("\n", "")
cols

['Postcode', 'Borough', 'Neighbourhood']

To get all of the table values, we will loop across the table rows, extracting all values surrounded by `<td>` tags. We will remove the tags and append them to a list, where each element of the list is a list containing row values. 

Once the row values are obtained, only the rows with an assigned borough are kept and then the list is transposed. Since each value in the last column contains `"\n"`, the newline is replaced.

In [5]:
table_rows = table.find_all("tr")
rows = []
for row in range(1,len(table_rows)):
    rows.append([i.text for i in table_rows[row].find_all("td")])

# Transform rows to columns
vals = [[] for i in cols]
for row in rows:
    if row[1] != "Not assigned":
        for elem in range(len(row)):
            vals[elem].append(row[elem])            
vals[2] = [i.replace("\n", "") for i in vals[2]]

The transposed list is converted to a pandas DataFrame via a dictionary comprehension using zip on the column names with the values.

In [7]:
import pandas as pd
df_post = pd.DataFrame({name:values for name, values in zip(cols, vals)})
df_post.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Since multiple neighbourhoods are contained in some of the postcodes, we will combine these using the `.groupby()` method across Postcode and Borough, then apply join to each neighborhood.

In [8]:
df_post = df_post.groupby(["Postcode", "Borough"])["Neighbourhood"].apply(lambda x: ','.join(x)).reset_index()
df_post.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Finally, we can find the shape of the new DataFrame.

In [9]:
df_post.shape

(103, 3)