# Segmenting and Clustering Neighborhoods in Toronto
**Applied Data Science Capstone**

## Webscraping Postal Codes

In the first part of this assignment, we will scrape Wikipedia for Postal Codes for various Canadian neighborhoods. 

Since this was previously described in an earlier assignment, I will just post the code here without explanation.

In [10]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url_wiki = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = requests.get(url_wiki).text
soup = BeautifulSoup(page, 'lxml')
table = soup.find_all("table")[0]

header = table.find_all("tr")[0].find_all("th")
cols = [i.text for i in header]
cols[2] = cols[2].replace("\n", "")

table_rows = table.find_all("tr")
rows = []
for row in range(1,len(table_rows)):
    rows.append([i.text for i in table_rows[row].find_all("td")])
    
vals = [[] for i in cols]
for row in rows:
    if row[1] != "Not assigned":
        for elem in range(len(row)):
            vals[elem].append(row[elem])            
vals[2] = [i.replace("\n", "") for i in vals[2]]

df_post = pd.DataFrame({name:values for name, values in zip(cols, vals)})
df_post = df_post.groupby(["Postcode", "Borough"])["Neighbourhood"].apply(lambda x: ','.join(x)).reset_index()


## Adding Latitude & Longitude Coordinates

I was unable to get geocoder to work, so I will use the `.csv` file that was provided.

In [11]:
coords = pd.read_csv("/home/brkalltheway/Dropbox/Coursera/IBM/capstone/Geospatial_Coordinates.csv")
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


The `coords` DataFrame will need to be merged with `df_post` along the variable `"Postcode"`. Notice that this variable name is different between the two DataFrames, so it will need to be changed.

In [12]:
coords.rename(columns={"Postal Code" : "Postcode"}, inplace=True)
coords.columns

Index(['Postcode', 'Latitude', 'Longitude'], dtype='object')

Now that the index column names is the same between the two, we merge the datasets. 

In [13]:
df = pd.merge(df_post, coords, on="Postcode")
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
