## Segmenting and Clustering Neighborhoods in Toronto (Week 03 - Second)

### From First Question... We will obtain the Toronto PostalCode DataFrame

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

# Read the URL page and save it as a html page
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
text = bs(page.content, 'lxml')

In [2]:
# Read all  columns and include an empty list for each one
table = text.find('table',{'class':'wikitable sortable'})
tr_rows = table.find_all('tr')
df_cols = []
df_rows = []

# Search for header tags. They will be the DataFrame columns name
th_rows = table.find_all('th')
for th in th_rows:
    df_cols.append(th.text)

# Search for row tags. They will be the DataFrame rows
for tr in tr_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    df_rows.append(row)

In [3]:
# DataFrame creation with df_cols and df_rows. Null or None rows are dropped.
df = pd.DataFrame(df_rows, columns=df_cols)
df.dropna(inplace = True)
df.rename(columns={'Neighbourhood\n':'Neighborhood', 'Postcode':'PostalCode'}, inplace = True)

# The values in the last column include an innecesary '\n' special char. We must remove it. Also rows without Borough
df = df.replace('\\n','', regex=True)
indexNames = df[df['Borough'] == 'Not assigned'].index
df.drop(indexNames, inplace = True)

# If Neighborhood is Not assigned, this value will be equal to Borough
df.loc[df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df['Borough']

# Rows with same Postcode must be merged, combining the Neighbourhood name
TorontoPD = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index()
TorontoPD.shape

(103, 3)

### Adding Latitude and Longitude for each Postal Code to TorontoPD DataFrame
By using geocoder, we will iterate each DataFrame row to add the related (Latitude, Longitude) coordinates as new columns.
I've tried several times to use geocoder to get the geographical coordinates without success. So, I'm finally using the CSV file.

In [4]:
# A CSV file with all the geographical coordinates will be used:
url_coords = 'http://cocl.us/Geospatial_data'
Toronto_coords = pd.read_csv(url_coords)

# Change the Postal Code column name to be equal than in Toronto_PD
Toronto_coords.rename(columns={'Postal Code':'PostalCode'}, inplace = True)
Toronto_coords.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [5]:
TorontoDF = pd.merge(TorontoPD, Toronto_coords, on = 'PostalCode')
TorontoDF.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
