<h1>Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto</h1>

<h2>Submission 1:<br/>Creation of initial data frame.</h2>

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import urllib

In [2]:
# Load the data set for parsing with BeautifulSoup
data_link="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
resp = urllib.request.urlopen(data_link)

soup = BeautifulSoup(resp.read())
table = soup.table
rows = table.find_all('tr')

<h3>
    Assumptions:
    <li>The zipcode has no hyperlinks and does not need extra processing.</li>
    <br/>
    Process the loaded data and perform the following actions:
    <li>If the row does not contain 3 items, drop the row.</li>
    <li>If a column contains a hyperlink, get the text from the hyperlink.</li>
    <li>If a column does not contain a hyperlink, use the text from that column.</li>
    <li>Remove all trailing whitespace from neighborhoods.</li>
    <li>If the row boroughs is 'Not assigned', drop the row.</li>
    <li>If the row neighborhood is 'Not assigned', use the boroughs name as the neighborhood name.</li>
</h3>

In [3]:
remapped_data = []
for row in rows:
    cols = row.find_all('td')
    if len(cols) == 3:        
        zipcode = cols[0].string

        if cols[1].find('a'):
            borough = cols[1].a.string
        else:
            borough = cols[1].string

        # Skip boroughs with name 'Not assigned'
        if borough == 'Not assigned':
            continue
            
        if cols[2].find('a'):
            neighborhood = cols[2].a.string.rstrip()
        else:
            neighborhood = cols[2].string.rstrip()

        if neighborhood == 'Not assigned':
            neighborhood = borough

        remapped_data.append([zipcode, borough, neighborhood])
        
# print(remapped_data)

<h3>Since multiple neighborhoods can exist in the same zip code, a single zipcode entry needs to aggregate all of the child neighborhoods (separated by commas).  It is assumed that a zipcode will only cover one borough.</h3>

In [4]:
zipcode_map = {}

for entry in remapped_data:
    zipcode = entry[0]
    if zipcode not in zipcode_map:
        # add borough and neighborhood to the new entry
        zipcode_map[zipcode] = [entry[1], entry[2]]
    else:
        # append this entry neighborhood to the existing record's neighborhood
        existing_entry = zipcode_map[zipcode]
        existing_entry[1] += ", "
        existing_entry[1] += entry[2]

# print(duplicate_zipcode_map)

In [5]:
# load the dataframe with the contents of the dictionary
df = pd.DataFrame(columns=['Postcode', 'Borough', 'Neighborhood'])

i = 0
for entry in zipcode_map:
    zipcode_data = zipcode_map[entry]
    df.loc[i] = [entry, zipcode_data[0], zipcode_data[1]]
    i += 1
    
# print(df)

In [6]:
print(df.shape)

(103, 3)


<h2>Submission 2:<br/>Assign location information.</h2>

In [21]:
location_data_source = "https://cocl.us/Geospatial_data"

location_df = pd.read_csv(location_data_source)
# print(location_df)

df_with_lat_lon = pd.merge(df, location_df, left_on='Postcode', right_on='Postal Code', how='inner')
# b = a(['Postcode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'])
df_with_lat_lon.drop(['Postal Code'], axis=1, inplace=True)
print(df_with_lat_lon.head())

  Postcode           Borough                      Neighborhood   Latitude  \
0      M3A        North York                         Parkwoods  43.753259   
1      M4A        North York                  Victoria Village  43.725882   
2      M5A  Downtown Toronto         Harbourfront, Regent Park  43.654260   
3      M6A        North York  Lawrence Heights, Lawrence Manor  43.718518   
4      M7A      Queen's Park                      Queen's Park  43.662301   

   Longitude  
0 -79.329656  
1 -79.315572  
2 -79.360636  
3 -79.464763  
4 -79.389494  


In [22]:
print(df)

    Postcode           Borough  \
0        M3A        North York   
1        M4A        North York   
2        M5A  Downtown Toronto   
3        M6A        North York   
4        M7A      Queen's Park   
5        M9A         Etobicoke   
6        M1B       Scarborough   
7        M3B        North York   
8        M4B         East York   
9        M5B  Downtown Toronto   
10       M6B        North York   
11       M9B         Etobicoke   
12       M1C       Scarborough   
13       M3C        North York   
14       M4C         East York   
15       M5C  Downtown Toronto   
16       M6C              York   
17       M9C         Etobicoke   
18       M1E       Scarborough   
19       M4E      East Toronto   
20       M5E  Downtown Toronto   
21       M6E              York   
22       M1G       Scarborough   
23       M4G         East York   
24       M5G  Downtown Toronto   
25       M6G  Downtown Toronto   
26       M1H       Scarborough   
27       M2H        North York   
28       M3H  

<h2>Submission 3:<br/>Visualization of Toronto data.</h2>