# IBM Data Science Capstone - Toronto Neighbourhood Clustering Part 1

First starting with the importing of the libraries and variables that wlil be used when we're ready to import and clean the data.

In [207]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# import k-means from clustering stage
from sklearn.cluster import KMeans

print ('Imported Libraries.')

Imported Libraries.


In [208]:
# Main vars

# URL where we can find the neighbourhoods
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Data Frame to hold the neighbourhood list
tdot_neighbourhoods_df = []


## Merge-a-Plenty

Time to get to work now that our libs and vars have been dealt with.  

I needed to split this off into a function to keep things relatively clean. If there's a more efficient, vector-driven way of doing this, please leave a comment I'd be happy to explore further!

In [209]:
# Data Cleansing Functions

# postcode_merge: take repeated postal codes and combine the neighbourhoods
def postcode_merge(df):
    
    # Vars needed for processing
    return_df, multi_postcode_df, duplicate_postcodes_list = [], [], []
    
    # Create list of postal codes we need to merge
    duplicate_postcodes_df = df.groupby(['Postcode','Borough']).count() > 1
    duplicate_postcodes_df = duplicate_postcodes_df[duplicate_postcodes_df['Neighbourhood'] == True]
    duplicate_postcodes_df.reset_index(inplace=True)
    duplicate_postcodes_list = duplicate_postcodes_df['Postcode'].to_numpy();
    
    # Split dataframes into done pile and merge pile
    return_df = df[~df['Postcode'].isin(duplicate_postcodes_list)][['Postcode','Borough','Neighbourhood']]
    multi_postcode_df = df[df['Postcode'].isin(duplicate_postcodes_list)][['Postcode','Borough','Neighbourhood']]
     
    # Iterate through the duplicate postcode list, NOT the dataframes themselves
    for postcode in duplicate_postcodes_list:

        # Don't repeat if we already processed the postcal code
        if return_df[return_df['Postcode'] == postcode].shape[0] == 0:
            
            # Join the neighbourhoods w/ matching postal code, sort list by neighbourhood name
            append_postcode = postcode
            append_hood =  ', '.join(sorted(map(str, multi_postcode_df[multi_postcode_df['Postcode'] == postcode]['Neighbourhood'].to_numpy())))
            append_borough = multi_postcode_df[multi_postcode_df['Postcode'] == postcode]['Borough'].unique()[0]
            
            # Add entry to return df
            return_df = return_df.append(pd.DataFrame({'Postcode':append_postcode,'Borough':append_borough,'Neighbourhood':append_hood}, index=[0]), ignore_index=True)
    
    # fix and return the clean list of postalcodes and neighbourhoods
    return return_df.rename(columns={"Postcode": "PostalCode","Neighbourhood":"Neighborhood"});
        

print ("functions compiled.")

functions compiled.


## Read and clean the data

It's now time to scrape the web page and clean up the data. I used Pandas to scrape but I would like to retry this one day soon using Beautiful Soup (Note to self: Google it).


In [210]:
# Read the Wikipedia data
tdot_neighbourhoods_df = pd.read_html(url)[0]

# Rule 1: Drop unassigned boroughs
tdot_neighbourhoods_df.drop( tdot_neighbourhoods_df[ tdot_neighbourhoods_df['Borough'] == 'Not assigned' ].index , inplace=True)

# Rule 2: Merge the repeated postal codes
tdot_neighbourhoods_df = postcode_merge(tdot_neighbourhoods_df)

# Rule 3: Fill unassigned hoods
tdot_neighbourhoods_df['Neighborhood'][tdot_neighbourhoods_df['Neighborhood'] == 'Not assigned'] = tdot_neighbourhoods_df['Borough'];

# Just some final sorting touches
tdot_neighbourhoods_df.sort_values(by=['Borough','Neighborhood'], inplace=True)
tdot_neighbourhoods_df.reset_index(drop=True, inplace=True)
print ("")





## Toronto Neighbourhoods by Postal Code

The end result with all cleansing in the rearview mirror...

In [211]:
tdot_neighbourhoods_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M4S,Central Toronto,Davisville
1,M4P,Central Toronto,Davisville North
2,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi..."
3,M5P,Central Toronto,"Forest Hill North, Forest Hill West"
4,M4N,Central Toronto,Lawrence Park
5,M4T,Central Toronto,"Moore Park, Summerhill East"
6,M5R,Central Toronto,"North Midtown, The Annex, Yorkville"
7,M4R,Central Toronto,North Toronto West
8,M5N,Central Toronto,Roselawn
9,M5H,Downtown Toronto,"Adelaide, King, Richmond"


### Lets dbl-check the row count

We'll check the row count using .shape

In [212]:
print ("Number of rows: ", tdot_neighbourhoods_df.shape[0])

Number of rows:  103


I hope you enjoyed it!