# Data Ingest and Cleanup for Toronto Neighborhood Segmentation and Clustering, Part 2

## This notebook satisfies the requirements to get, transform, and groom the data that will be used for the subsequent neighborhood segmentation and clustering exercise.  This is the SECOND of the submissions for Week 3 of the IBM Applied Data Science Capstone course offered via Coursera.  The assignment may be viewed at https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit.

### Please see the comments in the code for detailed explanations of each processing step.  Thanks!

### Part 1: Get, Transform, and Clean the Data

In [None]:
#install the necessary packages
#commented out here to clean up the notebook
#!pip install bs4;
#!pip install requests;
#!pip install lxml;
#!pip install cchardet;

In [29]:
#import the necessary libraries
import urllib;
import pandas as pd;
from bs4 import BeautifulSoup;

#set the url to scrape
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M';

#get the parsed text from the url using the lxml parser
html = urllib.request.urlopen(url);
bs = BeautifulSoup(html,'lxml');

#use BeautifulSoup4 to find the table we need
table_object = bs.find(lambda tag: tag.name=='table',attrs={"class": "wikitable sortable"}); 

#use BeautifulSoup4 to get all rows in the table
row_objects = table_object.tbody.find_all(lambda tag: tag.name=='tr');

#create and populate a list to stage the data for pandas
toronto_data_row_list = [];

#define a function to get rows by tag (is it header or data?)
def get_rows_by_tag(tr, column_tag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(column_tag)];

#append the header row to the list
toronto_headers = get_rows_by_tag(row_objects[0], 'th');
toronto_data_row_list.append(toronto_headers);

#append the data rows to the list
for tr in row_objects:
    toronto_data_row_list.append(get_rows_by_tag(tr, 'td'));

#use pandas to create a dataframe from the list
toronto_pc_df = pd.DataFrame(toronto_data_row_list[1:], columns=toronto_data_row_list[0]);

#delete rows with null Postcode
toronto_pc_df = toronto_pc_df[toronto_pc_df.Postcode.notnull()] 

#delete rows with Borough = 'Not assigned'
toronto_pc_df = toronto_pc_df[toronto_pc_df.Borough != 'Not assigned'] 

#aggregate rows with the same Postcode and Borough to create a comma-delimited Neighbourhood column
toronto_pc_df = toronto_pc_df.groupby(['Postcode']).  \
    agg({'Borough' : 'first' , 'Neighbourhood' : ', '.join})  \
   .reset_index()  \
   .reindex(columns = toronto_pc_df.columns)

#rename Postcode column to PostalCode as shown in assignment
toronto_pc_df.rename(columns = {'Postcode' : 'PostalCode'}, inplace = True)

#where 'Neighbourhood' is Not assigned, copy the Borough to Neighbourhood
toronto_pc_df.loc[(toronto_pc_df.Neighbourhood=='Not assigned'), 'Neighbourhood'] = toronto_pc_df.Borough

#print the result as a string for inspection
print(toronto_pc_df.to_string())

#display the dataframe in the format specified in the assignment
toronto_pc_df.head(12)


    PostalCode           Borough                                      Neighbourhood
0          M1B       Scarborough                                     Rouge, Malvern
1          M1C       Scarborough             Highland Creek, Rouge Hill, Port Union
2          M1E       Scarborough                  Guildwood, Morningside, West Hill
3          M1G       Scarborough                                             Woburn
4          M1H       Scarborough                                          Cedarbrae
5          M1J       Scarborough                                Scarborough Village
6          M1K       Scarborough        East Birchmount Park, Ionview, Kennedy Park
7          M1L       Scarborough                    Clairlea, Golden Mile, Oakridge
8          M1M       Scarborough    Cliffcrest, Cliffside, Scarborough Village West
9          M1N       Scarborough                        Birch Cliff, Cliffside West
10         M1P       Scarborough  Dorset Park, Scarborough Town Centre, Wexf

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [19]:
#as required by the assignment return the shape of the transformed and cleaned Toronto dataset
toronto_pc_df.shape

(103, 3)

### Part 2: Get Geocodes for Each Borough and Add Them to the Dataset

In [30]:
#long and repeated attempts to use geocoder failed miserably, completely useless
#now shifting to the csv file.

#ingest the data from the csv file into a pandas dataframe
#note that the original Postal Code column name was changed to PostalCode using Excel
#the csv file was downloaded to the development Windows workstation.
toronto_gc_df = pd.read_csv('C:/Users/mike/Desktop/Geospatial_Coordinates.csv', sep = ',')

#merge the original dataframe with the geocode dataframe joining on PostalCode
toronto_pc_gc_df = pd.merge(toronto_pc_df, toronto_gc_df, on='PostalCode')

#print the result for inspection
print(toronto_pc_gc_df.to_string())

#display the dataframe in the format specified in the assignment
toronto_pc_gc_df.head(12)


    PostalCode           Borough                                      Neighbourhood   Latitude  Longitude
0          M1B       Scarborough                                     Rouge, Malvern  43.806686 -79.194353
1          M1C       Scarborough             Highland Creek, Rouge Hill, Port Union  43.784535 -79.160497
2          M1E       Scarborough                  Guildwood, Morningside, West Hill  43.763573 -79.188711
3          M1G       Scarborough                                             Woburn  43.770992 -79.216917
4          M1H       Scarborough                                          Cedarbrae  43.773136 -79.239476
5          M1J       Scarborough                                Scarborough Village  43.744734 -79.239476
6          M1K       Scarborough        East Birchmount Park, Ionview, Kennedy Park  43.727929 -79.262029
7          M1L       Scarborough                    Clairlea, Golden Mile, Oakridge  43.711112 -79.284577
8          M1M       Scarborough    Cliffcrest

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
