# Segmenting and Clustering Neighborhoods in Toronto

In this assignment, we will perform the below activities:
  - Data Collection
  - Data formatting
  - Data Normalization
  - Clustering
    

## Data Collection

I will collect the data from Wikipedia using the below URL
    - https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

All the Postal Codes will be considered, except the ones which are not yet assigned.

To extract the necessary data from this website, I will need to use some web scapping activities. I will use Python's beautifulsoup4 package to do the web scrapping. 



### Import necessary packages

In [5]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from collections import OrderedDict
import geocoder

## Web Scapping

Retrieve the complete page content first

In [6]:
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(URL)

## Create dataframe of cordinates from the given CSV file

At first I tried to use the geocoder API to get the coordinates. goecoder API was going non-responsive.

After trying couple of times with the geocoder api to retrieve the coordinates, I have decided to use the CSV file present in the below URL to get the coordinates:
    - https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv


Read the CSV from the URL using Panda's read_csv method

In [9]:
cord_df = pd.read_csv("https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv")
cord_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Write an utility method to get the coordinates from the above dataframe.
This method will take the Toronto postal code and return the tuple of latitude and longitude 

In [10]:
def getCoordinatesFromCSVData(postal_code):
    temp_df = cord_df.loc[cord_df['Postal Code'] == postal_code]
    latitude = temp_df["Latitude"].values[0]
    longitude = temp_df["Longitude"].values[0]
    return latitude, longitude

Now our utility method getCoordinatesFromCSVData is ready. Now we can start extracting the data to build the required dataframe.

We will use BeautifulSoup's APIs to parse the html components and extract the data from the required component.

We will need to point the table which is holding the Postal Code information.

In many cells (the content of span component) the values are not standard
 - Some of the Postal Code is not assigned to any borough. Those need to be ignored
 - Parenthesis is not properly maintained
 - For some neighborhood names, <br/> is present between 2 words
All these need to be cleaned to collect the list of neighborhoods for each Postal Code

Once the data is cleaned up, then for each postal code:
 - Call the getCoordinatesFromCSVData() method to collect the latitude and longitude
 - Get the name of the borough
 - Get the list of neighborhoods
 - Create an Ordered Dictionary for the Postal code specific data and append into an array
 
Please note: I had to use the Ordered Dictionary, because, the order of the columns in the target DataFrame were not in desired order.

At the last, convert the whole array to Panda's dataframe for the desired result.

In [13]:
soup = BeautifulSoup(page.content, 'html.parser')
temp_toronto_data = []

tables = soup.find_all('table', {'rules': 'all'})
for table in tables:
    tbody = table.find('tbody')
    trs = tbody.find_all('tr')
    for tr in trs:
        tds = tr.find_all('td')
        for td in tds:
            para = td.find('p')
            b_tag = para.find('b')
            zipCode = b_tag.get_text()
            span_tag = para.find('span')
            span_tag_text = str(span_tag).replace("<br/>", "####", 1)
            span_tag_text = span_tag_text.replace("<br/>", "|||||")
            #print(span_tag_text)
            span_tag = BeautifulSoup(span_tag_text, 'html.parser')
            
            if "Not assigned" not in span_tag.get_text():
                span_tag_text = span_tag.text
                span_tag_text = span_tag_text.replace('(', '')
                span_tag_text = span_tag_text.replace(')', '')
                span_tag_text = span_tag_text.replace('/', ',')
                info_array = span_tag_text.split("####")
                borough = info_array[0]
                temp_neighbors = info_array[1]
                temp_neighbors = temp_neighbors.replace("Downsview|||||", "Downsview ")
                temp_neighbors = temp_neighbors.replace("Don Mills|||||", "Don Mills ") 
                temp_neighbors = temp_neighbors.replace("Willowdale|||||", "Willowdale ")
                temp_neighbors = temp_neighbors.replace("Northwest|||||Clairville", "Northwest Clairville") 
                temp_neighbors = temp_neighbors.replace("Danforth ||||| East", "Danforth East") 
                temp_neighbors = temp_neighbors.replace("|||||", ',')
                temp_neighbors = temp_neighbors.replace(" ,", ',')
                if temp_neighbors.startswith(","):
                    temp_neighbors = temp_neighbors.replace(",", '', 1).strip()
                #print("PostalCode " + zipCode + " is assigned to Borough : [" + borough + "] Neighborhoods : [" + temp_neighbors + "]")
                latitude, longitude = getCoordinatesFromCSVData(zipCode)
                temp_toronto_data.append(OrderedDict({
                    "PostalCode" : zipCode, "Borough" : borough, "Neighborhood" : temp_neighbors, "Latitude" : latitude, "Longitude" : longitude
                }))
                
                


toronto_df = pd.DataFrame(temp_toronto_data)
toronto_df.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


In [14]:
toronto_df.shape

(103, 5)