<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto City Project</font></h1>

## Introduction


This worksheet describes the first few steps of work done, to explore and cluster the neighborhoods in Toronto Project.

### Page1
1. Web scrapping of Wikipedia page to get the Canadian postal codes for their Boroughs and Neighborhoodes, using python's beautifulSoup model.
2. Transform the retrieved data into dataframe (or table) with below conditions:
    1. The dataframe will contain three columns: PostalCode, Borough, and Neighborhood
    2. Only the cells that have an assigned borough, will be processed. Cells with a borough that is Not assigned, are ignored.
    3. As more than one neighborhood can exist in one postal code area. 
        1. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
    4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

### Page2
1. Use python's csv_read to get the Canadian postal codes's co-ordinates.
2. Merge the last created dataframe containing Canadian Postal Codes with borough names, with co-ordinates.

### Page3
1. Create K-means Clusters based on Top 10 Venues on the Neighbourhood. 

## Table of Contents

<div class="Page1" style="margin-top: 20px">

<font size = 3>
    <u> Page1 </u>

1.  <a href="#item1">Get Postal Codes of Canada from Wiki</a>

2.  <a href="#item2">Load postal_data into a data frame</a>

3.  <a href="#item3">Loop thru the dataset to remove boroughs that are not assigned</a>

4.  <a href="#item4">display the shape of valid dataset</a>
    </font>
    </div>


<div class="Page2" style="margin-top: 20px">

<font size = 3>
    <u> Page2 </u>

1.  <a href="#item1">Get Canadian Postal Codes co-ordinates from CSV File</a>

2.  <a href="#item2">Load postal_coordinates_data into a data frame</a>

3.  <a href="#item3">Update the postal_data with co-ordinates</a>
    </font>
    </div>

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
import os

print('Libraries imported.')

Libraries imported.


-----------------
<p style="color:red"> <b>  Page 02 </b> </p>
<p style="color:Blue"> <b>    Objective - Get Canadian Postal Codes co-ordinates from CSV File & load into Postal dataframe </b> </p>


<b>Get Canadian Postal Codes co-ordinates from CSV File.</b>

In [10]:
path = os.getcwd()
filename = "Geospatial_Coordinates.csv"
full_csv_filename = os.path.join(path, filename)
full_csv_filename
coordinates_data = pd.read_csv(full_csv_filename)
coordinates_data

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


<b>Load postal_coordinates_data into a data frame. </b>

In [11]:
coordinates_data_df = np.array(coordinates_data)

<b> Loop thru the dataframe to change list-type to column-like structure.</b>

In [12]:
row_cnt = 0
#max_col = 3
new_postal_code = ''
new_latitude = 0
new_longitude = 0

new_entry = {}
postal_code_colm = []
latitude_colm = []
longitude_colm = []
clean_postal_coordinates_data = []
 
for eachrow in coordinates_data_df:
    row_cnt += 1
    curr_col = 0
    for eachentry in eachrow:
        curr_col += 1
        #print (eachentry)
        if (curr_col == 1):
            new_postal_code = eachentry
        elif (curr_col == 2):
            new_latitude = eachentry
        elif (curr_col == 3):
            curr_col = 0
            new_longitude = eachentry
        
        new_entry = [new_postal_code, new_latitude, new_longitude]
        #print(new_entry)
        clean_postal_coordinates_data.append(new_entry)
        postal_code_colm.append(new_postal_code)
        latitude_colm.append(new_latitude)
        longitude_colm.append(new_longitude)
        
#print(clean_postal_coordinates_data)

<b> Display the shape of clean data </b>

In [13]:
column_names = ['PostalCode','Latitude','Longitude']
clean_postal_coordinates_data_df = pd.DataFrame(clean_postal_coordinates_data,columns=column_names)
print(clean_postal_coordinates_data_df.shape)
#print(clean_postal_coordinates_data_df)
df_2 = pd.DataFrame({'PostalCode': postal_code_colm, 'Latitude' : latitude_colm,'Longitude': longitude_colm})
print(df_2.head())
print(df_2.shape)

(309, 3)
  PostalCode   Latitude  Longitude
0        M1B   0.000000   0.000000
1        M1B  43.806686   0.000000
2        M1B  43.806686 -79.194353
3        M1C  43.806686 -79.194353
4        M1C  43.784535 -79.194353
(309, 3)


<b> Sort Coordinates dataframe by Postal Code <b>

In [14]:
sorted_df_2 = df_2.sort_values(by ='PostalCode')
print(sorted_df_2.columns)
#print(sorted_df1)
sorted_clean_postal_coordinates_data_df = clean_postal_coordinates_data_df.sort_values(by ='PostalCode')
#print(sorted_clean_postal_data_df)

Index(['PostalCode', 'Latitude', 'Longitude'], dtype='object')


<b>             Update the postal_data with co-ordinates. </b>

<b> <u> First </u>, Merge or join both the dataframes on Key "PostalCode" </b>

In [15]:
merged_df = pd.merge(sorted_df_1, sorted_df_2, on='PostalCode', how='left')
print(merged_df)

    PostalCode           Borough  \
0          M1B       Scarborough   
1          M1B       Scarborough   
2          M1B       Scarborough   
3          M1C       Scarborough   
4          M1C       Scarborough   
5          M1C       Scarborough   
6          M1E       Scarborough   
7          M1E       Scarborough   
8          M1E       Scarborough   
9          M1G       Scarborough   
10         M1G       Scarborough   
11         M1G       Scarborough   
12         M1H       Scarborough   
13         M1H       Scarborough   
14         M1H       Scarborough   
15         M1J       Scarborough   
16         M1J       Scarborough   
17         M1J       Scarborough   
18         M1K       Scarborough   
19         M1K       Scarborough   
20         M1K       Scarborough   
21         M1L       Scarborough   
22         M1L       Scarborough   
23         M1L       Scarborough   
24         M1M       Scarborough   
25         M1M       Scarborough   
26         M1M       Scarbor

In [16]:
merged_column_names = ['PostalCode','Borough','Neighbourhood','Latitude','Longitude']
merged_df = pd.DataFrame(merged_df,columns=merged_column_names)
print(merged_df.columns)

Index(['PostalCode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'], dtype='object')


<b> Sort the Merged dataframe </b>

In [17]:
sort_column_names = ['PostalCode','Borough','Neighbourhood','Latitude','Longitude']
#sort_column_names = ['PostalCode','Longitude','Latitude']


sort_ascending=[True, True, True, False, True]
#sort_ascending=[True, True, False]


sorted_merged_df = merged_df.sort_values(by = sort_column_names, ascending=sort_ascending)
sorted_merged_df.reset_index()
sorted_merged_df

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1B,Scarborough,"Malvern, Rouge",43.806686,0.0
0,M1B,Scarborough,"Malvern, Rouge",0.0,0.0
3,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.806686,-79.194353
4,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.194353
5,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
6,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.784535,-79.160497
8,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
7,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.160497
9,M1G,Scarborough,Woburn,43.770992,-79.216917


<b> Now, remove the duplicate entries created during the merge() </b>

In [18]:
subset_column_names = ['PostalCode','Borough','Neighbourhood']
deduped_sorted_merged_df = sorted_merged_df.drop_duplicates(subset = subset_column_names,keep='first')
print(deduped_sorted_merged_df.shape)
deduped_sorted_merged_df

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
3,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.806686,-79.194353
6,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.784535,-79.160497
9,M1G,Scarborough,Woburn,43.770992,-79.216917
14,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
16,M1J,Scarborough,Scarborough Village,43.773136,-79.239476
19,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.744734,-79.239476
21,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.727929,-79.262029
25,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.284577
28,M1N,Scarborough,"Birch Cliff, Cliffside West",43.716316,-79.239476


<p style="color:red"> <b>  Page 02 - end </b> </p>