# Week 3 Assignment

## Segmenting and Clustering Neighborhoods in Toronto

This notebook is separated into 3 parts, each part corresponding to a task in this assignment. 

**Part 1** - Loading and cleaning the data  
**Part 2** - Mergin the geospatial data (using the CSV approach)  
**Part 3** - Clustering and analysis

---


## Setup

**Import the libaries**

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

**Read the CSV**

In [2]:
week3_data = pd.read_csv('week3_toronto_data.csv')
week3_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


**Print the shape of the data**

In [3]:
# get the Shape
week3_data.shape

(180, 3)

---
## PART 1
This section is for the first part of the assignment question.

### Manage 'Not assigned' values

1. Replace 'Not assigned with NaN
1. Drop rows where borough is NaN
1. For Neighborhood with NaN, assign value of Borough (**Note: There are no rows where only Neighborhoods are 'not assigned'**)
1. For Borough with multiple rows of Neighbourhood, merged them into 1 row separating the neighborhoods with commas (**Note: the data in the Wiki is already doing this. This will be skipped**)

In [4]:
# replace 'not assigned' with NA
week3_data.replace('Not assigned', np.nan, inplace=True);

# drop NaN
cleaned = week3_data.dropna()
cleaned.reset_index(inplace=True, drop=True)

# check if there are any neighborhoods with NaN
total_hood_with_nan = cleaned[ cleaned['Neighborhood'] == np.nan ];
print("There are {} neighboorhoods with NaN".format( len(total_hood_with_nan) ))


# print the clean
cleaned.head()

There are 0 neighboorhoods with NaN


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### Print the number of rows in the dataframe

In [5]:
print("There are {} rows in the dataframe".format( cleaned.shape[0] ))

There are 103 rows in the dataframe


---
## PART 2
This section is for the second part of the assignment question.

**Read the GeoSpatial Data**

In [6]:
geospatial = pd.read_csv('Geospatial_Coordinates.csv')

geospatial.sort_values(['Postal Code'], ascending=True, inplace=True)
geospatial


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


**Get the shape of the geospatial data**

In [7]:
geospatial.shape

(103, 3)

In [8]:

cleaned_sorted = cleaned.sort_values( ['Postal Code'], ascending=True ).reset_index(drop=True)


# copy the lat/long columns over
cleaned_sorted['Latitude'] = geospatial['Latitude']
cleaned_sorted['Longitude'] = geospatial['Longitude']

cleaned_sorted

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


---
## PART 3
This section is for the third part of the assignment question.

### Explanation
The analysis to be done below is to see if the KMeans cluster will clustering Toronto towns correctly, named, East, West, Central, and Downtown Toronto



**Import additional libaries**

In [9]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors



print("done")

done


**Get boroughs containing the string "Toronto"**

In [10]:
borough_toronto = cleaned_sorted[ cleaned_sorted['Borough'].str.contains('Toronto')]
borough_toronto.reset_index(drop=True, inplace=True)

borough_toronto_clustering = borough_toronto.drop(['Postal Code', 'Borough', 'Neighborhood'], 1)
borough_toronto_clustering.head()

Unnamed: 0,Latitude,Longitude
0,43.676357,-79.293031
1,43.679557,-79.352188
2,43.668999,-79.315572
3,43.659526,-79.340923
4,43.72802,-79.38879


**Get geo of Toronto**

In [11]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


**Run K-means**

In [25]:
# set number of clusters
kclusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(borough_toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       2, 2, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 3, 1], dtype=int32)

**Merge Toronto data with labels**

In [26]:
# borough_toronto.drop('Cluster Labels',1)

# copy
borough_toronto_new = borough_toronto.iloc[:];

# add clustering labels back to the data
borough_toronto_new.insert(0, 'Cluster Labels', kmeans.labels_)

borough_toronto_new.sort_values('Cluster Labels', ascending=True, inplace=True)
borough_toronto_new

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighborhood,Latitude,Longitude
36,0,M6S,West Toronto,"Runnymede, Swansea",43.651571,-79.48445
35,0,M6R,West Toronto,"Parkdale, Roncesvalles",43.64896,-79.456325
34,0,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763
33,0,M6K,West Toronto,"Brockton, Parkdale Village, Exhibition Place",43.636847,-79.428191
32,0,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975
31,0,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
30,0,M6G,Downtown Toronto,Christie,43.669542,-79.422564
0,1,M4E,East Toronto,The Beaches,43.676357,-79.293031
38,1,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
1,1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188


**Plot**

In [27]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# # set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, postal, poi, cluster in zip(borough_toronto_new['Latitude'], borough_toronto_new['Longitude'], borough_toronto_new['Postal Code'], borough_toronto_new['Borough'], borough_toronto_new['Cluster Labels']):
    label = folium.Popup( str(postal) + '(' + str(poi) + '), Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Observations

Some of the Central, Downtown, and West boroughs have coordinates that are slightly outside of its own borough, causing the clustering to incorrectly label said boroughs.

For example, Downtown borough (M6G, red color) is incorrectly labeled as West borough