### Week 3 Assignment part 1

General Approach: use Pandas to load dataframe from given URL -- then remove entries with no Borough info

In [1]:
import pandas as pd

# load data from Wikipedia table
df_t = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df_t = df_t[df_t['Borough'] != 'Not assigned']
df_t.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Week 3 Assignment Part 2

General Approach: Use Pandas to load Geospatial data, then merge with previous dataframe. This merge operation also automatically combines neighborhoods with the same postal code


In [2]:
# load geospatial data
df_geo = pd.read_csv('http://cocl.us/Geospatial_data')
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [3]:
# add coordinate data to main dataframe
df = pd.merge(df_t,df_geo,on='Postal Code')
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


### Week 3 Assignment Part 3
General Approach: Cluster postal codes by coordinates, then show on map, with points color coded by their assigned cluster; try different values of k. Observations discussed at end of notebook.


In [4]:
# install and import Folium
!pip install folium
import folium # map rendering library



In [5]:
from sklearn.cluster import KMeans
#try several values of k to determine best number of clusters
K = range(5,9)
COLORS = ['red','blue','lightgreen','purple','orange','darkgreen','darkred','black']
models = []
maps = []
for i in range(len(K)):
    models.append(KMeans(n_clusters=K[i]).fit(df[['Latitude','Longitude']]))
    maps.append(folium.Map(location=[43.6532,-79.3832],zoom_start=11))
    # once k-means has run, loop over points, color code by assigned cluster
    point_labels = models[i].labels_
    for j, row in df.iterrows():
        label = '{}, {}'.format(row['Postal Code'], row['Borough'])
        label = folium.Popup(label, parse_html=True)
        color = COLORS[point_labels[j]]
        coords = row[['Latitude','Longitude']]
        folium.CircleMarker(location=coords,radius=5,popup=label,fill=True,color=color,fill_opacity=0.7).add_to(maps[i])

In [10]:
print('k = ' + str(K[0]))
maps[0]

k = 5


In [7]:
print('k = ' + str(K[1]))
maps[1]

k = 6


In [8]:
print('k = ' + str(K[2]))
maps[2]

k = 7


In [9]:
print('k = ' + str(K[3]))
maps[3]

k = 8


### Observations
It appears that for all of the k values considered, the densely-populated Downtown Toronto area is always assigned its own cluster. As more clusters are added, the size of the downtown cluster shrinks, as its outermost points are shifted to other clusters. For larger values of k (7 and 8), the clusters for outer areas of the city begin to get split up, perhaps unnecessarily.