## Question 1

In [16]:
import pandas as pd
import numpy as np

import matplotlib.cm as cm
import matplotlib.colors as colors

After importing the pandas dataframe, we'll use the read_html method to read the tables off of the website. This will store them in an list. The table we're looking for is the first object in the list.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

df = pd.read_html(url)[0]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Now that we have the table, let's remove all the rows where Borough has the value of 'Not assigned'. 

In [3]:
df = df[df.Borough != 'Not assigned']
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Let's check if there are any neighborhoods remaining with the value of 'Not assigned'.

In [4]:
df[df.Neighbourhood == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


There are no outstanding rows with null values remaining. Let's move on to make sure that our Postal Codes are unique and that if they aren't, they are grouped together.

In [5]:
df = df.groupby(['Postal Code'], as_index=False).sum()
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


Let's take a look at the shape of the resulting dataframe.

In [6]:
df.shape

(103, 3)

So we know there are 103 rows in this table and three columns.

## Question 2

Download the geospatial data and confirm that it was converted to a dataframe successfully.

In [7]:
df_geoloc = pd.read_csv(r'http://cocl.us/Geospatial_data')
df_geoloc

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Both dataframes are already grouped by postal code so we can concat them directly.

In [8]:
df = pd.concat([df,df_geoloc['Latitude'],df_geoloc['Longitude']], axis=1)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [9]:
df.shape

(103, 5)

There are 103 rows and 5 columns.

## Question 3

Import libraries for KMeans and mapping

In [10]:
from sklearn.cluster import KMeans
import folium

We're only looking at neighborhoods in Toronto.

In [21]:
df = df[df['Borough'].str.contains('.*Toronto.*')==True]


Create the kMeans cluster model and fit it to the latitude and longitude of the neighborhoods. 

In [22]:
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df[['Latitude','Longitude']])

Add the cluster labels to their corresponding postal codes in the dataframe.

In [23]:
df.insert(0, 'Cluster Labels', kmeans.labels_)


Create the folium map and display the color coded neighborhood clusters.

In [25]:
latitude = 43.6532
longitude = -79.3832

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df['Latitude'], df['Longitude'], df['Neighbourhood'], df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

### List Clusters

Dataframes of the identified clusters in Toronto.

#### Cluster 1

In [26]:
df.loc[df['Cluster Labels'] == 0]

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
76,0,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
82,0,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763
83,0,M6R,West Toronto,"Parkdale, Roncesvalles",43.64896,-79.456325
84,0,M6S,West Toronto,"Runnymede, Swansea",43.651571,-79.48445


#### Cluster 2

In [27]:
df.loc[df['Cluster Labels'] == 1]

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
50,1,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
51,1,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675
52,1,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
53,1,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
54,1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
55,1,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
56,1,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
57,1,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
58,1,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
59,1,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752


#### Cluster 3

In [28]:
df.loc[df['Cluster Labels'] == 2]

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
44,2,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,2,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,2,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
47,2,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,2,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,2,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049
63,2,M5N,Central Toronto,Roselawn,43.711695,-79.416936
64,2,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.696948,-79.411307


#### Cluster 4

In [29]:
df.loc[df['Cluster Labels'] == 3]

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
65,3,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678
66,3,M5S,Downtown Toronto,"University of Toronto, Harbord",43.662696,-79.400049
67,3,M5T,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",43.653206,-79.400049
75,3,M6G,Downtown Toronto,Christie,43.669542,-79.422564
77,3,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975
78,3,M6K,West Toronto,"Brockton, Parkdale Village, Exhibition Place",43.636847,-79.428191


#### Cluster 5

In [30]:
df.loc[df['Cluster Labels'] == 4]

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
37,4,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,4,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,4,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,4,M4M,East Toronto,Studio District,43.659526,-79.340923
87,4,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
