<h1>Coursera Capstone Project Week 3</h1>

### Scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M:

### The url was opened using urllib.request, and the table was scraped using the BeautifulSoup package.  Note that the table is initially scraped containing the "Not assigned" values.  These will be removed in the next step.

In [7]:
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd

web = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
web_open = urllib.request.urlopen(web)
parse = BeautifulSoup(web_open, "lxml")
table=parse.find('table') 

A=[]
B=[]
C=[]

for row in table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
      
df=pd.DataFrame(A,columns=['Postal code'])
df['Borough']=B
df['Neighborhood']=C

df.head(12)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


### All of the "Not assigned" values were removed and the table was printed again.  

In [8]:
df2=df[df.Borough != 'Not assigned\n'] #Remove "Not assigned" values

df2 = df2.reset_index(drop=True) #Reset the index to start at 0 after the rows are removed

df2.head(12)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Malvern / Rouge
7,M3B,North York,Don Mills
8,M4B,East York,Parkview Hill / Woodbine Gardens
9,M5B,Downtown Toronto,"Garden District, Ryerson"


### The .csv file was used to get the latitudes and longitudes.

### Import the .csv file into a pandas dataframe.

In [9]:
df3 = pd.read_csv("http://cocl.us/Geospatial_data")
df3.dtypes

df3.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Merge the dataframe containing the Boroughs and Neighborhoods with the dataframe containing the latitudes and longitudes on Postal Code.

In [27]:
df2 = df2.replace('\n','', regex=True)
df2.rename(columns={'Postal code': 'Postal Code'}, inplace=True) #Make sure merge column names are identical 

fdf=df2.merge(df3, on=['Postal Code'], how='outer')

fdf.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,Parkview Hill / Woodbine Gardens,43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


### The .shape method was used to print the number of rows in the dataframe.

In [28]:
fdf.shape

(103, 5)

### Explore and cluster the neighborhoods in Toronto.

### Install and import the packages needed for clustering and generating maps.

In [3]:
!pip install folium

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/fd/a0/ccb3094026649cda4acd55bf2c3822bb8c277eb11446d13d384e5be35257/folium-0.10.1-py2.py3-none-any.whl (91kB)
[K     |████████████████████████████████| 92kB 13.3MB/s eta 0:00:01
[?25hCollecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/81/6d/31c83485189a2521a75b4130f1fee5364f772a0375f81afff619004e5237/branca-0.4.0-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.0 folium-0.10.1


In [49]:
import folium
import numpy as np 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib
from sklearn.cluster import KMeans

### Generate an initial map showing where all of the neighborhoods are located.

In [76]:
g = folium.Figure(width=1000, height=500)

latitude = 43.666667 #latitude of Toronto
longitude = -79.416667 #Longitude of Toronto

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)


# add markers to map
for lat, lng, borough, neighborhood in zip(fdf['Latitude'], fdf['Longitude'], fdf['Borough'], fdf['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
  


map_toronto.add_to(g)
display(map_toronto)

### Group all of the data by neighborhood in preparation to cluster the data.

In [29]:
toronto_grouped = fdf.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Agincourt,43.794200,-79.262029
1,Alderwood / Long Branch,43.602414,-79.543484
2,Bathurst Manor / Wilson Heights / Downsview North,43.754328,-79.442259
3,Bayview Village,43.786947,-79.385975
4,Bedford Park / Lawrence Manor East,43.733283,-79.419750
5,Berczy Park,43.644771,-79.373306
6,Birch Cliff / Cliffside West,43.692657,-79.264848
7,Brockton / Parkdale Village / Exhibition Place,43.636847,-79.428191
8,Business reply mail Processing CentrE,43.662744,-79.321558
9,CN Tower / King and Spadina / Railway Lands / ...,43.628947,-79.394420


### Cluster all of the data.

In [30]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 4, 4, 4, 1, 3, 1, 3, 1], dtype=int32)

### Add clustering labels in preparation for plotting a graph of k-means clustering.

In [35]:
# add clustering labels
toronto_grouped.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_grouped.head()

Unnamed: 0,Cluster Labels,Neighborhood,Latitude,Longitude
0,0,Agincourt,43.7942,-79.262029
1,2,Alderwood / Long Branch,43.602414,-79.543484
2,4,Bathurst Manor / Wilson Heights / Downsview North,43.754328,-79.442259
3,4,Bayview Village,43.786947,-79.385975
4,4,Bedford Park / Lawrence Manor East,43.733283,-79.41975


### Generate a map of the clusters generated using k-means clustering.

In [75]:
f = folium.Figure(width=1000, height=500)

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_grouped['Latitude'], toronto_grouped['Longitude'], toronto_grouped['Neighborhood'], toronto_grouped['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    
map_clusters.add_to(f)
map_clusters

### Examine the Clusters

In [37]:
toronto_grouped.loc[toronto_grouped['Cluster Labels'] == 0, toronto_grouped.columns[[1] + list(range(5, toronto_grouped.shape[1]))]]

Unnamed: 0,Neighborhood
0,Agincourt
12,Cedarbrae
34,Guildwood / Morningside / West Hill
50,Malvern / Rouge
51,Milliken / Agincourt North / Steeles East / L'...
68,Rouge Hill / Port Union / Highland Creek
71,Scarborough Village
86,Upper Rouge
94,Woburn


In [38]:
toronto_grouped.loc[toronto_grouped['Cluster Labels'] == 1, toronto_grouped.columns[[1] + list(range(5, toronto_grouped.shape[1]))]]

Unnamed: 0,Neighborhood
5,Berczy Park
7,Brockton / Parkdale Village / Exhibition Place
9,CN Tower / King and Spadina / Railway Lands / ...
13,Central Bay Street
14,Christie
15,Church and Wellesley
18,Commerce Court / Victoria Hotel
25,Dufferin / Dovercourt Village
29,First Canadian Place / Underground city
31,"Garden District, Ryerson"


In [39]:
toronto_grouped.loc[toronto_grouped['Cluster Labels'] == 2, toronto_grouped.columns[[1] + list(range(5, toronto_grouped.shape[1]))]]

Unnamed: 0,Neighborhood
1,Alderwood / Long Branch
11,Canada Post Gateway Processing Centre
21,Del Ray / Mount Dennis / Keelsdale and Silvert...
24,Downsview
27,Eringate / Bloordale Gardens / Old Burnhamthor...
36,High Park / The Junction South
38,Humber Summit
39,Humberlea / Emery
42,Islington Avenue
45,Kingsview Village / St. Phillips / Martin Grov...


In [40]:
toronto_grouped.loc[toronto_grouped['Cluster Labels'] == 3, toronto_grouped.columns[[1] + list(range(5, toronto_grouped.shape[1]))]]

Unnamed: 0,Neighborhood
6,Birch Cliff / Cliffside West
8,Business reply mail Processing CentrE
16,Clarks Corners / Tam O'Shanter / Sullivan
17,Cliffside / Cliffcrest / Scarborough Village West
22,Don Mills
23,Dorset Park / Wexford Heights / Scarborough To...
26,East Toronto
28,Fairview / Henry Farm / Oriole
33,Golden Mile / Clairlea / Oakridge
41,India Bazaar / The Beaches West


In [41]:
toronto_grouped.loc[toronto_grouped['Cluster Labels'] == 4, toronto_grouped.columns[[1] + list(range(5, toronto_grouped.shape[1]))]]

Unnamed: 0,Neighborhood
2,Bathurst Manor / Wilson Heights / Downsview North
3,Bayview Village
4,Bedford Park / Lawrence Manor East
10,Caledonia-Fairbanks
19,Davisville
20,Davisville North
30,Forest Hill North & West
32,Glencairn
37,Hillcrest Village
40,Humewood-Cedarvale
