# Assign_Capstone_W3: Segmenting and Clustering Neighborhoods in Toronto

## Part 1

#### 1.1. Build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below

In [28]:
#Step 1 Importing libraries: 
#1.1. Library for pulling data out of HTML and XML files
from bs4 import BeautifulSoup
#1.2 Library to handle data in a vectorized manner
import pandas as pd
#1.3. library for data analsysis
import numpy as np
#1.4. Library for web scraping
import requests

print('Step 1: Libraries imported.')

Step 1: Libraries imported.


* Parse the Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,

In [29]:
# Step 2:
#2.1. parsing Wikipedia:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, "lxml")

print('Step 2: Html page was parsed by Soup.')

Step 2: Html page was parsed by Soup.


In [30]:
# Step 3:
# Find and locate the table
table = soup.find('table', attrs={ "class" : "wikitable sortable"})

table_rows = table.find_all('tr')

data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('td')])
    
print('Step 3: ')

Step 3: 


* extract headers

In [31]:
headers = [table_rows.text.strip() for table_rows in table.find_all('th')]
print(headers)

['Postcode', 'Borough', 'Neighbourhood']


* The dataframe will include three columns: Postcode, Borough, and Neighborhood

In [32]:
df = pd.DataFrame(data[1:], columns=['Postcode', 'Borough', 'Neighbourhood'])
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [33]:
df.shape

(287, 3)

* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [34]:
df1 = df[df.Borough != "Not assigned"].reset_index(drop=True)
print(df1[20:30])

   Postcode           Borough       Neighbourhood
20      M1C       Scarborough      Highland Creek
21      M1C       Scarborough          Rouge Hill
22      M1C       Scarborough          Port Union
23      M3C        North York     Flemingdon Park
24      M3C        North York     Don Mills South
25      M4C         East York    Woodbine Heights
26      M5C  Downtown Toronto      St. James Town
27      M6C              York  Humewood-Cedarvale
28      M9C         Etobicoke   Bloordale Gardens
29      M9C         Etobicoke            Eringate


* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [35]:
df_grouped = df1.groupby(["Postcode", "Borough"], as_index=False).agg(lambda x: ", ".join(x))
df_grouped.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [36]:
for index, row in df_grouped.iterrows():
    if row["Neighbourhood"] == "Not assigned":
        row["Neighbourhood"] = row["Borough"]
        
df_grouped.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [37]:
df_grouped.shape

(103, 3)

# Part 2 "Geocoding"

### Import necessary Libraries

2.1 Use the Geocoder csv:

In [38]:
df_lon_lat = pd.read_csv('Geospatial_Coordinates.csv')
df_lon_lat.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


2.2. Change the name of the first column

In [39]:
df_lon_lat.columns=['Postcode','Latitude','Longitude']
df_lon_lat.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


2.3. Merge tables to get the latitude and the longitude coordinates of each neighborhood:

In [40]:
Tor_df = df_grouped.merge(df_lon_lat, on='Postcode')
Tor_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


2.4. Import additional necessary libraries

In [41]:
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
  - anaconda/win-64::ca-certificates-2019.8.28-0, anaconda/win-64::conda-4.8.1-py37_0, anaconda/win-64::openssl-1.1.1d-he774522_2
  - anaconda/win-64::ca-certificates-2019.8.28-0, anaconda/win-64::openssl-1.1.1d-he774522_2, defaults/win-64::conda-4.8.1-py37_0
  - anaconda/win-64::conda-4.8.1-py37_0, anaconda/win-64::openssl-1.1.1d-he774522_2, defaults/win-64::ca-certificates-2019.8.28-0
  - anaconda/win-64::openssl-1.1.1d-he774522_2, defaults/win-64::ca-certificates-2019.8.28-0, defaults/win-64::conda-4.8.1-py37_0
  - anaconda/win-64::conda-4.8.1-py37_0, defaults/win-64::ca-certificates-2019.8.28-0, defaults/win-64::openssl-1.1.1d-he774522_2
  - defaults/win-64::ca-certificates-2019.8.28-0, defaults/win-64::conda-4.8.1-py37_0, defaults/win-64::openssl-1.1.1d-he774522_2
  - anaconda/win-64::ca-certificates-2019.8.28-0, anaconda/win-64::conda-4.8.1-py37_0, defaults/win-64::openssl-

2.5. In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent foursquare_agent, as shown below.

* and find Toronto coordinates:

In [42]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


2.6. Define Foursquare Credentials and Version

In [56]:
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:


2.7. Create new new dataframe (we are planning to relocate to the East of Toronto and have to find where is more convenient to live):

In [44]:
Tor_geo = Tor_df[Tor_df['Borough'].str.contains('East')].reset_index(drop=True)
Tor_geo.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
1,M4C,East York,Woodbine Heights,43.695344,-79.318389
2,M4E,East Toronto,The Beaches,43.676357,-79.293031
3,M4G,East York,Leaside,43.70906,-79.363452
4,M4H,East York,Thorncliffe Park,43.705369,-79.349372


2.8. Create the map of the East district of Toronto

In [45]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=14)

# add markers to map
for lat, lng, label in zip(Tor_geo['Latitude'], Tor_geo['Longitude'], Tor_geo['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

* Explore one of the neighborhoods in Toronto dataframe

In [46]:
Tor_geo.loc[6, 'Neighbourhood']

'The Danforth West, Riverdale'

* find the coordinates of this neighbourhood

In [47]:
# Neighbourhood The Danforth West, Riverdale

Neighbourhood_latitude = Tor_geo.loc[6, 'Latitude'] # neighborhood latitude value
Neighbourhood_longitude = Tor_geo.loc[6, 'Longitude'] # neighborhood longitude value
Neighbourhood_name = Tor_geo.loc[6, 'Neighbourhood'] # neighborhood name

print('Latitude and Longitude values of {} are {}, {}.'.format(Neighbourhood_name, Neighbourhood_latitude, Neighbourhood_longitude))

Latitude and Longitude values of The Danforth West, Riverdale are 43.6795571, -79.352188.


* find all the bakeries there:

In [48]:
search_query = 'bakery'
radius = 500
print(search_query + ' .... OK!')

bakery .... OK!


* Define the corresponding URL:

In [49]:
LIMIT = 20
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/search?client_id=JWXCP3Y3ZHPDSTT012TWJYYIIOTLXPV1M0MOCAF1GCA243V3&client_secret=UL2TWXHPLO3SPYWXMP54JPAS524TG1LBPF5P50B35P3CI152&ll=43.653963,-79.387207&v=20180604&query=bakery&radius=500&limit=20'

* Send the GET Request and examine the results:

In [50]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e2440f9c94979001bb6ace3'},
 'response': {'venues': [{'id': '4b9f9b5af964a5206b2e37e3',
    'name': 'Bakery 18',
    'location': {'address': '595 Bay St.',
     'crossStreet': 'at Dundas St. W',
     'lat': 43.656765349412275,
     'lng': -79.38354755464013,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.656765349412275,
       'lng': -79.38354755464013}],
     'distance': 429,
     'cc': 'CA',
     'city': 'Toronto',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['595 Bay St. (at Dundas St. W)',
      'Toronto ON',
      'Canada']},
    'categories': [{'id': '4bf58dd8d48988d16a941735',
      'name': 'Bakery',
      'pluralName': 'Bakeries',
      'shortName': 'Bakery',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/bakery_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1579434244',
    'hasPerk': False},
   {'id': '4eadd3dbe5fad025f84c97f4',
    'name': 'Kin-K

29. Get relevant part of JSON and transform it into a pandas dataframe:

In [51]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
dataframe = json_normalize(venues)
dataframe.head(10)

Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.crossStreet,location.lat,location.lng,location.labeledLatLngs,location.distance,location.cc,location.city,location.state,location.country,location.formattedAddress,location.postalCode
0,4b9f9b5af964a5206b2e37e3,Bakery 18,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1579434244,False,595 Bay St.,at Dundas St. W,43.656765,-79.383548,"[{'label': 'display', 'lat': 43.65676534941227...",429,CA,Toronto,ON,Canada,"[595 Bay St. (at Dundas St. W), Toronto ON, Ca...",
1,4eadd3dbe5fad025f84c97f4,Kin-Kin Bakery & Bubble Tea,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1579434244,False,595 Bay St.,,43.656198,-79.382403,"[{'label': 'display', 'lat': 43.65619846393864...",460,CA,Toronto,ON,Canada,"[595 Bay St., Toronto ON M5G 2C2, Canada]",M5G 2C2
2,53542f20498eb5c7149ecafd,Bakery & Kaffee Haus,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1579434244,False,238 Queen Street West,,43.650096,-79.390378,"[{'label': 'display', 'lat': 43.6500955307706,...",500,CA,Toronto,ON,Canada,"[238 Queen Street West, Toronto ON M5V 0B5, Ca...",M5V 0B5
3,57462ae4498efddb24245014,Paris Croissant Bakery Cafe,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1579434244,False,595 Bay St,Dundas St West,43.656204,-79.38321,"[{'label': 'display', 'lat': 43.656204, 'lng':...",407,CA,Toronto,ON,Canada,"[595 Bay St (Dundas St West), Toronto ON, Canada]",
4,4afc5b80f964a520e42122e3,Fresh Start Coffee Co.,"[{'id': '4bf58dd8d48988d16d941735', 'name': 'C...",v-1579434244,False,655 Bay St,at Elm St,43.657377,-79.384393,"[{'label': 'display', 'lat': 43.65737716511621...",442,CA,Toronto,ON,Canada,"[655 Bay St (at Elm St), Toronto ON M5G 2K4, C...",M5G 2K4


30. Define information of interest and filter dataframe:

In [52]:
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

dataframe_filtered

Unnamed: 0,name,categories,address,crossStreet,lat,lng,labeledLatLngs,distance,cc,city,state,country,formattedAddress,postalCode,id
0,Bakery 18,Bakery,595 Bay St.,at Dundas St. W,43.656765,-79.383548,"[{'label': 'display', 'lat': 43.65676534941227...",429,CA,Toronto,ON,Canada,"[595 Bay St. (at Dundas St. W), Toronto ON, Ca...",,4b9f9b5af964a5206b2e37e3
1,Kin-Kin Bakery & Bubble Tea,Bakery,595 Bay St.,,43.656198,-79.382403,"[{'label': 'display', 'lat': 43.65619846393864...",460,CA,Toronto,ON,Canada,"[595 Bay St., Toronto ON M5G 2C2, Canada]",M5G 2C2,4eadd3dbe5fad025f84c97f4
2,Bakery & Kaffee Haus,Bakery,238 Queen Street West,,43.650096,-79.390378,"[{'label': 'display', 'lat': 43.6500955307706,...",500,CA,Toronto,ON,Canada,"[238 Queen Street West, Toronto ON M5V 0B5, Ca...",M5V 0B5,53542f20498eb5c7149ecafd
3,Paris Croissant Bakery Cafe,Bakery,595 Bay St,Dundas St West,43.656204,-79.38321,"[{'label': 'display', 'lat': 43.656204, 'lng':...",407,CA,Toronto,ON,Canada,"[595 Bay St (Dundas St West), Toronto ON, Canada]",,57462ae4498efddb24245014
4,Fresh Start Coffee Co.,Café,655 Bay St,at Elm St,43.657377,-79.384393,"[{'label': 'display', 'lat': 43.65737716511621...",442,CA,Toronto,ON,Canada,"[655 Bay St (at Elm St), Toronto ON M5G 2K4, C...",M5G 2K4,4afc5b80f964a520e42122e3


31. Let's visualize bakeries that are nearby:

In [53]:
dataframe_filtered.name

0                      Bakery 18
1    Kin-Kin Bakery & Bubble Tea
2           Bakery & Kaffee Haus
3    Paris Croissant Bakery Cafe
4         Fresh Start Coffee Co.
Name: name, dtype: object

In [54]:
venues_map = folium.Map(location=[latitude, longitude], zoom_start=15) # generate map centred around the Conrad Hotel

# add a red circle marker to represent the The Danforth West, Riverdale
folium.features.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='The Danforth West',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map)

# add the bakeries as blue circle markers
for lat, lng, label in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.categories):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map)

# display map
venues_map

## Thank You for your time!