# Assign_Capstone_W3: Segmenting and Clustering Neighborhoods in Toronto

## Part 1

#### 1.1. Build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below

In [1]:
#Step 1 Importing libraries: 
#1.1. Library for pulling data out of HTML and XML files
from bs4 import BeautifulSoup
#1.2 Library to handle data in a vectorized manner
import pandas as pd
#1.3. library for data analsysis
import numpy as np
#1.4. Library for web scraping
import requests

print('Step 1: Libraries imported.')

Step 1: Libraries imported.


* Parse the Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,

In [2]:
# Step 2:
#2.1. parsing Wikipedia:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, "lxml")

print('Step 2: Html page was parsed by Soup.')

Step 2: Html page was parsed by Soup.


In [3]:
# Step 3:
# Find and locate the table
table = soup.find('table', attrs={ "class" : "wikitable sortable"})

table_rows = table.find_all('tr')

data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('td')])
    
print('Step 3: ')

Step 3: 


* extract headers

In [4]:
headers = [table_rows.text.strip() for table_rows in table.find_all('th')]
print(headers)

['Postcode', 'Borough', 'Neighbourhood']


* The dataframe will include three columns: Postcode, Borough, and Neighborhood

In [5]:
df = pd.DataFrame(data[1:], columns=['Postcode', 'Borough', 'Neighbourhood'])
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [6]:
df.shape

(287, 3)

* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [7]:
df1 = df[df.Borough != "Not assigned"].reset_index(drop=True)
print(df1[20:30])

   Postcode           Borough       Neighbourhood
20      M1C       Scarborough      Highland Creek
21      M1C       Scarborough          Rouge Hill
22      M1C       Scarborough          Port Union
23      M3C        North York     Flemingdon Park
24      M3C        North York     Don Mills South
25      M4C         East York    Woodbine Heights
26      M5C  Downtown Toronto      St. James Town
27      M6C              York  Humewood-Cedarvale
28      M9C         Etobicoke   Bloordale Gardens
29      M9C         Etobicoke            Eringate


* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [8]:
df_grouped = df1.groupby(["Postcode", "Borough"], as_index=False).agg(lambda x: ", ".join(x))
df_grouped.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [9]:
for index, row in df_grouped.iterrows():
    if row["Neighbourhood"] == "Not assigned":
        row["Neighbourhood"] = row["Borough"]
        
df_grouped.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [10]:
df_grouped.shape

(103, 3)

# Part 2 "Geocoding"

### Import necessary Libraries

2.1 Use the Geocoder csv:

In [11]:
df_lon_lat = pd.read_csv('Geospatial_Coordinates.csv')
df_lon_lat.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


2.2. Change the name of the first column

In [12]:
df_lon_lat.columns=['Postcode','Latitude','Longitude']
df_lon_lat.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


2.3. Merge tables to get the latitude and the longitude coordinates of each neighborhood:

In [13]:
Tor_df = df_grouped.merge(df_lon_lat, on='Postcode')
Tor_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


2.4. Import additional necessary libraries

In [14]:
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
  - anaconda/win-64::ca-certificates-2019.8.28-0, anaconda/win-64::conda-4.8.1-py37_0, anaconda/win-64::openssl-1.1.1d-he774522_2
  - anaconda/win-64::ca-certificates-2019.8.28-0, anaconda/win-64::openssl-1.1.1d-he774522_2, defaults/win-64::conda-4.8.1-py37_0
  - anaconda/win-64::conda-4.8.1-py37_0, anaconda/win-64::openssl-1.1.1d-he774522_2, defaults/win-64::ca-certificates-2019.8.28-0
  - anaconda/win-64::openssl-1.1.1d-he774522_2, defaults/win-64::ca-certificates-2019.8.28-0, defaults/win-64::conda-4.8.1-py37_0
  - anaconda/win-64::conda-4.8.1-py37_0, defaults/win-64::ca-certificates-2019.8.28-0, defaults/win-64::openssl-1.1.1d-he774522_2
  - defaults/win-64::ca-certificates-2019.8.28-0, defaults/win-64::conda-4.8.1-py37_0, defaults/win-64::openssl-1.1.1d-he774522_2
  - anaconda/win-64::ca-certificates-2019.8.28-0, anaconda/win-64::conda-4.8.1-py37_0, defaults/win-64::openssl-

2.5. In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent foursquare_agent, as shown below.

* and find Toronto coordinates:

In [15]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


2.6. Define Foursquare Credentials and Version

In [65]:
# your Foursquare ID
# your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:


2.7. Create new new dataframe (we are planning to relocate to the East of Toronto and have to find where is more convenient to live):

In [28]:
Tor_geo = Tor_df[Tor_df['Borough'].str.contains('East')].reset_index(drop=True)
Tor_geo.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
1,M4C,East York,Woodbine Heights,43.695344,-79.318389
2,M4E,East Toronto,The Beaches,43.676357,-79.293031
3,M4G,East York,Leaside,43.70906,-79.363452
4,M4H,East York,Thorncliffe Park,43.705369,-79.349372
5,M4J,East York,East Toronto,43.685347,-79.338106
6,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
7,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
8,M4M,East Toronto,Studio District,43.659526,-79.340923
9,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558


2.8. Create the map of the East district of Toronto

In [29]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, label in zip(Tor_geo['Latitude'], Tor_geo['Longitude'], Tor_geo['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

* Explore one of the neighborhoods in Toronto dataframe

In [19]:
Tor_geo.loc[6, 'Neighbourhood']

'The Danforth West, Riverdale'

* find the coordinates of this neighbourhood

In [20]:
# Neighbourhood The Danforth West, Riverdale

Neighbourhood_latitude = Tor_geo.loc[6, 'Latitude'] # neighborhood latitude value
Neighbourhood_longitude = Tor_geo.loc[6, 'Longitude'] # neighborhood longitude value
Neighbourhood_name = Tor_geo.loc[6, 'Neighbourhood'] # neighborhood name

print('Latitude and Longitude values of {} are {}, {}.'.format(Neighbourhood_name, Neighbourhood_latitude, Neighbourhood_longitude))

Latitude and Longitude values of The Danforth West, Riverdale are 43.6795571, -79.352188.


* Define the corresponding URL:

In [21]:
LIMIT = 100
radius = 500

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, CLIENT_SECRET, VERSION, Neighbourhood_latitude, Neighbourhood_longitude, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=JWXCP3Y3ZHPDSTT012TWJYYIIOTLXPV1M0MOCAF1GCA243V3&client_secret=UL2TWXHPLO3SPYWXMP54JPAS524TG1LBPF5P50B35P3CI152&v=20180604&ll=43.6795571,-79.352188&radius=500&limit=100'

* Send the GET Request and examine the results:

In [22]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e24ce16b4b684001b10a838'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Greektown',
  'headerFullLocation': 'Greektown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 43,
  'suggestedBounds': {'ne': {'lat': 43.6840571045, 'lng': -79.34597738331301},
   'sw': {'lat': 43.675057095499994, 'lng': -79.35839861668698}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bce4183ef10952197da8386',
       'name': 'Pantheon',
       'location': {'address': '407 Danforth Ave.',
        'crossStreet': 'at Chester Ave.',
        'lat': 43.67762124481265,
        'lng': -79.35143390043564,
        'labeledLatLngs': [{'label': 'di

In [23]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

29. Get relevant part of JSON and transform it into a pandas dataframe:

In [24]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Pantheon,Greek Restaurant,43.677621,-79.351434
1,MenEssentials,Cosmetics Shop,43.67782,-79.351265
2,Cafe Fiorentina,Italian Restaurant,43.677743,-79.350115
3,Mezes,Greek Restaurant,43.677962,-79.350196
4,Dolce Gelato,Ice Cream Shop,43.677773,-79.351187


In [25]:
print('{} Number of venues by Foursquare.'.format(nearby_venues.shape[0]))

43 Number of venues by Foursquare.


30. Explore Neighborhoods in Toronto and create a function to repeat the same process to all the neighborhoods in Toronto:

In [30]:
#function to construct the dataframe with all the venues (max 100 venues per postal code)
def get_all_venues(postcodes, lat, lng):
    venues_list=[]
    for postcode, lat, lng in zip(postcodes, lat, lng):
        url= 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            postcode, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'], 
            v['venue']['categories'][0]['name'])
            for v in results])
    all_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    all_venues.columns = ['Postcode', 
                  'Postcode Latitude', 
                  'Postcode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude',
                  'Venue Category'
                  ]
    
    return all_venues

In [31]:
all_venues = get_all_venues(Tor_geo['Postcode'], Tor_geo['Latitude'], Tor_geo['Longitude'])

print('The total number of venues returned is ', all_venues.shape[0])

all_venues.head(10)

The total number of venues returned is  203


Unnamed: 0,Postcode,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M4B,43.706397,-79.309937,Jawny Bakers,43.705783,-79.312913,Gastropub
1,M4B,43.706397,-79.309937,East York Gymnastics,43.710654,-79.309279,Gym / Fitness Center
2,M4B,43.706397,-79.309937,Shoppers Drug Mart,43.705933,-79.312825,Pharmacy
3,M4B,43.706397,-79.309937,TD Canada Trust,43.70574,-79.31227,Bank
4,M4B,43.706397,-79.309937,Pizza Pizza,43.705159,-79.31313,Pizza Place
5,M4B,43.706397,-79.309937,Harvey's,43.71073,-79.308838,Fast Food Restaurant
6,M4B,43.706397,-79.309937,East York Animal Clinic,43.705921,-79.312196,Pet Store
7,M4B,43.706397,-79.309937,St. Clair Ave E & O'Connor Dr,43.705233,-79.313274,Intersection
8,M4B,43.706397,-79.309937,Venice Pizza,43.705921,-79.313957,Pizza Place
9,M4B,43.706397,-79.309937,91 Woodbine Bus (south),43.707646,-79.313808,Bus Line


In [32]:
all_venues.groupby('Postcode').count()

Unnamed: 0_level_0,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M4B,12,12,12,12,12,12
M4C,9,9,9,9,9,9
M4E,6,6,6,6,6,6
M4G,33,33,33,33,33,33
M4H,18,18,18,18,18,18
M4J,4,4,4,4,4,4
M4K,43,43,43,43,43,43
M4L,18,18,18,18,18,18
M4M,42,42,42,42,42,42
M7Y,18,18,18,18,18,18


In [33]:
all_venues.groupby('Venue Category').count()

Unnamed: 0_level_0,Postcode,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
American Restaurant,3,3,3,3,3,3
Athletics & Sports,1,1,1,1,1,1
Auto Workshop,1,1,1,1,1,1
Bagel Shop,1,1,1,1,1,1
Bakery,3,3,3,3,3,3
...,...,...,...,...,...,...
Trail,2,2,2,2,2,2
Video Store,1,1,1,1,1,1
Warehouse Store,1,1,1,1,1,1
Wine Bar,1,1,1,1,1,1


In [36]:
# Unique categories from all the returned venues

print('There are {} uniques categories.'.format(len(all_venues['Venue Category'].unique())))

There are 89 uniques categories.


* analyze each Postcode in East Toronto

In [38]:
# one hot encoding
DowntownToronto_onehot = pd.get_dummies(all_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
DowntownToronto_onehot['Postcode'] = all_venues['Postcode'] 

# move neighborhood column to the first column
fixed_columns = [DowntownToronto_onehot.columns[-1]] + list(DowntownToronto_onehot.columns[:-1])
DowntownToronto_onehot = DowntownToronto_onehot[fixed_columns]

DowntownToronto_onehot.head()

Unnamed: 0,Postcode,American Restaurant,Athletics & Sports,Auto Workshop,Bagel Shop,Bakery,Bank,Bar,Beer Store,Bike Shop,...,Steakhouse,Supermarket,Sushi Restaurant,Thai Restaurant,Thrift / Vintage Store,Trail,Video Store,Warehouse Store,Wine Bar,Yoga Studio
0,M4B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M4B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M4B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4B,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


* Dataframe size:

In [39]:
DowntownToronto_onehot.shape

(203, 90)

* Group rows by neighborhood and taking the mean of the frequency of each category:

In [42]:
DowntownToronto_grouped = DowntownToronto_onehot.groupby('Postcode').mean().reset_index()
DowntownToronto_grouped

Unnamed: 0,Postcode,American Restaurant,Athletics & Sports,Auto Workshop,Bagel Shop,Bakery,Bank,Bar,Beer Store,Bike Shop,...,Steakhouse,Supermarket,Sushi Restaurant,Thai Restaurant,Thrift / Vintage Store,Trail,Video Store,Warehouse Store,Wine Bar,Yoga Studio
0,M4B,0.0,0.083333,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0
2,M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0
3,M4G,0.0,0.0,0.0,0.030303,0.0,0.030303,0.0,0.030303,0.030303,...,0.0,0.030303,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M4H,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,...,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.055556
5,M4J,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M4K,0.023256,0.0,0.0,0.0,0.023256,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.023256,0.0,0.0,0.0,0.023256
7,M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.055556,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M4M,0.047619,0.0,0.0,0.0,0.047619,0.02381,0.02381,0.0,0.0,...,0.0,0.0,0.0,0.02381,0.02381,0.0,0.0,0.0,0.02381,0.02381
9,M7Y,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556


* Size of the new grouped dataframe is

In [43]:
DowntownToronto_grouped.shape

(10, 90)

* Print each neighborhood together with the top 5 most common venues:

In [44]:
num_top_venues = 5

for hood in DowntownToronto_grouped['Postcode']:
    print("----"+hood+"----")
    temp = DowntownToronto_grouped[DowntownToronto_grouped['Postcode'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M4B----
                  venue  freq
0           Pizza Place  0.17
1  Fast Food Restaurant  0.17
2          Intersection  0.08
3             Pet Store  0.08
4              Bus Line  0.08


----M4C----
         venue  freq
0     Pharmacy  0.11
1          Spa  0.11
2  Curling Ice  0.11
3  Video Store  0.11
4         Park  0.11


----M4E----
                  venue  freq
0                   Pub  0.17
1                  Park  0.17
2     Health Food Store  0.17
3                 Trail  0.17
4  Other Great Outdoors  0.17


----M4G----
                    venue  freq
0             Coffee Shop  0.12
1     Sporting Goods Shop  0.09
2  Furniture / Home Store  0.06
3            Burger Joint  0.06
4               Pet Store  0.03


----M4H----
                  venue  freq
0          Burger Joint  0.11
1     Indian Restaurant  0.11
2           Yoga Studio  0.06
3  Gym / Fitness Center  0.06
4           Gas Station  0.06


----M4J----
               venue  freq
0               Park  0.50
1     

* Venues in a descending order:

In [45]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

* create the new dataframe and display the top 10 venues for each Postcode:

In [47]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postcode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postcode'] = DowntownToronto_grouped['Postcode']

for ind in np.arange(DowntownToronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(DowntownToronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4B,Pizza Place,Fast Food Restaurant,Pet Store,Gym / Fitness Center,Pharmacy,Bus Line,Intersection,Bank,Athletics & Sports,Gastropub
1,M4C,Spa,Video Store,Park,Pharmacy,Bus Stop,Beer Store,Cosmetics Shop,Curling Ice,Skating Rink,Dessert Shop
2,M4E,Park,Neighborhood,Trail,Pub,Health Food Store,Other Great Outdoors,Curling Ice,Clothing Store,Coffee Shop,Comfort Food Restaurant
3,M4G,Coffee Shop,Sporting Goods Shop,Furniture / Home Store,Burger Joint,Fish & Chips Shop,Shopping Mall,Liquor Store,Mexican Restaurant,Gym,Grocery Store
4,M4H,Indian Restaurant,Burger Joint,Yoga Studio,Grocery Store,Sandwich Place,Pharmacy,Coffee Shop,Gym / Fitness Center,Gym,Pizza Place


In [48]:
neighborhoods_venues_sorted.shape

(10, 11)

### Cluster Postcode in East Toronto:

In [49]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

* Run k-means to cluster the Postcode into 5 clusters.

In [50]:
# set number of clusters
kclusters = 5

DowntownToronto_grouped_clustering = DowntownToronto_grouped.drop('Postcode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(DowntownToronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([4, 2, 3, 0, 0, 1, 0, 0, 0, 0])

In [52]:
Tor_geo.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
1,M4C,East York,Woodbine Heights,43.695344,-79.318389
2,M4E,East Toronto,The Beaches,43.676357,-79.293031
3,M4G,East York,Leaside,43.70906,-79.363452
4,M4H,East York,Thorncliffe Park,43.705369,-79.349372


In [53]:
neighborhoods_venues_sorted['Cluster Labels'] = kmeans.labels_
neighborhoods_venues_sorted.head()

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,M4B,Pizza Place,Fast Food Restaurant,Pet Store,Gym / Fitness Center,Pharmacy,Bus Line,Intersection,Bank,Athletics & Sports,Gastropub,4
1,M4C,Spa,Video Store,Park,Pharmacy,Bus Stop,Beer Store,Cosmetics Shop,Curling Ice,Skating Rink,Dessert Shop,2
2,M4E,Park,Neighborhood,Trail,Pub,Health Food Store,Other Great Outdoors,Curling Ice,Clothing Store,Coffee Shop,Comfort Food Restaurant,3
3,M4G,Coffee Shop,Sporting Goods Shop,Furniture / Home Store,Burger Joint,Fish & Chips Shop,Shopping Mall,Liquor Store,Mexican Restaurant,Gym,Grocery Store,0
4,M4H,Indian Restaurant,Burger Joint,Yoga Studio,Grocery Store,Sandwich Place,Pharmacy,Coffee Shop,Gym / Fitness Center,Gym,Pizza Place,0


In [54]:
EastToronto_merged = Tor_geo

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
EastToronto_merged = EastToronto_merged.join(neighborhoods_venues_sorted.set_index('Postcode'), on='Postcode').reset_index()

EastToronto_merged # check the last columns!

Unnamed: 0,index,Postcode,Borough,Neighbourhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,0,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937,Pizza Place,Fast Food Restaurant,Pet Store,Gym / Fitness Center,Pharmacy,Bus Line,Intersection,Bank,Athletics & Sports,Gastropub,4
1,1,M4C,East York,Woodbine Heights,43.695344,-79.318389,Spa,Video Store,Park,Pharmacy,Bus Stop,Beer Store,Cosmetics Shop,Curling Ice,Skating Rink,Dessert Shop,2
2,2,M4E,East Toronto,The Beaches,43.676357,-79.293031,Park,Neighborhood,Trail,Pub,Health Food Store,Other Great Outdoors,Curling Ice,Clothing Store,Coffee Shop,Comfort Food Restaurant,3
3,3,M4G,East York,Leaside,43.70906,-79.363452,Coffee Shop,Sporting Goods Shop,Furniture / Home Store,Burger Joint,Fish & Chips Shop,Shopping Mall,Liquor Store,Mexican Restaurant,Gym,Grocery Store,0
4,4,M4H,East York,Thorncliffe Park,43.705369,-79.349372,Indian Restaurant,Burger Joint,Yoga Studio,Grocery Store,Sandwich Place,Pharmacy,Coffee Shop,Gym / Fitness Center,Gym,Pizza Place,0
5,5,M4J,East York,East Toronto,43.685347,-79.338106,Park,Coffee Shop,Convenience Store,Comfort Food Restaurant,Comic Shop,Cosmetics Shop,Coworking Space,Curling Ice,Department Store,Dessert Shop,1
6,6,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Yoga Studio,Fruit & Vegetable Store,Liquor Store,Juice Bar,Indian Restaurant,0
7,7,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,Sandwich Place,Food & Drink Shop,Gym,Movie Theater,Park,Pet Store,Italian Restaurant,Pub,Burrito Place,Burger Joint,0
8,8,M4M,East Toronto,Studio District,43.659526,-79.340923,Café,Coffee Shop,American Restaurant,Bakery,Italian Restaurant,Brewery,Gastropub,Diner,Middle Eastern Restaurant,Latin American Restaurant,0
9,9,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,Light Rail Station,Yoga Studio,Spa,Garden Center,Fast Food Restaurant,Farmers Market,Gym / Fitness Center,Comic Shop,Butcher,Pizza Place,0


In [56]:
EastToronto_merged = EastToronto_merged.dropna()

In [57]:
EastToronto_merged['Cluster Labels']

0    4
1    2
2    3
3    0
4    0
5    1
6    0
7    0
8    0
9    0
Name: Cluster Labels, dtype: int32

* Visualize the resulting clusters:

In [58]:

## Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
#map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

map_clusters = folium.Map(location=[43.6532,-79.3832], zoom_start=12)


# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(EastToronto_merged['Latitude'], EastToronto_merged['Longitude'], EastToronto_merged['Postcode'], EastToronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

* Cluster 1.

In [59]:
EastToronto_merged.loc[EastToronto_merged['Cluster Labels'] == 0, EastToronto_merged.columns[[1] + list(range(5, EastToronto_merged.shape[1]))]]

Unnamed: 0,Postcode,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
3,M4G,-79.363452,Coffee Shop,Sporting Goods Shop,Furniture / Home Store,Burger Joint,Fish & Chips Shop,Shopping Mall,Liquor Store,Mexican Restaurant,Gym,Grocery Store,0
4,M4H,-79.349372,Indian Restaurant,Burger Joint,Yoga Studio,Grocery Store,Sandwich Place,Pharmacy,Coffee Shop,Gym / Fitness Center,Gym,Pizza Place,0
6,M4K,-79.352188,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Yoga Studio,Fruit & Vegetable Store,Liquor Store,Juice Bar,Indian Restaurant,0
7,M4L,-79.315572,Sandwich Place,Food & Drink Shop,Gym,Movie Theater,Park,Pet Store,Italian Restaurant,Pub,Burrito Place,Burger Joint,0
8,M4M,-79.340923,Café,Coffee Shop,American Restaurant,Bakery,Italian Restaurant,Brewery,Gastropub,Diner,Middle Eastern Restaurant,Latin American Restaurant,0
9,M7Y,-79.321558,Light Rail Station,Yoga Studio,Spa,Garden Center,Fast Food Restaurant,Farmers Market,Gym / Fitness Center,Comic Shop,Butcher,Pizza Place,0


In [60]:
EastToronto_merged.loc[EastToronto_merged['Cluster Labels'] == 1, EastToronto_merged.columns[[1] + list(range(5, EastToronto_merged.shape[1]))]]

Unnamed: 0,Postcode,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
5,M4J,-79.338106,Park,Coffee Shop,Convenience Store,Comfort Food Restaurant,Comic Shop,Cosmetics Shop,Coworking Space,Curling Ice,Department Store,Dessert Shop,1


In [61]:
EastToronto_merged.loc[EastToronto_merged['Cluster Labels'] == 2, EastToronto_merged.columns[[1] + list(range(5, EastToronto_merged.shape[1]))]]

Unnamed: 0,Postcode,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
1,M4C,-79.318389,Spa,Video Store,Park,Pharmacy,Bus Stop,Beer Store,Cosmetics Shop,Curling Ice,Skating Rink,Dessert Shop,2


In [62]:
EastToronto_merged.loc[EastToronto_merged['Cluster Labels'] == 3, EastToronto_merged.columns[[1] + list(range(5, EastToronto_merged.shape[1]))]]

Unnamed: 0,Postcode,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
2,M4E,-79.293031,Park,Neighborhood,Trail,Pub,Health Food Store,Other Great Outdoors,Curling Ice,Clothing Store,Coffee Shop,Comfort Food Restaurant,3


In [63]:
EastToronto_merged.loc[EastToronto_merged['Cluster Labels'] == 4, EastToronto_merged.columns[[1] + list(range(5, EastToronto_merged.shape[1]))]]

Unnamed: 0,Postcode,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,M4B,-79.309937,Pizza Place,Fast Food Restaurant,Pet Store,Gym / Fitness Center,Pharmacy,Bus Line,Intersection,Bank,Athletics & Sports,Gastropub,4


## Thank You for your time!