# IBM Applied Data Science Capstone Course by Coursera

## Week 3 Part 1, 2 and 3
1. Build a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name in Toronto.
2. Get the geographical coordinates of the neighborhoods in Toronto.
3. Explore and cluster the neighborhoods in Toronto (replicate the same analysis we did to New York City data).

## Part 1

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print("Libraries imported.")

Libraries imported.


In [2]:
#Scrap data from Wikipedia page into a DataFrame
#send the GET request
data = requests.get('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969').text
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')
# create three lists to store table data
postalCodeList = []
boroughList = []
neighborhoodList = []

In [3]:
# append the data into the respective lists
for row in soup.find('table').find_all('tr'):
    cells = row.find_all('td')
    if(len(cells) > 0):
        postalCodeList.append(cells[0].text.strip('\n'))
        boroughList.append(cells[1].text.strip('\n'))
        neighborhoodList.append(cells[2].text.strip('\n')) # avoid new lines in neighborhood cell

In [4]:
# create a new DataFrame from the three lists
toronto_df = pd.DataFrame({"PostalCode": postalCodeList,
                           "Borough": boroughList,
                           "Neighborhood": neighborhoodList})

toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
# drop cells with a borough that is Not assigned
toronto_df_dropna = toronto_df[toronto_df.Borough != "Not assigned"].reset_index(drop=True)
toronto_df_dropna.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [6]:
# group neighborhoods in the same borough
toronto_df_grouped = toronto_df_dropna.groupby(["PostalCode", "Borough"], as_index=False).agg(lambda x: ", ".join(x))
toronto_df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [7]:
# for Neighborhood="Not assigned", make the value the same as Borough
for index, row in toronto_df_grouped.iterrows():
    if row["Neighborhood"] == "Not assigned":
        row["Neighborhood"] = row["Borough"]
        
toronto_df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
# create a new test dataframe
column_names = ["PostalCode", "Borough", "Neighborhood"]
test_df = pd.DataFrame(columns=column_names)

test_list = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in test_list:
    test_df = test_df.append(toronto_df_grouped[toronto_df_grouped["PostalCode"]==postcode], ignore_index=True)
    
test_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Parkview Hill, Woodbine Gardens"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Wexford, Maryvale"
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


In [31]:
# print the number of rows of the cleaned dataframe
toronto_df_grouped.shape

(103, 3)

## Part 2

In [12]:
coordinates = pd.read_csv("https://cocl.us/Geospatial_data")
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
# rename the column "PostalCode"
coordinates.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
coordinates.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [14]:
# merge two table on the column "PostalCode"
toronto_df_new = toronto_df_grouped.merge(coordinates, on="PostalCode", how="left")
toronto_df_new.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [15]:
# create a new test dataframe
column_names = ["PostalCode", "Borough", "Neighborhood", "Latitude", "Longitude"]
test_df = pd.DataFrame(columns=column_names)

test_list = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in test_list:
    test_df = test_df.append(toronto_df_new[toronto_df_new["PostalCode"]==postcode], ignore_index=True)
    
test_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,"Wexford, Maryvale",43.750071,-79.295849
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442


In [17]:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto-explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [18]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df_new['Latitude'], toronto_df_new['Longitude'], toronto_df_new['Borough'], toronto_df_new['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

In [19]:
# filter borough names that contain the word Toronto
borough_names = list(toronto_df_new.Borough.unique())

borough_with_toronto = []

for x in borough_names:
    if "toronto" in x.lower():
        borough_with_toronto.append(x)
        
borough_with_toronto

['East Toronto',
 'Central Toronto',
 'Downtown Toronto',
 'West Toronto',
 'Toronto/York']

In [20]:
# create a new DataFrame with only boroughs that contain the word Toronto
toronto_df_new = toronto_df_new[toronto_df_new['Borough'].isin(borough_with_toronto)].reset_index(drop=True)
print(toronto_df_new.shape)
toronto_df_new.head()

(40, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [21]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df_new['Latitude'], toronto_df_new['Longitude'], toronto_df_new['Borough'], toronto_df_new['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

In [22]:
# define Foursquare Credentials and Version
CLIENT_ID = '1UAOA3EYQNPHNXZ2GLPQAD4WQZCESKRJURNJAQW0BPSZLH3P' # your Foursquare ID
CLIENT_SECRET = 'THIU2WDHYWPYB2WV0WJTATLYTAYWVGPCQSCQY42RRD1CEHO1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 1UAOA3EYQNPHNXZ2GLPQAD4WQZCESKRJURNJAQW0BPSZLH3P
CLIENT_SECRET:THIU2WDHYWPYB2WV0WJTATLYTAYWVGPCQSCQY42RRD1CEHO1


In [23]:
radius = 500
LIMIT = 100

venues = []

for lat, long, post, borough, neighborhood in zip(toronto_df_new['Latitude'], toronto_df_new['Longitude'], toronto_df_new['PostalCode'], toronto_df_new['Borough'], toronto_df_new['Neighborhood']):
    url = "https://api.foursquare.com/v2/venues/explore?client_id=1UAOA3EYQNPHNXZ2GLPQAD4WQZCESKRJURNJAQW0BPSZLH3P&client_secret=THIU2WDHYWPYB2WV0WJTATLYTAYWVGPCQSCQY42RRD1CEHO1&v=20180605&ll=43.6534817,-79.3839347&radius=500&limit=100".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append((
            post, 
            borough,
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [24]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['PostalCode', 'Borough', 'Neighborhood', 'BoroughLatitude', 'BoroughLongitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(2960, 9)


Unnamed: 0,PostalCode,Borough,Neighborhood,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,Downtown Toronto,43.653232,-79.385296,Neighborhood
1,M4E,East Toronto,The Beaches,43.676357,-79.293031,Nathan Phillips Square,43.65227,-79.383516,Plaza
2,M4E,East Toronto,The Beaches,43.676357,-79.293031,Japango,43.655268,-79.385165,Sushi Restaurant
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,Indigo,43.653515,-79.380696,Bookstore
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,Poke Guys,43.654895,-79.385052,Poke Place


In [25]:
venues_df.groupby(["PostalCode", "Borough", "Neighborhood"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
PostalCode,Borough,Neighborhood,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
M4E,East Toronto,The Beaches,74,74,74,74,74,74
M4K,East Toronto,"The Danforth West, Riverdale",74,74,74,74,74,74
M4L,East Toronto,"India Bazaar, The Beaches West",74,74,74,74,74,74
M4M,East Toronto,Studio District,74,74,74,74,74,74
M4N,Central Toronto,Lawrence Park,74,74,74,74,74,74
M4P,Central Toronto,Davisville North,74,74,74,74,74,74
M4R,Central Toronto,"North Toronto West, Lawrence Park",74,74,74,74,74,74
M4S,Central Toronto,Davisville,74,74,74,74,74,74
M4T,Central Toronto,"Moore Park, Summerhill East",74,74,74,74,74,74
M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park",74,74,74,74,74,74


In [26]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 53 uniques categories.


In [27]:
venues_df['VenueCategory'].unique()[:50]

array(['Neighborhood', 'Plaza', 'Sushi Restaurant', 'Bookstore',
       'Poke Place', 'Bubble Tea Shop', 'Art Museum', 'Shopping Mall',
       'Hotel', 'Cosmetics Shop', 'Ramen Restaurant', 'Coffee Shop',
       'Fast Food Restaurant', 'Monument / Landmark', 'Clothing Store',
       'Concert Hall', 'Bakery', 'Restaurant', 'Theater',
       'Seafood Restaurant', 'Japanese Restaurant',
       'Vegetarian / Vegan Restaurant', 'American Restaurant',
       'Electronics Store', 'Tanning Salon', 'Furniture / Home Store',
       'Department Store', 'Comic Shop', 'Modern European Restaurant',
       'New American Restaurant', 'Steakhouse', 'Gastropub', 'Bank',
       'Music Venue', 'Gym / Fitness Center', 'Latin American Restaurant',
       'Café', 'Pizza Place', 'Thai Restaurant', 'Jazz Club',
       'Middle Eastern Restaurant', 'Smoothie Shop', 'Mexican Restaurant',
       'Diner', 'Breakfast Spot', 'Colombian Restaurant', 'Movie Theater',
       'Shoe Store', 'Food & Drink Shop', 'Salad Pla

In [28]:
# one hot encoding
toronto_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add postal, borough and neighborhood column back to dataframe
toronto_onehot['PostalCode'] = venues_df['PostalCode'] 
toronto_onehot['Borough'] = venues_df['Borough'] 
toronto_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move postal, borough and neighborhood column to the first column
fixed_columns = list(toronto_onehot.columns[-3:]) + list(toronto_onehot.columns[:-3])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

(2960, 56)


Unnamed: 0,PostalCode,Borough,Neighborhoods,American Restaurant,Art Museum,Bakery,Bank,Bookstore,Breakfast Spot,Bubble Tea Shop,Café,Clothing Store,Cocktail Bar,Coffee Shop,Colombian Restaurant,Comic Shop,Concert Hall,Cosmetics Shop,Department Store,Diner,Electronics Store,Fast Food Restaurant,Food & Drink Shop,Furniture / Home Store,Gastropub,Gym / Fitness Center,Hotel,Japanese Restaurant,Jazz Club,Latin American Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Monument / Landmark,Movie Theater,Music Venue,Neighborhood,New American Restaurant,Pizza Place,Plaza,Poke Place,Ramen Restaurant,Restaurant,Salad Place,Seafood Restaurant,Shoe Store,Shopping Mall,Smoothie Shop,Steakhouse,Sushi Restaurant,Tanning Salon,Thai Restaurant,Theater,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant
0,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
3,M4E,East Toronto,The Beaches,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [29]:
toronto_grouped = toronto_onehot.groupby(["PostalCode", "Borough", "Neighborhoods"]).mean().reset_index()

print(toronto_grouped.shape)
toronto_grouped

(40, 56)


Unnamed: 0,PostalCode,Borough,Neighborhoods,American Restaurant,Art Museum,Bakery,Bank,Bookstore,Breakfast Spot,Bubble Tea Shop,Café,Clothing Store,Cocktail Bar,Coffee Shop,Colombian Restaurant,Comic Shop,Concert Hall,Cosmetics Shop,Department Store,Diner,Electronics Store,Fast Food Restaurant,Food & Drink Shop,Furniture / Home Store,Gastropub,Gym / Fitness Center,Hotel,Japanese Restaurant,Jazz Club,Latin American Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Monument / Landmark,Movie Theater,Music Venue,Neighborhood,New American Restaurant,Pizza Place,Plaza,Poke Place,Ramen Restaurant,Restaurant,Salad Place,Seafood Restaurant,Shoe Store,Shopping Mall,Smoothie Shop,Steakhouse,Sushi Restaurant,Tanning Salon,Thai Restaurant,Theater,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant
0,M4E,East Toronto,The Beaches,0.027027,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.108108,0.013514,0.067568,0.013514,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.027027,0.013514,0.013514,0.013514
1,M4K,East Toronto,"The Danforth West, Riverdale",0.027027,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.108108,0.013514,0.067568,0.013514,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.027027,0.013514,0.013514,0.013514
2,M4L,East Toronto,"India Bazaar, The Beaches West",0.027027,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.108108,0.013514,0.067568,0.013514,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.027027,0.013514,0.013514,0.013514
3,M4M,East Toronto,Studio District,0.027027,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.108108,0.013514,0.067568,0.013514,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.027027,0.013514,0.013514,0.013514
4,M4N,Central Toronto,Lawrence Park,0.027027,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.108108,0.013514,0.067568,0.013514,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.027027,0.013514,0.013514,0.013514
5,M4P,Central Toronto,Davisville North,0.027027,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.108108,0.013514,0.067568,0.013514,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.027027,0.013514,0.013514,0.013514
6,M4R,Central Toronto,"North Toronto West, Lawrence Park",0.027027,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.108108,0.013514,0.067568,0.013514,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.027027,0.013514,0.013514,0.013514
7,M4S,Central Toronto,Davisville,0.027027,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.108108,0.013514,0.067568,0.013514,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.027027,0.013514,0.013514,0.013514
8,M4T,Central Toronto,"Moore Park, Summerhill East",0.027027,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.108108,0.013514,0.067568,0.013514,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.027027,0.013514,0.013514,0.013514
9,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",0.027027,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.108108,0.013514,0.067568,0.013514,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.013514,0.013514,0.027027,0.013514,0.027027,0.013514,0.013514,0.013514,0.013514,0.013514,0.013514,0.027027,0.027027,0.013514,0.013514,0.013514


In [30]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
areaColumns = ['PostalCode', 'Borough', 'Neighborhoods']
freqColumns = []
for ind in np.arange(num_top_venues):
    try:
        freqColumns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        freqColumns.append('{}th Most Common Venue'.format(ind+1))
columns = areaColumns+freqColumns

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['PostalCode'] = toronto_grouped['PostalCode']
neighborhoods_venues_sorted['Borough'] = toronto_grouped['Borough']
neighborhoods_venues_sorted['Neighborhoods'] = toronto_grouped['Neighborhoods']

for ind in np.arange(toronto_grouped.shape[0]):
    row_categories = toronto_grouped.iloc[ind, :].iloc[3:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    neighborhoods_venues_sorted.iloc[ind, 3:] = row_categories_sorted.index.values[0:num_top_venues]

# neighborhoods_venues_sorted.sort_values(freqColumns, inplace=True)
print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted

(40, 13)


Unnamed: 0,PostalCode,Borough,Neighborhoods,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
1,M4K,East Toronto,"The Danforth West, Riverdale",Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
2,M4L,East Toronto,"India Bazaar, The Beaches West",Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
3,M4M,East Toronto,Studio District,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
4,M4N,Central Toronto,Lawrence Park,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
5,M4P,Central Toronto,Davisville North,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
6,M4R,Central Toronto,"North Toronto West, Lawrence Park",Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
7,M4S,Central Toronto,Davisville,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
8,M4T,Central Toronto,"Moore Park, Summerhill East",Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
9,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore


### Part 3 

In [34]:
#using k-means clustering
# set number of clusters
kclusters = 1

toronto_grouped_clustering = toronto_grouped.drop(["PostalCode", "Borough", "Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [35]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
toronto_merged = toronto_df_new.copy()

# add clustering labels
toronto_merged["Cluster Labels"] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.drop(["Borough", "Neighborhoods"], 1).set_index("PostalCode"), on="PostalCode")

print(toronto_merged.shape)
toronto_merged.head() # check the last columns!

(40, 16)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore


In [36]:
# sort the results by Cluster Labels
print(toronto_merged.shape)
toronto_merged.sort_values(["Cluster Labels"], inplace=True)
toronto_merged

(40, 16)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
22,M5N,Central Toronto,Roselawn,43.711695,-79.416936,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
23,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.696948,-79.411307,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
24,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
25,M5S,Downtown Toronto,"University of Toronto, Harbord",43.662696,-79.400049,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
26,M5T,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",43.653206,-79.400049,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
27,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
28,M5W,Downtown Toronto,Stn A PO Boxes,43.646435,-79.374846,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
21,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
29,M5X,Downtown Toronto,"First Canadian Place, Underground city",43.648429,-79.38228,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore


In [37]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, post, bor, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostalCode'], toronto_merged['Borough'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup('{} ({}): {} - Cluster {}'.format(bor, post, poi, cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [38]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
22,Central Toronto,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
23,Central Toronto,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
24,Central Toronto,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
25,Downtown Toronto,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
26,Downtown Toronto,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
27,Downtown Toronto,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
28,Downtown Toronto,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
21,Downtown Toronto,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore
29,Downtown Toronto,0,Clothing Store,Coffee Shop,American Restaurant,Hotel,Plaza,Restaurant,Seafood Restaurant,Diner,Cosmetics Shop,Bookstore


In [39]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


In [40]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


In [41]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


In [42]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


So most categories fall into cluster 1 itself.