<a href="https://colab.research.google.com/github/epflyingzhang/ibm_data_science_capstone/blob/master/200325_Segmenting_and_Clustering_Neighborhoods_in_Toronto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook contains codes and Markdown text for the peer-graded assignment of Applied Data Science Capstone Project - week 3

# Task 1
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:<img src = "https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1585267200000&hmac=1rN6Weo44UlwxiHAxhCwnyYUqaWt563nAiBAevGOpro" width = 400>

To create the above dataframe:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.



In [0]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
# scrap table from HTML and store as raw data 
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
# replace line break mark with ";"
df_raw = pd.read_html(str(table).replace('<br/>', ';'))[0]
df_raw.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M1A;Not assigned,M2A;Not assigned,M3A;North York;(Parkwoods),M4A;North York;(Victoria Village),M5A;Downtown Toronto;(Regent Park / Harbourfront),M6A;North York;(Lawrence Manor / Lawrence Heig...,M7A;Queen's Park; / Ontario Provincial Government,M8A;Not assigned,M9A;Etobicoke;(Islington Avenue)
1,M1B;Scarborough;(Malvern / Rouge),M2B;Not assigned,M3B;North York;(Don Mills);North,M4B;East York;(Parkview Hill / Woodbine Gardens),"M5B;Downtown Toronto;(Garden District, Ryerson)",M6B;North York;(Glencairn),M7B;Not assigned,M8B;Not assigned,M9B;Etobicoke;(West Deane Park / Princess Gard...
2,M1C;Scarborough;(Rouge Hill / Port Union / Hig...,M2C;Not assigned,M3C;North York;(Don Mills);South;(Flemingdon P...,M4C;East York;(Woodbine Heights),M5C;Downtown Toronto;(St. James Town),M6C;York;(Humewood-Cedarvale),M7C;Not assigned,M8C;Not assigned,M9C;Etobicoke;(Eringate / Bloordale Gardens / ...
3,M1E;Scarborough;(Guildwood / Morningside / Wes...,M2E;Not assigned,M3E;Not assigned,M4E;East Toronto;(The Beaches),M5E;Downtown Toronto;(Berczy Park),M6E;York;(Caledonia-Fairbanks),M7E;Not assigned,M8E;Not assigned,M9E;Not assigned
4,M1G;Scarborough;(Woburn),M2G;Not assigned,M3G;Not assigned,M4G;East York;(Leaside),M5G;Downtown Toronto;(Central Bay Street),M6G;Downtown Toronto;(Christie),M7G;Not assigned,M8G;Not assigned,M9G;Not assigned


In [3]:
# reshape raw data
df = pd.DataFrame(df_raw.values.reshape((df_raw.shape[0] * df_raw.shape[1], 1)), columns=['raw_txt'])

# take first three characters as PostalCode
df['PostalCode'] = df['raw_txt'].str[:3]

# remaining string for further analysis
df['after_postal_code'] = df['raw_txt'].str[4:]

# remove "Not assigned" rows
row_before = len(df)
df = df[~(df['after_postal_code'] == 'Not assigned')]
print("removed {} 'Not assigned' out of {} rows. remaining rows: {}.".format(
    row_before - len(df), row_before, len(df)))


removed 77 'Not assigned' out of 180 rows. remaining rows: 103.


In [4]:
# some exploratory analysis and continue cleaning

# check PostalCode is unique
print("{} postal codes are all unique:".format(len(df)), len(df) == len(df['PostalCode'].unique()))

# check assumption: 'after_postal_code' contains max 1 pair of brackets
print("\nafter_postal_code' contains not extactly one pair of brackets:")
print(df[df['after_postal_code'].str.count("\(") != 1]['after_postal_code'])  # results show 2 rows have two pairs of brackets and 1 row has none


# take the string before ";(" as borough
df['Borough'] = df['after_postal_code'].str.split("\;\(").apply(lambda x: x[0])

# check Boroughs that contain ";" 
print("\n Boroughs contain ; sign:")
print(df[df['Borough'].str.count("\;") > 0]['Borough'])

# Boroughs: replace "; / " with " / ", replace ";" with ", "
df['Borough'] = df['Borough'].str.replace("; /", " /").str.replace(";", ", ")
print("\nAfter replace:")
print(df[df['Borough'].str.count(",") > 0]['Borough']) 

# Neighborhood column:
df['Neighborhood'] = df['after_postal_code'].str.split("\;\(").apply(lambda x: ";(".join(x[1:]) if len(x) > 1 else "")

# replace " / " with ", "
df['Neighborhood'] = df['Neighborhood'].str.replace(" / ", ", ")
# remove ")" at the end
df['Neighborhood'] = df['Neighborhood'].apply(lambda x: x[:-1] if len(x) > 1 and x[-1]==")" else x)
# replace ");" with "-", replace ";(" and "; (" with ", "
df['Neighborhood'] = df['Neighborhood'].str.replace("\);", "-").str.replace(";\(", ", ").str.replace("; \(",", ")

# if Neighborhood is "", use Borough
df['Neighborhood'] = df[['Borough', 'Neighborhood']].apply(lambda x: x[0] if x[1]=="" else x[1] , axis=1)


103 postal codes are all unique: True

after_postal_code' contains not extactly one pair of brackets:
6      Queen's Park; / Ontario Provincial Government
20    North York;(Don Mills);South;(Flemingdon Park)
65       North York;(Downsview);East ; (CFB Toronto)
Name: after_postal_code, dtype: object

 Boroughs contain ; sign:
6          Queen's Park; / Ontario Provincial Government
57                                East York;East Toronto
114    Mississauga;Canada Post Gateway Processing Centre
148     Downtown Toronto;Stn A PO Boxes;25 The Esplanade
152                                  Etobicoke;Northwest
168    East Toronto;Business reply mail Processing Ce...
Name: Borough, dtype: object

After replace:
57                               East York, East Toronto
114    Mississauga, Canada Post Gateway Processing Ce...
148    Downtown Toronto, Stn A PO Boxes, 25 The Espla...
152                                 Etobicoke, Northwest
168    East Toronto, Business reply mail Processing C...
N

In [0]:
# keep only relevant columns as output
df_out = df[['PostalCode', 'Borough', 'Neighborhood']].reset_index(drop=True)

# pd.set_option('display.max_rows', 1000)
# pd.set_option('display.max_columns', 5)
# # pd.set_option('display.max_colwidth', 100)
# print(df_out)
# print("\nThe shape of output df is: ", df_out.shape)
# pd.reset_option('all')


In [30]:
df_out.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park / Ontario Provincial Government,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills-North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


# End of Task 1

# Task 2
Get the coordinates of neighborhoods from Geocoder Python

It seems geocoder does not work! Use csv file instead

In [6]:
df_ll = pd.read_csv('http://cocl.us/Geospatial_data')
df_ll.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [0]:
df_with_ll = df_out.merge(df_ll, how='left', left_on='PostalCode', right_on='Postal Code')
df_with_ll.drop('Postal Code', inplace=True, axis=1)

In [8]:
print(df_with_ll.shape)
df_with_ll.head(10)

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park / Ontario Provincial Government,Queen's Park / Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills-North,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


# End of Task 2

# Task 3
Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

- add enough Markdown cells to explain what you decided to do and to report any observations you make.
- generate maps to visualize your neighborhoods and how they cluster together

Firstly, we take the Neiborhoods in Toronto and visulize them on a map

In [9]:
# get all Neighborhoods in Boroughs with names containing the world Toronto
df_cluster = df_with_ll[df_with_ll['Borough'].str.contains('Toronto')]
# df_cluster = df_with_ll

df_cluster.shape

(39, 5)

In [10]:
df_cluster.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [0]:
from geopy.geocoders import Nominatim
import folium

In [12]:
# get the coordinates of Toronto
address = 'Toronto'
geolocator = Nominatim()
toronto = geolocator.geocode(address)
toronto



Location(Toronto, Golden Horseshoe, Ontario, M6K 1X9, Canada, (43.653963, -79.387207, 0.0))

In [13]:
# create map using latitude and longitude values
map_toronto = folium.Map(location=[toronto.latitude, toronto.longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_cluster['Latitude'], df_cluster['Longitude'], df_cluster['Borough'], df_cluster['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Get nearby venue data via FourSquare API

In [0]:
# Foursquare credentials
CLIENT_ID = 'QEPQ03NLWD2WHE1ILZ5AUJA3JNSXEI2DJWJNKHOU133HORCY' # your Foursquare ID
CLIENT_SECRET = 'GEGJUS2TFG3DQOASUYQDUJTNWTTX2IMN5RNASS5LUOIU3YMY' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

LIMIT = 50
RADIUS = 1000

In [0]:
def getNearbyVenues(names, latitudes, longitudes, radius=RADIUS):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [16]:
toronto_venues = getNearbyVenues(names=df_cluster['Neighborhood'],
                                   latitudes=df_cluster['Latitude'],
                                   longitudes=df_cluster['Longitude']
                                  )

Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
The Danforth ; East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Enclave of M5E
St. James Town, Cabbagetown
First Canad

In [17]:
print(toronto_venues.shape)
toronto_venues.head()
toronto_venues.groupby('Neighborhood').count()

(1798, 7)


Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,50,50,50,50,50,50
"Brockton, Parkdale Village, Exhibition Place",50,50,50,50,50,50
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",14,14,14,14,14,14
Central Bay Street,50,50,50,50,50,50
Christie,50,50,50,50,50,50
Church and Wellesley,50,50,50,50,50,50
"Commerce Court, Victoria Hotel",50,50,50,50,50,50
Davisville,50,50,50,50,50,50
Davisville North,50,50,50,50,50,50
"Dufferin, Dovercourt Village",50,50,50,50,50,50


Transform toronto_venues data into features

In [18]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 235 uniques categories.


In [19]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Stadium,Beach,Beach Bar,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Burger Joint,Burrito Place,Bus Line,Business Service,Butcher,Café,...,School,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Skate Park,Skating Rink,Smoke Shop,Snack Place,Soccer Stadium,Soup Place,South American Restaurant,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stationery Store,Steakhouse,Supermarket,Sushi Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Tech Startup,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Track,Trail,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.02,0.04,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.06,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,...,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.06,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,...,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0
2,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.071429,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,...,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Central Bay Street,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.02,0.04,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0
4,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12,...,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.0
5,Church and Wellesley,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.02,0.02,0.02,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02
6,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,...,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0
7,Davisville,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.06,...,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.04,0.02,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.0
8,Davisville North,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.02,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.06,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.04,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.02,0.02,0.0
9,"Dufferin, Dovercourt Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.02,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.02,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.12,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
toronto_grouped.describe()

Unnamed: 0,Yoga Studio,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Stadium,Beach,Beach Bar,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Burger Joint,Burrito Place,Bus Line,Business Service,Butcher,Café,Candy Store,...,School,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Skate Park,Skating Rink,Smoke Shop,Snack Place,Soccer Stadium,Soup Place,South American Restaurant,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stationery Store,Steakhouse,Supermarket,Sushi Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Tech Startup,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Track,Trail,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop
count,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,...,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0
mean,0.006737,0.001832,0.001832,0.013004,0.000513,0.000513,0.000513,0.000513,0.005641,0.000513,0.003611,0.004615,0.001068,0.004679,0.004308,0.027487,0.007644,0.018996,0.000513,0.001026,0.002051,0.003611,0.000513,0.012308,0.001026,0.000513,0.003816,0.012308,0.000513,0.001026,0.010321,0.008269,0.003077,0.006688,0.003611,0.002564,0.000513,0.001026,0.064573,0.001068,...,0.001026,0.001832,0.009231,0.001026,0.000513,0.001047,0.003946,0.001026,0.00156,0.000513,0.000513,0.000513,0.003147,0.002564,0.002121,0.001538,0.000513,0.004777,0.004615,0.018668,0.002564,0.001538,0.000675,0.002051,0.009814,0.000513,0.014612,0.005803,0.000513,0.000513,0.000513,0.000513,0.002927,0.006892,0.001538,0.001026,0.013344,0.005198,0.00366,0.000513
std,0.010695,0.011438,0.011438,0.015671,0.003203,0.003203,0.003203,0.003203,0.009118,0.003203,0.009069,0.012533,0.006672,0.009889,0.010209,0.02141,0.019918,0.027891,0.003203,0.006405,0.006147,0.012058,0.003203,0.019257,0.004469,0.003203,0.00959,0.021332,0.003203,0.004469,0.015232,0.015927,0.00731,0.01161,0.009069,0.016013,0.003203,0.004469,0.041718,0.006672,...,0.004469,0.011438,0.014397,0.004469,0.003203,0.004563,0.012346,0.004469,0.005475,0.003203,0.003203,0.003203,0.008781,0.006774,0.006368,0.005399,0.003203,0.008889,0.008537,0.028807,0.006774,0.005399,0.004214,0.006147,0.012938,0.003203,0.012265,0.011447,0.003203,0.003203,0.003203,0.003203,0.012231,0.018808,0.005399,0.004469,0.016118,0.012838,0.007938,0.003203
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0
75%,0.02,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.02,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.08,0.0,...,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.031364,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0
max,0.04,0.071429,0.071429,0.04,0.02,0.02,0.02,0.02,0.02,0.02,0.04,0.06,0.041667,0.041667,0.047619,0.08,0.095238,0.12,0.02,0.04,0.02,0.06,0.02,0.06,0.02,0.02,0.041667,0.1,0.02,0.02,0.06,0.0625,0.02,0.040816,0.04,0.1,0.02,0.02,0.2,0.041667,...,0.02,0.071429,0.04,0.02,0.02,0.020833,0.047619,0.02,0.020833,0.02,0.02,0.02,0.04,0.02,0.022727,0.02,0.02,0.026316,0.02,0.142857,0.02,0.02,0.026316,0.02,0.04,0.02,0.04,0.04,0.02,0.02,0.02,0.02,0.071429,0.1,0.02,0.02,0.06,0.06,0.022727,0.02


In [21]:
toronto_grouped.shape

(39, 235)

In [0]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [23]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Café,Beer Bar,Seafood Restaurant,Bakery,Cocktail Bar,Japanese Restaurant,Farmers Market,Hotel,Creperie
1,"Brockton, Parkdale Village, Exhibition Place",Restaurant,Bakery,Coffee Shop,Gift Shop,Café,Vegetarian / Vegan Restaurant,Furniture / Home Store,Breakfast Spot,Arts & Crafts Store,Sandwich Place
2,"CN Tower, King and Spadina, Railway Lands, Har...",Harbor / Marina,Coffee Shop,Park,Sculpture Garden,Dog Run,Dance Studio,Café,Garden,Scenic Lookout,Track
3,Central Bay Street,Coffee Shop,Japanese Restaurant,Ice Cream Shop,Italian Restaurant,Chinese Restaurant,Tea Room,Yoga Studio,Pizza Place,Sandwich Place,Bubble Tea Shop
4,Christie,Café,Korean Restaurant,Grocery Store,Coffee Shop,Cocktail Bar,Indian Restaurant,Pizza Place,Ethiopian Restaurant,Ice Cream Shop,Spa


Cluster neighborhoods

In [24]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5

toronto_clus = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clus)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 2, 4, 4, 4, 1, 4, 4, 2], dtype=int32)

Create a new df with cluster results and top 10 venue categories

In [26]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = df_cluster

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].apply(int)

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2,Coffee Shop,Café,Bakery,Park,Mexican Restaurant,Pub,Italian Restaurant,Theater,Yoga Studio,Shoe Store
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,4,Coffee Shop,Clothing Store,Electronics Store,Bookstore,Theater,Ramen Restaurant,Italian Restaurant,Restaurant,American Restaurant,Cosmetics Shop
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Coffee Shop,Café,Japanese Restaurant,Cocktail Bar,Beer Bar,Bakery,Farmers Market,Italian Restaurant,Restaurant,Hotel
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Pub,Breakfast Spot,Park,Tea Room,Caribbean Restaurant,Japanese Restaurant,Beach,Bakery,Coffee Shop,French Restaurant
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1,Coffee Shop,Café,Beer Bar,Seafood Restaurant,Bakery,Cocktail Bar,Japanese Restaurant,Farmers Market,Hotel,Creperie


Visulize resulting clusters:

In [27]:
from matplotlib import cm, colors

# create map
map_clusters = folium.Map(location=[toronto.latitude, toronto.longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [28]:
toronto_merged.sort_values(by='Cluster Labels') # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
62,M5N,Central Toronto,Roselawn,43.711695,-79.416936,0,Sushi Restaurant,Coffee Shop,Italian Restaurant,Pharmacy,Bank,Pet Store,Skating Rink,Café,Japanese Restaurant,Bakery
86,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049,0,Coffee Shop,Italian Restaurant,Sushi Restaurant,Park,Spa,Gym,Liquor Store,Restaurant,Café,Pub
83,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,0,Park,Italian Restaurant,Coffee Shop,Gym,Grocery Store,Pub,Sandwich Place,Pizza Place,Restaurant,Café
68,M5P,Central Toronto,Forest Hill North & West,43.696948,-79.411307,0,Park,Café,Bank,Italian Restaurant,Coffee Shop,Skating Rink,Sushi Restaurant,Bakery,Gym / Fitness Center,Trail
47,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,0,Coffee Shop,Beach,Indian Restaurant,Bakery,Fast Food Restaurant,Café,Sandwich Place,Grocery Store,Burrito Place,Brewery
100,M7Y,"East Toronto, Business reply mail Processing C...",Enclave of M4L,43.662744,-79.321558,0,Park,Coffee Shop,Brewery,Pizza Place,Fast Food Restaurant,Pet Store,Italian Restaurant,Sushi Restaurant,Burrito Place,Breakfast Spot
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Pub,Breakfast Spot,Park,Tea Room,Caribbean Restaurant,Japanese Restaurant,Beach,Bakery,Coffee Shop,French Restaurant
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1,Coffee Shop,Café,Beer Bar,Seafood Restaurant,Bakery,Cocktail Bar,Japanese Restaurant,Farmers Market,Hotel,Creperie
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568,1,Café,Concert Hall,Asian Restaurant,Restaurant,American Restaurant,Seafood Restaurant,Coffee Shop,Pizza Place,Salon / Barbershop,Burrito Place
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Coffee Shop,Café,Japanese Restaurant,Cocktail Bar,Beer Bar,Bakery,Farmers Market,Italian Restaurant,Restaurant,Hotel


# Observations:
- Neighborhoods in the biggest cluster (No. 2) are featured by Cafe, Coffee Shop and Bars. 
- Cluster No. 0 consists of neighborhoods in East and Central Toronto
- Cluster No. 1 is charaterized by high density of restaurants, pubs and cafe in or near Downtown Toronto
- Nieghborhoods in Cluster No. 2 have parks and coffee shops
- Cluster No. 3 consists of only one neighborhood, Lawrence Park, which is situated on the out-skirt of the city.
- Cluster No. 4 is the biggest cluster with neighborhoods spreading in different parts of Toronto, likely representing the most common residential surroundings. 


# End of Task 3