# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto


For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

Start by creating a new Notebook for this assignment.
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

![dataframe](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1603843200000&hmac=Oz4Fl8SLRUys6_OdzPYFxi4nTTgUg9dJ-5Ze2-FOzIc)


## Questions 1: Scrape the table from the Wikipedia page

_Note:The instruction for scraping table from website was found at https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059_

### Scape the table 

In [1]:
# import the libraries
import requests  # for getting the HTML contents of the website
import lxml.html as lh  # for parsing the relevant fields
import pandas as pd 
import numpy as np

In [2]:
# scrape table cells
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#create a handle, page, to handle the contents of the website
page = requests.get(url)

#store the contents of the website under doc
doc = lh.fromstring(page.content)

#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

In [3]:
#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

In [4]:
#Parse table header, the first row as the header

#create an empty list
col=[]
i=0

# for each row, store each first element (header)and an empty list
for t in tr_elements[0]:
    i+=1
    name =t.text_content()
    print(i,name)
    col.append((name,[]))

1 Postal Code

2 Borough

3 Neighbourhood



In [5]:
#Creating Pandas DataFrame
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        #if i>0:
        #Convert any numerical value to integers
        #    try:
         #       data=int(data)
         #   except:
         #       pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

In [6]:
[len(C) for (title,C) in col]

[181, 181, 181]

In [7]:
# create the DataFrame
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

df.head()

Unnamed: 0,Postal Code\n,Borough\n,Neighbourhood\n
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


In [8]:
df.shape

(181, 3)

In [9]:
df.columns

Index(['Postal Code\n', 'Borough\n', 'Neighbourhood\n'], dtype='object')

**there is unexpected '\n' in the columns**

### Data Cleaning
#### Revmove "\n" after each element in the dataframe

In [10]:
df.columns =['Postcode','Borough','Neighbourhood'] # rename the columns in order to remove the '\n'
df.columns

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

In [11]:
df.Neighbourhood.dtypes  #check the datatypes of the Neighbourhood column

dtype('O')

In [12]:
df['Postcode'] = df['Postcode'].str.strip('\n') # strip'\n' from the Postcode column
df['Borough'] = df['Borough'].str.strip('\n') # strip'\n' from the Borough column
df['Neighbourhood'] = df['Neighbourhood'].str.strip('\n') # strip'\n' from the Neighbourhood column

In [13]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Ignore cells **without an assigned borough**

In [14]:
# Remove the rows without assigned borough name and reset the index
df = df.loc[df['Borough'] != 'Not assigned']
df = df.loc[df['Postcode'] != ''] # remove where Postcode is empty too

df.reset_index(inplace=True)
df.drop('index',inplace=True, axis =1)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [15]:
df.shape[0]

103

#### If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [16]:
for i in range(0,df.shape[0]):
    if df.iloc[i,2] =='Not assigned':
        print(i)
        df.iloc[i,2]=df.iloc[i,1] # Assign neighborhood as the borough name
        print(df.iloc[i,2])

In [17]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### Use the _.shape_ method to print the number of rows of your dataframe

In [18]:
df.shape

(103, 3)

In [19]:
len(df.Postcode.unique())

103

It means all the postcodes in dataframe *df* are unique values

## Question 2 Find the location for each postal code

### Using the goecoder to retrieve the location of a postal code

*The location of each postcode is read from the CSV file provided, due to that the geocoder package is not stable.*

In [20]:
df_postcode_csv = pd.read_csv('http://cocl.us/Geospatial_data')
df_postcode_csv.columns

Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')

In [21]:
df1=df.copy()
df1['Latitude'] = 0.0
df1['Longitude'] = 0.0
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,0.0,0.0
1,M4A,North York,Victoria Village,0.0,0.0
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",0.0,0.0
3,M6A,North York,"Lawrence Manor, Lawrence Heights",0.0,0.0
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",0.0,0.0


In [22]:
dict_postcode_lat = dict(zip(df_postcode_csv['Postal Code'], df_postcode_csv['Latitude']))
dict_postcode_longit = dict(zip(df_postcode_csv['Postal Code'], df_postcode_csv['Longitude']))

In [23]:
# get the latitudes and longitudes

for index, candidate in df.iterrows():
    df1.iloc[index, 3] = dict_postcode_lat[candidate['Postcode']]
    df1.iloc[index,4] = dict_postcode_longit[candidate['Postcode']]
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [24]:
df = df1.copy()
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Question 3 Explore and cluster the neighborhoods

### Get the only boroughs that contain the word "Toronto" to save the workload and save it in dataframe df_Toronto

In [25]:
df = df[df['Borough'].str.contains('Toronto',regex=False)].reset_index(drop=True) 
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [26]:
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [27]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [28]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Canada are 43.6534817, -79.3839347.


In [29]:
# create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Postcode']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

### Define Foursquare Credentials and Version

In [30]:
CLIENT_ID = '1KJBD3JC00AXKBV4A3JLHEOYWLKL2BKHLFCHQB1IPWIAA4AI' # your Foursquare ID
CLIENT_SECRET = 'G1PQBEFQ2HENKMG0XW3EJ4NTJPBV24A2U0PKBAM3NTFRCW0J' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 1KJBD3JC00AXKBV4A3JLHEOYWLKL2BKHLFCHQB1IPWIAA4AI
CLIENT_SECRET:G1PQBEFQ2HENKMG0XW3EJ4NTJPBV24A2U0PKBAM3NTFRCW0J


### Explore the first neighourhood in the dataframe

Get the neighbourhood name

In [31]:
df.loc[0,'Postcode']

'M5A'

Get the neighbourhood's latitude and longitude values

In [32]:
neighborhood_latitude = df.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = df.loc[0, 'Postcode'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of M5A are 43.6542599, -79.3606359.


### Get top 100 venues that are in The Beaches within a radius of 500 meters

First, create the GET request URL. Name your URL **url**

In [33]:
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)
url


'https://api.foursquare.com/v2/venues/search?client_id=1KJBD3JC00AXKBV4A3JLHEOYWLKL2BKHLFCHQB1IPWIAA4AI&client_secret=G1PQBEFQ2HENKMG0XW3EJ4NTJPBV24A2U0PKBAM3NTFRCW0J&ll=43.6542599,-79.3606359&v=20180605&radius=500&limit=100'

Send the GET request and examin the results

In [34]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f975341e407617138a0b8ed'},
 'response': {'venues': [{'id': '5bdc6c2bba57b4002c4c71a8',
    'name': 'Oldtown Bodega',
    'location': {'lat': 43.653966,
     'lng': -79.360752,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.653966,
       'lng': -79.360752}],
     'distance': 34,
     'postalCode': 'M5A 1L6',
     'cc': 'CA',
     'city': 'Toronto',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['Toronto ON M5A 1L6', 'Canada']},
    'categories': [{'id': '4bf58dd8d48988d16d941735',
      'name': 'Café',
      'pluralName': 'Cafés',
      'shortName': 'Café',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/cafe_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1603752769',
    'hasPerk': False},
   {'id': '4bc70f5d14d7952126a066e9',
    'name': 'Sackville Playground',
    'location': {'address': '420 king st E',
     'lat': 43.65465604258614,
     'lng': -79.359870

In [35]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean the json and strucutre it into a pandas dataframe

In [36]:
import json # library to handle JSON files

In [37]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [38]:
venues = results['response']['venues']
venues
nearby_venues = json_normalize(venues) # flatten JSON
filtered_columns = ['name', 'categories', 'location.lat', 'location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Oldtown Bodega,Café,43.653966,-79.360752
1,Sackville Playground,Park,43.654656,-79.359871
2,Body Blitz Spa East,Spa,43.654735,-79.359874
3,TTC Streetcar #503 Kingston Rd,Moving Target,43.663549,-79.337669
4,Tandem Coffee,Coffee Shop,43.653559,-79.361809


In [39]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


### Explore neighbourhoods in Toronto

Create a function to repeat the same process to all the neighbourhoods in Toronto

In [40]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Write the code to run the above function on each neighborhood and create a new dataframe called **Toronto_venues**.

In [41]:
Toronto_venues = getNearbyVenues(names=df['Postcode'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )


M5A
M7A
M5B
M5C
M4E
M5E
M5G
M6G
M5H
M6H
M5J
M6J
M4K
M5K
M6K
M4L
M5L
M4M
M4N
M5N
M4P
M5P
M6P
M4R
M5R
M6R
M4S
M5S
M6S
M4T
M5T
M4V
M5V
M4W
M5W
M4X
M5X
M4Y
M7Y


In [42]:
print(Toronto_venues.shape)
Toronto_venues.head()

(1624, 7)


Unnamed: 0,Postcode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M5A,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,M5A,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,M5A,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,M5A,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,M5A,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


Check how many venues were returned for each postcode

In [43]:
Toronto_venues.groupby('Postcode').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M4E,4,4,4,4,4,4
M4K,43,43,43,43,43,43
M4L,19,19,19,19,19,19
M4M,37,37,37,37,37,37
M4N,3,3,3,3,3,3
M4P,9,9,9,9,9,9
M4R,18,18,18,18,18,18
M4S,33,33,33,33,33,33
M4T,2,2,2,2,2,2
M4V,14,14,14,14,14,14


Find out how many unique categories can be curated from all the returned venues

In [44]:
print('There are {} uniques categories.'.format(len(Toronto_venues['Venue Category'].unique())))

There are 237 uniques categories.


### Analyze each neighbourhood (postcode)

In [45]:
# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Postcode'] = Toronto_venues['Postcode'] 

# move neighborhood column to the first column
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns]

Toronto_onehot.head()

Unnamed: 0,Postcode,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
Toronto_onehot.shape

(1624, 238)

Group rows by postcode and by taking the mean of the frequency of occurrence of each category

In [47]:
Toronto_grouped = Toronto_onehot.groupby('Postcode').mean().reset_index()
Toronto_grouped

Unnamed: 0,Postcode,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023256,0.0,...,0.0,0.0,0.023256,0.0,0.0,0.0,0.0,0.0,0.0,0.023256
2,M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054054,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027027,0.0,0.027027
4,M4N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,M4P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M4R,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556
7,M4S,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M4T,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,M4V,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0


In [48]:
Toronto_grouped.shape

(39, 238)

In [49]:
Toronto_grouped.dtypes

Postcode                  object
Afghan Restaurant        float64
Airport                  float64
Airport Food Court       float64
Airport Gate             float64
                          ...   
Video Game Store         float64
Vietnamese Restaurant    float64
Wine Bar                 float64
Women's Store            float64
Yoga Studio              float64
Length: 238, dtype: object

#### Print each postcode along with the top 5 most common venues

In [50]:
num_top_venues = 5

for postcode in Toronto_grouped['Postcode']:
    print("----"+postcode+"----")
    temp = Toronto_grouped[Toronto_grouped['Postcode'] == postcode].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    #temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M4E----
                venue  freq
0               Trail  0.25
1   Health Food Store  0.25
2        Neighborhood  0.25
3                 Pub  0.25
4  Mexican Restaurant     0


----M4K----
                    venue       freq
0        Greek Restaurant   0.162791
1             Coffee Shop  0.0930233
2      Italian Restaurant  0.0697674
3  Furniture / Home Store  0.0465116
4               Bookstore  0.0465116


----M4L----
                  venue       freq
0                  Park   0.105263
1           Coffee Shop  0.0526316
2               Brewery  0.0526316
3  Fast Food Restaurant  0.0526316
4     Fish & Chips Shop  0.0526316


----M4M----
                 venue       freq
0          Coffee Shop  0.0810811
1              Brewery  0.0540541
2                 Café  0.0540541
3  American Restaurant  0.0540541
4               Bakery  0.0540541


----M4N----
                 venue      freq
0                 Park  0.333333
1             Bus Line  0.333333
2          Swim School  0.333

#### Put that into a pandas dataframe

Write a function to sort the venues in decending order

In [51]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top 10 venues for each neighborhood

In [52]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postcode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postcode'] = Toronto_grouped['Postcode']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,Pub,Health Food Store,Trail,Neighborhood,Yoga Studio,Distribution Center,Dim Sum Restaurant,Diner,Discount Store,Doner Restaurant
1,M4K,Greek Restaurant,Coffee Shop,Italian Restaurant,Restaurant,Ice Cream Shop,Furniture / Home Store,Bookstore,Juice Bar,Spa,Pub
2,M4L,Park,Sushi Restaurant,Fish & Chips Shop,Steakhouse,Pub,Brewery,Fast Food Restaurant,Italian Restaurant,Restaurant,Pizza Place
3,M4M,Coffee Shop,American Restaurant,Bakery,Brewery,Café,Gastropub,Yoga Studio,Fish Market,Park,Neighborhood
4,M4N,Park,Bus Line,Swim School,Dim Sum Restaurant,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


## Cluster postcode

Run k-means to cluster the neighborhood into 5 clusters.

In [53]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [54]:
# set number of clusters
kclusters = 5

Toronto_grouped_clustering = Toronto_grouped.drop('Postcode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 2, 1, 1, 1, 4, 1], dtype=int32)

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [55]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Toronto_merged = Toronto_merged.join(neighborhoods_venues_sorted.set_index('Postcode'), on='Postcode')

Toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Coffee Shop,Pub,Bakery,Park,Breakfast Spot,Café,Theater,Yoga Studio,Event Space,Restaurant
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1,Coffee Shop,Bank,Beer Bar,Smoothie Shop,Sandwich Place,Restaurant,Café,Portuguese Restaurant,Chinese Restaurant,Park
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Clothing Store,Coffee Shop,Café,Bubble Tea Shop,Japanese Restaurant,Cosmetics Shop,Hotel,Bookstore,Pizza Place,Middle Eastern Restaurant
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Coffee Shop,Café,Cocktail Bar,Restaurant,Gastropub,American Restaurant,Beer Bar,Gym,Moroccan Restaurant,Department Store
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Pub,Health Food Store,Trail,Neighborhood,Yoga Studio,Distribution Center,Dim Sum Restaurant,Diner,Discount Store,Doner Restaurant


Visualization

In [56]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [57]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Postcode'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters