Purpose: To access and parse a page inorder to make a dataframe out of a table!

Install and import "Requests" library for page access.

In [1]:
#!conda install -c anaconda requests -y

In [2]:
import requests as requests

In [3]:
wikipedia_link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

Download the Wikipedia Page

In [4]:
wikipedia_page = requests.get(wikipedia_link)
page = wikipedia_page.text
#print(page)

Install and Import "BeautifulSoup" for parsing of the page

In [5]:
#!conda install -c anaconda BeautifulSoup4 -y

In [6]:
from bs4 import BeautifulSoup

Install supporting html parsing libraries

In [7]:
#!conda install -c anaconda lxml -y

In [8]:
#!conda install -c anaconda html5 -y

Start parcing the page

In [9]:
soup = BeautifulSoup(page, 'lxml')

Segragate the table part of the page

In [10]:
#If we had several "table" tags/sections on one page, then we would use the next line
#which would use the "table" section which has class of XYZ.
#table_section = soup.find('table', class_='wikitable sortable')
#Since we do not have multiple "table" tags on one page, the next line
#would be the easier way of grabbing the first section (using "tbody" under "table" section)
table_section = soup.tbody
#print(soup.tbody.prettify())

Import Pandas

In [11]:
import pandas as pd

Parse the page-table and add each line to the new dataframe

In [12]:
res = []
for tr in table_section.find_all('tr'):
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
       res.append(row)

df = pd.DataFrame(res, columns=["Postcode","Borough","Neighborhood"])
df.head(15)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


First, lets drop rows that are "Not assigned" in the "Borough" column

In [13]:
#First, Drop rows that are "Not assigned" in "Borough" column 
df = df[df.Borough != 'Not assigned']
df.head(15)

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Replace "Not Assigned" Neighborhood data with Borough values

In [14]:
#Purpose: Replace 'Not Assigned' Neighborhood data with Borough values
#
#This line of code works fine but needs Numpy libraray and should be a little faster
#>>> df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned', df['Borough'], df['Neighborhood'])

#This line of code works fine but in different way and uses Pandas
#df.Neighborhood[df.Neighborhood == 'Not assigned'] = df.Borough

#I decided to used this type of code because it seems simpler
df.Neighborhood.replace('Not assigned',df.Borough,inplace=True)
df.head(15)

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Combine "Neighborhood" data/rows that have the same Postcode/Borough

In [15]:
#Combine "Neighborhood" data that have the same Borough
#Note: We need to use the 'apply' method to do this type of grouping
#This Line of code will work fine
#df_group = df.groupby(['Postcode', 'Borough']).apply(lambda group: ','.join(group['Neighborhood']))
#I found this line of code which seems to be a little simpler
df_grouped = df.groupby(['Postcode', 'Borough'])['Neighborhood'].apply(','.join).reset_index()
df_grouped.head(15)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


Finally, show the shape of the final grouped dataframe !

In [16]:
df_grouped.shape

(103, 3)

## Phase 2: Import Latitude and Longitude and apply it to the dataframe above.

Download the csv file containing the Postal Latitude & Longitude data

In [17]:
!wget -O Coordinates.csv http://cocl.us/Geospatial_data

--2019-06-14 16:43:30--  http://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.48.113.201
Connecting to cocl.us (cocl.us)|169.48.113.201|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data [following]
--2019-06-14 16:43:30--  https://cocl.us/Geospatial_data
Connecting to cocl.us (cocl.us)|169.48.113.201|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-06-14 16:43:31--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.26.197
Connecting to ibm.box.com (ibm.box.com)|107.152.26.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-06-14 16:43:31--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjc

Convert the csv file data into a new dataframe.

In [18]:
df_geo = pd.read_csv("Coordinates.csv")

In [19]:
df_geo.head(5)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Next we need to add the two new geo columns to the original dataframe "df_grouped" above
but before we do that lets make the Postal columns the same name so we
can use that column to join data between the two dataframes.

In [20]:
#Rename the postal column name so it match the name in the above dataframe
df_geo = df_geo.rename(columns = {"Postal Code": "Postcode"}) 
df_geo.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Lets add the Latitude & Longitude columns from this dataframe to the above dataframe
using "Postcode" as key.

In [21]:
df_postal_geo = pd.merge(df_grouped, df_geo, on='Postcode')
df_postal_geo

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


## Phase 3 Map Clustering

In [25]:
import numpy as np # library to handle data in a vectorized manner

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [26]:
#First, lets focus our dataframe on a specific city such as "North York"
df_NY = df_postal_geo[df_postal_geo.Borough == 'North York']
df_NY = df_NY.reset_index(drop=True)
df_NY.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M2H,North York,Hillcrest Village,43.803762,-79.363452
1,M2J,North York,"Fairview,Henry Farm,Oriole",43.778517,-79.346556
2,M2K,North York,Bayview Village,43.786947,-79.385975
3,M2L,North York,"Silver Hills,York Mills",43.75749,-79.374714
4,M2M,North York,"Newtonbrook,Willowdale",43.789053,-79.408493


In [27]:
#Lets get Latitude and Longitude of "M2h, Hillcrest Village, North York"
NY_latitude = df_NY.loc[df_NY['Neighborhood'] == 'Hillcrest Village', 'Latitude'].iat[0]
NY_longitude = df_NY.loc[df_NY['Neighborhood'] == 'Hillcrest Village', 'Longitude'].iat[0]
print(NY_latitude)
print(NY_longitude)

43.8037622
-79.3634517


In [28]:
# create map of North York using latitude and longitude values
map_northyork = folium.Map(location=[NY_latitude, NY_longitude], zoom_start=10)

# add neighborhood/markers to map
for lat, lng, borough, neighborhood in zip(df_NY['Latitude'], df_NY['Longitude'], df_NY['Borough'], df_NY['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_northyork)  
    
map_northyork

Lets explore neighborhoods and segment them using Foursquare API

In [29]:
CLIENT_ID = 'CITQUKHYRNH24TDRB3E5FWK03UFKVSIEWLLX5R2DV1Q5G3JT' # your Foursquare ID
CLIENT_SECRET = 'TOEXJBCWJHPXLK5EFKMEBKX5FBOMOLZPGEHJ3CULSPYM03WE' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: CITQUKHYRNH24TDRB3E5FWK03UFKVSIEWLLX5R2DV1Q5G3JT
CLIENT_SECRET:TOEXJBCWJHPXLK5EFKMEBKX5FBOMOLZPGEHJ3CULSPYM03WE


Lets just grab a randon neighborhood to focus the map on

In [30]:
neighborhood_name = df_NY.loc[9, 'Neighborhood']
neighborhood_name

'Don Mills North'

In [31]:
#Lets get Latitude and Longitude of "Don Mills North" neighborhood
Donmills_latitude = df_NY.loc[df_NY['Neighborhood'] == 'Don Mills North', 'Latitude'].iat[0]
Donmills_longitude = df_NY.loc[df_NY['Neighborhood'] == 'Don Mills North', 'Longitude'].iat[0]

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               Donmills_latitude, 
                                                               Donmills_longitude))

Latitude and longitude values of Don Mills North are 43.745905799999996, -79.352188.


Lets get the top 5 venues within the 2 kilometer radius

In [32]:
LIMIT = 5 # limit of number of venues returned by Foursquare API

radius = 2000 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    Donmills_latitude, 
    Donmills_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=CITQUKHYRNH24TDRB3E5FWK03UFKVSIEWLLX5R2DV1Q5G3JT&client_secret=TOEXJBCWJHPXLK5EFKMEBKX5FBOMOLZPGEHJ3CULSPYM03WE&v=20180605&ll=43.745905799999996,-79.352188&radius=2000&limit=5'

In [33]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d03d0656bdee60039413375'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4c18e819d4d9c9284e19f029-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/building/gym_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d175941735',
         'name': 'Gym / Fitness Center',
         'pluralName': 'Gyms or Fitness Centers',
         'primary': True,
         'shortName': 'Gym / Fitness'}],
       'id': '4c18e819d4d9c9284e19f029',
       'location': {'address': '1380 Don Mills Road',
        'cc': 'CA',
        'city': 'Toronto',
        'country': 'Canada',
        'crossStreet': 'Mallard Rd',
        'distance': 455,
        'formattedAddress': ['1380 Don Mills Road (Mallard Rd)',
         'Toronto ON M3B 2X2',
         'Canada'],


Lets extract Categories of the Venues

In [34]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [35]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,LA Fitness,Gym / Fitness Center,43.747665,-79.347077
1,Island Foods,Caribbean Restaurant,43.745866,-79.346035
2,VIA CIBO | italian streetfood,Italian Restaurant,43.754067,-79.357951
3,Matsuda Japanese Cuisine & Teppanyaki,Japanese Restaurant,43.745494,-79.345821
4,Galleria Supermarket,Supermarket,43.75352,-79.349518


In [36]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

5 venues were returned by Foursquare.


Lets repeat the same process for all Neighborhoods in North York

In [37]:
def getNearbyVenues(names, latitudes, longitudes, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Call the function for each neighborhood in North York in to a new dataframe NorthYork_venues

In [38]:
NorthYork_venues = getNearbyVenues(names=df_NY['Neighborhood'],
                                   latitudes=df_NY['Latitude'],
                                   longitudes=df_NY['Longitude']
                                  )

Hillcrest Village
Fairview,Henry Farm,Oriole
Bayview Village
Silver Hills,York Mills
Newtonbrook,Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park,Don Mills South
Bathurst Manor,Downsview North,Wilson Heights
Northwood Park,York University
CFB Toronto,Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Bedford Park,Lawrence Manor East
Lawrence Heights,Lawrence Manor
Glencairn
Downsview,North Park,Upwood Park
Humber Summit
Emery,Humberlea


In [39]:
print(NorthYork_venues.shape)
NorthYork_venues.head()

(120, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hillcrest Village,43.803762,-79.363452,Chatime Willowdale,43.791326,-79.367506,Bubble Tea Shop
1,Hillcrest Village,43.803762,-79.363452,Tastee,43.807722,-79.356798,Bakery
2,Hillcrest Village,43.803762,-79.363452,Bayview Golf & Country Club,43.809391,-79.375285,Golf Course
3,Hillcrest Village,43.803762,-79.363452,고려삼계탕 Korean Ginseng Chicken Soup & Bibimbap,43.798391,-79.369187,Korean Restaurant
4,Hillcrest Village,43.803762,-79.363452,Galati,43.797831,-79.36941,Grocery Store


In [40]:
NorthYork_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor,Downsview North,Wilson Heights",5,5,5,5,5,5
Bayview Village,5,5,5,5,5,5
"Bedford Park,Lawrence Manor East",5,5,5,5,5,5
"CFB Toronto,Downsview East",5,5,5,5,5,5
Don Mills North,5,5,5,5,5,5
Downsview Central,5,5,5,5,5,5
Downsview Northwest,5,5,5,5,5,5
Downsview West,5,5,5,5,5,5
"Downsview,North Park,Upwood Park",5,5,5,5,5,5
"Emery,Humberlea",5,5,5,5,5,5


In [41]:
print('There are {} uniques categories.'.format(len(NorthYork_venues['Venue Category'].unique())))

There are 71 uniques categories.


In [42]:
# one hot encoding
NorthYork_onehot = pd.get_dummies(NorthYork_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
NorthYork_onehot['Neighborhood'] = NorthYork_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [NorthYork_onehot.columns[-1]] + list(NorthYork_onehot.columns[:-1])
NorthYork_onehot = NorthYork_onehot[fixed_columns]

NorthYork_onehot.head()

Unnamed: 0,Neighborhood,Airport,Asian Restaurant,Bagel Shop,Bakery,Bank,Beer Store,Bike Shop,Bookstore,Boutique,...,Steakhouse,Supermarket,Sushi Restaurant,Tea Room,Tennis Stadium,Thai Restaurant,Toy / Game Store,Turkish Restaurant,Vietnamese Restaurant,Warehouse Store
0,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Hillcrest Village,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
NorthYork_onehot.shape

(120, 72)

In [45]:
#Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
NorthYork_grouped = NorthYork_onehot.groupby('Neighborhood').mean().reset_index()
NorthYork_grouped

Unnamed: 0,Neighborhood,Airport,Asian Restaurant,Bagel Shop,Bakery,Bank,Beer Store,Bike Shop,Bookstore,Boutique,...,Steakhouse,Supermarket,Sushi Restaurant,Tea Room,Tennis Stadium,Thai Restaurant,Toy / Game Store,Turkish Restaurant,Vietnamese Restaurant,Warehouse Store
0,"Bathurst Manor,Downsview North,Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park,Lawrence Manor East",0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CFB Toronto,Downsview East",0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.2
4,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Downsview Central,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0
6,Downsview Northwest,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0
7,Downsview West,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0
8,"Downsview,North Park,Upwood Park",0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0
9,"Emery,Humberlea",0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

Let's print each neighborhood along with the top 3 most common venues

In [46]:
num_top_venues = 3

for hood in NorthYork_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = NorthYork_grouped[NorthYork_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))

----Bathurst Manor,Downsview North,Wilson Heights----
           venue  freq
0  Deli / Bodega   0.2
1    Bridal Shop   0.2
2     Restaurant   0.2
----Bayview Village----
                 venue  freq
0   Chinese Restaurant   0.6
1  Japanese Restaurant   0.2
2      Bubble Tea Shop   0.2
----Bedford Park,Lawrence Manor East----
         venue  freq
0   Bagel Shop   0.2
1  Coffee Shop   0.2
2   Restaurant   0.2
----CFB Toronto,Downsview East----
                venue  freq
0             Airport   0.2
1        Climbing Gym   0.2
2  Turkish Restaurant   0.2
----Don Mills North----
                  venue  freq
0  Gym / Fitness Center   0.2
1    Italian Restaurant   0.2
2  Caribbean Restaurant   0.2
----Downsview Central----
                   venue  freq
0  Vietnamese Restaurant   0.4
1                 Bakery   0.2
2     Falafel Restaurant   0.2
----Downsview Northwest----
            venue  freq
0           Hotel   0.4
1  Tennis Stadium   0.4
2   Historic Site   0.2
----Downsview West----
 

Lets sort the venues in descending order

In [47]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [48]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = NorthYork_grouped['Neighborhood']

for ind in np.arange(NorthYork_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(NorthYork_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor,Downsview North,Wilson Heights",Mediterranean Restaurant,Park,Deli / Bodega,Restaurant,Bridal Shop,Coffee Shop,Convenience Store,Cosmetics Shop,Dessert Shop,Discount Store
1,Bayview Village,Chinese Restaurant,Japanese Restaurant,Bubble Tea Shop,Warehouse Store,Deli / Bodega,Dessert Shop,Discount Store,Doner Restaurant,Electronics Store,Event Space
2,"Bedford Park,Lawrence Manor East",Italian Restaurant,Restaurant,Café,Coffee Shop,Bagel Shop,Furniture / Home Store,French Restaurant,Fish Market,Fast Food Restaurant,Falafel Restaurant
3,"CFB Toronto,Downsview East",Warehouse Store,Liquor Store,Climbing Gym,Airport,Turkish Restaurant,Thai Restaurant,Furniture / Home Store,French Restaurant,Fish Market,Fast Food Restaurant
4,Don Mills North,Gym / Fitness Center,Caribbean Restaurant,Japanese Restaurant,Supermarket,Italian Restaurant,Electronics Store,Deli / Bodega,Dessert Shop,Discount Store,Doner Restaurant


Run k-means to cluster the neighborhood into 5 clusters.

In [49]:
# set number of clusters
kclusters = 5

NorthYork_grouped_clustering = NorthYork_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(NorthYork_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 0, 1, 1, 1, 4, 3, 1, 4, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

In [50]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

NorthYork_merged = df_NY

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
NorthYork_merged = NorthYork_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

NorthYork_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,North York,Hillcrest Village,43.803762,-79.363452,4,Grocery Store,Golf Course,Bakery,Korean Restaurant,Bubble Tea Shop,Electronics Store,Deli / Bodega,Dessert Shop,Discount Store,Doner Restaurant
1,M2J,North York,"Fairview,Henry Farm,Oriole",43.778517,-79.346556,1,Movie Theater,Electronics Store,Pharmacy,Shopping Mall,Toy / Game Store,French Restaurant,Fish Market,Fast Food Restaurant,Falafel Restaurant,Event Space
2,M2K,North York,Bayview Village,43.786947,-79.385975,0,Chinese Restaurant,Japanese Restaurant,Bubble Tea Shop,Warehouse Store,Deli / Bodega,Dessert Shop,Discount Store,Doner Restaurant,Electronics Store,Event Space
3,M2L,North York,"Silver Hills,York Mills",43.75749,-79.374714,1,Steakhouse,Italian Restaurant,Bagel Shop,Furniture / Home Store,Fish Market,Cosmetics Shop,Deli / Bodega,Dessert Shop,Discount Store,Doner Restaurant
4,M2M,North York,"Newtonbrook,Willowdale",43.789053,-79.408493,1,Coffee Shop,Café,Hookah Bar,Grocery Store,Asian Restaurant,Bakery,Bank,Golf Course,General Entertainment,Furniture / Home Store


In [51]:
#Lets visualize the resulting clusters
# create map
map_clusters = folium.Map(location=[NY_latitude, NY_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(NorthYork_merged['Latitude'], NorthYork_merged['Longitude'], NorthYork_merged['Neighborhood'], NorthYork_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters