<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto Project</font></h1>

In [1]:
import numpy as np 
import pandas as pd

import requests


## import wikipedia table with BeautifulSoup

In [2]:
# import beautifulSoup
from bs4 import BeautifulSoup

In [3]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
website_url = requests.get(url).text
soup = BeautifulSoup(website_url,'lxml')

In [4]:
my_table = soup.find('table',{'class':'wikitable sortable'})
print(my_table)

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td

We have the table under the html format. 

We then try to create an empty dataframe with the columns we want, that we fill with the data from my_table read line by line in the td parts

In [5]:
# create empty dataframe

column_names = ['PostalCode', 'Borough', 'Neighborhood'] 
neighborhoods = pd.DataFrame(columns=column_names)

In [6]:
# read my_table line by line and fill each of the three columns
# For each group of 3 lines, the first one is the postal code, the second one is the borough and the third one is the neiborhood

i=0
postalcode=[]
borough=[]
neighborhood=[]

for row in my_table.find_all('td'):
    if i%3==0:
        postalcode.append(row.text)
    if i%3==1:
        borough.append(row.text)
    if i%3==2:
        neighborhood.append(row.text)
    i+=1

neighborhoods['PostalCode']=postalcode
neighborhoods['Borough']=borough
neighborhoods['Neighborhood']=neighborhood

In [7]:
# Check the result
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


In [8]:
# Remove the \n sign at the end of each cell

neighborhoods['PostalCode'] = neighborhoods['PostalCode'].str.replace(r'\n', '')
neighborhoods['Borough'] = neighborhoods['Borough'].str.replace(r'\n', '')
neighborhoods['Neighborhood'] = neighborhoods['Neighborhood'].str.replace(r'\n', '')

In [9]:
# Check the result, this time it should be ok
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [10]:
# Count number of rows in order to later check if we correctly suppressed unassigned rows
neighborhoods.shape

(180, 3)

Now we have a good table.

We now need to clean the data :
- suppress the rows where borough is 'Not assigned'
- combine rows that have the same postal code
- when the neighborough is not assigned, we give it the name of the borough

## Suppress Not assigned borough

it seems that the dropna function doesn't work there.
So the idea is to slice the dataframe to keep only the borough others than 'Not assigned'

In [11]:
#Count number of unassigned boroughs
neighborhoods[neighborhoods['Borough']=='Not assigned'].shape

(77, 3)

In [12]:
# Slice the dataframe by keeping only the rows with an assigned Borough
neighborhoods=neighborhoods[neighborhoods['Borough']!='Not assigned']

In [13]:
# Check the results on the first rows
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [14]:
# Check the result with the number of rows. We expect to have now 180 - 77 = 103 rows
neighborhoods.shape

(103, 3)

So now we have removed the Not assigned borough and reduced the numbers of rows from 180 to 103.

## Combine rows with same Postal Code
Or at list check if there is any, because it seems the table already combined them

In [15]:
neighborhoods['PostalCode'].value_counts()

M9P    1
M4S    1
M4K    1
M1G    1
M5B    1
      ..
M8V    1
M9M    1
M5L    1
M2L    1
M1C    1
Name: PostalCode, Length: 103, dtype: int64

Each postal code is already unique.

## Assign Borough name to unassigned neighborhood

In [16]:
# Count number of unassigned neighborhood
neighborhoods[neighborhoods['Neighborhood']=='Not assigned'].shape

(0, 3)

There is no unassigned neighborhood

On the wikipedia table, we can check that all unassigned neighborhoods are also unassigned boroughs (and we have dropped them)

## Final Table - End of part 1

In [17]:
neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [18]:
neighborhoods.shape

(103, 3)

# Part 2 - Getting latitude and longitude

I will try to do it with the csv file

In [19]:
# read the csv file in a dataframe
coord = pd.read_csv('Geospatial_Coordinates.csv')

In [20]:
# Check the dataframe
coord.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [21]:
# rename Postal Code column to have the exact same name as in neighborhood
coord.rename(columns={'Postal Code':'PostalCode'},inplace=True)
coord.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [22]:
# merge the dataframes with the key being the Postal Code
Neighborhood_table = pd.merge(neighborhoods, coord, on='PostalCode')

In [23]:
Neighborhood_table.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [24]:
Neighborhood_table.shape

(103, 5)

# Part 3 - exploring neighborhood

Since the data is classified by Postal Code, we would rather identify neighborhoods by Postal Code

In [25]:
import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

In [26]:
Neighborhood_table['Borough'].value_counts()

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
York                 5
East Toronto         5
Mississauga          1
Name: Borough, dtype: int64

### Show Postal Codes locations

In [27]:
# centering the map around postal code M4S that seems to be central
index = Neighborhood_table[Neighborhood_table['PostalCode']=='M4S'].index[0]
latitude = Neighborhood_table['Latitude'].loc[index]
longitude = Neighborhood_table['Longitude'].loc[index]

# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for code, lat, lng, borough, neighborhood in zip(Neighborhood_table['PostalCode'], Neighborhood_table['Latitude'], Neighborhood_table['Longitude'], Neighborhood_table['Borough'], Neighborhood_table['Neighborhood']):
    label = '{},{}'.format(code, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

    
map_toronto

In [28]:
#Define Foursquare credentials
CLIENT_ID = 'L33OJGAD0GA5VKLGQVQXJVHGYURB1LSKG1FBKNW10LQU0A41' # your Foursquare ID
CLIENT_SECRET = 'ZTOVIJBFAX5AJLJWT5SXKANXUY2HUS0G0RPMRJTCMU2TDWCG ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: L33OJGAD0GA5VKLGQVQXJVHGYURB1LSKG1FBKNW10LQU0A41
CLIENT_SECRET:ZTOVIJBFAX5AJLJWT5SXKANXUY2HUS0G0RPMRJTCMU2TDWCG 


### Explore Downtown Toronto Postal Code M5A

In [29]:
neighborhood_latitude = Neighborhood_table['Latitude'][2] # neighborhood latitude value
neighborhood_longitude = Neighborhood_table['Longitude'][2] # neighborhood longitude value

neighborhood_name = Neighborhood_table['PostalCode'][2] # neighborhood name

radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()

In [30]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [31]:
results

{'meta': {'code': 200, 'requestId': '5f9fc6e22be1a37c28e4db66'},
 'response': {'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 44,
  'suggestedBounds': {'ne': {'lat': 43.6587599045, 'lng': -79.3544279001486},
   'sw': {'lat': 43.6497598955, 'lng': -79.36684389985142}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54ea41ad498e9a11e9e13308',
       'name': 'Roselle Desserts',
       'location': {'address': '362 King St E',
        'crossStreet': 'Trinity St',
        'lat': 43.653446723052674,
        'lng': -79.3620167174383,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.653446723052674,
          'lng': -79.3620167174383}],
        'distance': 143,
       

In [32]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
3,Body Blitz Spa East,Spa,43.654735,-79.359874
4,Impact Kitchen,Restaurant,43.656369,-79.35698


In [33]:
print('{} venues were returned by Foursquare around postal Code {}'.format(nearby_venues.shape[0],neighborhood_name))

44 venues were returned by Foursquare around postal Code M5A


### Explore Neighborhoods in Dowtown Toronto

In [34]:
# create Downtown Toronto Database
DowntownToronto_data = Neighborhood_table[Neighborhood_table['Borough']=='Downtown Toronto']
DowntownToronto_data.shape

(19, 5)

In [35]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [36]:
DowntownToronto_venues = getNearbyVenues(names=DowntownToronto_data['PostalCode'],
                                   latitudes=DowntownToronto_data['Latitude'],
                                   longitudes=DowntownToronto_data['Longitude']
                                  )

In [37]:
print(DowntownToronto_venues.shape)
DowntownToronto_venues.head()

(1248, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M5A,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,M5A,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,M5A,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,M5A,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,M5A,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


In [38]:
DowntownToronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M4W,4,4,4,4,4,4
M4X,48,48,48,48,48,48
M4Y,75,75,75,75,75,75
M5A,44,44,44,44,44,44
M5B,100,100,100,100,100,100
M5C,85,85,85,85,85,85
M5E,55,55,55,55,55,55
M5G,68,68,68,68,68,68
M5H,100,100,100,100,100,100
M5J,100,100,100,100,100,100


In [39]:
print('There are {} uniques categories.'.format(len(DowntownToronto_venues['Venue Category'].unique())))

There are 213 uniques categories.


### Analyze Each Neighborhood

In [40]:
DowntownToronto_onehot = pd.get_dummies(DowntownToronto_venues[['Venue Category']], prefix="", prefix_sep="")
DowntownToronto_onehot['Neighborhood'] = DowntownToronto_venues['Neighborhood'] 
fixed_columns = [DowntownToronto_onehot.columns[-1]] + list(DowntownToronto_onehot.columns[:-1])
DowntownToronto_onehot = DowntownToronto_onehot[fixed_columns]


print(DowntownToronto_onehot.shape)
DowntownToronto_onehot.head()

(1248, 213)


Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [41]:
# grouping rows by neighborhood and mean of frequencies
DowntownToronto_grouped = DowntownToronto_onehot.groupby('Neighborhood').mean().reset_index()
DowntownToronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,M4W,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
1,M4X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M4Y,0.026667,0.013333,0.0,0.0,0.0,0.0,0.0,0.0,0.013333,...,0.013333,0.013333,0.0,0.0,0.0,0.0,0.0,0.013333,0.0,0.0
3,M5A,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M5B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.02,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.01,0.0


In [42]:
DowntownToronto_grouped.shape

(19, 213)

In [43]:
#  print the 5 top venues for each neighborhood

num_top_venues = 5

for hood in DowntownToronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = DowntownToronto_grouped[DowntownToronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M4W----
                 venue  freq
0                 Park  0.50
1           Playground  0.25
2                Trail  0.25
3  Moroccan Restaurant  0.00
4  Martial Arts School  0.00


----M4X----
         venue  freq
0  Coffee Shop  0.08
1   Restaurant  0.06
2         Café  0.06
3  Pizza Place  0.06
4          Pub  0.04


----M4Y----
                 venue  freq
0          Coffee Shop  0.09
1  Japanese Restaurant  0.05
2     Sushi Restaurant  0.05
3              Gay Bar  0.05
4           Restaurant  0.04


----M5A----
            venue  freq
0     Coffee Shop  0.18
1             Pub  0.07
2          Bakery  0.07
3            Park  0.07
4  Breakfast Spot  0.05


----M5B----
             venue  freq
0   Clothing Store  0.10
1      Coffee Shop  0.09
2             Café  0.04
3   Cosmetics Shop  0.03
4  Bubble Tea Shop  0.03


----M5C----
          venue  freq
0   Coffee Shop  0.07
1          Café  0.06
2    Restaurant  0.05
3  Cocktail Bar  0.05
4      Beer Bar  0.04


----M5E----
    

In [44]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [45]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Downtownneighborhoods_venues_sorted = pd.DataFrame(columns=columns)
Downtownneighborhoods_venues_sorted['PostalCode'] = DowntownToronto_grouped['Neighborhood']

for ind in np.arange(DowntownToronto_grouped.shape[0]):
    Downtownneighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(DowntownToronto_grouped.iloc[ind, :], num_top_venues)

Downtownneighborhoods_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Park,Trail,Playground,Creperie,Doner Restaurant,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant
1,M4X,Coffee Shop,Café,Pizza Place,Restaurant,Pub,Chinese Restaurant,Market,Bakery,Park,Italian Restaurant
2,M4Y,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Restaurant,Yoga Studio,Men's Store,Mediterranean Restaurant,Hotel,Pub
3,M5A,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Café,Theater,Yoga Studio,Mexican Restaurant,Restaurant
4,M5B,Clothing Store,Coffee Shop,Café,Bubble Tea Shop,Cosmetics Shop,Japanese Restaurant,Ramen Restaurant,Hotel,Italian Restaurant,Lingerie Store


### Cluster Neighborhoods

In order to try to get more relevant result, we will consider all of Toronto, not just Downtown Toronto

In [48]:
# doing same things as before but with all of Toronto
AllToronto_data = Neighborhood_table

AllToronto_venues = getNearbyVenues(names=AllToronto_data['PostalCode'],
                                   latitudes=AllToronto_data['Latitude'],
                                   longitudes=AllToronto_data['Longitude']
                                  )
AllToronto_onehot = pd.get_dummies(AllToronto_venues[['Venue Category']], prefix="", prefix_sep="")
AllToronto_onehot['Neighborhood'] = AllToronto_venues['Neighborhood'] 
fixed_columns = [AllToronto_onehot.columns[-1]] + list(AllToronto_onehot.columns[:-1])
AllToronto_onehot = AllToronto_onehot[fixed_columns]

AllToronto_grouped = AllToronto_onehot.groupby('Neighborhood').mean().reset_index()

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Allneighborhoods_venues_sorted = pd.DataFrame(columns=columns)
Allneighborhoods_venues_sorted['PostalCode'] = AllToronto_grouped['Neighborhood']

for ind in np.arange(AllToronto_grouped.shape[0]):
    Allneighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(AllToronto_grouped.iloc[ind, :], num_top_venues)


In [49]:
# set number of clusters
kclusters = 4

AllToronto_grouped_clustering = AllToronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(AllToronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
print(kmeans.labels_[0:10])

[1 0 1 1 1 1 1 1 1 1]


In [50]:
# add clustering labels
Allneighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = AllToronto_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
Toronto_merged = Toronto_merged.join(Allneighborhoods_venues_sorted.set_index('PostalCode'), on='PostalCode')



Toronto_merged.head() 

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2.0,Park,Food & Drink Shop,Women's Store,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,Intersection,Pizza Place,Coffee Shop,Hockey Arena,Portuguese Restaurant,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1.0,Coffee Shop,Park,Pub,Bakery,Breakfast Spot,Café,Theater,Hotel,Chocolate Shop,Spa
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,Clothing Store,Women's Store,Coffee Shop,Event Space,Furniture / Home Store,Gift Shop,Boutique,Accessories Store,Vietnamese Restaurant,Convenience Store
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1.0,Coffee Shop,Yoga Studio,Sushi Restaurant,Bar,Beer Bar,Smoothie Shop,Sandwich Place,Restaurant,Café,Portuguese Restaurant


In [51]:
Toronto_merged=Toronto_merged.dropna()

In [52]:
Toronto_merged.shape

(100, 16)

In [53]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['PostalCode'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)],
        fill=True,
        fill_color=rainbow[int(cluster)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [54]:
Toronto_merged['Cluster Labels'].value_counts()

1.0    86
2.0    10
0.0     3
3.0     1
Name: Cluster Labels, dtype: int64