### The following notebook is intended to be developed for the final IBM Data Science project.

In [1]:
import pandas as pd
import numpy as np

In [2]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


# Segmenting and Clustering Neighborhoods in Toronto

### 1. Creating a new Dataframe from Web Page

let's scratch the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, and then build a pandas data frame.

In [3]:
!pip install lxml



In [4]:
import lxml.html as lh
import requests

get the content of the web page, then the rows of the table containing the zip codes of the Toronto neighborhoods are extracted.

In [5]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
content = lh.fromstring(page.content)
trs = content.xpath('//table//tbody//tr')
print("Rows: ", len(trs))

Rows:  185


the first row contains the titles of the table, with which the dataframe is constructed.

In [6]:
colums=[]

for i in range(len(trs[0])):
    name = trs[0][i].text_content()
    colums.append(str(name).replace('\n',''))
    
neighborhoods = pd.DataFrame(columns=colums)
neighborhoods

Unnamed: 0,Postal Code,Borough,Neighbourhood


the rows are traversed to add each neighborhood to the dataframe.

In [7]:
for i in range(1,len(trs)):
    row = trs[i]
    if(len(row) == 3):
        neighborhoods = neighborhoods.append({'Postal Code': str(row[0].text_content()).replace('\n',''),
                                              'Borough': str(row[1].text_content()).replace('\n',''),
                                              'Neighbourhood': str(row[2].text_content()).replace('\n','')
                                             }, ignore_index=True)

In [8]:
neighborhoods.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [9]:
neighborhoods.shape

(181, 3)

Rows with null values "Not assigned" are removed, as they are not actual neighborhoods.

In [10]:
neighborhoods.replace("Not assigned", np.nan, inplace = True)
neighborhoods.replace("", np.nan, inplace = True)
neighborhoods.dropna(inplace = True)

In [11]:
neighborhoods.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [12]:
neighborhoods.sort_values(by='Postal Code', ascending=True, inplace = True)

In [13]:
neighborhoods["Postal Code"].unique()

array(['M1B', 'M1C', 'M1E', 'M1G', 'M1H', 'M1J', 'M1K', 'M1L', 'M1M',
       'M1N', 'M1P', 'M1R', 'M1S', 'M1T', 'M1V', 'M1W', 'M1X', 'M2H',
       'M2J', 'M2K', 'M2L', 'M2M', 'M2N', 'M2P', 'M2R', 'M3A', 'M3B',
       'M3C', 'M3H', 'M3J', 'M3K', 'M3L', 'M3M', 'M3N', 'M4A', 'M4B',
       'M4C', 'M4E', 'M4G', 'M4H', 'M4J', 'M4K', 'M4L', 'M4M', 'M4N',
       'M4P', 'M4R', 'M4S', 'M4T', 'M4V', 'M4W', 'M4X', 'M4Y', 'M5A',
       'M5B', 'M5C', 'M5E', 'M5G', 'M5H', 'M5J', 'M5K', 'M5L', 'M5M',
       'M5N', 'M5P', 'M5R', 'M5S', 'M5T', 'M5V', 'M5W', 'M5X', 'M6A',
       'M6B', 'M6C', 'M6E', 'M6G', 'M6H', 'M6J', 'M6K', 'M6L', 'M6M',
       'M6N', 'M6P', 'M6R', 'M6S', 'M7A', 'M7R', 'M7Y', 'M8V', 'M8W',
       'M8X', 'M8Y', 'M8Z', 'M9A', 'M9B', 'M9C', 'M9L', 'M9M', 'M9N',
       'M9P', 'M9R', 'M9V', 'M9W'], dtype=object)

103 total neighborhoods

In [14]:
neighborhoods.shape

(103, 3)

In [15]:
neighborhoods.to_csv('neighborhoods.csv', index=False)  

### 2. get neighborhood coordinates

the geopy library is used to obtain the latitude and longitude values of the city of Toronto; three forms of query are used since only one is not enough to obtain the values.

The geopy library is not very optimal to obtain the coordinates of the neighborhoods, therefore, it is necessary to search with different queries and even then it is not possible to obtain all the coordinates for all the neighborhoods.

In [None]:
# !conda install -c conda-forge geopy --yes

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: \ 

In [16]:
from geopy.geocoders import Nominatim

neighborhoods_list = []
url = 'https://maps.googleapis.com/maps/api/geocode/json'

for i in range(0,neighborhoods.shape[:1][0]-1):
    row = neighborhoods.iloc[i,0:3]
    query = '{}, {}, {}'.format(row[0], row[1], row[2])
    
    geolocator = Nominatim(user_agent="user_agent")
    location = geolocator.geocode(query)
    
    if(location != None):
        neighborhoods_list.append([row[0], row[1], row[2], location.latitude, location.longitude])
    else:
        query = '{} Toronto, {}'.format(row[0], row[1])
        location = geolocator.geocode(query)
        if(location != None):
            neighborhoods_list.append([row[0], row[1], row[2], location.latitude, location.longitude])
        else:
            query = 'Toronto, Ontario {}, Canadá'.format(row[0])
            location = geolocator.geocode(query)
            if(location != None):
                neighborhoods_list.append([row[0], row[1], row[2], location.latitude, location.longitude])

neighborhoods_ll = None
neighborhoods_ll = pd.DataFrame(neighborhoods_list)
neighborhoods_ll.columns = ['PostalCode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude']

coordinates were only obtained for 21 neighborhoods.

In [17]:
neighborhoods_ll.head(24)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.773077,-79.257774
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.773077,-79.257774
2,M1G,Scarborough,Woburn,43.759824,-79.225291
3,M1W,Scarborough,"Steeles West, L'Amoreaux West",43.812528,-79.307979
4,M2J,North York,"Fairview, Henry Farm, Oriole",43.765498,-79.364269
5,M2M,North York,"Willowdale, Newtonbrook",43.785962,-79.416031
6,M2N,North York,"Willowdale, Willowdale East",43.77398,-79.413833
7,M3A,North York,Parkwoods,43.761224,-79.323986
8,M3C,North York,Don Mills,43.732822,-79.346961
9,M3K,North York,Downsview,43.735823,-79.478709


The csv file "Geospatial_Coordinates" will be used, because the geopy library is not very good at obtaining the coordinates of the neighborhoods, likewise, it was tried to use other libraries but the one that worked best was geopy library.

we load the file:

In [19]:
Geospatial_df = pd.read_csv('https://cocl.us/Geospatial_data')
Geospatial_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


we combine it with the data frame that contains the neighborhoods

In [20]:
neighborhoods_merged = neighborhoods.merge(Geospatial_df, left_on='Postal Code', right_on='Postal Code')
neighborhoods_merged.head(50)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


And make sure how many boroughs and neighborhoods the dataset set has.

In [21]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods_merged['Borough'].unique()),
        neighborhoods_merged.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


### 3. Explore and cluster the neighborhoods in Toronto

First, we get the coordinates of Toronto, Ontario, Canada

In [24]:
address = 'Toronto, Ontario, Canadá'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 43.6534817, -79.3839347.


In [25]:
# !conda install -c conda-forge folium=0.5.0 --yes

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: - 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
                                                                                                                                                         /failed

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - cffi -> python[version='2.7.*|3.5.*|3.6.*|3.6.9|3.6.9|3.6.9|3.6.9|>=3.6,<3.7.0a0|>=3.9,<3.10.0a0|>=3.8,<3.9.0a0|>=3.7,<3.8.0a0|>=2.7,<2.8.0a0|>=3.5,<3.6.0a0|3.4.*',build='0_73_pypy|3_73_pypy|2_73_pypy|1_73_pypy']
  - rsa -> python[version='2.7.*|3.4.*|3.5.*|3.6.*']

Yo

In [28]:
!pip install folium

Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 2.6 MB/s eta 0:00:011
Collecting branca>=0.3.0
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


we created a map with all the Toronto neighborhoods

In [29]:
# create map of New York using latitude and longitude values
import folium
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods_merged['Latitude'],
                                           neighborhoods_merged['Longitude'],
                                           neighborhoods_merged['Borough'],
                                           neighborhoods_merged['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [30]:
# The code was removed by Watson Studio for sharing.


In [31]:
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

We make use of the foursquare library to obtain the venues of each neighborhood, within a radius of 500 and a limit of 100, in the end we all add them to a dataframe which will be used for analysis and classification.

In [32]:
venues_list=[]
radius = 500

for code, bor, name, lat, lng in zip(neighborhoods_merged['Postal Code'],
                                     neighborhoods_merged['Borough'],
                                     neighborhoods_merged['Neighbourhood'],
                                     neighborhoods_merged['Latitude'],
                                     neighborhoods_merged['Longitude']):
    
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    print('Neighbourhood: {}, Venues: {}'.format(name, len(results)))
    
    venues_list.append([(code,bor,name,lat,lng,
                         v['venue']['name'],
                         v['venue']['location']['lat'],
                         v['venue']['location']['lng'],
                         v['venue']['categories'][0]['name']) for v in results])
    
toronto_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
toronto_venues.columns = ['PostalCode', 'Borough', 'Neighbourhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 
                          'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']

Neighbourhood: Malvern, Rouge, Venues: 1
Neighbourhood: Rouge Hill, Port Union, Highland Creek, Venues: 2
Neighbourhood: Guildwood, Morningside, West Hill, Venues: 8
Neighbourhood: Woburn, Venues: 4
Neighbourhood: Cedarbrae, Venues: 8
Neighbourhood: Scarborough Village, Venues: 3
Neighbourhood: Kennedy Park, Ionview, East Birchmount Park, Venues: 4
Neighbourhood: Golden Mile, Clairlea, Oakridge, Venues: 10
Neighbourhood: Cliffside, Cliffcrest, Scarborough Village West, Venues: 2
Neighbourhood: Birch Cliff, Cliffside West, Venues: 4
Neighbourhood: Dorset Park, Wexford Heights, Scarborough Town Centre, Venues: 5
Neighbourhood: Wexford, Maryvale, Venues: 6
Neighbourhood: Agincourt, Venues: 5
Neighbourhood: Clarks Corners, Tam O'Shanter, Sullivan, Venues: 12
Neighbourhood: Milliken, Agincourt North, Steeles East, L'Amoreaux East, Venues: 3
Neighbourhood: Steeles West, L'Amoreaux West, Venues: 12
Neighbourhood: Upper Rouge, Venues: 0
Neighbourhood: Hillcrest Village, Venues: 5
Neighbourhood

for some neighborhoods, venues could not be obtained, where the value is 0

In [33]:
toronto_venues.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,SEBS Engineering Inc. (Sustainable Energy and ...,43.782371,-79.15682,Construction & Landscaping
3,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,RBC Royal Bank,43.76679,-79.191151,Bank
4,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


In [34]:
toronto_venues[toronto_venues['Neighbourhood'] == 'Alderwood, Long Branch']

Unnamed: 0,PostalCode,Borough,Neighbourhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
2071,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484,Il Paesano Pizzeria & Restaurant,43.60128,-79.545028,Pizza Place
2072,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484,Timothy's Pub,43.600165,-79.544699,Pub
2073,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484,Toronto Gymnastics International,43.599832,-79.542924,Gym
2074,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484,Tim Hortons,43.602396,-79.545048,Coffee Shop
2075,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484,Pizza Pizza,43.60534,-79.547252,Pizza Place
2076,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484,Subway,43.599152,-79.544395,Sandwich Place
2077,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484,Rexall,43.601951,-79.545694,Pharmacy


convert each category into a column using pandas.get_dummies, then include the neighborhoods in the first column of this new dataframe.

In [35]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood']

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

(2136, 274)


Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [36]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
print(toronto_grouped.shape)
toronto_grouped[1:30]

(96, 274)


Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Birch Cliff, Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0625,0.0625,0.0625,0.125,0.125,0.0625,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,Caledonia-Fairbanks,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0


let's write a function to sort the venues in descending order

In [37]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [38]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Clothing Store,Lounge,Breakfast Spot,Skating Rink,Latin American Restaurant,Electronics Store,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
1,"Alderwood, Long Branch",Pizza Place,Sandwich Place,Coffee Shop,Pub,Pharmacy,Gym,Greek Restaurant,Discount Store,Department Store,Dessert Shop
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Pharmacy,Ice Cream Shop,Bridal Shop,Shopping Mall,Sandwich Place,Diner,Middle Eastern Restaurant,Restaurant
3,Bayview Village,Japanese Restaurant,Café,Chinese Restaurant,Bank,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Yoga Studio
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Coffee Shop,Greek Restaurant,Thai Restaurant,Locksmith,Liquor Store,Comfort Food Restaurant,Juice Bar,Butcher


### 4. Cluster Neighborhoods

To start creating the clusters, we are going to remove the first column of neighborhoods.

In [51]:
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)
toronto_grouped_clustering.head()

Unnamed: 0,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Run k-means to cluster the neighborhood into 5 clusters.

In [52]:
from sklearn.cluster import KMeans
num_clusters = 5

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(toronto_grouped_clustering)
labels = k_means.labels_

print(labels)

[0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0
 0 0 2 0 0 0 0 0 0 3 0 0 0 3 0 0 0 3 0 0 0 2 0 0 3 0 0 0 3 0 2 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 4 0 3 0 0 0 0 0 3 1]


In [53]:
kmeans_1 = KMeans(n_clusters=num_clusters, random_state=0)
kmeans_1.fit(toronto_grouped_clustering)
labels = kmeans_1.labels_

print(labels)

[1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 0 1 1 4 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 3 1 0 1 1 1 1 1 0 2]


Add the group to the 10 most common places dataframe

In [54]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', k_means.labels_)
neighborhoods_venues_sorted.head()

ValueError: cannot insert Cluster Labels, already exists

We are going to mix the dataframe with the initial dataframe, to include the information of the neighborhoods, in addition to the group to which it belongs and the 10 most popular vanues.

In [55]:
toronto_merged = neighborhoods_merged
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
print(toronto_merged.shape)
toronto_merged.head()

(103, 16)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,2.0,Fast Food Restaurant,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Health & Beauty Service
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,4.0,Construction & Landscaping,Bar,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,4.0,Breakfast Spot,Restaurant,Electronics Store,Medical Center,Rental Car Location,Intersection,Mexican Restaurant,Bank,Yoga Studio,Doner Restaurant
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4.0,Coffee Shop,Mexican Restaurant,Korean Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Yoga Studio
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,4.0,Gas Station,Fried Chicken Joint,Bakery,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Hakka Restaurant,Electronics Store,Eastern European Restaurant


the column of the cluster, we convert it into intero to be able to plot on the map

In [56]:
toronto_merged = toronto_merged.fillna(0)
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)
toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,2,Fast Food Restaurant,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Health & Beauty Service
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,4,Construction & Landscaping,Bar,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,4,Breakfast Spot,Restaurant,Electronics Store,Medical Center,Rental Car Location,Intersection,Mexican Restaurant,Bank,Yoga Studio,Doner Restaurant
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4,Coffee Shop,Mexican Restaurant,Korean Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Yoga Studio
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,4,Gas Station,Fried Chicken Joint,Bakery,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Hakka Restaurant,Electronics Store,Eastern European Restaurant


In [57]:
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(num_clusters)
ys = [i + x + (i*x)**2 for i in range(num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'],
                                  toronto_merged['Longitude'],
                                  toronto_merged['Neighbourhood'],
                                  toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 5. Examine Clusters

let's examine cluster 1

In [58]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0,
                   toronto_merged.columns[[1]+[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",0,Playground,Park,Bakery,Yoga Studio,Dumpling Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore
16,Scarborough,Upper Rouge,0,0,0,0,0,0,0,0,0,0,0
21,North York,"Willowdale, Newtonbrook",0,0,0,0,0,0,0,0,0,0,0
23,North York,York Mills West,0,Park,Convenience Store,Yoga Studio,Eastern European Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
25,North York,Parkwoods,0,Park,Food & Drink Shop,Yoga Studio,Dumpling Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Electronics Store
40,East York,"East Toronto, Broadview North (Old East York)",0,Intersection,Park,Convenience Store,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore
44,Central Toronto,Lawrence Park,0,Park,Swim School,Bus Line,Yoga Studio,Drugstore,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
50,Downtown Toronto,Rosedale,0,Park,Playground,Trail,Yoga Studio,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant
64,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",0,Trail,Park,Sushi Restaurant,Jewelry Store,Yoga Studio,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
72,North York,Glencairn,0,Pizza Place,Park,Japanese Restaurant,Pub,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dim Sum Restaurant


let's examine cluster 2

In [59]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1,
                   toronto_merged.columns[[1]+[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
94,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",1,Print Shop,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Yoga Studio,Dim Sum Restaurant


let's examine cluster 3

In [60]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2,
                   toronto_merged.columns[[1]+[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,"Malvern, Rouge",2,Fast Food Restaurant,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Health & Beauty Service


let's examine cluster 4

In [61]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3,
                   toronto_merged.columns[[1]+[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,North York,"York Mills, Silver Hills",3,Martial Arts School,Yoga Studio,Eastern European Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store


let's examine cluster 5

In [62]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4,
                   toronto_merged.columns[[1]+[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Scarborough,"Rouge Hill, Port Union, Highland Creek",4,Construction & Landscaping,Bar,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
2,Scarborough,"Guildwood, Morningside, West Hill",4,Breakfast Spot,Restaurant,Electronics Store,Medical Center,Rental Car Location,Intersection,Mexican Restaurant,Bank,Yoga Studio,Doner Restaurant
3,Scarborough,Woburn,4,Coffee Shop,Mexican Restaurant,Korean Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Yoga Studio
4,Scarborough,Cedarbrae,4,Gas Station,Fried Chicken Joint,Bakery,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Hakka Restaurant,Electronics Store,Eastern European Restaurant
5,Scarborough,Scarborough Village,4,Playground,Smoke Shop,Jewelry Store,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,North York,Humber Summit,4,Pizza Place,Furniture / Home Store,Dumpling Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Diner
97,North York,"Humberlea, Emery",4,Baseball Field,Yoga Studio,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant,Field
99,Etobicoke,Westmount,4,Pizza Place,Coffee Shop,Discount Store,Sandwich Place,Chinese Restaurant,Intersection,Eastern European Restaurant,Electronics Store,Dumpling Restaurant,Dim Sum Restaurant
101,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",4,Pizza Place,Grocery Store,Fried Chicken Joint,Sandwich Place,Pharmacy,Liquor Store,Beer Store,Fast Food Restaurant,Gluten-free Restaurant,Department Store
