# Segmenting and Clustering Neighborhoods in Toronto

This notebook aims to explore and cluster the neighborhoods in Toronto.

## PART 1

In Part 1, code will be built to scrape the following Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M  .The page will be scraped to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.    *the webpage version as at 6th April 2020 was used for this notebook. 

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import pandas as pd  #library for data analsysis
!pip install bs4
from bs4 import BeautifulSoup #this package will be used for parsing the html content
import requests # Import the "requests" library to fetch the page content
!pip install lxml #beautiful soup package supports a number of third-party Python parsers. One is the lxml parser. lxml is recommeded for speed.
import lxml #import library



You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
Invalid requirement: '#beautiful'
Traceback (most recent call last):
  File "C:\Users\boker\Anaconda3\lib\site-packages\pip\_vendor\packaging\requirements.py", line 92, in __init__
    req = REQUIREMENT.parseString(requirement_string)
  File "C:\Users\boker\Anaconda3\lib\site-packages\pip\_vendor\pyparsing.py", line 1617, in parseString
    raise exc
  File "C:\Users\boker\Anaconda3\lib\site-packages\pip\_vendor\pyparsing.py", line 1607, in parseString
    loc, tokens = self._parse( instring, 0 )
  File "C:\Users\boker\Anaconda3\lib\site-packages\pip\_vendor\pyparsing.py", line 1379, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "C:\Users\boker\Anaconda3\lib\site-packages\pip\_vendor\pyparsing.py", line 3376, in parseImpl
    loc, exprtokens = e._parse( instring, loc, doActions )
  File "C:\Us

Now we can define the webpage we wish to scrape and get the data from the page

In [2]:
data_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" #this is the webpage we will scrape for the data
source = requests.get(data_url).text #get request to fetch raw HTML data
#you can also do in one line: source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [3]:
soup=BeautifulSoup(source,'lxml') #pass the webpage to the bs4 constructor to parse using lmxl parser

In [4]:
print(soup.title.text) #print webpage title. The ".text" inclusion gets the title without html tags. 

List of postal codes of Canada: M - Wikipedia


We then search for the table on the webpage where we wish to scrape the data from

In [5]:
#find the table on the webpage. Table class can be found by inspecting html elements on the webpage
pcode_table = soup.find('table', attrs={'class':'wikitable'})  

First we want to find the table headers and add them to a Dataframe

In [6]:
pcode_table_data = pcode_table.tbody.find_all("tr")  #finds all the elements of the pcode_table with attribute tag 'tr'
column_headers = [] #create empty list for column headers then get text from webpage to fill
for th in pcode_table_data[0].find_all("th"): #the table headings have attribute tag "th"
    column_headers.append(th.text.replace('\n', ' ').strip()) #replace the html "\n" code with nothing. Using python string methods we remove newlines and spaces from left and right. 
print(column_headers)

['Postal code', 'Borough', 'Neighborhood']


In [50]:
#the following also works the same way:
#table=soup.find('table', attrs={'class':'wikitable'}) 
#headers = []
#for th in table.find_all("th"):
#    headers.append(th.text.replace('\n', ' ').strip())
#print(headers)


Now we create a dataframe which contains the headers

In [7]:
df = pd.DataFrame(columns = column_headers) #create dataframe 'df' which will contain the headers
df

Unnamed: 0,Postal code,Borough,Neighborhood


Then we get the row data and fill the rest of the Dataframe with this info

In [8]:
#get table row data
data = []
for row in pcode_table_data:
    td=[] #create empty list 'td'
    for t in row.find_all('td'):   #find all elements with attribute tag 'td'
        td.append(t.text.strip())  #add into td list and strip whitespace etc. 
    data.append(td)    #add td list into data list

In [9]:
#add row data to df
df = pd.DataFrame(data, columns = column_headers)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,,,
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


We only want to process the cells that have an assigned borough hence we ignore cells with a borough that is 'Not assigned'.

In [10]:
df1 = df[df.Borough != 'Not assigned'] # create new dataframe 'df1' that does not contain Boroughs with 'Not assigned' values
df1.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,,,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Regent Park / Harbourfront
6,M6A,North York,Lawrence Manor / Lawrence Heights


Inspection of the Dataframe shows that there is an empty row in the first row which contains 'None' in all cells of the row. We must remove this row. 

In [11]:
df2 = df1[~df1['Borough'].isnull()]  # create a new Dataframe 'df2' which has no 'bad' rows. 

df2.reset_index(drop=True, inplace=True) #reset the table row index
df2.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


Now lets replace the '/' seperators in the 'Neighborhood' column with commas. 

In [12]:
df2['Neighborhood'] = df2['Neighborhood'].str.replace('/', ',') # replaces forward slashes with commas

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [13]:
df2.shape #use .shape to view the dimensions of the table

(103, 3)

## PART 2

In Part 2 we wish to obtain the Longitude and Latitude data for Toronto so we can begin to explore and cluster the neighborhood. 

Now that we have a dataframe of the postal codes for each neighborhood as well as the borough name and neighbourhood name, in order to utilise Foursquare location data, we need to get the longitude and latitude coordinates of each neighborhood. 

In [14]:
#add Geo-spatial data from provided link
dfgeo= pd.read_csv("http://cocl.us/Geospatial_data") #read in csv data from url into a pandas dataframe
dfgeo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We can then merge the new location data with the original dataframe from Part 1

In [15]:
dfgeo2 = dfgeo.rename(columns={'Postal Code':'Postalcode'}) #create a new df with renamed postalcode column for the geo data
df3 = df2.rename(columns={'Postal code':'Postalcode'}) #do the same for df2

In [16]:
merged_data=pd.merge(dfgeo2, df3, on='Postalcode') #now we can merge the 2 dataframes on 'Postalcode'
merged_data.head()

Unnamed: 0,Postalcode,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,"Malvern , Rouge"
1,M1C,43.784535,-79.160497,Scarborough,"Rouge Hill , Port Union , Highland Creek"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood , Morningside , West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae


In [17]:
geo_data=merged_data[['Postalcode','Borough','Neighborhood','Latitude','Longitude']] #rearrange columns and store as new df

In [18]:
geo_data.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern , Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill , Port Union , Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## PART 3

### In Part 3 we will start to explore the Toronto area using clustering and the Foursquare API data. 

First we will install all the relevant python packages

In [19]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium
import folium # map rendering library

from pandas.io.json import json_normalize #used to clean the json and structure into pandas dataframe




You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


### We can now create a new dataframe which contains only Toronto data

In [20]:
#only include rows that have 'Toronto' somewhere in the 'Borough' column
dftoronto = geo_data[geo_data['Borough'].str.contains('Toronto',regex=False)] 
dftoronto

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West , Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar , The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park , Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Summerhill West , Rathnelly , South Hill , For...",43.686412,-79.400049


### We can now use the Toronto data to create a map using folium library. This map will show markers where we have data for. 

In [21]:
map_toronto = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

for lat,lng,borough,neighborhood in zip(dftoronto['Latitude'],dftoronto['Longitude'],dftoronto['Borough'],dftoronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)
map_toronto

### Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

In [22]:
CLIENT_ID = 'WQEXL4WCSYWE5CZFFAM50MN4NRAZV2H1XCCZ0OHIDCK2U41R' # your Foursquare ID
CLIENT_SECRET = 'D4DE5OQTO1LOZJK2ZDSJPUEZ34TEPTDVLMV5WXTORDZGOLZ5' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: WQEXL4WCSYWE5CZFFAM50MN4NRAZV2H1XCCZ0OHIDCK2U41R
CLIENT_SECRET:D4DE5OQTO1LOZJK2ZDSJPUEZ34TEPTDVLMV5WXTORDZGOLZ5


Let's explore the first neighborhood in our dataframe

In [23]:
dftoronto.reset_index(drop=True, inplace=True) #reset the table row index

In [24]:
dftoronto.loc[0, 'Neighborhood'] # get the first neighborhoods latitude and longitude values. 'The Beaches' is returned.

'The Beaches'

Get The Beaches latitude and longitude values.

In [25]:
neighborhood_latitude = dftoronto.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = dftoronto.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = dftoronto.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of The Beaches are 43.67635739999999, -79.2930312.


### Now, lets get the top 100 venues that are in The Beaches within a radius of 500 meters

In [26]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

In [27]:
#create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=WQEXL4WCSYWE5CZFFAM50MN4NRAZV2H1XCCZ0OHIDCK2U41R&client_secret=D4DE5OQTO1LOZJK2ZDSJPUEZ34TEPTDVLMV5WXTORDZGOLZ5&v=20180605&ll=43.67635739999999,-79.2930312&radius=500&limit=100'

In [28]:
#send the get request and examine results
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e990e1b2115360020842459'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4bd461bc77b29c74a07d9282-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/hikingtrail_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d159941735',
         'name': 'Trail',
         'pluralName': 'Trails',
         'primary': True,
         'shortName': 'Trail'}],
       'id': '4bd461bc77b29c74a07d9282',
       'location': {'address': 'Glen Manor',
        'cc': 'CA',
        'city': 'Toronto',
        'country': 'Canada',
        'crossStreet': 'Queen St.',
        'distance': 89,
        'formattedAddress': ['Glen Manor (Queen St.)', 'Toronto ON', 'Canada'],
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.67682

In [29]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### Now we are ready to clean the json and structure it into a pandas dataframe.

In [31]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869


In [32]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


### Let's create a function to repeat the same process to all the neighborhoods in Toronto

In [33]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [34]:
#run the above function on each neighborhood and create a new dataframe called Toronto_venues
Toronto_venues = getNearbyVenues(names=dftoronto['Neighborhood'],
                                   latitudes=dftoronto['Latitude'],
                                   longitudes=dftoronto['Longitude']
                                  )

The Beaches
The Danforth West , Riverdale
India Bazaar , The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park , Summerhill East
Summerhill West , Rathnelly , South Hill , Forest Hill SE , Deer Park
Rosedale
St. James Town , Cabbagetown
Church and Wellesley
Regent Park , Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond , Adelaide , King
Harbourfront East , Union Station , Toronto Islands
Toronto Dominion Centre , Design Exchange
Commerce Court , Victoria Hotel
Roselawn
Forest Hill North & West
The Annex , North Midtown , Yorkville
University of Toronto , Harbord
Kensington Market , Chinatown , Grange Park
CN Tower , King and Spadina , Railway Lands , Harbourfront West , Bathurst Quay , South Niagara , Island airport
Stn A PO Boxes
First Canadian Place , Underground city
Christie
Dufferin , Dovercourt Village
Little Portugal , Trinity
Brockton , Parkdale Village , Exhibition Place
High Park , 

In [35]:
#lets check the size of the resulting dataframe
print(Toronto_venues.shape)
Toronto_venues.head()

(1635, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West , Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


### Lets Analyse the venues returned for Toronto by Neighborhood and Venue Type

In [36]:
#Let's check how many venues were returned for each neighborhood
Toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,56,56,56,56,56,56
"Brockton , Parkdale Village , Exhibition Place",23,23,23,23,23,23
Business reply mail Processing CentrE,16,16,16,16,16,16
"CN Tower , King and Spadina , Railway Lands , Harbourfront West , Bathurst Quay , South Niagara , Island airport",14,14,14,14,14,14
Central Bay Street,64,64,64,64,64,64
Christie,18,18,18,18,18,18
Church and Wellesley,77,77,77,77,77,77
"Commerce Court , Victoria Hotel",100,100,100,100,100,100
Davisville,33,33,33,33,33,33
Davisville North,8,8,8,8,8,8


In [37]:
#Let's find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(Toronto_venues['Venue Category'].unique())))

There are 231 uniques categories.


### Analyze Each Neighborhood looking at the most common venue types

In [38]:
# one hot encoding
toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = Toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
toronto_onehot.shape

(1635, 231)

In [40]:
#Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0
1,"Brockton , Parkdale Village , Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business reply mail Processing CentrE,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower , King and Spadina , Railway Lands , ...",0.0,0.071429,0.071429,0.071429,0.071429,0.142857,0.071429,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.015625,0.0,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.025974,0.0,0.0,0.0,0.0,0.0,0.0,0.012987,0.0,...,0.012987,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Commerce Court , Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
toronto_grouped.shape

(39, 231)

In [42]:
#Let's print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                venue  freq
0         Coffee Shop  0.05
1  Seafood Restaurant  0.04
2         Cheese Shop  0.04
3          Restaurant  0.04
4      Farmers Market  0.04


----Brockton , Parkdale Village , Exhibition Place----
            venue  freq
0            Café  0.13
1  Breakfast Spot  0.09
2       Nightclub  0.09
3     Coffee Shop  0.09
4             Gym  0.04


----Business reply mail Processing CentrE----
              venue  freq
0       Yoga Studio  0.06
1     Garden Center  0.06
2        Comic Shop  0.06
3              Park  0.06
4  Recording Studio  0.06


----CN Tower , King and Spadina , Railway Lands , Harbourfront West , Bathurst Quay , South Niagara , Island airport----
                venue  freq
0     Airport Service  0.14
1            Boutique  0.07
2  Airport Food Court  0.07
3        Airport Gate  0.07
4      Airport Lounge  0.07


----Central Bay Street----
                 venue  freq
0          Coffee Shop  0.19
1       Sandwich Place  0.06


In [43]:
#let's write a function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [44]:
import numpy as np

In [45]:
#Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Bakery,Beer Bar,Café,Cheese Shop,Cocktail Bar,Italian Restaurant,Farmers Market,Seafood Restaurant,Restaurant
1,"Brockton , Parkdale Village , Exhibition Place",Café,Breakfast Spot,Nightclub,Coffee Shop,Grocery Store,Intersection,Restaurant,Bar,Stadium,Bakery
2,Business reply mail Processing CentrE,Yoga Studio,Auto Workshop,Comic Shop,Pizza Place,Recording Studio,Restaurant,Burrito Place,Brewery,Skate Park,Smoke Shop
3,"CN Tower , King and Spadina , Railway Lands , ...",Airport Service,Harbor / Marina,Sculpture Garden,Bar,Coffee Shop,Boat or Ferry,Boutique,Rental Car Location,Airport Terminal,Airport Lounge
4,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Burger Joint,Bubble Tea Shop,Ice Cream Shop,Salad Place,Japanese Restaurant,Sushi Restaurant


### Cluster Neighborhoods

In [46]:
#Run k-means to cluster the neighborhood into 5 clusters.
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [47]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [48]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = dftoronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Trail,Health Food Store,Pub,Women's Store,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
1,M4K,East Toronto,"The Danforth West , Riverdale",43.679557,-79.352188,0,Greek Restaurant,Italian Restaurant,Coffee Shop,Restaurant,Bookstore,Ice Cream Shop,Furniture / Home Store,Fruit & Vegetable Store,Pub,Pizza Place
2,M4L,East Toronto,"India Bazaar , The Beaches West",43.668999,-79.315572,0,Sandwich Place,Fast Food Restaurant,Pet Store,Food & Drink Shop,Park,Movie Theater,Pub,Restaurant,Burrito Place,Liquor Store
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,Gastropub,Brewery,Bakery,American Restaurant,Yoga Studio,Comfort Food Restaurant,Seafood Restaurant,Sandwich Place
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,4,Park,Swim School,Bus Line,Women's Store,Diner,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


### Finally lets visualise the results

In [52]:
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters