# Segmenting and Clustering Neighborhoods in Toronto

## First get the page from Wikipedia

For this we will use as suggested **Beautiful Soap**

### Install Beautiful Soap in case it is not already installed

In [1]:
!pip install beautifulsoup4



### Make some imports

In [2]:
import requests
import numpy as np
import pandas as pd

### We use a variable to hold the URL

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

### Retrieve the table

The table is retrieved with the remaining page, so we will have to clean it properly

**BeautifulSoap** has a function to _prettyfy_ the page

In [4]:
page = requests.get(url).text

In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page,'lxml')
# print(soup.prettify())

### Locate the table

Looking to the result we see where is the table (the tags) and we retrieve it

In [6]:
table = soup.find('table',{'class':'wikitable sortable'})

### Loop table

#### Loop rows
#### And inside cells

Adding everything to a dataframe

In [7]:
boroughs = []
for row in table.findAll("tr"):
    arrayrow = []
    cells = row.findAll("td")
    for cell in cells:
        celtext = cell.text.replace('\n','')
        arrayrow.append(celtext)
    boroughs.append(arrayrow)

df_boroughs = pd.DataFrame(boroughs)
df_boroughs.columns = ['PostalCode','Borough','Neighborhood']

### Clean

### First not assigned

When both borough and neighborhood are not assigned drop the row

### Then first row, that is empty

### Fix neighborhoods

To do this we read the borough and set the neighborhood accordingly

In [8]:
df_boroughs.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [9]:
df_boroughs.drop(
    df_boroughs[(df_boroughs.Borough == 'Not assigned') &
                (df_boroughs.Neighborhood == 'Not assigned')].index, inplace=True)

In [10]:
df_boroughs = df_boroughs.iloc[1:]

##### How many are they?

One

In [11]:
df_boroughs[df_boroughs.Neighborhood == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
9,M7A,Queen's Park,Not assigned


### Assign Borough name to not assigned neighborhoods

This is pretty straightforward, we need just to select those with _Not assigned_ as value add assign the name of the botough

In [12]:
df_boroughs.loc[df_boroughs.Neighborhood == 'Not assigned', 'Neighborhood'] = df_boroughs.Borough

In [13]:
df_boroughs.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


### Group neighborhoods of the same borough

This is done by grouping by PostalCode and Borough and applying a join to the neighborhoods separated by commas

In [14]:
df_result = df_boroughs.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(','.join).reset_index()

In [15]:
df_result.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## Final results

The shame is **(103, 3)** meaning there are 103 different postal codes with proper names grouped by borough

In [16]:
df_result.shape

(103, 3)

!pip install geocoder

### It didn't work

So we just load the csv file.

In [17]:
lat_lng_coords = pd.read_csv('Geospatial_Coordinates.csv')

We see that _Postal Code_ has a different name than _PostalCode_ so we just change it

In [18]:
lat_lng_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We see that _Postal Code_ has a different name than _PostalCode_ so we just change it

In [19]:
lat_lng_coords.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)

Now we just merge them

In [20]:
df_result = pd.merge(df_result, lat_lng_coords, on='PostalCode')

And voilá there we have it

In [21]:
df_result.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [22]:
from geopy.geocoders import Nominatim
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


In [23]:
import folium

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_result['Latitude'], df_result['Longitude'], df_result['Borough'], df_result['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=6,
        popup=label,
        color='green',
        fill=True,
        fill_color='#31ffcc',
        fill_opacity=0.6,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [24]:
CLIENT_ID = '13UICRQSJWVMKWXWJTT0TRQJMK4K3MF2EROEVPBGRMPBZ4VD' # your Foursquare ID
CLIENT_SECRET = '0J4P5ZL5E4KMFIHJRYWTS0QJAFWNXKMWNFURVKGQIGVNZFMI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 13UICRQSJWVMKWXWJTT0TRQJMK4K3MF2EROEVPBGRMPBZ4VD
CLIENT_SECRET:0J4P5ZL5E4KMFIHJRYWTS0QJAFWNXKMWNFURVKGQIGVNZFMI


In [25]:
df_boroughs = pd.merge(df_boroughs, lat_lng_coords, on='PostalCode')


We create a dataframe with the neigborhoods containing letter _a_

In [26]:
df_toronto = df_boroughs[df_boroughs['Neighborhood'].str.contains("Toronto")]

In [27]:
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
60,M4J,East York,East Toronto,43.685347,-79.338106
62,M5J,Downtown Toronto,Toronto Islands,43.640816,-79.381752
70,M3K,North York,CFB Toronto,43.737473,-79.464763
75,M5K,Downtown Toronto,Toronto Dominion Centre,43.647177,-79.381576
130,M4R,Central Toronto,North Toronto West,43.715383,-79.405678


### Explore Neighborhoods in Toronto


#### Let's create a function to repeat the same process to all the neighborhoods in Manhattan

In [28]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    LIMIT = 100
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called *toronto_venues*.

In [29]:
toronto_venues = getNearbyVenues(
    names=df_toronto['Neighborhood'],
    latitudes=df_toronto['Latitude'],
    longitudes=df_toronto['Longitude']
)

East Toronto
Toronto Islands
CFB Toronto
Toronto Dominion Centre
North Toronto West
University of Toronto
New Toronto


#### Let's find out how many unique categories can be curated from all the returned venues

In [30]:
print('There are {} uniques categories.'.format(
    len(toronto_venues['Venue Category'].unique())))

There are 100 uniques categories.


### Analyze Each Neighborhood

In [31]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Airport,American Restaurant,Aquarium,Art Gallery,Asian Restaurant,Bagel Shop,Bakery,Bank,Bar,...,Steakhouse,Supermarket,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Wine Bar
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [32]:
toronto_g = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_g

Unnamed: 0,Neighborhood,Yoga Studio,Airport,American Restaurant,Aquarium,Art Gallery,Asian Restaurant,Bagel Shop,Bakery,Bank,...,Steakhouse,Supermarket,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Wine Bar
0,CFB Toronto,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,East Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,New Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,North Toronto West,0.052632,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Toronto Dominion Centre,0.0,0.0,0.04,0.0,0.01,0.01,0.0,0.02,0.0,...,0.02,0.0,0.0,0.02,0.01,0.01,0.01,0.01,0.0,0.01
5,Toronto Islands,0.0,0.0,0.0,0.05,0.01,0.0,0.0,0.02,0.01,...,0.01,0.01,0.01,0.01,0.0,0.01,0.01,0.01,0.0,0.01
6,University of Toronto,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,...,0.0,0.0,0.027778,0.0,0.0,0.027778,0.0,0.0,0.027778,0.0


#### Let's put each neighborhood along with the top 5 most common venues into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [33]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [34]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_g['Neighborhood']

for ind in np.arange(toronto_g.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_g.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,CFB Toronto,Snack Place,Airport,Bus Stop,Park,Wine Bar,College Gym,Comfort Food Restaurant,Concert Hall,Convenience Store,Dance Studio
1,East Toronto,Coffee Shop,Park,Convenience Store,Discount Store,College Arts Building,College Gym,Comfort Food Restaurant,Concert Hall,Dance Studio,Deli / Bodega
2,New Toronto,Café,Pharmacy,Sandwich Place,Liquor Store,Restaurant,Fast Food Restaurant,Bakery,Fried Chicken Joint,Gym,Coffee Shop
3,North Toronto West,Coffee Shop,Sporting Goods Shop,Clothing Store,Chinese Restaurant,Rental Car Location,Salon / Barbershop,Sandwich Place,Mexican Restaurant,Dessert Shop,Park
4,Toronto Dominion Centre,Coffee Shop,Café,Hotel,American Restaurant,Italian Restaurant,Gastropub,Restaurant,Deli / Bodega,Seafood Restaurant,Steakhouse


## Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [35]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_gc = toronto_g.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_gc)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 1, 3, 4, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [36]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto = df_toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto = toronto.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
60,M4J,East York,East Toronto,43.685347,-79.338106,1,Coffee Shop,Park,Convenience Store,Discount Store,College Arts Building,College Gym,Comfort Food Restaurant,Concert Hall,Dance Studio,Deli / Bodega
62,M5J,Downtown Toronto,Toronto Islands,43.640816,-79.381752,0,Coffee Shop,Hotel,Aquarium,Italian Restaurant,Café,Brewery,Pizza Place,Fried Chicken Joint,Scenic Lookout,Restaurant
70,M3K,North York,CFB Toronto,43.737473,-79.464763,2,Snack Place,Airport,Bus Stop,Park,Wine Bar,College Gym,Comfort Food Restaurant,Concert Hall,Convenience Store,Dance Studio
75,M5K,Downtown Toronto,Toronto Dominion Centre,43.647177,-79.381576,0,Coffee Shop,Café,Hotel,American Restaurant,Italian Restaurant,Gastropub,Restaurant,Deli / Bodega,Seafood Restaurant,Steakhouse
130,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,4,Coffee Shop,Sporting Goods Shop,Clothing Store,Chinese Restaurant,Rental Car Location,Salon / Barbershop,Sandwich Place,Mexican Restaurant,Dessert Shop,Park


Finally, let's visualize the resulting clusters

In [37]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto['Latitude'], toronto['Longitude'], toronto['Neighborhood'], toronto['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters