# Segmenting and Clustering Neighborhoods in Toronto

---

## Summary

In this notebook, I have segmented and clustered the neighborhoods in Toronto, Canada, based on their nearby venues. In order to do this, I have first scraped the postal codes, boroughs, and neighborhoods of Toronto from a table from Wikipedia and formatted these into a DataFrame. After that, I have downloaded the geographical coordinates from Coursera, and merged these with the existing DataFrame. These coordinates made me able to perform API calls to Foursquare, which results in a DataFrame of each venue in each neighborhood, including their geographical coordinates and category.

Using these venues and **K-Means Clustering**, the neighborhoods were splitted into **four clusters**: 

* The **first cluster** contains 67 neighborhoods which, compared to the other clusters, maily include a park, coffee shops, and pizza places. 
* The **second cluster** contains 12 neighborhoods, and all these neighborhoods include a coffee shop and a cafe. Also, this cluster contain seven vegetarian restaurants and six art galleries, which are not in the top 10 of the other clusters. 
* The **third cluster** only contains one neighborhood. However, this neighborhood includes a wide variation of venues, from a bakery and a grocery store to a spa and a pub.
* The **last cluster** contains 18 neighborhoods, which all contain a coffee shop. However, these neighborhoods have a wide variation of places to have a dinner. For example, this cluster include in total 15 restaurants, 12 sandwich places, 10 sushi restaurants, and 8 Italian restaurant.

---

## Getting PostalCode, Borough, and Neighborhoods

First, the required data is scraped from Wikipedia using requests and BeautifulSoup. After that, a for loop is used to append each row data into a list. This list is used to remove the "not assigned" neighborhoods and format the values. Finally, this list is converted to a two-dimensional NumPy, in order to split the columns from the values, and to create a DataFrame.

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import pandas as pd

# 1. Getting the webpage content of Wikipedia
wikipedia = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").content

# 2. Converting the HTML content to a BeautifulSoup object
wikipedia_soup = BeautifulSoup(wikipedia, "html.parser")

# 3. Assigning the table content from Wikipedia to variable
neighborhoods = wikipedia_soup.table

# 4. Converting the tags to lists
neighborhoods_lists = []
for tag in neighborhoods.find_all("tr"):
    temp_list = []
    temp_split = tag.text.split("\n")
    for i in range(len(temp_split)):
        if i in [1, 3, 5]:
            temp_list.append(temp_split[i])
    neighborhoods_lists.append(temp_list)

# 5. Removing the "not assigned" boroughs
neighborhoods_rows = []
for lst in neighborhoods_lists:
    if "Not assigned" in lst[1]:
        continue
    else:
        lst[2] = re.sub(" /", ",", lst[2])
        neighborhoods_rows.append(lst)

# 6. Converting the lists into a NumPy array and separates the columns from the values
neighborhoods_rows = np.array(neighborhoods_rows)
neighborhoods_columns = neighborhoods_rows[0,:]
neighborhoods_values = neighborhoods_rows[1:,:]

# 7. Creating a DataFrame of the neighborhoods in Toronto
neighborhoods_df = pd.DataFrame(
    neighborhoods_values,
    columns = neighborhoods_columns
).rename(columns={"Postal code": "PostalCode"}).sort_values(by=["PostalCode"])

# 8. Printing the number of rows in the DataFrame
print("This DataFrame contains {} rows.".format(neighborhoods_df.shape[0]))

neighborhoods_df.head()

This DataFrame contains 103 rows.


Unnamed: 0,PostalCode,Borough,Neighborhood
6,M1B,Scarborough,"Malvern, Rouge"
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
18,M1E,Scarborough,"Guildwood, Morningside, West Hill"
22,M1G,Scarborough,Woburn
26,M1H,Scarborough,Cedarbrae


---

## Getting Latitude and Longitude Coordinates

In order to make API calls to Foursquare, the georaphical coordinate of each neighborhood is required. Therefore, the CSV file from Coursera is used and merged with the existing DataFrame. This results in the DataFrame below.

In [2]:
# 1. Loading the coordinates csv into notebook
coordinates_df = pd.read_csv("Files/Geospatial_Coordinates.csv").rename(columns={"Postal Code": "PostalCode"})

# 2. Merging neighborhoods_df with coordinates_df
neighborhoods_df = pd.merge(neighborhoods_df, coordinates_df)

neighborhoods_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


---

## Exploring the Neighborhood (Malvern, Rouge)

In order to explore every neighborhood in Toronto, I have first explored one neighborhood: Malvern, Rouge. In the code below, I have made the first API call with the geographical coordinate of Malvern, Rouge. This results in a JSON file, including the nearby venues in the neighborhood. These data is converted in DataFrame and formatted in such a way that it is readable.

In [3]:
from pandas.io.json import json_normalize

# 1. Defining Foursquare credentials and version
CLIENT_ID = "ITYEMLDDKSZTIRSHWI2SXOPCGFJ3AWQY5JKF3PZVKD5EJLIP"
CLIENT_SECRET = "MFSJZJIIEGJPW2H1TLX1ZEKPACYPWS3KRIOK3HNO3ARJIBQM"
VERSION = "20200504"
radius = 500
limit = 100

# 2. Setting the neighborhood's coordinates
neighborhood_name = neighborhoods_df.loc[0, "Neighborhood"]
neighborhood_latitude = neighborhoods_df.loc[0, "Latitude"]
neighborhood_longitude = neighborhoods_df.loc[0, "Longitude"]

# 3. Creating URL and sends the GET request
url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    neighborhood_latitude,
    neighborhood_longitude,
    radius,
    limit
)
results = requests.get(url).json()

# 4. Getting the items in JSON
venues = results["response"]["groups"][0]["items"]

# 5. Converting JSON into a DataFrame
venues_df = json_normalize(venues)

# 6. Filtering the columns
filtered_columns = ["venue.name", "venue.categories", "venue.location.lat", "venue.location.lng"]
venues_df = venues_df.loc[:, filtered_columns]

# 7. Returning the value of the name key in venue.categories
venues_df["category"] = venues_df["venue.categories"].apply(lambda x: x[0]["name"])

# 8. Renaming the columns
venues_df.columns = [x.split(".")[-1] for x in venues_df.columns]

# 9. Printing the number of venues returned by Foursquare
print("{} venues were returned by Foursquare.".format(venues_df.shape[0]))

# 10. Adding the neighborhood to each row
venues_df["neighborhood"] = neighborhood_name

# 11. Setting neighborhood as first column
venues_df = venues_df[["neighborhood", "name", "category", "lat", "lng"]]

venues_df

2 venues were returned by Foursquare.


Unnamed: 0,neighborhood,name,category,lat,lng
0,"Malvern, Rouge",Wendy’s,Fast Food Restaurant,43.807448,-79.199056
1,"Malvern, Rouge",Interprovincial Group,Print Shop,43.80563,-79.200378


---

## Exploring All Neighborhoods in Toronto

In the code below, API calls are done for each neighborhood in the first DataFrame. These JSON files are appended to a list, including the neigborhood of each venue. Just like the code above, these JSON files are converted and appended to a single DataFrame and formatted in such a way that the neighborhood, category and geographical coordinates of each venue are shown. 

In [8]:
# 1. Assinging necessary column values to variables
neighborhood_values = neighborhoods_df["Neighborhood"].values
latitude_values = neighborhoods_df["Latitude"].values
longitude_values = neighborhoods_df["Longitude"].values

# 2. Sends the GET results for each neighborhood in the DataFrame and appends its JSON in all_venues
all_results = []
all_venues_neighborhoods = []
for hood, lat, lng in zip(neighborhood_values, latitude_values, longitude_values):
    
    # 2.1. Generates the URL
    temp_url = "https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit{}".format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, limit)
    
    # 2.2. Sends the GET request
    temp_results = requests.get(temp_url).json()["response"]["groups"][0]["items"]
    
    # 2.3. Appends the results to all_venues
    all_results.append(temp_results)
    
    # 2.4. Generates a temporary DataFrame in order to get the number of rows
    temp_df = json_normalize(temp_results)
    
    # 2.5. Appends the neighborhood of each row in the temporary DataFrame
    for i in range(temp_df.shape[0]):
        all_venues_neighborhoods.append(hood)

print("Number of results: {}".format(len(all_results)))
print("Number of venues: {}".format(len(all_venues_neighborhoods)))

# 3. Converting each JSON into a DataFrame and appends it to all_venues_df
all_venues_df = pd.DataFrame()
for result in all_results:
    all_venues_df = all_venues_df.append(json_normalize(result), sort=False)

# 4. Formatting the DataFrame
all_venues_df["Category"] = all_venues_df["venue.categories"].apply(lambda x: x[0]["name"])
all_venues_df["Neighborhood"] = all_venues_neighborhoods
filtered_columns = ["Neighborhood", "venue.name", "Category", "venue.location.lat", "venue.location.lng"]
all_venues_df = all_venues_df.loc[:, filtered_columns].reset_index(drop=True)
all_venues_df.columns = ["Neighborhood", "Venue", "Venue Category", "Venue Latitude", "Venue Longitude"]

all_venues_df.head()

Number of results: 103
Number of venues: 1338


Unnamed: 0,Neighborhood,Venue,Venue Category,Venue Latitude,Venue Longitude
0,"Malvern, Rouge",Wendy’s,Fast Food Restaurant,43.807448,-79.199056
1,"Malvern, Rouge",Interprovincial Group,Print Shop,43.80563,-79.200378
2,"Rouge Hill, Port Union, Highland Creek",Royal Canadian Legion,Bar,43.782533,-79.163085
3,"Rouge Hill, Port Union, Highland Creek",Affordable Toronto Movers,Moving Target,43.787919,-79.162977
4,"Guildwood, Morningside, West Hill",RBC Royal Bank,Bank,43.76679,-79.191151


In [9]:
# 5. Printing the number of unique venue categories in Toronto neighborhoods
print("There are {} unique categories".format(len(all_venues_df["Venue Category"].unique())))

# 6. Returning the top 5 neighborhoods with the most venues
all_venues_df.groupby("Neighborhood")["Venue"].count().reset_index().sort_values(by=["Venue"], ascending=False).reset_index(drop=True).head()

There are 235 unique categories


Unnamed: 0,Neighborhood,Venue
0,Willowdale,35
1,Church and Wellesley,30
2,"Garden District, Ryerson",30
3,"Kensington Market, Chinatown, Grange Park",30
4,"First Canadian Place, Underground city",30


----

## Analyzing Each Neighborhood in Toronto

In order to know which venues are the most common in each neighborhood, the categories are converted into dummy variables, and grouped by neigborhood, which results in a count of each category in each neighborhood. After that, these counts are sorted and categorized, so that a top N is created for each neighborhood. These places are used as columns in a pivot table (neighborhood as index, and venue als value), which shows the most common venues for each neighborhood (from left to right).

In [10]:
# 1. Splitting the values in Venue Category into dummy variables
toronto_dummies = pd.get_dummies(
    all_venues_df["Venue Category"],
    prefix = "",
    prefix_sep = ""
)

# 2. Adding neighborhood as first column into dummies DataFrame
toronto_dummies["neighborhood"] = all_venues_df["Neighborhood"]
fixed_columns = [toronto_dummies.columns[-1]] + list(toronto_dummies.columns[:-1])
toronto_dummies = toronto_dummies[fixed_columns]

print(toronto_dummies.shape)
toronto_dummies.head()

(1338, 236)


Unnamed: 0,neighborhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# 3. Grouping rows by each neighborhood
toronto_grouped = toronto_dummies.groupby("neighborhood").sum().reset_index()

# 5. Creating a DataFrame containing the most common venues in each neighborhood
toronto_common_venues = pd.DataFrame()
for hood in toronto_grouped["neighborhood"]:
    temp_df = toronto_grouped[toronto_grouped["neighborhood"] == hood].T.reset_index()
    temp_df.columns = ["Venue", "Count"]
    temp_df = temp_df.loc[1:]
    temp_df = temp_df[temp_df["Count"] > 0]
    temp_df = temp_df.sort_values(by=["Count"], ascending=False).reset_index(drop=True)
    temp_df["Most Common"] = [i+1 for i in range(temp_df.shape[0])]
    temp_df["Neighborhood"] = hood
    temp_df = temp_df[["Neighborhood", "Most Common", "Venue", "Count"]]
    toronto_common_venues = toronto_common_venues.append(temp_df)
toronto_common_venues.reset_index(drop=True)

toronto_common_venues.head()

Unnamed: 0,Neighborhood,Most Common,Venue,Count
0,Agincourt,1,Breakfast Spot,1
1,Agincourt,2,Clothing Store,1
2,Agincourt,3,Latin American Restaurant,1
3,Agincourt,4,Lounge,1
0,"Alderwood, Long Branch",1,Pizza Place,2


In [12]:
# 6. Creating a pivot table where each row shows the top 10 most common venues for each neighborhood
toronto_common_venues_pivot = toronto_common_venues.pivot(
    columns = "Most Common",
    index = "Neighborhood",
    values = "Venue"
).fillna("")

toronto_common_venues_pivot.head()

Most Common,1,2,3,4,5,6,7,8,9,10,...,21,22,23,24,25,26,27,28,29,30
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Agincourt,Breakfast Spot,Clothing Store,Latin American Restaurant,Lounge,,,,,,,...,,,,,,,,,,
"Alderwood, Long Branch",Pizza Place,Athletics & Sports,Coffee Shop,Gym,Pharmacy,Pool,Pub,Sandwich Place,Skating Rink,,...,,,,,,,,,,
"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Pharmacy,Sushi Restaurant,Supermarket,Shopping Mall,Sandwich Place,Restaurant,Pizza Place,Middle Eastern Restaurant,...,,,,,,,,,,
Bayview Village,Bank,Café,Chinese Restaurant,Japanese Restaurant,,,,,,,...,,,,,,,,,,
"Bedford Park, Lawrence Manor East",Italian Restaurant,Coffee Shop,Sandwich Place,Restaurant,Juice Bar,Sushi Restaurant,Pub,Pizza Place,Pharmacy,Liquor Store,...,,,,,,,,,,


---

## Clustering the Neighborhoods in Toronto

Because of the dummy variables of the venue categories, I am able to apply the K-Means Clustering model to the dataset. For this dataset, I have used 4 clusters. The clusters are shown in the map below. _Note: Some neighborhoods had "NaN" als value in Cluster Labels because these neighborhoods don't have any nearby venues. These are grouped as cluster 5 in order to convert the values as int, but aren't used in the analysis._

In [25]:
from sklearn.cluster import KMeans

# 1. Setting the number of clusters
clusters = 4

# 2. Dropping the neighborhood column
toronto_grouped_clustering = toronto_grouped.drop("neighborhood", axis=1)

# 3. Running K-Mean Clustering
kmeans = KMeans(n_clusters=clusters, random_state=0).fit(toronto_grouped_clustering)

print(kmeans.labels_)

# 4. Adding clustering labels to toronto_common_venues_pivot
#toronto_common_venues_pivot.insert(0, "Cluster Labels", kmeans.labels_)

# 5. Merging neighborhoods_df with toronto_common_venues_pivot
toronto_common_venues_df = pd.DataFrame(toronto_common_venues_pivot.to_records())
toronto_merged = neighborhoods_df.join(toronto_common_venues_df.set_index("Neighborhood"), on="Neighborhood")
toronto_merged["Cluster Labels"] = toronto_merged["Cluster Labels"].fillna(4).astype("int")

print(toronto_merged.shape[0])
toronto_merged.head()

[1 1 0 1 0 3 1 0 1 1 1 1 1 2 0 0 1 1 3 0 1 1 0 1 1 1 1 1 2 3 1 0 1 1 1 1 0
 1 1 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 2 2 3 1 1 1 0 1 1 1 3 0
 1 3 0 1 0 1 0 1 1 2 0 1 1 1 0 1 1 1 1]
103


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1,2,3,4,...,21,22,23,24,25,26,27,28,29,30
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,1,Fast Food Restaurant,Print Shop,,,...,,,,,,,,,,
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,1,Bar,Moving Target,,,...,,,,,,,,,,
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,1,Bank,Breakfast Spot,Electronics Store,Intersection,...,,,,,,,,,,
3,M1G,Scarborough,Woburn,43.770992,-79.216917,1,Coffee Shop,Indian Restaurant,Korean Restaurant,,...,,,,,,,,,,
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1,Athletics & Sports,Bakery,Bank,Caribbean Restaurant,...,,,,,,,,,,


In [26]:
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium

# 6. Finding the coordinates of Toronto, Canada
address = "Toronto, CA"
geolocator = Nominatim(user_agent="toronto_finder")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# 7. Creating a map of Toronto
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# 8. Creating colors for each cluster
x = np.arange(clusters)
y = [i+x+(i*x)**2 for i in range(clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(y)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# 9. Adding markers to the map
latitude_values = toronto_merged["Latitude"].values
longitude_values = toronto_merged["Longitude"].values
neighborhood_values = toronto_merged["Neighborhood"].values
cluster_values = toronto_merged["Cluster Labels"].values

for lat, lng, hood, cluster in zip(latitude_values, longitude_values, neighborhood_values, cluster_values):
    label = folium.Popup("{} Cluster {}".format(hood, cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = rainbow[cluster-1],
        fill = True,
        fill_color = rainbow[cluster-1],
        fill_opacity = 0.7
    ).add_to(map_clusters)

map_clusters

---

## Analyzing Clusters

In order to analyze these four clusters, I have created a function that returns (1) the number of neighborhoods in each cluster, and (2) the top 10 most common venues in each cluster. This data give us an idea of how these clusters differ from each other. 

It turns out that the first cluster contains neighborhoods that include lots of venues where you are able to have a lunch or diner. The second cluster contains neighborhoods that mainly include a park and some coffee shops, cafes and restaurants. The third cluster contains five neighborhoods that have a wide variation of nearby venues, including a yoga studio and a theater. The fourth cluster contains a wide variation of restaurants, including some vegetarian restaurants.

In [32]:
def get_top_10_venues(dataframe, cluster):
    temp_columns = ["Neighborhood", "Cluster Labels"] + [i for i in dataframe.columns[6:]]
    temp_df1 = dataframe[dataframe["Cluster Labels"] == cluster-1][temp_columns]
    temp_array = temp_df1.values[:,2:]
    temp_list = []
    for neighborhood in temp_array:
        for venue in neighborhood:
            if not venue == "":
                temp_list.append(venue)
    temp_df2 = pd.DataFrame({
        "id": [i for i in range(len(temp_list))],
        "venue": temp_list
    })
    temp_df2 = temp_df2.groupby("venue").id.count().reset_index()
    temp_df2 = temp_df2.sort_values(by="id", ascending=False).reset_index(drop=True)
    temp_df2 = temp_df2.rename(columns={
        "id": "Count",
        "venue": "Venue"
    })
    if temp_df1.shape[0] == 1:
        print("This cluster counts 1 neighborhood.")
    else:
        print("This cluster counts {} neighborhoods.".format(temp_df1.shape[0]))
    return temp_df2.head(10)

### Cluster 1

In [33]:
get_top_10_venues(toronto_merged, 1)

This cluster counts 23 neighborhoods.


Unnamed: 0,Venue,Count
0,Coffee Shop,22
1,Café,19
2,Restaurant,17
3,Italian Restaurant,16
4,Sandwich Place,14
5,Pizza Place,12
6,Sushi Restaurant,10
7,Grocery Store,10
8,Park,9
9,Diner,9


### Cluster 2

In [34]:
get_top_10_venues(toronto_merged, 2)

This cluster counts 64 neighborhoods.


Unnamed: 0,Venue,Count
0,Park,22
1,Coffee Shop,16
2,Pizza Place,15
3,Bank,14
4,Sandwich Place,11
5,Fast Food Restaurant,10
6,Grocery Store,9
7,Pharmacy,9
8,Liquor Store,9
9,Intersection,8


### Cluster 3

In [35]:
get_top_10_venues(toronto_merged, 3)

This cluster counts 5 neighborhoods.


Unnamed: 0,Venue,Count
0,Coffee Shop,5
1,Café,4
2,Yoga Studio,3
3,Japanese Restaurant,3
4,Theater,3
5,Bakery,3
6,Sandwich Place,3
7,Restaurant,3
8,Park,3
9,Hotel,2


### Cluster 4

In [36]:
get_top_10_venues(toronto_merged, 4)

This cluster counts 6 neighborhoods.


Unnamed: 0,Venue,Count
0,Café,6
1,Restaurant,6
2,Coffee Shop,6
3,Vegetarian / Vegan Restaurant,5
4,Seafood Restaurant,5
5,Art Gallery,5
6,Japanese Restaurant,4
7,Hotel,4
8,Gastropub,4
9,American Restaurant,4


---