<h2>Segmenting and Clustering Neighborhoods in Toronto Project</h2>

Step 1 is obtaining postal codes from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

This project will use the BeautifulSoup package to scrape the site and convert the data to a dataframe.

In [1]:
# !pip install beautifulsoup4
# !pip install html5lib

In [2]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup as bs

Creating a BeatifulSoup object from an url requires the use of the requests package to return the html document.
This is done, and the BeautifulSoup object is imported

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
file=requests.get(url)
text=file.text
soup=bs(text)

We can use BeautifulSoup find_all of the object type 'table' and save them into table. We can then use read_html from Pandas to turn table into a dataframe. This dataframe is a list of all of the tables, in our case the first table is the one we want and we can save it to our desired "neigh" dataframe.

In [4]:
table = soup.find_all('table')
df = pd.read_html(str(table))
neigh=df[0]

Remove the 'not assigned' postal codes 

In [5]:
neigh=neigh[neigh.Borough !='Not assigned']

and combining identical postal codes

In [6]:
neigh=neigh.groupby(['Postcode','Borough'], as_index=False).agg({'Neighbourhood':lambda x: ', '.join(x)})

and rename any un-named Neighourboods

In [7]:
neigh.loc[(neigh.Neighbourhood=='Not assigned'),'Neighbourhood']=neigh.loc[(neigh.Neighbourhood=='Not assigned'),'Borough']
neigh.shape

(103, 3)

In [8]:
#! pip install geocoder

In [9]:
import geocoder

Creating a new data frame and adding columns for latitude and longitude

In [None]:
neigh_latlng=neigh
neigh_latlng['Latitude']=""
neigh_latlng['Longitude']=""

I couldn't get the google geocoder service to work right, so I utilized arcgis, it is a bit slow, but works fine.

In [None]:
for label,row in neigh_latlng.iterrows():
    neigh_latlng.loc[label,'Latitude'] = geocoder.arcgis(row['Postcode']).latlng[0]
    neigh_latlng.loc[label,'Longitude'] = geocoder.arcgis(row['Postcode']).latlng[1]

### For the neighborhood clustering, we will follow the same analysis methodology as the New York Lab.

#### I continued to use the geocoder library to get the latitude and longitude values of Toronto

In [None]:
address = 'Toronto, ON'

location = geocoder.arcgis(address)
latitude = location.latlng[0]
longitude = location.latlng[1]
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

In [None]:
#! pip install folium
import folium

Nearly all of the follow code is taken from the New York lab and modified to suit this exercise

As before - let's take a look at all of the neighborhoods

In [None]:
# create map of Toronoto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neigh_latlng['Latitude'], neigh_latlng['Longitude'], neigh_latlng['Borough'], neigh_latlng['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [None]:
{
    "tags": [
        "hide_input",
    ]
}

CLIENT_ID = 'B252YBG5JPJO0YHRIPXJTFR1YFSLEUGABHOG5IZDB04KEGGT' # your Foursquare ID
CLIENT_SECRET = '2IYT5VMBXIDT2HDKXSORJILR0JMKUTSIVLEJI30HEUGHS0WU' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

This function will collect 100 of the closest venues within 500 meters of the postcode centers 

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            500, 
            100)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

This calls the above function and saves the data.

In [None]:
toronto_venues = getNearbyVenues(names=neigh_latlng['Neighbourhood'],
                                   latitudes=neigh_latlng['Latitude'],
                                   longitudes=neigh_latlng['Longitude']
                                  )

This shows us how many different categories exist in our data set.

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

The next three steps will pre-process the data and prepare it for the kmeans clustering algorithim. It converts the 2358 venues into a single row for each post code, depicting the relative frequency of each category.

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

In [None]:
toronto_onehot.shape

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

In [None]:
toronto_grouped.shape

To simplify the clustering further, the 10 most common venues in each postcode are identified.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

We can now run our clustering algorithim, kmeans.

In [None]:
from sklearn.cluster import KMeans

In [None]:
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

First, we must select the optimum k, or number of clusters. I elected to use the silhoute score to find the best k.

In [None]:
from sklearn.metrics import silhouette_score
import seaborn as sns
sil = []
kmax = 15

# dissimilarity would not be defined for a single cluster, thus, minimum number of clusters should be 2
for k in range(2, kmax+1):
  kmeans_k_det = KMeans(n_clusters = k, random_state=0).fit(toronto_grouped_clustering)
  labels = kmeans_k_det.labels_
  sil.append(silhouette_score(toronto_grouped_clustering, labels, metric = 'euclidean'))
x= range(1,15)
sns.lineplot(x,sil)

We can see that the best value of k is 4, as it produces the highest silhoutte score.

In [None]:
# set number of clusters
kclusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = neigh_latlng

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

toronto_merged # check the last columns!

As it turns out, all of the postcodes did not return data. For this exercise, dropping the rows is the best way to deal with missing data.

In [None]:
toronto_merged=toronto_merged.dropna()

In [None]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

We can now take a look at how the clusters appear on a map.

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Finally - we can take a closer look at what each postcode had in common to drive the clustering.

For example - label '0' appears to contain mainy parks, fields, and farms.

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]