# Segmenting and Clustering Neighborhoods in Toronto  
_by Francisco Peretti_

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

---
# 1. Toronto Data Scraping

In [1]:
!pip install selenium
# !pip install beautifulsoup4

Collecting selenium
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |████████████████████████████████| 911kB 2.6MB/s 
Installing collected packages: selenium
Successfully installed selenium-3.141.0


In [47]:
#Import urllib which simplifies HTTP and socket management
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
import pandas as pd
import numpy as np

import requests
import folium 

from geopy.geocoders import Nominatim


In [3]:
# Web scraping

#Ignore SSl Errors
ctx = ssl.create_default_context()
ctx.check_hostname = False

# Toronto neighbouthoods Wikipedia website
url = 'https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto'

# Retrieve HTML and parse
html = urllib.request.urlopen(url, context = ctx).read()
soup = BeautifulSoup(html, 'html.parser')

print(soup.prettify()[0:500])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of neighbourhoods in Toronto - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"f32d


In [5]:
# Find the tables within the HTML -> Choose the second table with the required neighbourhoods data
table = soup.find_all('table', attrs={'class':'wikitable sortable'})[1]

# Select table rows
table_rows = table.find_all('tr')

l = []
head_flag = True

for tr in table_rows:

  # Use header flag to get Columns names ->
  if head_flag:
    aux = tr.find_all('th')
    header = [x.text.split('\n')[0] for x in aux]
    head_flag = False
    continue
  
  # Create list with each rows'elements
  else:
    td = tr.find_all('td')
    row = [tr.text.split('\n')[0] for tr in td]

  # Append row list and create a matrix
  l.append(row)

# Transform the table matrix into a DataFrame
toronto_df = pd.DataFrame(l )

# Drop useless columns, set columns names and drop duplicates
toronto_df.drop ([4,5],  axis = 1, inplace = True)
# toronto_df.columns = np.array(header).reshape(1,-1)[0][0:-1]

toronto_df.columns = ['CDN number','Neighborhood','Borough','Neigh_covered']
toronto_df.drop_duplicates(inplace = True)

print(toronto_df.shape)
toronto_df.head()

(140, 4)


Unnamed: 0,CDN number,Neighbourhood,Borough,Neigh_covered
0,129,Agincourt North,Scarborough,Agincourt and Brimwood
1,128,Agincourt South-Malvern West,Scarborough,Agincourt and Malvern
2,20,Alderwood,Etobicoke,Alderwood
3,95,Annex,Old City of Toronto,The Annex and Seaton Village
4,42,Banbury-Don Mills,North York,Don Mills


---
# 2. Data wrangle  
Find Coordinates of neighborhoods using GEOPY

In [18]:
geolocator = Nominatim(user_agent="foursquare_agent")

ll =[]
for n in toronto_df['Neighbourhood'].to_list():
  location = geolocator.geocode(n + ', Toronto, Canada')

  # If there are no results from GEOPY, use first part of the Neighbourhoods names
  if location == None:
    location = geolocator.geocode(n.split('-')[0] + ', Toronto, Canada')

  # If there are no results from GEOPY with full name and first part of the name, use the second part, if it exists
  try: 
    if location == None:
      location = geolocator.geocode(n.split('-')[1] + ', Toronto, Canada')
  except:
    None

  # Try first part of first name if it exists or second part
  try:
    if location == None:
      location = geolocator.geocode(n.split('-')[0].split(' ')[0] + ', Toronto, Canada')
  except:
    if location == None:
      location = geolocator.geocode(n.split('-')[0].split(' ')[1] + ', Toronto, Canada')

   # if nothing has been found so far, Try using the Neighbourhood_covered
  if location == None:
    location = geolocator.geocode( toronto_df.set_index('Neighbourhood').loc['Clanton Park']['Neigh_covered'] + ', Toronto, Canada')

   # if nothing has been found so far, append [Nan, NaN]
  if location == None:
    ll.append([np.nan, np.nan] )
    continue

  # Append Latitude and Longitude to the matrix
  ll.append([location.latitude, location.longitude] )


In [26]:
# Append Latitudes and Longitudes of each neighbourhood to Toronto's DataFrame
toronto_df['Latitude'] = np.array(ll)[:,0]
toronto_df['Longitude'] = np.array(ll)[:,1]

toronto_df = toronto_df[['Borough', 'Neighborhood', 'Latitude', 'Longitude']]

print('Count of missing coordinates: ', toronto_df['Latitude'].isnull().sum())
toronto_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Scarborough,Agincourt North,43.808038,-79.266439
1,Scarborough,Agincourt South-Malvern West,43.785353,-79.278549
2,Etobicoke,Alderwood,43.601717,-79.545232
3,Old City of Toronto,Annex,43.670338,-79.407117
4,North York,Banbury-Don Mills,43.734804,-79.357243


---
# 3. Data exploration

In [41]:
# create map of New York using latitude and longitude values

city = 'Sunnybrook, Toronto, Canada'

latitude = geolocator.geocode(city).latitude
longitude = geolocator.geocode(city).longitude

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11, width='70', height='70' )

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

In [44]:
#@title Foursquare credentials
# Hidden cell
CLIENT_ID = 'YLROMIFNBX1AXMIYRRL04CMNBU5L2KSYL5E13X0T1LHY014G' # your Foursquare ID
CLIENT_SECRET = 'TO40LHBWD5WWOJCOVFK53RDUQOCX0RDRON1JH43JYE0JIT0O' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version


In [45]:
# From Foursquare Lab
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
toronto_venues = getNearbyVenues(names = toronto_df['Neighborhood'],
                                   latitudes = toronto_df['Latitude'],
                                   longitudes = toronto_df['Longitude']
                                  )

In [51]:
print(toronto_venues.shape)
toronto_venues.head()

(3142, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Agincourt North,43.808038,-79.266439,Menchie's,43.808338,-79.268288,Frozen Yogurt Shop
1,Agincourt North,43.808038,-79.266439,Saravanaa Bhavan South Indian Restaurant,43.810117,-79.269275,Indian Restaurant
2,Agincourt North,43.808038,-79.266439,Booster Juice,43.809915,-79.269382,Juice Bar
3,Agincourt North,43.808038,-79.266439,Shoppers Drug Mart,43.808894,-79.269854,Pharmacy
4,Agincourt North,43.808038,-79.266439,Dollarama,43.808894,-79.269854,Discount Store


In [54]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))
toronto_venues.groupby('Neighborhood').count()

There are 286 uniques categories.


Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt North,25,25,25,25,25,25
Agincourt South-Malvern West,13,13,13,13,13,13
Alderwood,7,7,7,7,7,7
Annex,45,45,45,45,45,45
Banbury-Don Mills,4,4,4,4,4,4
...,...,...,...,...,...,...
Wychwood,53,53,53,53,53,53
Yonge and Eglinton,68,68,68,68,68,68
Yonge-St.Clair,57,57,57,57,57,57
York University Heights,19,19,19,19,19,19


In [61]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Garage,Automotive Shop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Beach,Beach Bar,Bed & Breakfast,Beer Bar,Beer Store,Belgian Restaurant,Big Box Store,Bike Rental / Bike Share,Bike Shop,Bistro,Boat or Ferry,Bookstore,Botanical Garden,...,Smoke Shop,Smoothie Shop,Snack Place,Soccer Field,Soccer Stadium,South American Restaurant,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Steakhouse,Storage Facility,Supermarket,Sushi Restaurant,Syrian Restaurant,Taco Place,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tea Room,Tennis Court,Thai Restaurant,Theater,Thrift / Vintage Store,Tibetan Restaurant,Tour Provider,Toy / Game Store,Track,Trail,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [64]:
toronto_onehot.columns.to_list().index('Neighborhood')

195

---
# 4. Clustering