# Segmenting and Clustering Neighborhoods in Toronto

## Part 1 : Getting Data

**We will get data from wikipedia website using Web Scraping**

In [1]:
#import libraries
from bs4 import BeautifulSoup #library for web scraping
import requests
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
table = soup.table

In [3]:
#Find the respective data from tag and store in list
postcode = []
borough = []
neighborhood = []

for cell in table.find_all('td'):
    postcode.append(cell.b.text)
    borough.append(cell.span.text.split('(')[0])
    try:
        neigh = cell.span.text
        #split neighborhood by brackets and repace / with ,
        neighbor = ''.join(neigh.split('(')[1].split(')'))
        neighborhood.append(neighbor.replace('/',','))  
    except Exception as e:
        neighborhood.append('Not assigned')

**Converting data into Dataframe**

In [4]:
df = pd.DataFrame(list(zip(postcode,borough,neighborhood)),columns=['PostalCode','Borough','Neighborhood'])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"


In [5]:
df.shape

(180, 3)

## Data Cleaning

1. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [6]:
#first check
len(df[df['Borough'] == 'Not assigned'])

77

In [7]:
#drop rows
df = df.drop(df[df['Borough'] == 'Not assigned'].index)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Queen's Park / Ontario Provincial Government,Not assigned


In [8]:
#check again
len(df[df['Borough'] == 'Not assigned'])

0

In [9]:
df.shape

(103, 3)

2. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [10]:
#first check
df[df['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M7A,Queen's Park / Ontario Provincial Government,Not assigned


In [11]:
#replace
df.loc[df['Neighborhood'] == 'Not Assigned', 'Neighborhood'] = df['Borough']

In [12]:
#check again
df[df['Neighborhood'] == 'Not Assigned'][0:5]

Unnamed: 0,PostalCode,Borough,Neighborhood


In [13]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Queen's Park / Ontario Provincial Government,Not assigned


In [14]:
#number of rows and column of dataframe
df.shape

(103, 3)

#   

## Part 2 : Extracting the Latitude and Longitude

In [15]:
#Second dataframe which has latitude and longitude 
path = 'http://cocl.us/Geospatial_data'
df_lat_lng = pd.read_csv(path)
df_lat_lng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [16]:
#Changing the column name Postal code to Postcode to merge the two data frames together
df_lat_lng.columns = ['PostalCode','Latitude','Longitude']
df_lat_lng.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [17]:
df_lat_lng.shape

(103, 3)

In [18]:
#Merging of both dataframe
df_can = pd.merge(df,df_lat_lng, on='PostalCode')
df_can.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park / Ontario Provincial Government,Not assigned,43.662301,-79.389494


In [19]:
df.shape

(103, 3)

#     

## Part 3 : Explore and cluster the neighborhoods in Toronto

In [22]:
import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


**Find how many boroughs and neighborhoods**

In [40]:
print("The dataframe has {} Borough and {} Neighborhood".format(len(df_can['Borough'].unique()),len(df_can['Neighborhood'])))

The dataframe has 15 Borough and 103 Neighborhood


**Use geopy library to get the latitude and longitude values of Toronto**

In [60]:
address = "Toronto, Ontario"
geolocator = Nominatim(user_agent='to_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print("The coordinates of Toronto are {} , {}".format(latitude,longitude))

The coordinates of Toronto are 43.653963 , -79.387207


**Create a map of Toronto with neighborhoods superimposed on top**

In [87]:
toronto_map = folium.Map(
    location = [lat,lng],
    zoom_start = 11
)

for lat,lng,borough,neighborhood in zip(df_can['Latitude'],df_can['Longitude'],df_can['Borough'],df_can['Neighborhood']):
    label = '{}, {}'.format(neighborhood,borough)
    label = folium.Popup(label,parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius = 10,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        popup = label
    ).add_to(toronto_map)
    
toronto_map

**Utilizing the Foursquare API to explore the neighborhoods and segment them.**

In [103]:
LIMIT = 100
CLIENT_ID = 'KAJZOUZF20UXDL2H3MOEBLTIHGQ2HK1PDJ14Z1R0QCAJCYVA' # your Foursquare ID
CLIENT_SECRET = 'UOJCAODP5HOBJS1W451QQ3JQ3HSAWPQ2TL2WWMCJXLFAYD33' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: KAJZOUZF20UXDL2H3MOEBLTIHGQ2HK1PDJ14Z1R0QCAJCYVA
CLIENT_SECRET:UOJCAODP5HOBJS1W451QQ3JQ3HSAWPQ2TL2WWMCJXLFAYD33


**Let's explore the neighborhood in our dataframe**

In [104]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [105]:
toronto_venues = getNearbyVenues(names=df_can['Neighborhood'],
                                   latitudes=df_can['Latitude'],
                                   longitudes=df_can['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park , Harbourfront
Lawrence Manor , Lawrence Heights
Not assigned
Islington Avenue
Malvern , Rouge
Don MillsNorth
Parkview Hill , Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park , Princess Gardens , Martin Grove , Islington , Cloverdale
Rouge Hill , Port Union , Highland Creek
Don MillsSouth
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate , Bloordale Gardens , Old Burnhamthorpe , Markland Wood
Guildwood , Morningside , West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor , Wilson Heights , Downsview North
Thorncliffe Park
Richmond , Adelaide , King
Dufferin , Dovercourt Village
Scarborough Village
Fairview , Henry Farm , Oriole
Northwood Park , York University
The Danforth  East
Harbourfront East , Union Station , Toronto Islands
Little Portugal , Trinity
Kennedy Park , Ionview , East Birchmount Park
Bayview Village
DownsviewEast  
Th

**Our new dataset**

In [137]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [138]:
toronto_venues.shape

(2247, 7)

**Let's check how many venues were returned for each neighborhood**

In [157]:
neigh_venue = toronto_venues.groupby('Neighborhood')['Venue'].count()
neigh_venue.to_frame()[0:5]

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
Agincourt,5
"Alderwood , Long Branch",9
"Bathurst Manor , Wilson Heights , Downsview North",21
Bayview Village,4
"Bedford Park , Lawrence Manor East",26


**Let's find out how many unique categories can be curated from all the returned venues**

In [164]:
print('There are {} unique venue categories'.format(len(toronto_venues['Venue Category'].unique())))

There are 274 unique venue categories
