# Segmenting and Clustering Neighborhoods in Toronto

A Jupyter Notebook for the Applied Data Science Capstone, as a part of "IBM Data Science" course on Coursera.

## 1. Scraping Data of Postal Codes and Neighborhoods

In this section we will scrape the Wikipedia page ([List of postal codes of Canada](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)) to extrac neighborhoods in the city of Toronto.In this list corresponds to the postal codes with the first letter M. Postal codes beginning with M are located within the city of Toronto in the province of Ontario.

Then we will convert data to a _pandas_ dataframe by wrangling and cleaning the data.

First of all, we will import necessary libraries.

#### Import necessary libraries

In [130]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)

# Visualization
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
import seaborn as sns
import folium # map rendering library

# Libraries for scraping and communicate with websites
import requests
from bs4 import BeautifulSoup

# File handeling
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# import k-means from clustering stage
from sklearn.cluster import KMeans

Then we'll use wikipedia page to extrace data of postal codes, boroughs and neighbourhoods. For this purpose we would first extract the page content and make a _soup_ of different elements. Then based on the type of our element, here _table_ and its _class_, here _wikitable sortable_, we would extract the table contents.

In [48]:
# scrap data from wikipedia page
url_wiki = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page_wiki = requests.get(url_wiki).text
soup = BeautifulSoup(page_wiki, 'lxml')

# get right table to scrap
table = soup.find('table',{'class':'wikitable sortable'})

The content is still not clean and we have our data between lots of `<th> ... <\th>`, `<tr> ... <\tr>`, and `<td> ... <\td>`. We would use `.find` and `.find_all` methods to find these parts and select only the text part. 

In [75]:
# extract header of the table
header = [th.text.rstrip() for th in table.find_all('th')]
header

['Postal Code', 'Borough', 'Neighbourhood']

In [76]:
# extract cells of the table
# consider an empty list for each column of the table
postal_code = []
borough = []
neighbourhood = []

for row in table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells)==3:    # only extract table body not the heading
        postal_code.append(cells[0].find(text=True).rstrip())
        borough.append(cells[1].find(text=True).rstrip())
        neighbourhood.append(cells[2].find(text=True).rstrip())

In the next step we will make our datafram from the extracted `postal_code`, `borough`, `neighbourhood`, which were as `list`. We will use `header` for the column's names.

In [77]:
# make a dataframe of the table
toronto_data = pd.DataFrame(zip(postal_code, borough, neighbourhood), columns = header)
toronto_data

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In the next step we will remove rows which their `Borough`s value is `Not assigned`, and we'll make sure to reset the indecies.

In [78]:
# drop rows which Borough not assigned
toronto_data.drop(toronto_data.index[toronto_data['Borough']=='Not assigned'], inplace=True)
# reset index
toronto_data.reset_index(drop=True, inplace=True)
toronto_data

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Check if there is any `Not assigned` value after cleaning the dataframe or not.

In [79]:
'Not assigned' in toronto_data.Neighbourhood.values

False

We are done for this step as there is no `Not assigned` value for the `Neighbourhood` in the `toronto_data`.

In [80]:
toronto_data.shape

(103, 3)

## 2. Add Location Information (Latitude, Longitude)

In this part we will load a `.csv` file contains postal code of Toronto and corresponding geospatial coordinate.

In [81]:
# load geospatial coordinate file
toronto_geo_coor = pd.read_csv('Torento_Geospatial_Coordinates.csv')
toronto_geo_coor

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Now we can merge `toronto_geo_coor` and `toronto_data` dataframes together based on `Postal Code` column and then sort it.

In [82]:
toronto_data = pd.merge(toronto_data, toronto_geo_coor, on='Postal Code')
toronto_data.sort_values(by=['Postal Code'], inplace=True)
toronto_data.reset_index(drop=True, inplace=True)
toronto_data

    Postal Code      Borough  \
0           M1B  Scarborough   
1           M1C  Scarborough   
2           M1E  Scarborough   
3           M1G  Scarborough   
4           M1H  Scarborough   
..          ...          ...   
98          M9N         York   
99          M9P    Etobicoke   
100         M9R    Etobicoke   
101         M9V    Etobicoke   
102         M9W    Etobicoke   

                                         Neighbourhood   Latitude  Longitude  
0                                       Malvern, Rouge  43.806686 -79.194353  
1               Rouge Hill, Port Union, Highland Creek  43.784535 -79.160497  
2                    Guildwood, Morningside, West Hill  43.763573 -79.188711  
3                                               Woburn  43.770992 -79.216917  
4                                            Cedarbrae  43.773136 -79.239476  
..                                                 ...        ...        ...  
98                                              Weston  43.706

## 3. Explore and cluster the neighborhoods in Toronto

In this section we will use Foursquare API to explore neighborhoods in Toronto and find out the most common venue categories in each neighborhood. We'll use different features to group the neighborhoods into clusters by use of the k-means clustering algorithm.

First of all we want to create a map of Toronto with neighborhoods superimposed on top.

In [86]:
# create map of Toronto using latitude and longitude values
toronto_latitude = 43.741
toronto_longitude = -79.373

map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=10)

# add markers to map
for lat, lng, borough in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough']):
    label = borough
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Now we can visualize the map of Downtown and its corresponding neighborhoods.

#### Utilizing the Foursquare API to explore the neighborhoods and segment them

In [103]:
# Define Foursquare Credentials and Version
CLIENT_ID = 'L222QY5EXUBMGH3IZS0FVPRPWPBYWULC4ATCGIXS1WORB35V' # your Foursquare ID
CLIENT_SECRET = '4DVPOSTMFDK5X4Q2XIA50JADZ0PJRDP5UONOHAQVTREEMCMT' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

### Explore neighborhoods in Toronto

We know that all the information is in the items key. Before we proceed, let's borrow the get_category_type function from the Foursquare lab.

In [104]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [107]:
toronto_venues = getNearbyVenues(toronto_data['Neighbourhood'],
                                   toronto_data['Latitude'],
                                   toronto_data['Longitude'])

Malvern, Rouge
Rouge Hill, Port Union, Highland Creek
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Golden Mile, Clairlea, Oakridge
Cliffside, Cliffcrest, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Wexford Heights, Scarborough Town Centre
Wexford, Maryvale
Agincourt
Clarks Corners, Tam O'Shanter, Sullivan
Milliken, Agincourt North, Steeles East, L'Amoreaux East
Steeles West, L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
York Mills, Silver Hills
Willowdale, Newtonbrook
Willowdale, Willowdale East
York Mills West
Willowdale, Willowdale West
Parkwoods
Don Mills
Don Mills
Bathurst Manor, Wilson Heights, Downsview North
Northwood Park, York University
Downsview
Downsview
Downsview
Downsview
Victoria Village
Parkview Hill, Woodbine Gardens
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto, Broadview North (Old East York)
The Danforth West, 

Let's check the size of the resulting dataframe:

In [108]:
print(toronto_venues.shape)
toronto_venues.head()

(2129, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,"Malvern, Rouge",43.806686,-79.194353,Interprovincial Group,43.80563,-79.200378,Print Shop
2,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,SEBS Engineering Inc. (Sustainable Energy and ...,43.782371,-79.15682,Construction & Landscaping
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,RBC Royal Bank,43.76679,-79.191151,Bank


Let's check the first 20 top venue categories and their numbers:

In [112]:
toronto_venues['Venue Category'].value_counts()[0:20]

Coffee Shop             183
Café                     93
Restaurant               63
Pizza Place              55
Park                     54
Italian Restaurant       44
Hotel                    42
Bakery                   42
Sandwich Place           41
Japanese Restaurant      40
Clothing Store           33
Sushi Restaurant         32
Gym                      31
Fast Food Restaurant     29
Grocery Store            29
Bar                      28
American Restaurant      26
Bank                     25
Seafood Restaurant       23
Thai Restaurant          22
Name: Venue Category, dtype: int64

Number of unique categories:

In [129]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 271 uniques categories.


Now let's groupby by neighborhoods:

In [131]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,5,5,5,5,5,5
"Alderwood, Long Branch",6,6,6,6,6,6
"Bathurst Manor, Wilson Heights, Downsview North",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",22,22,22,22,22,22
...,...,...,...,...,...,...
"Willowdale, Willowdale East",35,35,35,35,35,35
"Willowdale, Willowdale West",5,5,5,5,5,5
Woburn,3,3,3,3,3,3
Woodbine Heights,7,7,7,7,7,7
