# Capstone Project Week 3 (Part 1)

# Preparing Dataset of Neighborhoods in Toronto

#### by We You Toh

## Introduction

This is a 3-part assignment, which I aim to guide the reader through the process of using data to discover some interesting things about the neighborhoods in Toronto. In the first part, I will prepare a dataframe that consist of the neighborhoods in Toronto and their respective latitude and longitude values. I will also use the Folium library to visualize the neighborhoods in Toronto.

In the second part, I will be using the Foursquare API to explore and analyze the neighborhoods in Toronto. I will use the **explore** function to get the most common venue categories in each neighborhood.

In the last part, I will use the neighborhood and venue data retrieved from the second part to group the neighborhoods into clusters. I will use the *k*-means clustering algorithm to gain some insights about the neighborhoods in Toronto.  I will once again use the Folium library to visualize the emerging clusters among the neighborhoods in Toronto.

>*__Note:__ This notebook is publicly shared on github repository. The Folium interactive maps and some Markdown features don't display on github the same way as they do on a local host. To interact with these features, it may be necessary to download the jupyter notebook and host it locally. If you wish, you may re-run the whole notebook on your own. You may also wish to provide your own Foursquare API credentials.*
>
> *__Tip:__ If you run the notebook on a local host, set the notebook to 'Trusted' to enable javascript display, otherwise the Folium maps may not display properly.*

<div class="alert alert-block alert-info" style="margin-top: 10px"></div>

## Table of Contents

  
### Part 1
1. [**Scraping Postal Codes of Toronto and Prepare Dataset**](#item1) 

### Part 2
2. **Explore Neighborhoods in Toronto**
3. **Analyze Each Neighborhood**

### Part 3
4. **Cluster Neighborhoods** 
5. **Examine Clusters**

<div class="alert alert-block alert-info" style="margin-top: 0px"></div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
# For web-scraping
from bs4 import BeautifulSoup # library for efficient web data extraction
import requests # library to handle requests

# For data-handling
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# For geographical-related tasks
!conda install -c conda-forge geopy --yes # uncomment this line if geopy have not been installed
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if folium has not been installed.
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')


Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geopy                     1.17.0                     py_0    conda-forge
Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Libraries imported.


<a id='item1'></a>

## 1. Scraping Postal Codes of Toronto and Preparing Dataset

In [2]:
# Retrieving postal codes from the internet
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")

In [3]:
# Creating the dataframe
data = []
table = soup.find('table', attrs={'class':"wikitable sortable"})
rows = table.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols]) if cols else None # Get rid of empty rows

df = pd.DataFrame(data, columns = ['PostalCode','Borough','Neighborhood'])

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Data Wrangling

In [4]:
# Removing postal codes not assigned to a borough
df = df[df.Borough != 'Not assigned']

# Replace unassigned neighborhoods with its borough's name.
df[df.Neighborhood == 'Not assigned'] = df.assign(Neighborhood = df.Borough)


In [5]:
# A quick survey over the data frame
print("Number of postal codes: ", len(df.PostalCode.unique()))
print("Number of boroughs: ", len(df.Borough.unique()))
print("Number of neighborhoods: ", len(df.Neighborhood.unique()))

Number of postal codes:  103
Number of boroughs:  11
Number of neighborhoods:  210


In [6]:
# Group neighborhoods by their postal codes and boroughs
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [7]:
df.shape

(103, 3)

Neighborhood has a total of 11 boroughs and 210 neighborhoods in 103 postal codes. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 11 boroughs and the neighborhoods that exist in each borough as well as the the latitude and longitude coordinates of each postal code.

The data set for the latitude and longitude coordinates for each postal code shall be retrieved from the site provided in the Coursera assignment: https://cocl.us/Geospatial_data.

In [8]:
!wget -q -O 'toronto_data.csv' https://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


#### Load and explore the data

Next, let's load the data.

In [9]:
postalcode_latlong = pd.read_csv('toronto_data.csv')

Let's take a quick look at the data.

In [10]:
postalcode_latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Tranform the data into a *pandas* dataframe

In [11]:
# Merging the latitude and longitude coordinates into a new dataframe.
neighborhoods = df.join(postalcode_latlong.set_index('Postal Code'), on='PostalCode')

neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


And make sure that the dataset has all 11 boroughs and 103 postal codes.

In [12]:
print('The dataframe has {} boroughs and {} postal codes.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]))

The dataframe has 11 boroughs and 103 postal codes.


#### Use geopy library to get the latitude and longitude values of Toronto.

In [13]:
address = 'Toronto, Ontario'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Ontario are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Toronto, Ontario are 43.653963, -79.387207.


#### Create a map of Toronto with neighborhoods superimposed on top.

In [14]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], 
                                           neighborhoods['Longitude'], 
                                           neighborhoods['Borough'], 
                                           neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In the next part of the assignment, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Recap

Here's a recap of the size of the dataframe.

In [7]:
df.shape

(103, 3)

### Thank you for checking out this notebook!

This concludes Part 1 of the assignment. I hope you have found this notebook interesting!

This notebook was created by We You Toh.