# Applied Data Science - Capstone Project Notebook

This Notebook will be primarily used to document the progress with Capstone Project (part of the Coursera's class on Applied Data Science).

## Week 3 Activities

- Scrape the list of Toronto neighborhoods off the following Wikipedia page [https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)
- Get the latitude and the longitude coordinates of each neighborhood
- Explore and cluster the neighborhoods in Toronto
- Generate maps to visualize Toronto neighborhoods and how they cluster together


### Scraping the Wikipedia

In this activity I use the [`pandas.read_html`](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html) method. Unfortunately, there is some information on the wiki page that I don't need for my dataframe. I get rid of it by:
- using a *regexp* matching the Canadian postal codes format for Toronto area
- selecting just the first dataframe

In [1]:

import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

# forma URL
postal_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# fetch the needed data
postal_df = pd.read_html(postal_url)[0]
postal_df.head()


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [2]:
# get rid of the rows where borough is "Not assigned"
postal_df.drop(postal_df[postal_df['Borough'] == 'Not assigned'].index, inplace = True)
postal_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [3]:
# demonstrate the resulting number of rows in the neighborhoods dataframe
postal_df.shape

(103, 3)

### Getting Latitude and Longitude coordinates for each neighborhood

Unfortunately, the `geocoder` seems to be unreliable method for coordinates look-up. Also, I've noticed that both Google and Bing insist on using an API key associated with billing information - since I'm not interested in spending actual money on an experiment, I will use the coordinates CSV file provided in the assignment [https://cocl.us/Geospatial_data](https://cocl.us/Geospatial_data).

In [4]:
geo_url = "https://cocl.us/Geospatial_data"
geo_df = pd.read_csv(geo_url)
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [5]:
toronto_df = pd.merge(postal_df, geo_df, on = 'Postal Code')
toronto_df.rename(columns={'Neighbourhood':'Neighborhood'}, inplace = True)
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
