## Applied Data Science Capstone Assignment 2 :Segmenting and Clustering Neighborhoods in Toronto

### For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [2]:
#importing the necessary libraries
import pandas as pd

3.To create the above dataframe:

-  The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

4.Submit a link to your Notebook on your Github repository. (10 marks)

- Note: There are different website scraping libraries and packages in Python. One of the most common packages is  BeautifulSoup. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

- The package is so popular that there is a plethora of tutorials and examples of how to use it. Here is a very good Youtube video on how to use the BeautifulSoup package: https://www.youtube.com/watch?v=ng2o98k983k

- Use the BeautifulSoup package or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe

First task is to parse data from Wikipedia:

In [3]:
from IPython.display import IFrame
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
IFrame(url, width=800, height=350)

Data can be parsed using BeautifulSoap, but it's more straightforward just to use Pandas and its function read_html:

In [4]:
data, = pd.read_html(url, match="Postal code", skiprows=1)
data.columns = ["PostalCode", "Borough", "Neighborhood"]
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M2A,Not assigned,
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Regent Park / Harbourfront
4,M6A,North York,Lawrence Manor / Lawrence Heights


In [5]:
data.shape

(179, 3)

Only process the cells that have an assigned borough. Ignore cells with a borough that is "Not assigned"

In [6]:
data = data[data["Borough"] != "Not assigned"]
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Regent Park / Harbourfront
4,M6A,North York,Lawrence Manor / Lawrence Heights
5,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

Solution is to group data by PostalCode and aggregate columns. For borough, it's sufficient to pick first item from the resulting series and for neighbourhood, items are joined together using ", ".join(s):

In [7]:
borough_func = lambda s: s.iloc[0]
neighborhood_func = lambda s: ", ".join(s)
agg_funcs = {"Borough": borough_func, "Neighborhood": neighborhood_func}
data_temp = data.groupby(by="PostalCode").aggregate(agg_funcs)
data_temp.head()

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,Malvern / Rouge
M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
M1E,Scarborough,Guildwood / Morningside / West Hill
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


Some postprocessing is needed; reset the index and add columns back to right order:

In [8]:
data = data_temp.reset_index()[data.columns]
data.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


Checking for Not assigned and Nan(blank) values

In [9]:
data[data["Neighborhood"] == "Not assigned"]

Unnamed: 0,PostalCode,Borough,Neighborhood


In [10]:
data["Neighborhood"].isnull().sum()

0

In [11]:
data.shape

(103, 3)

In case of "Not assigned" replace by:

In [12]:
for (j, row) in data.iterrows():
    if row["Neighborhood"] == "Not assigned":
        borough = row["Borough"]
        print("Replace \"Not assigned\" => %s in row %i" % (borough, j))
        row["Neighborhood"] = borough

In [13]:
data.iloc[83:88]

Unnamed: 0,PostalCode,Borough,Neighborhood
83,M6R,West Toronto,Parkdale / Roncesvalles
84,M6S,West Toronto,Runnymede / Swansea
85,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
86,M7R,Mississauga,Canada Post Gateway Processing Centre
87,M7Y,East Toronto,Business reply mail Processing CentrE


In [14]:
data.shape

(103, 3)

# Determining coordinates for each neigbourhood

In [1]:
# !pip install geopy
# !pip install folium
# !pip install geocoder

Collecting geopy
  Downloading https://files.pythonhosted.org/packages/53/fc/3d1b47e8e82ea12c25203929efb1b964918a77067a874b2c7631e2ec35ec/geopy-1.21.0-py2.py3-none-any.whl (104kB)
Collecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-1.21.0
Collecting geocoder
  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [15]:
locations = pd.read_csv("https://cocl.us/Geospatial_data")
locations.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [16]:
locations.columns = ["PostalCode", "Latitude", "Longitude"]
data2 = pd.merge(data, locations, on='PostalCode')
data2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


To check that merging was succesfull, find the first postal code M5G which should be (43.657952, -79.387383):

In [17]:
data2[data2["PostalCode"] == "M5G"]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
57,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383


In [18]:
data2[data2["PostalCode"] == "M5A"]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
53,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636


The ordering of the dataframe in assignment is unknown but clearly we have correct latitude and longitude now attached for each postal code.

# Explore data

Let's filter only rows where Borough contains word Toronto and explore and cluster that.

In [19]:
some = data2[data2['Borough'].str.contains("Toronto")]
some.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,The Danforth West / Riverdale,43.679557,-79.352188
42,M4L,East Toronto,India Bazaar / The Beaches West,43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [20]:
some.shape

(39, 5)

# Map of Toronto

In [25]:
from geopy.geocoders import Nominatim
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, Canada are 43.6534817, -79.3839347.


In [27]:
data2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [29]:
import folium
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(data2['Latitude'], data2['Longitude'], data2['Borough'], data2['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto