# Segmentation and Clustering Neighborhoods in Toronto

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.
1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:
3. To create the above dataframe:
  - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
  - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
  - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
  - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
  - Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
  - In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
4. Submit a link to your Notebook on your Github repository. (10 marks)


## 1. Notebook creation, import packages, 

In [4]:
# import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
import urllib.request

!conda install -c anaconda beautifulsoup4 --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    soupsieve-1.9.2            |           py36_0          61 KB  anaconda
    openssl-1.1.1              |       h7b6447c_0         5.0 MB  anaconda
    beautifulsoup4-4.8.0       |           py36_0         147 KB  anaconda
    certifi-2019.6.16          |           py36_1         156 KB  anaconda
    ------------------------------------------------------------
                                           Total:         5.4 MB

The following NEW packages will be INSTALLED:

    soupsieve:      1.9.2-py36_0      anaconda   

The following packages will be UPDATED

## 2. Scrape Wikipedia using urllib.request obtain html and write to file.

In [5]:
url       = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
req       = urllib.request.urlopen(url)
article   = req.read().decode()
file_html = 'List_of_postal_codes_of_Canada:_M.html'
file_txt  = 'List_of_postal_codes_of_Canada:_M.txt'

with open(file_html, 'w') as fileout:
    fileout.write(article)

In [6]:
from bs4 import BeautifulSoup

# Load article, turn into soup and get the <table>s.
article = open(file_html).read()
soup    = BeautifulSoup(article, 'html.parser')
tables  = soup.find_all('table', class_='sortable')

# Search through the tables for the one with the headings we want.
for table in tables:
    ths = table.find_all('th')
    headings = [th.text.strip() for th in ths]
    if headings[:3] == ['PostalCode', 'Borough', 'Neighbourhood']:
        break
        
print(ths)

[<th>Postcode</th>, <th>Borough</th>, <th>Neighbourhood
</th>]


In [7]:
#Extract the columns and write to a comma-delimited text file.

with open(file_txt, 'w') as fileout:
    for tr in table.find_all('tr'):
        tds = tr.find_all('td')
        if not tds:
            continue
        PostalCode, Borough, Neighbourhood = [td.text.strip() for td in tds[:3]]
        print(', '.join([PostalCode, Borough, Neighbourhood]), file=fileout)


## 3. Create the dataframe referenced in assignment page.

In [8]:
# Read in txt file
df = pd.read_csv(file_txt, header=None, names=['PostalCode', 'Borough', 'Neighbourhood'], sep=', ')
# print(df.shape)

# Ignore cells with a borough that is 'Not assigned'.
df = df[df.Borough != 'Not assigned']

# If a cell has a borough but a 'Not assigned' neighbourhood, then the neighbourhood will be the same as the borough.
df.Neighbourhood[df.Neighbourhood == 'Not assigned'] = df.Borough[df.Neighbourhood == 'Not assigned']


  


In [9]:
# Create a custome aggregator.
agg_Neighbourhoods_from_PostalCode = lambda a: ', '.join(a)

# Execute the groupby, aggregation, reset the index
df = df.groupby(by='PostalCode').agg({'Neighbourhood':agg_Neighbourhoods_from_PostalCode}).reset_index()

df.shape

(103, 2)

## 4. Submit a link to your Notebook on your Github repository. (10 marks)

In [None]:
# Complete!

## 5. Use the Geocoder package or the csv file to create a dataframe.
Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [12]:
df_geo = pd.read_csv('Geospatial_Coordinates.csv')
df_geo.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
df_geo.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [15]:
# Get the latitude and the longitude coordinates of each neighborhood
df_final = pd.merge(df, df_geo, on='PostalCode')
df_final

Unnamed: 0,PostalCode,Neighbourhood,Latitude,Longitude
0,M1B,"Rouge, Malvern",43.806686,-79.194353
1,M1C,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Woburn,43.770992,-79.216917
4,M1H,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough Village,43.744734,-79.239476
6,M1K,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,"Birch Cliff, Cliffside West",43.692657,-79.264848


## 6. Explore and cluster the neighborhoods in Toronto. 

- You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:
- to add enough Markdown cells to explain what you decided to do and to report any observations you make.
- to generate maps to visualize your neighborhoods and how they cluster together.
- Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

6.1. <a href="#item1">Import Libraries</a>

6.2. <a href="#item2">Explore Neighborhoods in Toronto</a>

6.3. <a href="#item3">Analyze Each Neighborhood</a>

6.4. <a href="#item4">Cluster Neighborhoods</a>

6.5. <a href="#item5">Examine Clusters</a>    
</font>
</div>

## 6.1 Import Libraries

In [24]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         237 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0        conda-forge
    geopy:         1.20.0-py_0      conda-forge

The following packages will be UPDATED:

    certifi:       2019.6.

## 6.2. Explore Neighborhoods in Toronto
Create a function to repeat the same process to all the neighborhoods in Toronto

Use geopy library to get the latitude and longitude values of New York City.
In order to define an instance of the geocoder, we need to define a user_agent. 
We will name our agent toronto_explorer, as shown below.

In [25]:
address    = 'Toronto, ON'
geolocator = Nominatim(user_agent="toronto_explorer")
location   = geolocator.geocode(address)
latitude   = location.latitude
longitude  = location.longitude

print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


#### Create a map of Toronto with neighborhoods superimposed on top.

In [33]:
# create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_final['Latitude'], df_final['Longitude'],
                                           df_final['PostalCode'], df_final['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

Folium is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in Toronto. So let's slice the original dataframe and create a new dataframe of the Toronto data.

In [35]:
# M1V, M4V, M8V
toronto_data = df_final[df_final['PostalCode'] == 'M8V'].reset_index(drop=True)
toronto_data.head()


Unnamed: 0,PostalCode,Neighbourhood,Latitude,Longitude
0,M8V,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321


In [37]:
CLIENT_ID = 'A1IF4LD541WATHK3KA4K542ZSO3ATT3DAMJ01KTKLP1XZLVM'
CLIENT_SECRET = 'HFGPXJ3PTPB4YEE4KJUSTIT3ZZGVBHQ5RIZGWLZ1044SODPW'
VERSION = '20190922'
radius = '50'
LIMIT = '100'

In [38]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Run the above function on each neighborhood and create a new dataframe called manhattan_venues.

In [36]:
# toronto_venues = getNearbyVenues(names=df_final['Neighbourhood'],
#                                  latitudes=df_final['Latitude'],
#                                  longitudes=df_final['Longitude']
#                                  )