# Segementing and clustering of neighbourhoods in Toronto
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

We will use this to scrap the data for Toronto Neighbourhoods from the wikipedia page:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M



In [1]:
# Install 3rd party packages
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install beautifulsoup4

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ca-certificates-2019.9.11  |       hecc5488_0         144 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.3 MB

The following NEW packages will be 

In [2]:
# Import all libraries

# library to handle data in a vectorized manner
import numpy as np 

# library for data analsysis
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# library for web page scraping
from bs4 import BeautifulSoup, SoupStrainer

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library

# Import the urllib library to access the html page using an http driver
import urllib as http

print('Libraries imported.')

Libraries imported.


Start by getting all the html page content using the python urllib library, then use soup to get only the bits we need (the Toronto Neighbourhoods) as an HTML Table

In [12]:
# Setup the predefined URL
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Use urllib to get the html source from the predefined URL
try:
    encoded_data = http.request.urlopen(url)
except OSError as err:
    print("Unable to retrieve page error code: ", err)  
mybytes = encoded_data.read()

# Decode bytecode into utf8 text and place into 
html_doc = mybytes.decode("utf8")
encoded_data.close()
# print(html_doc)

# Setup the html filter to extract the table data
only_tags_with_table_class = SoupStrainer('table')
# table of postal codes 
html_table = BeautifulSoup(html_doc, 'html.parser', parse_only=only_tags_with_table_class).prettify()

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postcode
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Parkwoods" title="Parkwoods">
     Parkwoods
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Victoria_Village" title="Victoria Village">
     Victoria Village
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    <a href="/wiki/Downtown_Toronto" title="Downtown Toronto">
     Downtown Toronto
    </a>
   </td>
   <td>
    <a href="

Convert the HTML to a dataframe, and remove any rows with not assigned values in any given column

In [56]:
# Using Pandas read html function parse the html_table as a list and return the 1st element as a dataframe
df_postal = pd.read_html(html_table)[0]
display(df_postal.shape)
display(df_postal.columns)

# Remove rows that have "not assigned" in any of its columns
df_postal = df_postal[df_postal.Postcode != 'Not assigned']
df_postal = df_postal[df_postal.Borough != 'Not assigned']
df_postal = df_postal[df_postal.Neighbourhood != 'Not assigned']
display(df_postal.shape)
display(df_postal)


(288, 3)

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

(210, 3)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern
14,M3B,North York,Don Mills North


In [63]:
df_postal.groupby('Postcode')
df_postal

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern
14,M3B,North York,Don Mills North
