# Notebook for Clustering of Neighbourhood in the city of Toronto


## This is the part 1 asked to complete in the course instructions.

### Import the necessary libraries

In [1]:
import requests
import pandas as pd
import numpy as np
import urllib.request
from bs4 import BeautifulSoup

### Scraping
scraping data from given URL and assigning it to a variable 'page'

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
page

<http.client.HTTPResponse at 0x7f2c152f9d68>

We used BeautifulSoup package to parse the HTML data we stored in our ‘url’ variable and store it in a new variable called ‘soup’ in the Beautiful Soup format. we can extract various specifications of the data collected from website. Here, the title and all the listed tables in the data is extracted. Also, we can inspect the HTML code of the page by utilising prettify() function.

In [3]:
soup = BeautifulSoup(page,"lxml")
print(soup.title)
print(soup.title.string)
#print(soup.prettify())
print(soup.find_all('table'))

<title>List of postal codes of Canada: M - Wikipedia</title>
List of postal codes of Canada: M - Wikipedia
[<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Ro

Find out the table needed for analysing and store it to a variable 'table' so that it can accessed and modified easily.

In [4]:
table = soup.find('table',class_='wikitable sortable')
table


<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td>

Assigning the table data to different lists and then arranging those lists into organised dataframe

In [5]:
A=[]
B=[]
C=[]
for row in table.find_all('tr'):
    cells=row.find_all('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        
df=pd.DataFrame(A,columns=['Postal Code'])
df['Borough']=B
df['Neighbourhood']=C
df.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


###  Ignore cells with a borough that is Not assigned
Listing out all the count of each different strings in Borough so that we get an idea about the number of Not assigned values.
And, creating a new df without 'Not assigned' values.

In [6]:
print(df['Borough'].value_counts())
df=df[df.Borough != 'Not assigned\n']
df=df.reset_index(drop=True)
df.head()

Not assigned
        77
North York
          24
Downtown Toronto
    19
Scarborough
         17
Etobicoke
           12
Central Toronto
      9
West Toronto
         6
York
                 5
East York
            5
East Toronto
         5
Mississauga
          1
Name: Borough, dtype: int64


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [7]:
df.loc[df['Neighbourhood']=='Not assigned','Neighbourhood']=df['Borough']
df.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


More than one neighborhood can exist in one postal code area. So, By using groupby() function, making all the cells with same postal codes combine into a single cell with all different neighbourhoods 

In [8]:
df_g=df.groupby(['Postal Code','Borough'],sort=False).agg(','.join)
df=df_g.reset_index()
df.head()
df.shape

(103, 3)

## --- Part 2 ---

Here, we have extracted the data of each postal code with geographical coordinates i.e., latitude and longitude from a given csv file and stored it to a new dataframe named 'df_loc'

In [9]:
!wget -q -O 'Toronto_location.csv'  http://cocl.us/Geospatial_data
df_loc = pd.read_csv('Toronto_location.csv')
df_loc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Let us define the headings of three columns into string type

In [10]:
df_loc.columns = ['Postal Code','Latitude','Longitude']
df_loc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Then, by using '.str.strip', we got rid of both leading and tailing characters of common column as we are supposed to merge two data frames into  a single data frame inorder to make further process (Clustering) conveniently.

In [13]:
print(df.dtypes)
print(df_loc.dtypes)
df['Postal Code']=df['Postal Code'].str.strip()
df_loc['Postal Code']=df_loc['Postal Code'].str.strip()
df_toronto=pd.merge(df,df_loc,how='outer',on='Postal Code')
print(df_toronto.shape)
df_toronto['Neighbourhood']=df_toronto['Neighbourhood'].str.strip()
df_toronto.head()

Postal Code      object
Borough          object
Neighbourhood    object
dtype: object
Postal Code     object
Latitude       float64
Longitude      float64
dtype: object
(103, 5)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
