## Toronto Neighbourhood
#### Made by: <a href = "https://www.facebook.com/henriquemmenezes">Henrique Chaves</a>
#### <a href = "https://www.linkedin.com/in/henrique-c-a6b0a5121/">My Linkedin</a>

### 1st STEP: Web scraping and data cleaning.

First of all, import the requiered libraries for web scraping and data cleaning.

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np

First, I need to request the wikipedia page that contains the Neighbourhood table of Toronto using a GET method and BeautifulSoup HTML parser.

In [2]:
req = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
if req.status_code == 200:
    content = req.content
    print("OK.")
else:
    print("Error.")

OK.


In [3]:
soup = BeautifulSoup(content, "html.parser")
table = soup.find(name = "table")
table

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

Then, I need to convert this content to string and use pandas to convert into a DataFrame.

In [4]:
table_str = str(table)
dftotal = pd.read_html(table_str)
dftotal

[    Postcode           Borough  \
 0        M1A      Not assigned   
 1        M2A      Not assigned   
 2        M3A        North York   
 3        M4A        North York   
 4        M5A  Downtown Toronto   
 5        M5A  Downtown Toronto   
 6        M6A        North York   
 7        M6A        North York   
 8        M7A      Queen's Park   
 9        M8A      Not assigned   
 10       M9A         Etobicoke   
 11       M1B       Scarborough   
 12       M1B       Scarborough   
 13       M2B      Not assigned   
 14       M3B        North York   
 15       M4B         East York   
 16       M4B         East York   
 17       M5B  Downtown Toronto   
 18       M5B  Downtown Toronto   
 19       M6B        North York   
 20       M7B      Not assigned   
 21       M8B      Not assigned   
 22       M9B         Etobicoke   
 23       M9B         Etobicoke   
 24       M9B         Etobicoke   
 25       M9B         Etobicoke   
 26       M9B         Etobicoke   
 27       M1C       

This returned a list of DataFrames, so, I will get the first one (that is what I need).

In [5]:
df = dftotal[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


There are a lot of "Not assigned" labels. So I will convert all "Not assigned" to NaN in the Borough column, and then, I'll drop these rows.

In [6]:
df["Borough"].replace("Not assigned", np.nan, inplace = True)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,Not assigned
1,M2A,,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,,Not assigned


In [7]:
df = df.dropna(axis = 0)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [8]:
df.reset_index(drop = True, inplace = True)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Now, I'll search for any "Not assigned" remaining on the Neighbourhood column.

In [9]:
df.loc[df["Neighbourhood"] == "Not assigned"]

Unnamed: 0,Postcode,Borough,Neighbourhood
6,M7A,Queen's Park,Not assigned


I just found one "Not assigned" on Neighbourhood. So I'll convert this to the same name of the Borough on the row.

In [10]:
df["Neighbourhood"].replace("Not assigned", "Queen's Park", inplace = True)
df.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Now, we don't have more "Not assigned". 
<br>
Let's check if Postcode is linked with just one or more neighbourhoods.

In [11]:
df_grouped = df.groupby("Postcode").count()
df_grouped

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,2,2
M1C,3,3
M1E,3,3
M1G,1,1
M1H,1,1
M1J,1,1
M1K,3,3
M1L,3,3
M1M,3,3
M1N,2,2


Well, Postcode is linked with one or more neighbourhoods. <br>
So, I will get all neighbourhoods of each postcode, and then, I'll put all together in the same row.

In [12]:
df_new = pd.DataFrame(columns = ["Postcode", "Borough", "Neighbourhood"])
for index, row in df_grouped.iterrows():
    neigh = ''
    borough = df.loc[df["Postcode"] == index]["Borough"].values[0]
    for i in range(row["Neighbourhood"]):
        if len(neigh) == 0:
            neigh = df.loc[df["Postcode"] == index]["Neighbourhood"].values[i]
        else:
            neigh = neigh + ' , ' + df.loc[df["Postcode"] == index]["Neighbourhood"].values[i]
        
    df_new = df_new.append({"Postcode": index, "Borough": borough, "Neighbourhood": neigh}, ignore_index = True)

df_new

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge , Malvern"
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union"
2,M1E,Scarborough,"Guildwood , Morningside , West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park , Ionview , Kennedy Park"
7,M1L,Scarborough,"Clairlea , Golden Mile , Oakridge"
8,M1M,Scarborough,"Cliffcrest , Cliffside , Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff , Cliffside West"


Now that is good. Let's see the shape of our df_new.

In [13]:
df_new.shape

(103, 3)

So now, I will import the geopy library to get Latitude and Longitude for each Postal code.

In [14]:
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="toronto_cluster")

In [49]:
df_lat_lon = pd.DataFrame(columns = ["Latitude", "Longitude"])
df_lat_lon

Unnamed: 0,Latitude,Longitude


In [50]:
for index, row in df_new.iterrows():
    location = None
    location = geolocator.geocode("{}, Toronto, Ontario".format(row["Postcode"]))
    if location == None:
        latitude = np.nan
        longitude = np.nan
    else:
        latitude = location.latitude
        longitude = location.longitude
    df_lat_lon = df_lat_lon.append({"Latitude": latitude, "Longitude": longitude}, ignore_index=True)
    print("Postcode: {} , Latitude: {}, Longitude: {}".format(row["Postcode"], latitude, longitude))

df_lat_lon

Postcode: M1B , Latitude: 43.653963, Longitude: -79.387207
Postcode: M1C , Latitude: 43.653963, Longitude: -79.387207
Postcode: M1E , Latitude: nan, Longitude: nan
Postcode: M1G , Latitude: 43.6449033, Longitude: -79.3818364
Postcode: M1H , Latitude: nan, Longitude: nan
Postcode: M1J , Latitude: nan, Longitude: nan
Postcode: M1K , Latitude: nan, Longitude: nan
Postcode: M1L , Latitude: nan, Longitude: nan
Postcode: M1M , Latitude: nan, Longitude: nan
Postcode: M1N , Latitude: nan, Longitude: nan
Postcode: M1P , Latitude: nan, Longitude: nan
Postcode: M1R , Latitude: nan, Longitude: nan
Postcode: M1S , Latitude: nan, Longitude: nan
Postcode: M1T , Latitude: nan, Longitude: nan
Postcode: M1V , Latitude: nan, Longitude: nan
Postcode: M1W , Latitude: 43.6449033, Longitude: -79.3818364
Postcode: M1X , Latitude: nan, Longitude: nan
Postcode: M2H , Latitude: nan, Longitude: nan
Postcode: M2J , Latitude: 43.6449033, Longitude: -79.3818364
Postcode: M2K , Latitude: nan, Longitude: nan
Postcode:

Unnamed: 0,Latitude,Longitude
0,43.653963,-79.387207
1,43.653963,-79.387207
2,,
3,43.644903,-79.381836
4,,
5,,
6,,
7,,
8,,
9,,


The function couldn't got a lot of locations of postal code, so I'll download a CSV file that contains Latitude and Longitude for each Postal code from Toronto.

In [51]:
!wget -O latlongtoronto.csv https://cocl.us/Geospatial_data

--2019-10-14 20:11:40--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.48.113.194
Connecting to cocl.us (cocl.us)|169.48.113.194|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-10-14 20:11:43--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.26.197
Connecting to ibm.box.com (ibm.box.com)|107.152.26.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-10-14 20:11:44--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-10-14 

Then, I will create a DataFrame with these informations.

In [52]:
df_lat_lon = pd.read_csv("latlongtoronto.csv")
df_lat_lon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now, I will join two dataframes by Postal Code, and then drop a column (the new dataframe have 2 postalcode columns) of postal code.

In [56]:
df_new = df_new.merge(df_lat_lon, how = "left", left_on = "Postcode", right_on = "Postal Code")
df_new.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge , Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476


In [57]:
df_new.drop(["Postal Code"], axis = 1, inplace = True)
df_new.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge , Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [58]:
df_new.rename(columns = {"Postcode": "PostalCode"}, inplace = True)
df_new.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge , Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
