<B><text size = 14>Examination of Toronto Neighborhoods</b></text>

The link we need to scrape is: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

We will use the BeautifulSoup package to scrape this Wiki Page.

In [13]:
from bs4 import BeautifulSoup ## importing BeautifulSoup package 
import requests 

In [14]:
source  = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text ## requesting the text from the website
soup = BeautifulSoup(source,'lxml') ## storing this text as a beautifulSoup object

In [15]:
My_table = soup.find('table',{'class':'wikitable sortable'}) ## operating on our BeautifulSoup object to find the table in the text
table_body = My_table.find('tbody')
## table_body  ## checking table content to see if we got the right table

Importing the libraries for dataframes (the panda library) is important.  I also loaded the data into a .csv file to examine the entire set by eye.

In [16]:
import pandas as pd
l = []
for tr in table_body.findAll("tr"):
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
df = pd.DataFrame(l, columns=["Postcode","Borough", "Neighborhood"])
df
df.to_csv('check.csv')

I noticed theres some extra markup in the 'neighborhood' column so the function defined here gets rid of those pesky linebreaks. 

In [17]:
def func(string):
    if string is not None:
        s = string.split("\n")
        return s[0]
    else:
        return string
## applying the function the column to chop off the pesky extra '/n' values
df['Neighborhood']= df['Neighborhood'].apply(func)        

In [18]:
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


The first entry was all nones, so I chopped it off using the drop function on a dataframe.

In [19]:
df=df.drop(df.index[0])

Then I checked to make sure it is chopped off.

In [20]:
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


Now, I wanted to delete postal codes that had no borough and no neighbhorhood assigned to them.

In [22]:
##Here we create a new dataframe that is the same as the old one but without the postal codes that have neither a borough or neighborhood assigned.
df22 = pd.DataFrame()
for index, row in df.iterrows(): ## need to use iterrows to iterate through a dataframe
    if (df['Borough'][index]=="Not assigned" and df['Neighborhood'][index]=="Not assigned"):  
        None
    else:
        df22 = df22.append(({'Neighborhood': df['Neighborhood'][index],'Borough': df['Borough'][index],'Postcode': df['Postcode'][index]}), ignore_index=True)
df22.head()
df22.count()

Borough         211
Neighborhood    211
Postcode        211
dtype: int64

Some postal codes had a borough but no neighborhood, so I just make the neighborhood equal to the borough.

In [23]:
##Creating a fresh data frame that will now have unassigned neighborhood values to the same as the borough
df222 = pd.DataFrame()
for index, row in df22.iterrows(): ## need to use iterrows to iterate through a dataframe
    if (df22['Neighborhood'][index]=="Not assigned"):  
        df222 = df222.append(({'Borough': df22['Borough'][index],'Neighborhood': df22['Borough'][index],'Postcode': df22['Postcode'][index]}), ignore_index=True)
    else:
        df222 = df222.append(({'Borough': df22['Borough'][index],'Neighborhood': df22['Neighborhood'][index],'Postcode': df22['Postcode'][index]}), ignore_index=True)
df22.head()
df22.count()
df222.to_csv('checkmiddle.csv')

Some postal codes are duplicates.  A postal code seems to only have one borough but it can have multiple neighborhoods.  Here, I wrote a function that first checks if the postal codes are the same.  If they are the same, it then checks if the neighborhoods are the same.  If the neighborhoods are the same in the string of the following entry, it's a duplicate entry so we stop there, and we don't append it to the new empty dataframe.  If the neighborhoods are different, we add the neighborhood to the next entry's string.  So we get a list of neighborhoods for each borough. 

In [24]:
df2222 = pd.DataFrame()
for index, row in df222.iterrows():
    if(index != 210):
        if(df222['Postcode'][index]==df222['Postcode'][index+1]):
            if((df222['Neighborhood'][index]==((df222['Neighborhood'][index+1])))):
                None
            else:
                df222['Neighborhood'][index+1] = df222['Neighborhood'][index] + ", " + df222['Neighborhood'][index+1]
            
        else:
            df2222 = df2222.append(({'Borough': df222['Borough'][index],'Neighborhood': df222['Neighborhood'][index],'Postcode': df222['Postcode'][index]}), ignore_index=True)
    else:
         df2222 = df2222.append(({'Borough': df222['Borough'][index],'Neighborhood': df222['Neighborhood'][index],'Postcode': df222['Postcode'][index]}), ignore_index=True)
df2222.to_csv('checkend.csv')

We end up with 102 unique postal code values. It seems to look okay when I check it against the csvs. 

In [25]:
df2222.shape

(103, 3)

In [45]:
df2222

Unnamed: 0,Borough,Neighborhood,Postcode
0,North York,Parkwoods,M3A
1,North York,Victoria Village,M4A
2,Downtown Toronto,"Harbourfront, Regent Park",M5A
3,North York,"Lawrence Heights, Lawrence Manor",M6A
4,Queen's Park,Queen's Park,M7A
5,Etobicoke,Islington Avenue,M9A
6,Scarborough,"Rouge, Malvern",M1B
7,North York,Don Mills North,M3B
8,East York,"Woodbine Gardens, Parkview Hill",M4B
9,Downtown Toronto,"Ryerson, Garden District",M5B


The shape was (103,3) and in case that is the wrong shape, I will show you that it looks to be right by printing the dataframe out for you.

Now we create a new dataframe with the geo-coords.

In [26]:
import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values


In [42]:
geocode = Nominatim(user_agent="toronto_explorer")

I couldn't for the life of me figure out how to input a postalcode into the geolocater.geocode function, so I just passed the borough with "ON, Canada" tagged onto it.  Maybe I will see how other people did it... It just doesn't seem correct to pass the borough only but that's all I could figure out since postcode didn't seem to be a valid input. 

In [43]:
data_frame_with_coords = pd.DataFrame()
for index, row in df2222.iterrows():
    address = df2222['Borough'][index] + ', ON, Canada'
    geolocator = Nominatim(user_agent="toronto_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    data_frame_with_coords = data_frame_with_coords.append(({'Borough': df2222['Borough'][index],'Neighborhood': df2222['Neighborhood'][index],'Postcode': df2222['Postcode'][index], 'Latitude': latitude, 'Longitude': longitude}), ignore_index=True)



In [44]:
data_frame_with_coords

Unnamed: 0,Borough,Latitude,Longitude,Neighborhood,Postcode
0,North York,43.770817,-79.413300,Parkwoods,M3A
1,North York,43.770817,-79.413300,Victoria Village,M4A
2,Downtown Toronto,43.655115,-79.380219,"Harbourfront, Regent Park",M5A
3,North York,43.770817,-79.413300,"Lawrence Heights, Lawrence Manor",M6A
4,Queen's Park,43.659980,-79.390369,Queen's Park,M7A
5,Etobicoke,43.671459,-79.552492,Islington Avenue,M9A
6,Scarborough,43.773077,-79.257774,"Rouge, Malvern",M1B
7,North York,43.770817,-79.413300,Don Mills North,M3B
8,East York,43.691339,-79.327821,"Woodbine Gardens, Parkview Hill",M4B
9,Downtown Toronto,43.655115,-79.380219,"Ryerson, Garden District",M5B
