<h1> Scraping Postal Codes Of Torronto City </h1>

<h4>Importing dependencies</h4>

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import geocoder

<h4>Requesting source page</h4>

In [2]:
source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text

<h4>Using BeautifulSoup for scraping the data and geocoder for obtaining coordinates</h4>

In [8]:
soup = BeautifulSoup(source, 'lxml')
table = soup.find('tbody')
postcode = []
borough = []
neighborhood = []
latitudes = []
longitudes = []
for row in table.findAll('tr'):
    temp = []
    for each_data in row.findAll('td'):
        temp.append(each_data.text)
    if len(temp) and temp[1] != 'Not assigned': #Ignoring cells with a borough that is Not assigned
        postcode.append(temp[0])
        borough.append(temp[1])
        temp_2 = temp[2].rstrip('\n')
        if temp_2 == 'Not assigned': #If a cell has a Not assigned neighborhood
            temp2 = temp[1]          #then the neighborhood will be the same as the borough
        neighborhood.append(temp_2)
        
        #lat_lng_coords = None
        #while(lat_lng_coords is None):
            #g = geocoder.google('{}, Toronto, Ontario'.format(temp[0]))
            #lat_lng_coords = g.latlng

        #latitudes.append(lat_lng_coords[0])
        #longitudes.append(lat_lng_coords[1])
data = {'Postal Code': postcode, 'Borough': borough, 'Neighborhood': neighborhood} #The dataframe will consist of three columns:
                                                                                  #PostalCode, Borough, and Neighborhood

<h4>Storing the scraped data in pandas dataframe</h4>

In [10]:
df = pd.DataFrame(data)                                                           
df = df.groupby("Postal Code").agg(lambda x:','.join(set(x)))

<h4>The number of rows of the dataframe</h4>

In [11]:
df.shape

(103, 2)

<h4>Displaying any 10 data from the dataframe</h4>

In [12]:
df.sample(10)

Unnamed: 0_level_0,Borough,Neighborhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M5G,Downtown Toronto,Central Bay Street
M9R,Etobicoke,"St. Phillips,Kingsview Village,Martin Grove Ga..."
M3M,North York,Downsview Central
M6K,West Toronto,"Parkdale Village,Exhibition Place,Brockton"
M5P,Central Toronto,"Forest Hill North,Forest Hill West"
M7A,Queen's Park,Not assigned
M4M,East Toronto,Studio District
M4J,East York,East Toronto
M5K,Downtown Toronto,"Toronto Dominion Centre,Design Exchange"
M9P,Etobicoke,Westmount


<h4>Get the latitude and the longitude coordinates of each neighborhood using geospatial data</h4>   
The http://cocl.us/Geospatial_data (csv file) contains geographical coordinates of each postal code of Torronto

In [14]:
df2 = pd.read_csv('Geospatial_Coordinates.csv')
df = pd.merge(df, df2, on = 'Postal Code')

<h4>Examing the resulting dataframe</h4> 

In [18]:
df.sample(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
86,M7R,Mississauga,Canada Post Gateway Processing Centre,43.636966,-79.615819
71,M6A,North York,"Lawrence Manor,Lawrence Heights",43.718518,-79.464763
98,M9N,York,Weston,43.706876,-79.518188
1,M1C,Scarborough,"Port Union,Rouge Hill,Highland Creek",43.784535,-79.160497
73,M6C,York,Humewood-Cedarvale,43.693781,-79.428191
42,M4L,East Toronto,"India Bazaar,The Beaches West",43.668999,-79.315572
14,M1V,Scarborough,"Agincourt North,Milliken,L'Amoreaux East,Steel...",43.815252,-79.284577
79,M6L,North York,"Upwood Park,North Park,Downsview",43.713756,-79.490074
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
35,M4B,East York,"Parkview Hill,Woodbine Gardens",43.706397,-79.309937


In [24]:
df.groupby('Borough').count()

Unnamed: 0_level_0,Postal Code,Neighborhood,Latitude,Longitude
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Central Toronto,9,9,9,9
Downtown Toronto,18,18,18,18
East Toronto,5,5,5,5
East York,5,5,5,5
Etobicoke,12,12,12,12
Mississauga,1,1,1,1
North York,24,24,24,24
Queen's Park,1,1,1,1
Scarborough,17,17,17,17
West Toronto,6,6,6,6


In [25]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.
