# Striping wikipedia site to get Toronto neighbourhood

## Getting the data from the website

Importing needed packages

In [37]:
# import needed packages
import requests
from bs4 import BeautifulSoup
import pandas as pd

Getting the page source code and initializing it into BeautifulSoup

In [5]:
# setting variables
page_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# getting page and using BeautifulSoup to read it
page_html = requests.get(page_url).text
page_resource = BeautifulSoup(page_html)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Now, lets get the first (and unique) table of the page, loop through every row and collect the data into dictionaries.

<div style="background-color:#ccc;padding:10px">
    <strong>Note:</strong> The first row is the header, so, for index == 0, cell's values will be assigned to another dictionary.
</div>

In [220]:
# getting the first table of the page (i.e., the table that contains the desired information)
table_resource = page_resource.table

# loop through all rows assigning it to a list
dict_toLocationInfo = []
dict_toLocationInfo_columns = []
for index, row in enumerate(table_resource.find_all('tr')):
    # data
    if (index > 0):
        row_info = []
        for cell in row.find_all('td'):
            row_info.append(cell.text.replace('\n', ''))
        dict_toLocationInfo.append(row_info)
        
    # headers/columns
    else:
        for cell in row.find_all('th'):
            dict_toLocationInfo_columns.append(cell.text.replace('\n', ''))

Show a sample of data

In [224]:
# showing samples
print('Data sample: \n',dict_toLocationInfo[0:9])
print('\n\nColumns: \n', dict_toLocationInfo_columns)

Data sample: 
 [['M1A', 'Not assigned', 'Not assigned'], ['M2A', 'Not assigned', 'Not assigned'], ['M3A', 'North York', 'Parkwoods'], ['M4A', 'North York', 'Victoria Village'], ['M5A', 'Downtown Toronto', 'Harbourfront'], ['M5A', 'Downtown Toronto', 'Regent Park'], ['M6A', 'North York', 'Lawrence Heights'], ['M6A', 'North York', 'Lawrence Manor'], ['M7A', "Queen's Park", 'Not assigned']]


Columns: 
 ['Postcode', 'Borough', 'Neighbourhood']


In [228]:
# merging data to Pandas DataFrame
toPostalDf = pd.DataFrame(dict_toLocationInfo)
toPostalDf.columns = dict_toLocationInfo_columns
print(toPostalDf.head(10))

# doing a backup
toPostalDf.to_csv('toPostalDf.csv')

  Postcode           Borough     Neighbourhood
0      M1A      Not assigned      Not assigned
1      M2A      Not assigned      Not assigned
2      M3A        North York         Parkwoods
3      M4A        North York  Victoria Village
4      M5A  Downtown Toronto      Harbourfront
5      M5A  Downtown Toronto       Regent Park
6      M6A        North York  Lawrence Heights
7      M6A        North York    Lawrence Manor
8      M7A      Queen's Park      Not assigned
9      M8A      Not assigned      Not assigned


## Cleaning the dataset  

First, drop all lines with **Borough** equal to *Not assigned*

In [261]:
# cleaning the "Not assigned" entries
toPostalDf = toPostalDf[toPostalDf['Borough'] != 'Not assigned']
toPostalDf = toPostalDf.reset_index(drop=True)
toPostalDf['Borough'].unique()

array(['North York', 'Downtown Toronto', "Queen's Park", 'Etobicoke',
       'Scarborough', 'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

Create a new dataframe with duplicated postcode

In [102]:
toGroupedDf = toPostalDf.groupby(['Postcode']).count().reset_index()

Turn the *Postcode* column of the original dataframe into a index

In [137]:
toPostalDf.set_index('Postcode', inplace=True)
toPostalDf.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M5A,Downtown Toronto,Regent Park
M6A,North York,Lawrence Heights


Now, for every postcode into grouped dataframe, join the neighbourhoods, drop rows in the (copy of) original dataframe that have the duplicated postcode and then insert new row with merged neighbourhoods

In [203]:
# remove duplicated Postalcode joining Neighbourhoods
# make a copy
toPostalDf2 = toPostalDf.copy()
for postalcode in toGroupedDf['Postcode']:
    # get neighbourhood and borough
    neighbourhoods = toPostalDf2.loc[postalcode, 'Neighbourhood']
    if (not isinstance(neighbourhoods, str)):
        neighbourhoods = neighbourhoods.to_string(header=False, index=False).replace(' ', '').replace('\n', ', ')
    
    # get borough
    borough = toPostalDf2.loc[postalcode, 'Borough']
    if (not isinstance(borough, str)):
        borough = borough.unique()[0]
    
    # drop rows that have the duplicated entry
    toPostalDf2.drop(index=postalcode, axis=0, inplace=True)
    
    # add new line with joined values
    toPostalDf2 = toPostalDf2.append(pd.DataFrame({'Borough':borough, 'Neighbourhood': neighbourhoods}, index=[postalcode]))
toPostalDf2.reset_index(inplace=True)
toPostalDf2.head()

Unnamed: 0,index,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"HighlandCreek, RougeHill, PortUnion"
2,M1E,Scarborough,"Guildwood, Morningside, WestHill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Remove *Not assigned* from Neighbourhood making them equal to Borough

In [249]:
# assign Borough label to neighbourhood that is 'Not assigned'
toPostalDf2[toPostalDf2['Neighbourhood'] == 'Not assigned']
toPostalDf2.loc[85, 'Neighbourhood'] = 'Queen\' Park' # manual assign, just one Neighbourhood needs it

Check if the original and copied dataframe has the same number of Postcode (integrity check)

In [262]:
# check integrity
print('Original dataframe postcode numbers: ',len(toPostalDf['Postcode'].unique()))
print('Preprocessed dataframe postcode Numbers: ', len(toPostalDf2['index'].unique()))

Original dataframe postcode numbers:  103
Preprocessed dataframe postcode Numbers:  103


In [265]:
# fixing columns into df2
toPostalDf2.columns = dict_toLocationInfo_columns
toPostalDf2.to_csv('toPostalDf2.csv')

And print the shape of final dataframe

In [266]:
# printing the shape
print('The shape of DataFrame is: ', toPostalDf2.shape)

The shape of DataFrame is:  (103, 3)
