The purpose of this notebook is to scrape a wikipedia page in order to select and save the table which contains the information of all neighborhoods of Toronto city.

In [104]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Get the centent of the page.

In [105]:
url_wiki = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
centent_wiki = requests.get(url_wiki).text

Initialize the object soup by the xml content of the wiki page.

In [106]:
soup = BeautifulSoup(centent_wiki,'lxml')

Select the table from the xml text.

In [107]:
table = soup.find('table',{'class':'wikitable sortable'})

Get the table header. 

In [108]:
columns_names = table.find_all('th')
columns_names = [col.text.strip() for col in  columns_names]
columns_names

['Postcode', 'Borough', 'Neighbourhood']

Initialize our final dataframe.

In [109]:
df_toronto = pd.DataFrame(columns=columns_names)

Get all cells of table and save them on df_toronto.

In [110]:
table_rows = table.find_all('tr') # select all rows
for row in table_rows :
    tds = row.find_all('td')
    row1 = [td.text.strip() for td in tds]
    if (len(row1)==3) :
        dict_row = {}
        for cell , col_name in zip( row1,columns_names) :
            dict_row[col_name] = cell
        df_toronto = df_toronto.append(dict_row, ignore_index=True)

In [111]:
df_toronto.shape

(289, 3)

In [112]:
df_toronto.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Ignore cells with a borough that is Not assigned.


In [113]:
df_toronto = df_toronto[df_toronto['Borough']!='Not assigned']

In [114]:
df_toronto.shape

(212, 3)

In [115]:
df_toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Combine all neighborhoods which have the same  postal code.

In [116]:
# Function tha combines all nei which the same postal code
def combine_neighborhoods(pcode):
    x = df_toronto[df_toronto['Postcode']==pcode] 
    if(x.shape[0]!=1):
        new_nei=  [nei for nei in x['Neighbourhood']]
        new_nei = ', '.join(new_nei)
        #print(new_nei)
        df_toronto.loc[df_toronto['Postcode']==pcode, 'Neighbourhood'] = new_nei
    return df_toranto

In [117]:
set_pcodes = set(df_toronto.Postcode.tolist()) # get the set of postal codes
for pcode in set_pcodes : 
     df_toranto = combine_neighborhoods(pcode)

In [118]:
df_toronto.shape

(212, 3)

Now , we can remove the duplicate rows.

In [119]:
df_toronto.drop_duplicates(inplace=True)

In [120]:
df_toronto.shape

(103, 3)

In [121]:
df_toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Harbourfront, Regent Park"
6,M6A,North York,"Lawrence Heights, Lawrence Manor"
8,M7A,Queen's Park,Not assigned


Check if the number of postcode equals to rows number

In [122]:
df_toronto.Postcode.nunique()==df_toronto.shape[0]

True

Replace the cells which have Not assigned values by the corresponding boroughs. (e.g Row 8 )

In [123]:
def correct_neighborhood( row):
    if row['Neighbourhood'] == 'Not assigned' :
        return row['Borough']
    return row['Neighbourhood']

In [124]:
df_toronto.Neighbourhood = df_toronto.apply(correct_neighborhood,axis=1)

We can remarque that the cell 'Neighbourhood' of 8st row is changed to its Borouth.

In [128]:
df_toronto.reset_index(drop=True,inplace=True)
df_toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [126]:
print('The number of rows is',df_toronto.shape[0])

The number of rows is 103


In [129]:
df_toronto.to_csv('List_neighbourhood_toronto_city.csv',index=False)