## Data collection 

In this notebook I prepare a data set of Toronto neighbohoods with the help of its Wikipedia page.

In [1]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

In [2]:
#!conda install -c conda-forge beautifulsoup4 --yes

first step: get wikipedia page for our data set.

In [3]:
wiki_page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

With the help of BeautifulSoup, I find the first 'table' tag in the html page. After that I find all rows in that table by finding all 'tr' tags.

In [5]:
soup = BeautifulSoup(wiki_page,'html.parser')
main_table = soup.find('table')
#I omit the first row because it's header.
rows = main_table.find_all('tr')[1:]

I initalize a dataframe and in a for loop, I add every row in that wikipedia page to my dataframe.

In [6]:
columns_name = ['PostalCode', 'Borough','Neighborhood']
df = pd.DataFrame(columns=columns_name)
for row in rows:
    raw_element = row.find_all('td')
    df = df.append({'PostalCode':raw_element[0].text, 'Borough':raw_element[1].text, 'Neighborhood':raw_element[2].text.strip('\n')}, ignore_index=True)

In [8]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Omiting the rows with 'Not assigned' value for Borough column.

In [None]:
df = df[df['Borough'] != 'Not assigned']
df.reset_index(inplace=True, drop=True)
df.head(10)

Replacing Neighborhood column value with Borough column value for rows that Neighborhood value is 'Not assigned'.

In [16]:
df['Neighborhood'] = df.apply((lambda row:row['Borough']  if row['Neighborhood'] == 'Not assigned' else row['Neighborhood']), axis=1)

Grouping dataframe by postal code and for each group accumulating all neighborhoods with the same postal code into one row of a new dataframe.

In [11]:
new_df = pd.DataFrame(columns=columns_name)
for group in df.groupby('PostalCode'):
    temp = group[1].loc[:,'Neighborhood'].values
    neighborhood = ', '.join(temp)
    postalcode = group[1].loc[:,'PostalCode'].values[0]
    borough = group[1].loc[:,'Borough'].values[0]
    new_df = new_df.append({'PostalCode':postalcode, 'Borough':borough, 'Neighborhood':neighborhood}, ignore_index=True)

In [12]:
new_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [17]:
print('Final dataframe has {} rows.'.format(new_df.shape[0]))

Final dataframe has 103 rows.


In [18]:
new_df.shape

(103, 3)