# <font color=blue> Segmenting and Clustering Neighborhoods in Toronto
# <font color=blue> STEP A : webpage scraping

### Using the Beautiful Soup package to scrape the Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

#### **Getting the html source code:**

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

#print(soup.prettify())

#### **Parsing the table and storing our table headers (Postcode, Borough, Neighbourhood)**

In [3]:
soup = BeautifulSoup(source, 'lxml')

table = soup.find('table', class_="wikitable sortable")
#print(table)

table_columns=[]

for header in table.find_all('th'):
    table_columns.append(header.text)
print("Here are our table columns:", table_columns)

Here are our table columns: ['Postcode', 'Borough', 'Neighbourhood\n']


#### **Remove '\n' from Neighborhood**

In [4]:
table_columns[2] = table_columns[2][:-1]
table_columns

['Postcode', 'Borough', 'Neighbourhood']

#### **Getting all table rows in a list of lists, named "data"**

In [5]:
data = []
rows = table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols]) 
data[0:5]

[[],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village']]

In [6]:
#check number of rows
len(data)

290

In [7]:
#deleting empty first row corresponding to the headers
del data[0]  
data[0:5]

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront']]

#### **Create our dataframe**

In [8]:
df = pd.DataFrame(data, columns = table_columns)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [9]:
df.shape

(289, 3)

#### **Ignore rows where Borough is not assigned**

In [10]:
#get the indices of the rows where Borough is Not assigned
todropindex = df[ df['Borough'] == 'Not assigned'].index
 
# Delete these rows from dataFrame
df.drop(todropindex , inplace=True)

#reset dataframe index
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [11]:
df.shape

(212, 3)

In [17]:
#df.to_csv('mydata.csv', index=False)
#dftest=pd.read_csv('mydata.csv')
#dftest.head()

#### **Combine rows with the same Postcode by comma joining the corresponding neighbourhoods**

In [13]:
df = df.groupby('Postcode').agg({'Borough': 'first', 'Neighbourhood' : ', '.join }).reset_index()
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [15]:
df.shape

(103, 3)

In [16]:
df.to_csv('mydata_modified.csv', index=False)

#### **For cases where neighbourhood is not assigned, use as neighbourrhood name the borough name**

In [18]:
#find the indices of such cases
check = df.Neighbourhood[df.Neighbourhood == 'Not assigned'].index.tolist()
check

[85]

#### **So there's only one such case. Let's inspect:** 

In [20]:
df.loc[check]

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Not assigned


#### **Use as Neighbourhood name the Borough name**

In [22]:
df.loc[check,'Neighbourhood'] = df.loc[check,'Borough']

In [23]:
df.loc[check]

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


#### **Save our final dataframe into a csv file**

In [24]:
df.to_csv('mydata_final.csv', index=False)

#### **Print the number of rows of our dataframe**

In [26]:
print("The number of rows of our dataframe is:", df.shape[0])

The number of rows of our dataframe is: 103
