# Segmenting and Clustering Neighborhoods in Toronto: part 1

In this notebook, we're going to scrape a [page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) from Wikipedia for a table that displays the relation between post codes and neighborhoods in Toronto.

Let's begin by importing the libraries that will be used.

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

Next, we need to download the web page and put it into a variable. We will use the **Requests** module.

In [2]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
page.status_code

200

Now we have a **Response** object. With '_status_code_', we can see if the request was successful or not. In this case, it was.

Using the **Beautiful Soup** library, we're gonna parse the HTML data.

In [3]:
# First, let's create varible in Beautiful Soup format.
soup = bs(page.content, 'html.parser')

In [4]:
#print(soup.prettify()) # uncomment this line to get the HTML structure of the page

In [5]:
# Here, we get the title of the table we're looking for
soup.title.string

'List of postal codes of Canada: M - Wikipedia'

Moving on, we'll find the table we're interested in.

In [6]:
postcode_table = soup.find('table', class_='wikitable sortable')

And let's transform it into a dataframe.

In [7]:
# We create three lists that will be converted into our columns later
A = []
B = []
C = []

# Now, we pass a loop to populate our dataframe with the rows from the original table
for r in postcode_table.findAll('tr'):
    cells = r.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

In [8]:
# Finally, we transform those lists into a dataframe
df = pd.DataFrame(A, columns=['PostalCode'])
df['Borough'] = B
df['Neighborhood'] = C
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


Let's remove the escape character that came along with our string values.

In [9]:
for i in df.columns.values.tolist(): 
    df[i] = df[i].apply(lambda x: x.rstrip('\n'))

Now, we need to remove the rows from the **'Borough'** column that have _"Not assigned"_ as a value. 

In [10]:
# Here, we're checking if rows that have "Not assigned" as a value for the "Borough" or "Neighborhood" are the same
df[df['Borough'].eq('Not assigned')].index == df[df['Neighborhood'].eq('Not assigned')].index

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])

Since the "Not assigned" rows are the same for "Borough" and "Neighborhood", it won't be necessary to change any neighborhood.

In [11]:
# Remove the "Not assigned" rows
df = df[df['Borough'] != 'Not assigned']
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Lastly, let's get the final **shape** of the created dataframe.

In [12]:
df.shape

(103, 3)

We're gonna save this dataframe into a CSV file so we don't need to repeat this whole process in the next notebooks.

In [13]:
#df.to_csv('post_code_toronto_1.csv', index=False)