# Segmenting and Clustering Neighborhoods in Toronto

### Libraries

Let's import all needed libraries.

In [0]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

## 1. Step 1: Scraping

Let's get the raw html code from Wikipedia:

In [0]:
wikipedia_page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(wikipedia_page).text

And feed it to BeautifulSoup:

In [0]:
soup = BeautifulSoup(source, 'lxml')

We find the table with the data we're looking for:

In [0]:
table = soup.find('table')

And we extract said data, according to the assignment's specifications:

* The dataframe needs to have three columns: PostalCode, Borough, and Neighborhood
* We only process the cells that have an assigned borough and ignore cells with a borough that is Not assigned.
* Neighborhoods with the same postal code will be combined into one row with the neighborhoods separated with a comma.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 

In [24]:
# initialize data list
data = []

# cycle through table rows
for i, row in enumerate(table.find_all('tr')):
  # extract the current row's data
  entry = [item.text for item in row.find_all('td')]
  # check that we extracted something and that the postal code is assigned to a borough
  if entry and entry[1] != 'Not assigned':
    # check if postal code is already in our list
    if entry[0] not in [item['Postcode'] for item in data]:
      # if not, add entry
      data.append({'Postcode': entry[0],
                   'Borough': entry[1],
                   'Neighbourhood': entry[2].rstrip() if entry[2].rstrip() != 'Not assigned' else entry[1]})
    else:
      # if yes, find the index
      index = [i for i, item in enumerate(data) if item['Postcode'] == entry[0]][0]
      # and add neighborhood to existing entry
      data[index]['Neighbourhood'] += ', '+entry[2].rstrip()

# create dataframe from data list
df = pd.DataFrame(data=data, columns=['Postcode', 'Borough', 'Neighbourhood'])
# rename columns
df.columns = ['PostCode', 'Borough', 'Neighborhood']
df.head(10)

Unnamed: 0,PostCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"
