# Assignment of week 3 - Segmenting and Clustering Neighborhoods in Toronto

## Part 1

To create the dataframe:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have complete information and not greyed out or not assigned.
    - For each cell, the postal code will go under the PostalCode column, the first line under the postal code will go under Borough, and the remaining lines will go under the Neighborhood column formatted nicely and separated with commas as shown in the sample dataframe. 
    - For example, for cell (1, 3) on the Wikipedia page, M3A will go under PostalCode, North York will go under Borough, and Parkwoods will go under Neighborhood.
- If a cell has only one line under the postal code, like cell (1, 7), then that line will go under the Borough and the Neighborhood columns. So for cell (1, 7), the value of the Borough and the Neighborhood column will be Queen's Park.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

## Import needed libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Get the Wikipedia page

In [2]:
wikipediaPage = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

## Extract the info from the page

In [3]:
# import it in a soup object
soup = BeautifulSoup(wikipediaPage.content, 'html5lib')

# get the table with the needed info
table = soup.find_all('table')[0]

# extract all <td> from the table
tds = table.find_all('td')

### Content of a <td\>
##### Valid
    <td style="vertical-align:top;">
        <p>
            <b>M1B</b>
            <br/>
            <span style="font-size:80%;">
                <a href="/wiki/Scarborough,_Toronto" title="Scarborough, Toronto">Scarborough</a>
                <br/>(<a href="/wiki/Malvern,_Toronto" title="Malvern, Toronto">Malvern</a> / <a href="/wiki/Rouge,_Toronto" title="Rouge, Toronto">Rouge</a>)</span>
        </p>
    </td>

##### Invalid
    <td style="width:11%; vertical-align:top; color:#ccc;">
        <p>
            <b>M1A</b>
            <br/>
            <span style="font-size:80%;">
                <i>Not assigned</i>
            </span>
        </p>
    </td>

## Define functions for parsing the html

In [8]:
# split the given strings from a string list based on given separator and outpu a list of the splitted strings
def extractString(strInputList, separator):
    strOutputList = []
    for si in strInputList:
        splitted = si.split(separator)
        strOutputList.extend(splitted)
    return strOutputList


In [9]:
# compute the neighborhood string based on a list of strings extracted from the td data
def computeNeighborhood(inputStringList):
    neighborhood = None
    
    processed = extractString(inputStringList, '(')
    processed = extractString(processed, ')')
    processed = extractString(processed, '/')
    
    for s in processed:
        sClean = s.strip()
        if sClean != '':
            if neighborhood == None:
                neighborhood = sClean
            else:
                neighborhood = neighborhood + ', ' + sClean
    
    return neighborhood


In [10]:
# function to parse the <td> and return an entry for the dataframe : return PostalCode, Borough, Neighborhood
def parseTableData(td):
    postalCode = None
    borough = None
    neighborhood = None
    otherString = ''
    foundFirst = False

    index = 0
    for string in td.stripped_strings:
        if index == 0:
            postalCode = str(string).strip()
        elif index == 1:
            if string == 'Not assigned':
                break
            else:
                borough = str(string).strip()
        else:
            if '(' in string:
                foundFirst = True
            if foundFirst:
                otherString = otherString + str(string).strip()
            
        index = index + 1

    if postalCode != None and borough != None:
        neighborhood = computeNeighborhood([otherString])
        if neighborhood == None:
            neighborhood = borough
    
    return postalCode, borough, neighborhood


## Create and fill the dataframe

In [11]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

#### I am assuming that invalid data (<td\> content) contains a field equal to None (see function parseTableData)

In [12]:
# fill the dataframe
counter = 0
for td in tds:
    data = parseTableData(td)
    # skip invalid data
    if (None in data):
        counter = counter + 1
        continue
    neighborhoods = neighborhoods.append({'PostalCode': data[0], 'Borough': data[1], 'Neighborhood': data[2]}, ignore_index=True)
print('Skipped {} invalid entries and added {} valid ones.'.format(counter, len(tds)-counter))

Skipped 77 invalid entries and added 103 valid ones.


In [13]:
neighborhoods.shape

(103, 3)