In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim

## Scrapping an HTML
- Using **requests** package I do a **GET** request to the *Wikipedia list of postal codes of Canada*
- The text of the request answer is stored
- Using **BeautifulSoup** object to parse the HTML
- Use the find method of the object to find an HTML tag **"table"**

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
tab = soup.find('table')

## Structuring the found table into a DataFrame
- Create an empty list to store the data
- Use the find method of the object to find all HTML tags **"td"** and iterate over each of them
- If the HTML text is *Not assigned* the iterarion step is ignored
- If there is data then the text is parsed and cleaned to separate into the desired data **PostalCode, Borough, Neighborhood**

In [3]:
table_contents=[]

for row in tab.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

df=pd.DataFrame(table_contents)

## Assigning Boroughs as Neighborhood
- Using a lambda function in the DataFrame the Borough value is applied to the Neighborhood when this is not assigned

In [4]:
df['Neighborhood'] = df.apply(lambda row: row['Borough'] if row['Neighborhood']=='Not assigned' else row['Neighborhood'], axis = 1)

## Grouping neighborhoods with same postalcode
- Using a groupby function by PostalCode Neighborhoods with similar PostalCodes are stored into a list
- The list is joined by a **", "** to acchieve the desired output

In [5]:
df['Neighborhood'] = df[['PostalCode', 'Neighborhood']].groupby(['PostalCode'])['Neighborhood'].transform(lambda x: ', '.join(x))

### DataFrame final shape

In [6]:
df.shape

(103, 3)