## This notebook will scrape a table from Wikipedia using Beautiful Soup

#### First, we will import the necessary libraries:

In [6]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

#### We use the request to connect with the site, and Beautiful Soup to scrape the site's source HTML

In [2]:
source = requests.get('http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

#### We only want the table object containing the postal data, so we store the table HTML code as it's own variable

In [3]:
table = soup.find_all('table')[0]

#### Using the read_html method in Pandas will return a list of dataframes, in this case of only 1 data frame. We can simply extract it from the list with indexing.

In [26]:
df_list = pd.read_html(str(table))
df = df_list[0]

#### The cleaning required on this project is removing the unassigned boroughs. We'll filter out any rows with unassigned boroughs.

In [28]:
df_cleaned = df[df['Borough']!='Not assigned']
df_cleaned.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


#### Lastly, we print out the shape to make sure we have the right number of rows and columns.

In [30]:
df_cleaned.shape

(103, 3)