<div align='center'><h1>Toronto Web Scraping</h1></div>
<div align='center'><h3>by Darvesh Gorhe</h3></div>

This notebook is the code necessary to the scrape [this](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) wikipedia page which lists all of the postal codes, boroughs, and their respective neighborhoods for Toronto. We wanted a pandas dataframe to store all postal codes that were assigned a borough and their respective neighborhoods. To achieve this we:
1. Import the modules in the cell below
2. Extract information in each cell via BeautifulSoup4
3. Fill a dataframe with that information
4. Clean that dataframe to remove unwanted rows/cells

### Importing Necessary Libraries

In [1]:
# Library for storing, manipulating, analyzing data
import pandas as pd

# Web Scraping
from bs4 import BeautifulSoup as bs
import requests
import html5lib

### Getting Table as HTML

In [2]:
# Getting HTML from Wikipedia
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
t = requests.get(url) # Requesting the raw HTML from wikipedia
t.encoding = 'utf-8' # Making sure it's encoded in utf-8
soup = bs(t.text, 'html5lib') # Storing it as a BeautifulSoup Object

b # Extracting the table of interest

rows = table.find_all('tr') # Extracting rows
rows = rows[1:] # Selecting only the 

### Putting Data Into Dataframe

In [3]:
# Creating a dataframe with the 3 columns
cols = ['Postal Code', 'Borough', 'Neighborhood']
df = pd.DataFrame(columns = cols)

# Retrieving each cell and putting into a dataframe
for row in rows:
    row_n = row.find_all('td')
    
    postal_code = str(row_n[0].string).replace('\n', '') 
    borough = str(row_n[1].string).replace('\n', '')
    neighborhood = str(row_n[2].string).replace('\n', '')
    
    df = df.append({'Postal Code': postal_code,
                'Borough': borough,
                'Neighborhood': neighborhood}, ignore_index=True)

### Cleaning Dataframe

In [4]:
df.astype({'Postal Code':'object', 
           'Borough':'object', 
           'Neighborhood':'object'}) # changing all python str objects to pandas objects

# Replacing forward slashes (/) with commas (,)
for n in range(0,len(df['Neighborhood'])):
    replaced = df['Neighborhood'][n].replace('/', ', ')
    df['Neighborhood'][n] = replaced

# Removing any unassigned boroughs
df = df[df.Borough != 'Not assigned']

# Resetting the index of the dataframe
df.reset_index(drop=True, inplace=True)

### Result

In [5]:
print("The following dataframe has", df.shape[0], "rows", '\n')

df.head()

The following dataframe has 103 rows 



Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"
