# Create Toronto Neighborhoods new dataset

## Introduction

In this notebook we will have to scrape a wikipedia page in order to gether the information to build our data frame with Toronto Neighborhoods.
After getting the data from the wikipedia web page, we'll have to clean it and save it.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [15]:
import pandas as pd # For managing the dataframe
#!conda install -c conda-forge python3-bs4 --yes
from bs4 import BeautifulSoup # For webscraping
import requests # library to handle requests

print('Libraries imported.')

Libraries imported.


#### 1. First we need to get the data from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In order get the data, we visit the home page and we check the html of the page. We can see that the table with the useful content is the only table with class 'wikitable', so we can scrape the content of this table in BeautifulSoupe.

Let's use `request` library to get the full content of the web page

In [16]:
# instantiate headers to be sent with the request
headers = requests.utils.default_headers()

# Define wikipedia url
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Set headers to emulate a firefox browser in Ubuntu OS
headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})

# Do the request
r = requests.get(url, headers)

# Get the raw content of the request result
raw_html = r.content

#### 2. Now we use BeautifulSoup to parse and get the content of the desirable chunk of the page.

We want to get the table with class `wikitable` then extract all the rows from the table.

In [17]:
# Parse with BeautifulSoup
soup = BeautifulSoup(raw_html, 'html.parser')

# Thre content is on a table with class 'wikitable'. This table with this class is unique in this page
table = soup.find('table', class_='wikitable')

# Get all the rows of the table
table_rows = table.find_all('tr')

#### 3. Build a dataframe with the conten of all the rows in the table

To build a data frame with the rows in the table, we are going to loop through all rows, find all `<td>` elements in it, store a list of the texts found in each row to a new list and append that list to list `l`. Then we create a data frame with that list content and with the columns `'PostalCode'`, `'Borough'` and `'Neighborhood'`. 

In [18]:
# Instatiate a list
l = []

# for each row in the table, we get all the cells on that row, create a 
# list with the text content of that cells and append them to l list.
for tr in table_rows:
    td = tr.find_all('td')
    row = [cells.text for cells in td]
    l.append(row)

# instantiate the dataframe
df = pd.DataFrame(l, columns=['PostalCode', 'Borough', 'Neighborhood'])

#### 4. Clean the dataset

In order to clean the dataset we need to remove all empty lines, all the lines in which `Borough` is `Not available` and and set the `Neighborhood` equal to `Borough` wherethe `Neighborhood`is `Not available`.

In [19]:
# Remove first empty line
df = df.iloc[1:]

# Delete rows with None or Not assigned values on Borough
df = df[df.Borough != "Not assigned"]

# Reset index after removing lines
df = df.reset_index(drop=True)

# Remove special character \n new line from the end of Neighborhood values
df.Neighborhood = df.Neighborhood.str.replace(r'\n$', '')

# Set Neighborhood equal to Borough if the value of Neighborhood == "Not assigned"
df.loc[df.Neighborhood == "Not assigned", "Neighborhood"] = df.loc[df.Neighborhood == "Not assigned", "Borough"]

df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


#### 5. Save data in a .csv file

In [20]:
df.to_csv("TorontoNeighborhood.csv")

Now we have a file ready to be used for the next project iterations.