# Part I. Creating Toronto's Postal Code Panda Dataframe

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and transform the data into a pandas dataframe.




![alt text](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1573084800000&hmac=wMQca_jeiHZfDcFSbTSX2e5wUoX7Ur7DNAqir1u5G-c)



To create the above dataframe:

* The dataframe will consist of three columns, i.e., PostalCode, Borough, and Neighborhoold.
* Only process the cells that have an assigned borough and ignore the cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will
... notice that -M5A is listed twice and has two neighborhoods, i.e., Harbourfront and Regent Park. These two rows will be combined
... into one row with the -neighborhoods separated with a comma as shown in row 11 in the above table.
* if a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the
... 9th cell in the table of the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
* Clear your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

Postal codes beginning with M are located within the city of Toronto in the province of Ontario. 


In [2]:
#   get_ipython().system(u' pip install beautifulsoup4')
#   get_ipython().system(u' pip install wikipedia')
#   get_ipython().system(u' pip install lxml')

In [3]:
# Importing necessary libraries
import pandas as pd
import wikipedia as wp
from bs4 import BeautifulSoup

## Scraping the List of Postal Codes of Canada

In [4]:
# Enter the h1 element in wp.page(h1)
html = wp.page("List of postal codes of Canada: M").html().encode("UTF-8")

In [5]:
# Determine the index of the table
df = pd.read_html(html, header = 0)[0]

## Data Cleaning

In [7]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned
df = df[df.Borough != 'Not assigned']

In [9]:
# Combine the neighborhoods, separated by a comma, of duplicate postcode as shown in row 11 as shown in the table above
df = df.groupby(['Postcode', 'Borough'])['Neighborhood'].apply(list).apply(lambda x:', '.join(x)).to_frame().reset_index()


In [10]:
# Unassigned neighbourhood will be labled with its corresponding borough
for index, row in df.iterrows():
    if row['Neighborhood'] == 'Not assigned':
        row['Neighborhood'] == row['Borough']

In [11]:
df

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


## Exporting Toronto Postal Code Dataframe to Csv File

In [12]:
df.to_csv('postal_code.csv', index=False)