# Neighborhoods in Toronto
[1. Create dataframe]

---

## 1. Intro

This is a Notebook for capstone **Peer-graded Assignment** <br> 
The goal of this notebook is to create a dataframe **with required format** from wiki page. <br>
Wiki page:https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

#### The format of dataframe should be: <br>
- The dataframe will consist of ``three columns: PostalCode, Borough, and Neighborhood``
- Only process the cells that have an assigned borough. ``Ignore cells with a borough that is Not assigned``.
- More than one neighborhood can exist in one postal code area. These rows will be ``combined`` into one row with the ``neighborhoods separated with a comma`` as shown in row 11 in the above table.
- If a ``cell has a borough but a Not assigned neighborhood``, then the ``neighborhood`` will be the ``same as the borough.``
- Clean your Notebook and add Markdown cells to ``explain your work and any assumptions you are making.``
- In the last cell of your notebook, use the ``.shape method`` to print the number of rows of your dataframe.




## 2. Dataset scraping

In [2]:
from bs4 import BeautifulSoup
import requests

Set the target url of Wikipedia page:

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [4]:
source = requests.get(url).text
soup = BeautifulSoup(source,'lxml')

Get the tabel which we want:

In [5]:
html_table = soup.find('table', class_='wikitable sortable')

In [6]:
print(html_table.prettify())

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postcode
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Parkwoods" title="Parkwoods">
     Parkwoods
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Victoria_Village" title="Victoria Village">
     Victoria Village
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    <a href="/wiki/Downtown_Toronto" title="Downtown Toronto">
     Downtown Toronto
    </a>
   </td>
   <td>
    <a href="

Extract the table head:

In [7]:
th = html_table.tbody.find_all('th')
print(th)

[<th>Postcode</th>, <th>Borough</th>, <th>Neighbourhood
</th>]


In [8]:
head=[th[0].text,th[1].text,th[2].text]
head

['Postcode', 'Borough', 'Neighbourhood\n']

Notice that the last element in head has '\n', it needs to be deleted

In [9]:
head=[th[0].text,th[1].text,th[2].text[:-1]]
head

['Postcode', 'Borough', 'Neighbourhood']

In [10]:
import pandas as pd

Create a empty dataframe with table head

In [15]:
df = pd.DataFrame(columns=head)
df

Unnamed: 0,Postcode,Borough,Neighbourhood


Fill the dataframe by each row <br>
Rows in html table will be ignored if borough is "Not assigned"

In [16]:
x=0
for line in html_table.tbody.find_all('tr')[1:]:
    cell=line.find_all('td')
    row=[cell[0].text, cell[1].text, cell[2].text[:-1]]
    if row[1]!='Not assigned':   #check borough
        df.loc[x]=row
        x=x+1

print(x, 'rows added in to dataframe!')

211 rows added in to dataframe!


Check the dataframe

In [17]:
df.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [18]:
df.describe()

Unnamed: 0,Postcode,Borough,Neighbourhood
count,211,211,211
unique,103,11,209
top,M9V,Etobicoke,Runnymede
freq,8,45,2


Now we scraped a rough tabel with 211 rows, data with "Not assigned" borough are already droped

-------------

## 2. Dataframe formating

#### Next we need to modify it to the desired format

Check how many rows have a "Not assigned" neighborhood

In [19]:
df.query('Neighbourhood=="Not assigned"')

Unnamed: 0,Postcode,Borough,Neighbourhood
6,M7A,Queen's Park,Not assigned


 Only 'Queen's Park' in this case, so we set it's neighborhood same as the borough

In [20]:
df.Neighbourhood[6]=df.Borough[6]
df.Neighbourhood[6]

"Queen's Park"

For each Postcode, we need combine all neighborhoods into one row separated with a comma 

In [22]:
NB_list=df.groupby(by=['Postcode','Borough']).apply(lambda x:x.Neighbourhood.str.cat(sep=','))

In [23]:
df1=pd.DataFrame({'Neighbourhood':NB_list})

In [24]:
df1.reset_index(inplace=True)

In [29]:
df1.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


In [30]:
df1.tail(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
93,M9A,Etobicoke,Islington Avenue
94,M9B,Etobicoke,"Cloverdale,Islington,Martin Grove,Princess Gar..."
95,M9C,Etobicoke,"Bloordale Gardens,Eringate,Markland Wood,Old B..."
96,M9L,North York,Humber Summit
97,M9M,North York,"Emery,Humberlea"
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie..."
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."
102,M9W,Etobicoke,Northwest


In [26]:
df1.describe()

Unnamed: 0,Postcode,Borough,Neighbourhood
count,103,103,103
unique,103,11,103
top,M1N,North York,Roselawn
freq,1,24,1


In [27]:
df1.shape

(103, 3)

Here is the final dataframe, with 103 rows

In the end, we export the dataframe as a csv file

In [28]:
df1.to_csv('ZIP_canada.csv',index=False)