# Exploring Toronto's Neighborhood [(Part 2 Below)](#Part-2---Appending-Geolocation-Data)

_Borirak Opasanont_  
_04Apr20_  

Peer-graded assignment of week 3 of IBM Data Science Capstone course on Coursera.

## Part 1 - Scraping Neighborhood Data

Build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

In [18]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

### Scraping data

I use BeautifulSoup to extract the table from Wikipedia. First, use the requests library to get the webpage. Then, create a BeautifulSoup object, `soup`. Then I open the wiki page in Chrome and use __Inspect Element__ to find out the table class is called "wikitable", which is then stored in `table` object.

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url).text

# Create a BeautifulSoup object
soup = BeautifulSoup(page, 'xml')

table = soup.find(class_='wikitable') 

print(table.prettify())

<table class="wikitable">
 <tbody>
  <tr>
   <th>
    Postal code
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighborhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    North York
   </td>
   <td>
    Parkwoods
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    North York
   </td>
   <td>
    Victoria Village
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    Downtown Toronto
   </td>
   <td>
    Regent Park / Harbourfront
   </td>
  </tr>
  <tr>
   <td>
    M6A
   </td>
   <td>
    North York
   </td>
   <td>
    Lawrence Manor / Lawrence Heights
   </td>
  </tr>
  <tr>
   <td>
    M7A
   </td>
   <td>
    Downtown Toronto
   </td>
   <td>
    Queen's Park / Ontario Provincial Government
   </td>
  </tr>
  <tr>
   <td>
    M8A
   </td>
   <td>
    Not assigned
   </

Next, I extract the table headings, follow by the data in the rows.

In [4]:
# # Get table column names... Commented out because not needed
# headings = table.findAll('th')   # 'th' is the header marker
# column_names = []
# for h in headings:
#     column_names.append(h.text.strip())

# Get table data row by row
rows = table.findAll('tr') # 'tr' is the row marker
data = []
for row in rows:
    data.append([t.text.strip() for t in row.findAll('td')])   # 'td' is the data marker

# Print some data to see how they look like
#print(column_names)
print(data[0:5])
print("Number of data rows: ", len(data))

[[], ['M1A', 'Not assigned', ''], ['M2A', 'Not assigned', ''], ['M3A', 'North York', 'Parkwoods'], ['M4A', 'North York', 'Victoria Village']]
Number of data rows:  181


If we count the rows in wiki page, there is 180 rows excluding heading. Note that our data table has 181 rows with the first row as blank. So looks like we've got all the data.  

Next, let's put them into a Pandas dataframe.

In [5]:
# Now put everything into a Pandas dataframe
df = pd.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighborhood'])
df = df[~df['PostalCode'].isnull()]  # to filter out the first empty row
print(df.shape)
df.head()

(180, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Regent Park / Harbourfront


Great so we've got 180 data rows like in the Wiki page.

### Cleaning the data

__The assignment requires that we ignore cells with a borough that is Not assigned.__

__Combine boroughs with more than one neighborhood into one row separated by a comma.__  
I noted that there is no repeat of M5A like in the assignment instruction... so Wikipedia must have been updated. In any case I write the code.

__If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.__  
There is actually no "Not assigned" neighborhood. But again, I write the code.

In [6]:
# Ignore unassigned boroughs
df_cleaned = df[df['Borough'] != 'Not assigned']

# Combine boroughs with more than one neighborhood
df_cleaned = df_cleaned.groupby(['PostalCode','Borough'])['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index()
df_cleaned['Neighborhood'].replace(r' /', ',', regex=True, inplace=True)

# Assign borough to neighborhood if neighborhood is Not assigned
df_cleaned.loc[df_cleaned['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df_cleaned['Borough']

df_cleaned.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In the last cell of your notebook, use the `.shape` method to print the number of rows of your dataframe.

In [7]:
df_cleaned.shape

(103, 3)

## Part 2 - Appending Geolocation Data

In [10]:
# read csv file into a dataframe
coord_df = pd.read_csv('Geospatial_Coordinates.csv')
coord_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [17]:
# merge coordinate data into the postal code data
df_geo = pd.merge(df_cleaned, coord_df, how='left', left_on='PostalCode', right_on='Postal Code')
df_geo.drop('Postal Code', axis=1, inplace=True)
df_geo.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
