# Part 1

Data from the Wikipedia page:  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M was used to create the dataframe of Toronto neighbourhoods based on the
following criteria:

   * The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
   * Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
   * More than one neighborhood can exist in one postal code area. 
   * If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

    

The Beautiful Soup Python library was used to pull data from the Wikipedia page.

In [1]:
import pandas as pd

import requests

from bs4 import BeautifulSoup

req = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

soup = BeautifulSoup(req.content,'lxml')

table = soup.find_all('table')[0]

df = pd.read_html(str(table))

neighbourhood=pd.DataFrame(df[0]) 

The data was cleaned to remove rows with the Borough not assigned and consolidate rows with the same postcode. There were no rows found where the Neighbourhood was not assigned and a Borough was assigned.

In [2]:
#Remove cells with Borough not assigned
neighbourhood2 = neighbourhood[neighbourhood.Borough != 'Not assigned']

#Check if Not assigned values removed from Borough
neighbourhood2.loc[neighbourhood2.Borough == 'Not assigned'].count()


Postcode         0
Borough          0
Neighbourhood    0
dtype: int64

In [3]:
#Check for cells with Neighbourhood not assigned
neighbourhood2[(neighbourhood2['Neighbourhood'] == 'Not assigned') & (neighbourhood2['Borough'] != 'Not assigned')].count()

Postcode         0
Borough          0
Neighbourhood    0
dtype: int64

In [7]:
neighbourhood3=neighbourhood2.groupby(['Postcode','Borough']) ['Neighbourhood'].apply(', '.join).reset_index()
neighbourhood3.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [5]:
neighbourhood3.shape

(103, 3)

# Part 2

In order to use the Foursqare location data, we need to get the latitude and the longitude coordinates of each neighborhood. Geographical coordinates were obtained from the csv file http://cocl.us/Geospatial_data.

In [8]:
#Import csv file with coordinates
coord = pd.read_csv("http://cocl.us/Geospatial_data")
coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now, we need to add geographical data to our original dataframe. To do so, we must first rename the column Postal Code to Postcode.

In [10]:
#Rename Postal Code column to Postcode
coord2 = coord.rename(columns = {'Postal Code':'Postcode'})
coord2.head()


Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Next, we merge the 2 dataframes.

In [11]:
#Join neighbourhood data and coordinate data
result = pd.merge(neighbourhood3, coord2, on='Postcode')
result.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
