### Neighborhoods in Canada Project

This project includes  the neighborhoods in the city of Toronto. The neighborhood data was not readily available on the internet. It is scraped from Wikipedia by using BeautifulSoup and requests libraries of Python.

In the first step, libraries imported.

In [24]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

import requests #library to handle requests

import sys, json # library to handle JSON data

from bs4 import BeautifulSoup #library for pulling data out of HTML and XML files

from geopy.geocoders import Nominatim

import io

import lxml.html as lh #library for processing XML and HTML

print('Libraries imported.')

Libraries imported.


### Scraping Data From Wikipedia

Wikipedia has been blocked in-country, and I used wiki zero instead of it.

In [25]:
#Make request to webpage
page = requests.get("https://www.wikizeroo.org/index.php?q=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvTGlzdF9vZl9wb3N0YWxfY29kZXNfb2ZfQ2FuYWRhOl9N")
doc=lh.fromstring(page.content)

#filter table elements from page
tr_elements = doc.xpath('//tr')

#Create empty list
col_list=[]
i=0

#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    col_list.append((name.strip("\n")))

#Columns of neighborhoods dataframe   
print(col_list)

['Postcode', 'Borough', 'Neighbourhood']


In [26]:
#BeautifulSoup used to get table data.

soup= BeautifulSoup(page.text,'html.parser')
list_of_rows = []

for row in soup.find_all('tr'):
    list_of_cells = []
    for col in row.find_all('td'):
        text = col.text
        
        list_of_cells.append(text.strip("\n"))
    if len(list_of_cells) == len(col_list):
        list_of_rows.append(list_of_cells)
    else:
        pass

#Create dataframe from lists
df= pd.DataFrame(list_of_rows,columns=col_list)

#"Not Assigned" neighborhoods replaced with borough.
df[df.Neighbourhood == "Not assigned"]["Neighbourhood"] = df.Borough

#Filter table
neighborhoods = df[df.Borough != "Not assigned"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [27]:
#group neighborhoods by postcodes and combine them into one row with a comma

new_neighborhoods= neighborhoods.groupby(["Postcode","Borough"]).Neighbourhood.agg([('count'), ('Neighbourhood', ', '.join)])
neighborhoods2 =new_neighborhoods.copy()
neighborhoods2.reset_index(inplace=True)

print("Neighborhoods in Toronto dataFrame size is: ",neighborhoods2.shape[0],"rows", " ",neighborhoods.shape[1],"columns")

neighborhoods2.head()


Neighborhoods in Toronto dataFrame size is:  103 rows   3 columns


Unnamed: 0,Postcode,Borough,count,Neighbourhood
0,M1B,Scarborough,2,"Rouge, Malvern"
1,M1C,Scarborough,3,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,3,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,1,Woburn
4,M1H,Scarborough,1,Cedarbrae


##### Geocoding

In [41]:
#Create address column for geocoding

neighborhoods2["Address"] = "Toronto,"+neighborhoods2.Borough + ","+ neighborhoods2.Postcode
neighborhoods2.head()

Unnamed: 0,Postcode,Borough,count,Neighbourhood,Address,Coordinates
0,M1B,Scarborough,2,"Rouge, Malvern","Toronto,Scarborough,M1B","(43.773077, -79.257774)"
1,M1C,Scarborough,3,"Highland Creek, Rouge Hill, Port Union","Toronto,Scarborough,M1C","(43.773077, -79.257774)"
2,M1E,Scarborough,3,"Guildwood, Morningside, West Hill","Toronto,Scarborough,M1E",
3,M1G,Scarborough,1,Woburn,"Toronto,Scarborough,M1G",
4,M1H,Scarborough,1,Cedarbrae,"Toronto,Scarborough,M1H",


I used OSM Geocoding Api to geocode addresses, it is easy and free.

In [32]:
#Create geolocator object with a unique user agent
geolocator = Nominatim(user_agent="CourseraCapstoneEA3WeekSubmission@ibm.com")

#Create a geocoder function to apply dataframe
def geocoder(address):
    try:
        try:
            location = geolocator.geocode(address,timeout=5)
            try:
                return location.latitude, location.longitude
            except AttributeError:
                return np.nan
        except GeocoderTimedOut as e:
            return "timeout"
    except GeocoderQuotaExceeded as e:
        return np.nan

neighborhoods2['Coordinates'] = neighborhoods2['Address'].apply(geocoder)

I used OSM geocoder API, but all columns did not geocoded. Then I used the link given.

In [42]:
url="http://cocl.us/Geospatial_data"

s = requests.get(url).content
coordinates_ = pd.read_csv(io.StringIO(s.decode('utf-8')))
coordinates_.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [43]:
#Merge Postal Codes and their coordinates with neighbourhood dataframe
final_df = pd.merge(neighborhoods2, coordinates_, left_on="Postcode", right_on="Postal Code", how="left")[["Postcode","Borough",
                                                                                                "Neighbourhood",
                                                                                               "Latitude","Longitude"]]

final_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
