# Coursera Capstone Week 3 Toronto

## Basic Instructions

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

## Import all Libraries for this Assignment

In [303]:
import pandas as pd
# for webscraping
import requests
from bs4 import BeautifulSoup
# for getting location data
import geocoder
# for generating and displaying maps
import folium
from IPython.display import IFrame

print('Libraries imported')

Libraries imported


## Toronto Neighborhoods

### Scrape Relevant Data from the Internet

In [29]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html5lib')
chars_scraped = len(soup.get_text())
if chars_scraped > 0:
    print(chars_scraped, 'characters scraped from', url)
else:
    print('empty document!!!')

16935 characters scraped from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


### Get Relevant Data from Scraped Website

Postal Codes and Neighborhood Names are stored in a table.

In [51]:
tables = soup.find_all('table')
print('There are', len(tables), 'tables on this webpage.')
# print first 250 characters of each table to identify the correct table
for index, table in enumerate(tables):
    print(' ', '---------- table index ' + str(index) + '----------',
          ' ', str(table)[:250], sep='\n')

There are 3 tables on this webpage.
 
---------- table index 0----------
 
<table cellpadding="2" cellspacing="0" rules="all" style="width:100%; border-collapse:collapse; border:1px solid #ccc;">

<tbody><tr>
<td style="width:11%; vertical-align:top; color:#ccc;">
<p><b>M1A</b><br/><span style="font-size:85%;"><i>Not assign
 
---------- table index 1----------
 
<table class="navbox">
<tbody><tr>
<td style="width:36px; text-align:center"><a class="image" href="/wiki/File:Flag_of_Canada.svg" title="Flag of Canada"><img alt="Flag of Canada" data-file-height="600" data-file-width="1200" decoding="async" height=
 
---------- table index 2----------
 
<table cellspacing="0" style="background-color: #F8F8F8;" width="100%">
<tbody><tr>
<td style="text-align:center; border:1px solid #aaa;"><a href="/wiki/Newfoundland_and_Labrador" title="Newfoundland and Labrador">NL</a>
</td>
<td style="text-align:c


For getting postal codes, boroughs and neighborhood names the first table is needed as can be seen by `<p><b>M1A</b><br/><span style="font-size:85%;"><i>Not assign`. M1A is the postal code in this example.

In [206]:
all_table_rows = []

for row in tables[0].tbody.find_all('tr'):
    for cells in row.find_all('td'):
        all_table_rows.append([cells.find('b').get_text(), cells.find('span').get_text()])

print(len(all_table_rows), 'cells were found in this table')
print('---------- 5. element of 1. row ----------')
print(all_table_rows[4])

180 cells were found in this table
---------- 1. element of 1. row ----------
['M5A', 'Downtown Toronto(Regent Park / Harbourfront)']


In [270]:
df = pd.DataFrame(all_table_rows)
display(df.head())

Unnamed: 0,0,1
0,M1A,Not assigned
1,M2A,Not assigned
2,M3A,North York(Parkwoods)
3,M4A,North York(Victoria Village)
4,M5A,Downtown Toronto(Regent Park / Harbourfront)


### Clean the DataFrame

In [271]:
# get a description of the dataframe
display(df.describe())
# delete all rows without Borough/Neighborhood
df = df[df[1] != 'Not assigned']
df[['Borough', 'Neighborhood', 'Empty']] = df[1].str.split('(', expand=True)
df.drop(columns=[1, 'Empty'], inplace=True)
df.rename(columns = {0:'PostalCode'}, inplace = True)
df['Neighborhood'] = df['Neighborhood'].str.replace(')', '')
df['Neighborhood'] = df['Neighborhood'].str.replace(' / ', ', ')
df['Borough'] = df['Borough'].str.replace('Downtown TorontoStn A PO Boxes25 The Esplanade', 'Downtown Toronto')
df['Borough'] = df['Borough'].str.replace('MississaugaCanada Post Gateway Processing Centre', 'Mississauga')
df['Borough'] = df['Borough'].str.replace('East TorontoBusiness reply mail Processing Centre969 Eastern', 'East Toronto')
df['Borough'] = df['Borough'].str.replace('EtobicokeNorthwest', 'Etobicoke Northwest')
df['Borough'] = df['Borough'].str.replace('East YorkEast Toronto', 'East York/East Toronto')
display(df.head())

Unnamed: 0,0,1
count,180,180
unique,180,104
top,M1X,Not assigned
freq,1,77


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Queen's Park,Ontario Provincial Government


### Check for Null-Values

In [276]:
display(df.isna().sum())

PostalCode      0
Borough         0
Neighborhood    0
dtype: int64

## Get Latitudes and Longitudes per Postal Code

### Using Geocoder (or at least trying to)

In [284]:
pc_lat_lng = []

print('Working', end=': ')

for postal_code in df['PostalCode']:

    print(postal_code, end='')
    
    lat_lng_coords = None
    while(lat_lng_coords is None):
        print(' . ', end='')
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    pc_lat_lng.append([postal_code, lat_lng_coords[0], lat_lng_coords[1]])
    
    print('done', end=' - ')
    
print(pc_lat_lng[:5])

Working M3A.

KeyboardInterrupt: 

### Reading .csv-File Provided in Case Geocoder Doesn't Work

In [288]:
df_latlng = pd.read_csv('Geospatial_Coordinates.csv')
print('This dataframe has', df_latlng.shape[0], 'rows')
display(df_latlng.head(2))

This dataframe has 103 rows


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497


### Merge with Existing DataFrame

In [291]:
df_complete = df.merge(df_latlng, left_on='PostalCode', right_on='Postal Code')
df_complete.drop(columns='Postal Code', inplace=True)
df_complete.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


## Explore Neighbourhoods

### Visualize all Neighborhoods

This is merely meant to get any idea of how all neighborhoods are distributed - I'm not familiar with them and I guess a lot of other users aren't either.

In [317]:
# use mean values of latitude and longitude to define center of map
latitude = df_complete['Latitude'].mean()
longitude = df_complete['Longitude'].mean()

# generate basic map
nbh_map = folium.Map(location=[latitude, longitude], zoom_start=10)

colors = ['lightblue', 'mediumblue', 'blue', 'darkblue', 'navy',
          'lightgreen', 'palegreen', 'green', 'darkgreen', 'forestgreen',
          'coral', 'red', 'darkred']

for i, borough in enumerate(df_complete['Borough'].unique()):
    color = colors[i]
    df_borough = df_complete[df_complete['Borough'] == borough]
    # add markers for each neighborhood
    for lat, lng, borough, postal in zip(df_borough['Latitude'], df_borough['Longitude'],
                                                df_borough['Borough'], df_borough['PostalCode']):
            label= postal + ' ' + borough
            folium.CircleMarker(
                [lat, lng],
                radius=5,
                color='black',
                weight=1,
                popup=label,
                fill = True,
                fill_color=color,
                fill_opacity=1
            ).add_to(nbh_map)

# save and display map
nbh_map.save('nbh_map.html')
display(IFrame(src='nbh_map.html', width=600, height=400))