<h1>Segmenting and Clustering Neighborhoods in Toronto</h1>

Peer-graded assigment for Coursera's "_Applied Data Science Capstone_" course, the last in the _IBM Data Science Professional Certificate_ specialization

<h2>Question 1: Getting the list of Neighborhoods</h2>

<b>OBJECTIVE:</b> Create a dataframe with all postcodes from Toronto along with its respective boroughs and neighborhoods.<br>
<b>METHOD:</b> Web scrapping of a Wikipedia page.<br>
<b>RESULTS:</b>

The following function uses Beautiful Soup to scrape tables from a given URL and returns a dataframe with the data:

In [1]:
def scrape_table( url , table_number=0 ):
    request = requests.get(url)
    html_doc = request.text
    soup = BeautifulSoup( html_doc , 'html.parser')
    table = soup.find_all('table')[table_number]
    
    rows =[]
    column_names = []

    for row in table.find_all('tr'):
        th_tags = row.find_all('th')
        for th in th_tags:
            column_names.append( th.get_text().strip() )
        
        td_tags = row.find_all('td')
        if( len(td_tags)>0 ):
            tmp_row = []
            for td in td_tags: 
                tmp_row.append( td.get_text().strip() )

            rows.append( tmp_row )

    df = pd.DataFrame.from_records(rows)
    df.columns = column_names

    return df

First we use the function above to scrape the Wikipedia page.

In [2]:
import requests
from   bs4      import BeautifulSoup
import pandas   as     pd

url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

df_neighborhoods = scrape_table(url,0)
shape1 = df_neighborhoods.shape
print('We scrapped a table with {} rows and {} columns.\n'.format(shape1[0],shape1[1]))
df_neighborhoods.head()

We scrapped a table with 289 rows and 3 columns.



Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We will work with this dataframe to clean it and format it according to the assignment's requirements.

In [3]:
# Removing entries for which Borough is 'Not assigned'
df_neighborhoods = df_neighborhoods.drop( df_neighborhoods[(df_neighborhoods['Borough']=='Not assigned') ].index , axis=0 ).reset_index().drop('index',axis=1)

shape2 = df_neighborhoods.shape
print('{} entries were removed because they didn\'t have useful information.'.format(shape1[0]-shape2[0]))
print('The dataframe has how {} rows. Each corresponds to a neighborhood.\n'.format(shape2[0]))

df_neighborhoods.head()

77 entries were removed because they didn't have useful information.
The dataframe has how 212 rows. Each corresponds to a neighborhood.



Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [4]:
# Making unassigned neighborhoods the same as its respective boroughs
for index in df_neighborhoods[ df_neighborhoods['Neighbourhood']=="Not assigned" ].index:
    df_neighborhoods.iloc[index,2]=df_neighborhoods.iloc[index,1]

In [5]:
# Now we need to create a table in which each Postcode appears just once.
# Multiple neighbourhoods must be merged in a single row
# The final dataframe should have 1 row for each unique postcode

postcode_list = df_neighborhoods['Postcode'].unique()

num_unique_postcodes = len(postcode_list)
print('We have {} unique postcodes.\n'.format(num_unique_postcodes))

We have 103 unique postcodes.



In [6]:
borough_list       = []
neighbourhood_list = []

for postcode in postcode_list:
    
    borough_list.append( df_neighborhoods[df_neighborhoods['Postcode']==postcode]['Borough'].unique()[0] )
    
    tmp_list=[]
    for neighbourhood in df_neighborhoods[ df_neighborhoods['Postcode']==postcode ]['Neighbourhood']:
        tmp_list.append(neighbourhood)
    neighbourhood_list.append(', '.join( tmp_list ))

df_neighborhoods_combined = pd.DataFrame.from_records( zip(postcode_list,borough_list,neighbourhood_list) )
df_neighborhoods_combined.columns = ['Postcode','Borough','Neighbourhoods']

df_neighborhoods_combined.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhoods
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [7]:
df_neighborhoods_combined.shape

(103, 3)

<hr>

<h2>Question 2: Getting the coordinates for each postcode</h2>

<b>OBJECTIVE:</b> Get the coordinates (latitude and longitude) of each of Toronto's neighborhoods.<br>
<b>METHOD:</b> Geocoding APIs.<br>
<b>RESULTS:</b>
It is not possible to use the Google Map API for free anymore and the provided alternative failed to return coordinates.<br>
I looked for other alternatives but I didn't find a service than can recognize all neighborhoods.<br>

Therefore, I'm using the file provided by the instructor.

In [8]:
df_postcodes = pd.read_csv('Geospatial_Coordinates.csv')
df_postcodes.columns = ['Postcode','Latitude','Longitude']
df_postcodes.head(12)

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [9]:
df_postcodes.shape

(103, 3)

Making a single dataframe if all information we have at the moment.

In [10]:
df_geoinfo = pd.merge( df_neighborhoods_combined , df_postcodes, on=['Postcode'], how='inner')
df_geoinfo.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhoods,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


In [11]:
df_geoinfo.shape

(103, 5)

<hr>