# Capstone Project for Data Science

In this notebook I will work on my final project for the Data Science course offered by IBM through Coursera.

<h1>The project: Segmenting and Clustering Neighborhoods in Toronto</h1>

In this project, I will segmentate the neighborhoods of Toronto, and then cluster them into similar groups using the k-Means clustering machine learning model. 

<h2>Data aquisition</h2>

I will use data from Wikipedia for this project, the link containing the data is presented [here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). The idea is to obtain boroughs and neighborhoods based on their postal codes, to achieve this, I will use the BeautifulSoup and Pandas packages.

In [1]:
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Opening the wikipedia-link
contents = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

In [3]:
# Reading the contents and creating a Beautiful Soup object
html_contents = contents.read()
parser = BeautifulSoup(html_contents, 'html.parser')

In [4]:
# Creating an empty list that will store the data
toronto_data = []

In [5]:
# As the data is stored in form of a table, I will use the Beautiful Soup object to find it
table = parser.find('tbody')

#Iterating over the table and formatting the data
for row in table.find_all('td'):
    observation = {}
    if row.span.text == 'Not assigned':
        pass
    else:
        observation['PostalCode'] = row.p.text[0:3]
        observation['Borough'] = (row.p.text[3:]).split('(')[0]
        observation['Neighborhood'] = ((((row.span.text[3:]).split('(')[1]).strip(')').replace('/', ',')).replace(')', ' ').strip(' ')).replace(' ,', ',')
        toronto_data.append(observation)

In [6]:
data = pd.DataFrame(toronto_data)
data

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [7]:
# It seems like there are some errors in the Borough table, let's fix them
data['Borough']=data['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                         'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                         'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                         'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
data

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [8]:
# Let's go trough the shape of the dataframe
data.shape

(103, 3)

<h2>First attemp: Getting the latitude and longitude for each neighborhood based on their postal code using geocoder</h2>

<b>NOTE: This didn't work </b>

First of all, I need to install the geocoder package so that I'm able to get the latitude and longitude values for each neighborhood.
Regardless of the fact that this did not work as expected, I'm keeping this section to show what I've tried to do; the geocoder API did't return anything, so the loop never ended.

In [None]:
!pip install geocoder

In [None]:
import geocoder

In [None]:
postal_codes = list(data['PostalCode'])
latitudes = []
longitudes = []
for i in range(0, len(data['PostalCode'])+1):
    lat_lon = None
    while (lat_lon is None):
        g = geocoder.google('{postal_codes[i]}, Toronto, Ontario')
        lat_lon = g.latlng
    latitudes.append(lat_lon[0])
    longitudes.append(lat_lon[1])

<h2>Second attemp: Getting the latitude and longitude using a csv file</h2>

IBM provided a csv file containing the coordinates for each postal code; so I'm going to use that data to keep working.

In [9]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [10]:
data.sort_values(by='PostalCode', inplace = True)

In [11]:
data = data.merge(lat_long, how='inner', on=lat_long['Postal Code'])

In [12]:
data

Unnamed: 0,key_0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1B,M1B,Scarborough,"Malvern, Rouge",M1B,43.806686,-79.194353
1,M1C,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",M1C,43.784535,-79.160497
2,M1E,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711
3,M1G,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476
...,...,...,...,...,...,...,...
98,M9N,M9N,York,Weston,M9N,43.706876,-79.518188
99,M9P,M9P,Etobicoke,Westmount,M9P,43.696319,-79.532242
100,M9R,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",M9R,43.688905,-79.554724
101,M9V,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",M9V,43.739416,-79.588437
