<a href="https://colab.research.google.com/github/agmorcillo/Coursera_Capstone/blob/main/Segmenting_and_Clustering_Neighborhoods_in_Toronto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Segmenting and Clustering Neighborhoods in Toronto**

# **Part1**

### Loading libraries - Part 1

In [58]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np

### Scrapping the url using BeautifulSoup

In [59]:
# fetching Wikipedia page data using requests
url = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup = BeautifulSoup(url, 'lxml')

# looking tables on the Wikipedia page
table = soup.find("table")
table_rows = table.tbody.find_all("tr")



clean = []
for tr in table_rows:
    td = tr.find_all("td")
    row = [tr.text for tr in td]
    
    # Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    if row != [] and row[1] != "Not assigned\n":
        # If a cell has a borough but a "Not assigned" neighbourhood, then the neighborhood will be the same as the borough.
        if "Not assigned\n" in row[2]: 
            row[2] = row[1]
        clean.append(row)

# Dataframe with the main columns
df = pd.DataFrame(clean, columns = ["PostalCode", "Borough", "Neighbourhood"])

#Cleaning "/n" from de columns Neighbourhood, Borough and PostalCode

df["Neighbourhood"] = df["Neighbourhood"].str.replace("\n","")
df["Borough"] = df["Borough"].str.replace("\n","")
df["PostalCode"] = df["PostalCode"].str.replace("\n","")

df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


**Group all neighborhoods using postal code**

In [60]:
df = df.groupby(["PostalCode", "Borough"])["Neighbourhood"].apply(", ".join).reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## **Question 1 Final Answer:**

In [61]:
print("Shape: rows and columns", df.shape)

Shape: rows and columns (103, 3)


# **Part 2**

Loading libraries - Part 2

In [62]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

**Geocoder package:**

In [63]:
df_geo_coor = pd.read_csv("sample_data/Geospatial_Coordinates.csv")
df_geo_coor.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [64]:
# Merge Latitude and Longitude using PostalCode
df_toronto = pd.merge(df, df_geo_coor, how='left', left_on = 'PostalCode', right_on = 'Postal Code')

# delete the PostalCode column
df_toronto.drop("Postal Code", axis=1, inplace=True)
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


**How many neighbourhoods in each borough**

In [65]:
df_toronto.groupby('Borough').count()['Neighbourhood']

Borough
Central Toronto      9
Downtown Toronto    19
East Toronto         5
East York            5
Etobicoke           12
Mississauga          1
North York          24
Scarborough         17
West Toronto         6
York                 5
Name: Neighbourhood, dtype: int64

**Obtain the coordinates from the dataset itself, just averaging Latitude/Longitude**

In [66]:
lat_toronto = df_toronto['Latitude'].mean()
lon_toronto = df_toronto['Longitude'].mean()
print('The geographical coordinates of Toronto are {}, {}'.format(lat_toronto, lon_toronto))

The geographical coordinates of Toronto are 43.70460773398059, -79.39715291165048


# **Part 3**

## Loading libraries - Part 3

In [67]:
from pandas.io.json import json_normalize  # tranform JSON file into a pandas dataframe
import folium # map rendering library

# import k-means from clustering stage
from sklearn.cluster import KMeans

**Get the latitude and longitude of Toronto**

In [68]:
address = "Toronto, ON"

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto city are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto city are 43.6534817, -79.3839347.


**Create a map of the whole Toronto City with neighborhoods superimposed on top**

In [69]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
map_toronto

# **Using KMeans clustering for the clsutering of the neighbourhoods**

In [74]:
k=5
toronto_clustering = df_toronto.drop(['PostalCode', 'Borough', 'Neighbourhood'],1)
kmeans = KMeans(n_clusters=k, random_state=0).fit(toronto_clustering)
kmeans.labels_
df_toronto.insert(0,'Cluster Labels', kmeans.labels_)

In [75]:
df_toronto

Unnamed: 0,Cluster Labels,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,2,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,2,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,2,M1G,Scarborough,Woburn,43.770992,-79.216917
4,2,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...,...
98,0,M9N,York,Weston,43.706876,-79.518188
99,0,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,0,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,0,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


# **Create a Map with the clusters**

In [76]:
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighbourhood'], df_toronto['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters