# Coursera Capstone Project

## Segmenting and Clustering Neighborhoods in Toronto - Part 1

### Part 1

# Table of Contents

<div class = "alert alert-block alert-info" style = "margin-top: 20px">

<font size = 3>

1.  <a href="#item1">Imports</a>

2.  <a href="#item2">Loading the data</a>

3.  <a href="#item3">Scraping the data</a>

4.  <a href="#item4">Data Wrangling</a>

5.  <a href="#item5">Loading Foursquare Credentials</a>  
    
6.  <a href="#item5">Getting Latitude and Longitude</a> 
    
7.  <a href="#item5">Pre Processing</a> 
    
8.  <a href="#item5">Clustering Neighborhoods</a> 
    
9.  <a href="#item5">Analyzing the clusters</a> 

</font>
</div>

# 1. Imports

In [1]:
!conda install --yes beautifulsoup4

!pip install lxml

!conda install -c conda-forge folium=0.5.0 --yes

!conda install -c conda-forge geopy --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.9.3       |     pyhb0f4dca_0          91 KB
    ca-certificates-2020.10.14 |                0         121 KB
    certifi-2020.6.20          |     pyhd3eb1b0_3         155 KB
    openssl-1.1.1h             |       h7b6447c_0         2.5 MB
    soupsieve-2.0.1            |             py_0          33 KB
    ------------------------------------------------------------
                                           Total:         2.9 MB

The following NEW packages will be INSTALLED:

  beautifulsoup4     pkgs/main/noarch::beautifulsoup4-4.9.3-pyhb0f4dca_0
  soupsieve          pkgs/main/noarch::soupsieve-2.0.1-py_0

The fo

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files


from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import silhouette_samples, silhouette_score

# import k-means from clustering stage
from sklearn.cluster import KMeans

from bs4 import BeautifulSoup
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


# 2. Loading data

## 2.1 Loading Canada´s Postal code, Borough and Neighbourhood information

In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text 
soup = BeautifulSoup(source, 'xml')
table = soup.find('table')

In [4]:
columns_names = ['Postalcode','Borough','Neighbourhood']
df = pd.DataFrame(columns = columns_names)


## 2.2 Loading the Geospatial Coordinates

In [5]:
dfgc = pd.read_csv('Geospatial_Coordinates.csv')
dfgc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


# 3 Scraping the data

In [6]:
for tr in table.find_all('tr'):
    row_data=[]
    for td in tr.find_all('td'):
        row_data.append(td.text.strip())         
    if len(row_data)==3:
        df.loc[len(df)] = row_data


# 4 Data Wrangling

## 4.1 Replacing the "Not assigned" values in the Neighborhood column

In [7]:
for i in range(len(df)):
    if df['Neighbourhood'].loc[i] == 'Not assigned':
        a = df['Neighbourhood'].loc[i]
        b = df['Borough'].loc[i]
        df['Neighbourhood'].replace(a, b)

## 4.2. Replacing the "Not assigned" values in the Borough column

In [8]:
for i in range(len(df)):
    if df['Borough'].loc[i] == 'Not assigned':
        df = df.drop(i)

In [9]:
df = df.reset_index(drop = True) # prevent it to create a "column" named index

In [10]:
df.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [11]:
df.shape

(103, 3)

In [12]:
dfgc.rename(columns={'Postal Code' : 'Postalcode'}, inplace = True)

## 4.3 Merge the columns

In [13]:
fulldf = pd.merge(df, dfgc, left_on = 'Postalcode', right_on = 'Postalcode')
fulldf.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [14]:
column_names = ['Borough', 'Neighbourhood', 'Latitude', 'Longitude'] 
neighborhoods = pd.DataFrame(columns=column_names)

## 4.4 Creating a new Dataframe

In [15]:
toronto_data = fulldf.copy()
toronto_data.reset_index(drop = True)
toronto_data.drop(['Postalcode'], axis =1, inplace = True)
toronto_data.drop(['Borough'], axis = 1, inplace = True)
toronto_data.head()

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Parkwoods,43.753259,-79.329656
1,Victoria Village,43.725882,-79.315572
2,"Regent Park, Harbourfront",43.65426,-79.360636
3,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


# 5 Loading Foursquare Credentials

In [20]:

CLIENT_ID ='PTV35CIWXLTOOCIXRYDULHNVIAQPAE10FGHHFXJPDG3HBQ3A' # your Foursquare ID
CLIENT_SECRET = 'XGXRYPMXFWQFYEB4GYKP5N2232XOAA4NHWLIJFQVODFV1K24' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30


SyntaxError: invalid syntax (<ipython-input-20-656fcff54d6f>, line 2)

# 6 Getting Latitude and Longitude

In [18]:
address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York are 43.6534817, -79.3839347.


# 7 Pre Processing

In [None]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighbourhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

## Analyze Neighborhood

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

In [None]:
# Let's put that into a pandas dataframe
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

# 8 Clustering Neighborhoods

In [None]:
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

In [None]:
def plot(x, y, xlabel, ylabel):
    plt.figure(figsize=(20,10))
    plt.plot(np.arange(2, x), y, 'o-')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.xticks(np.arange(2, x))
    plt.show()

## 8.1 Finding the Best K

In [None]:
max_range = 15 #Max range 15 (number of clusters)

indices = []
scores = []

for toronto_clusters in range(2, max_range) :
    
    # Run k-means clustering
    toronto_gc = toronto_grouped_clustering
    kmeans = KMeans(n_clusters = toronto_clusters, init = 'k-means++', random_state = 0).fit_predict(toronto_gc)
    
    # Gets the score for the clustering operation performed
    score = silhouette_score(toronto_gc, kmeans)
    
    # Appending the index and score to the respective lists
    indices.append(toronto_clusters)
    scores.append(score)

In [None]:
plot(max_range, scores, "No. of clusters", "Silhouette Score")

In [None]:
opt_value = 13

## 8.2 Using the Best K

In [None]:
toronto_clusters = opt_value

# Run k-means clustering
toronto_gc = toronto_grouped_clustering
kmeans = KMeans(n_clusters = toronto_clusters, init = 'k-means++', random_state = 0).fit(toronto_gc)

In [None]:
# Add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [None]:
toronto_final = toronto_data
toronto_final = toronto_final.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
toronto_final.dropna(inplace = True)
toronto_final['Cluster Labels'] = toronto_final['Cluster Labels'].astype(int)
toronto_final.head()

In [None]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# Setup color scheme for different clusters
x = np.arange(toronto_clusters)
ys = [i + x + (i*x)**2 for i in range(toronto_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(toronto_final['Latitude'], toronto_final['Longitude'], toronto_final['Neighbourhood'], 
                                  toronto_final['Cluster Labels']):
    label = folium.Popup(str(poi) + ' (Cluster ' + str(cluster + 1) + ')', parse_html=True)
    map_clusters.add_child(
        folium.features.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7))
       
map_clusters

# 9 Analyzing the clusters

In [None]:
val = 1
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]

In [None]:
val = 2
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]

In [None]:
val = 3
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]

In [None]:
val = 4
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]

In [None]:
val = 5
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]

In [None]:
val = 6
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]

In [None]:
val = 7
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]

In [None]:
val = 8
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]

In [None]:
val = 9
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]

In [None]:
val = 10
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]

In [None]:
val = 11
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]

In [None]:
val = 12
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]

In [None]:
val = 13
toronto_final.loc[toronto_final['Cluster Labels'] == (val - 1), toronto_final.columns[[0] + np.arange(4, toronto_final.shape[1]).tolist()]]