<h1 align=center><font size = 5>Battle of Neighborhood (Segmenting and Clustering Neighborhood)</font></h1> 

My name is Muhammad Ariqleesta Hidayat (github: https://github.com/ariqleesta). This is a project from IBM Professional Data Science - Applied Data Science Capstone Project Coursera. The objective from this project is to make segmentation of neighborhoods based on distance of each neighborhood

In [138]:
# Environment
import os

# Random
import random
random.seed(2)

# Data Manipulation Tools
import pandas as pd
import numpy as np

# Data Visualization Tools
import matplotlib.pyplot as plt 
import seaborn as sns
from colour import Color
%matplotlib inline

# libraries for displaying images
from IPython.display import HTML, Image 
from IPython.core.display import HTML 
from IPython.display import display


# Location Modules
import folium
from geopy.geocoders import Nominatim

# Web Scraping Modules
import requests 
from bs4 import BeautifulSoup as bs

folium.__version__

'0.5.0'

In case I need Foursquare API calls, I think it is okay to store my credentials first. But I have to make things secure because I need to share this notebook for grading.

In [None]:
import getpass

# Foursquare API Access
CLIENT_ID = getpass.getpass('Enter your Foursquare Client Id') # your Foursquare ID
CLIENT_SECRET = getpass.getpass('Enter your Foursquare Secret') # your Foursquare Secret
ACCESS_TOKEN = getpass.getpass('Enter your FourSquare Access Token') # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 15

Enter your Foursquare Client Id ················································


## Gathering and Cleaning the Data
I need to do some web scraping from wikipedia url https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.<br/>
First I have to get the request and clean the html from by using *html.parser* form BeautifulSoup. Then I continue to get "table" tag from html using **BeautifulSoup.find_all()** method and read the extracted table. Next I should drop rows that have Borough with **"Not assigned"** values. If a Neighbourhood is **not assigned** yet but has a borough name, I replace **"Not assigned"** value by **its Borough**.

In [67]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Scraping and form into pandas dataframe
html = requests.get(url).text
soup = bs(html, 'html.parser')
table = soup.find_all('table',class_="wikitable sortable")
df = pd.read_html(str(table))[0]

# Dropping "Not assigned" Borough
df.drop(index = df[df["Borough"] == "Not assigned"].index, axis = 0, inplace = True)
df

# Replace "Not assigned" Neighborhood by its Borough, 
print("Number of \"Not assigned\" Neighborhood: ",len(df[df["Neighbourhood"] == "Not assigned"]))
# Seems I don't have any of "Not assigned" Neighborhood

# Let's see the shape of the data
print("This data has {} rows and {} columns".format(df.shape[0], df.shape[1]) )

Number of "Not assigned" Neighborhood:  0
This data has 103 rows and 3 columns


In [68]:
# Let's see the 5 rows of the data
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## Getting Coordinates of each Postal Code
Before utilizing the Foursquare location data, I need to get the **latitude** and the **longitude** coordinates of each neighborhood. I can find the location based on their postal code by using python geocoder package. However, this package can be very unreliable I have to be persistent sometimes in order to get the geographical coordinates of a given postal code. In this case, I can just simply get the location from this .csv file: http://cocl.us/Geospatial_data

In [69]:
location = pd.read_csv("http://cocl.us/Geospatial_data")
location.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now I can join these two dataframes with respect to the postal code.

In [70]:
df_new = pd.merge(df, location, how = "left", on = "Postal Code")
df_new.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Creating a Map
I use folium package to create a map. First I need to specify the map center around **Toronto, Ontario**. Then  plot each point of neighborhoods in a map before. I can get the center of Toronto, Ontario by using geolocator **geopy.geocoder.Nominatim** from geopy module.

In [95]:
center = 'Toronto, Ontario'
geolocator = Nominatim(user_agent="foursquare_aagent")
location = geolocator.geocode(center)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

43.6534817 -79.3839347


Now I can create a map based on the center latitude and longitude. I named this map **"neigh_map"** and set the zoom start at 11 because I think it is the most appropriate. Then I loop through the data frame to store postal code's location points on **neigh_map**. I want to see segmentation with respect to certain Borough so I colored them differently. 

In [137]:
neigh_map = folium.Map(location = [latitude, longitude], zoom_start = 11,)

# Get all unique Borough
allBorough = df_new.Borough.unique().tolist()

# Generate color palettes and associate all Borough with different color respectively
red = Color('#33ccff')
colors = list(red.range_to(Color('#ff99cd'),len(allBorough)))
colors = [str(i) for i in colors if '#' in str(i) ]

for lat, lng, post, bor in zip(df_new.Latitude, df_new.Longitude, df_new["Postal Code"], df_new.Borough):
    #print(lat, lng, post)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color=colors[allBorough.index(bor)],
        popup=folium.Popup(post + ' - ' + bor),
        fill = False,
        fill_color=colors[allBorough.index(bor)],
        fill_opacity=0.6,
    ).add_to(neigh_map)
    
display(neigh_map)

Now I can easily do the segmentation with **Fast Marker Cluster** in **folium.plugins**. Marker clusters can be a good way to simplify a map containing many markers. When the map is zoomed out nearby markers are combined together into a cluster, which is separated out when the map zoom level is closer. In this post I show you how Folium marker clusters are easy to set up and use.

In [92]:
from folium.plugins import FastMarkerCluster

neigh_map = folium.Map(location = [latitude, longitude], zoom_start = 10,)

neigh_map.add_child(FastMarkerCluster(df_new[['Latitude', 'Longitude']].values.tolist()))
                                            
neigh_map