<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in New York City</font></h1>


## Introduction

In this assignment, we will explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information. 

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

Once the data is in a structured format, we will analyse and explore and cluster the neighborhoods in the city of Toronto.

### Task 1 - Creating and Cleaning the Dataset

Firstly let's download and import the necessary libraries that we will use.

In [2]:
#download the geocoding library
!pip install geocoder
#download the folium library to visualize geospatial data
!pip install folium

In [3]:
# this module helps in web scrapping.
from bs4 import BeautifulSoup 
# this module helps us to download a web page
import requests

#for data manipulation and analysis
import pandas as pd
#for computing data
import numpy as np

# import geocoder
import geocoder 
# map rendering library
import folium 

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

### Scraping the Wikipedia page

This <a href='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'>wikipedia</a> page contains an html table with data of postal codes of Canada: M. Firstly we wiil get the contents of the webpage in text format and store in a variable called data. Then we will use  *BeautifulSoup* constructor to find the necessary html table in the web page.

In [4]:
#The below url contains an html table with data of postal codes of Canada: M.
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [5]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [6]:
#pass the data into the BeautifulSoup constructor
soup = BeautifulSoup(data,"html5lib")

In [7]:
#find a html table in the web page
table = soup.find('table') 

### Creating the Dataset

After finding the table and the table data, we create the dataframe. As seen on the web page, there are 'Not assigned' values in the table. We extract these values while creating the dataframe. Also, we clear the symbols like '/, (,)' in the table. Thus, we obtain the dataframe formed in Postal Code, Borough, Neighborhood columns.

In [8]:
table_contents=[]
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['Postal Code'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df=pd.DataFrame(table_contents)


Check for duplicate rows based on Postal Code column

In [9]:
# Select all duplicate rows based on one column
duplicateRowsDF = df[df.duplicated(['Postal Code'])]
print("Duplicate Rows based on a single column are:", duplicateRowsDF, sep='\n')

Duplicate Rows based on a single column are:
Empty DataFrame
Columns: [Postal Code, Borough, Neighborhood]
Index: []


Let's examine the Dataframe that we created

In [10]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [11]:
df.shape

(103, 3)

### Task 2 - Get the coordinates of each neighborhood

Using the provided <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv">Geospatial Coordinates</a> csv file we get the latitude and longitude values of each neigborhood. 

In [12]:
df_coordinates = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv')
df_coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now that we have the coordinates of each neighborhood, we can create the data frame to work on. For this, we will use pandas' merge function. And we will use the *Postal Code* column to match the two tables.

In [13]:
merged_df = pd.merge(df,df_coordinates,on='Postal Code')
merged_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


In [14]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(merged_df['Borough'].unique()),
        len(merged_df['Neighborhood'].unique())
    )
)

The dataframe has 15 boroughs and 103 neighborhoods.


### Task 3 - Explore and cluster the neighborhoods in Toronto

Let's visualize the data to better understand. We can use the geopy library to get Toronto's latitude and longitude values.

In [15]:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.



#### Create a map of New York with neighborhoods superimposed on top.


In [16]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(merged_df['Latitude'], merged_df['Longitude'], merged_df['Borough'], merged_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### Choose the neighbourhoods that contain word "Toronto"

Let's use districts that contain the word Toronto according to the suggestion in the instructions. And to achieve this we will create a new dataframe

In [17]:
toronto_df = merged_df[merged_df['Borough'].str.contains("Toronto")]
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [18]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(toronto_df['Borough'].unique()),
        len(toronto_df['Neighborhood'].unique())
    )
)

The dataframe has 7 boroughs and 39 neighborhoods.


#### Let's Create a map of Toronto to understand the new dataframe


In [19]:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [20]:
# create map of Toronto using latitude and longitude values
map_toronto_boroughs = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_boroughs)  
    
map_toronto_boroughs

### Cluster Neighborhoods

Run _k_-means to cluster the neighborhood into 7 clusters.

In [21]:
# set number of clusters
kclusters = 7

toronto_grouped_clustering = toronto_df[['Latitude','Longitude']]

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 


array([5, 5, 5, 2, 5, 5, 0, 5, 3, 6], dtype=int32)



To visualize the results insert the cluster labels to the dataframe

In [22]:
toronto_df.insert(0, 'Cluster Labels', kmeans.labels_)

In [23]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Neighborhood'], toronto_df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters