# **The Battle of the Neighborhoods**

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
!pip install beautifulsoup4
!pip install lxml

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # transform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


For the Singapore neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Singapore. We will scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

In [2]:
website_url = requests.get('https://en.wikipedia.org/wiki/Planning_Areas_of_Singapore').text #will ping the website and return HTML of the website.
from bs4 import BeautifulSoup
soup=BeautifulSoup(website_url,'lxml') #creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping
print(soup.prettify()) #will enable us to view how the tags are nested in the document

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Planning Areas of Singapore - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"00a0d238-ead1-4f8a-ab30-a41ccde9eb20","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Planning_Areas_of_Singapore","wgTitle":"Planning Areas of Singapore","wgCurRevisionId":955442265,"wgRevisionId":955442265,"wgArticleId":2224605,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages using deprecated image syntax","Urban planning in Singapore","Subdivisions of Singapore"],"wgPageConten

We know the data resides within an HTML table so firstly we send Beautiful Soup off to retrieve all instances of the table tag within the page and add them to an array called all_tables

In [3]:
all_tables=soup.find_all("table")
all_tables

[<table class="infobox vevent" style="width:22em;float: right; width: 250px; font-size: 90%; text-align: left; border-spacing: 3px;"><tbody><tr><th class="summary" colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-size: 130%; background-color: #F0F0F0; vertical-align: middle">Planning Areas of Singapore</th></tr><tr><td colspan="2" style="text-align:center"><a class="image" href="/wiki/File:Singapore_MP2008._Urban_Planning_Areas.svg"><img alt="Singapore MP2008. Urban Planning Areas.svg" data-file-height="452" data-file-width="710" decoding="async" height="191" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c1/Singapore_MP2008._Urban_Planning_Areas.svg/300px-Singapore_MP2008._Urban_Planning_Areas.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c1/Singapore_MP2008._Urban_Planning_Areas.svg/450px-Singapore_MP2008._Urban_Planning_Areas.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c1/Singapore_MP2008._Urban_Planning_Areas.sv

Looking through the output of ”all_tables” we can again see that the class id of our chosen table is ”wikitable sortable”. We can use this to get Beautiful Soup to only bring back the table data for this particular table and keep that in a variable called ”right_table“

In [4]:
right_table=soup.find('table', class_='wikitable sortable')
right_table

<table class="wikitable sortable">
<tbody><tr>
<th>Name <small>(<a href="/wiki/English_language" title="English language">English</a>)</small>
</th>
<th><a href="/wiki/Malay_language" title="Malay language">Malay</a>
</th>
<th><a href="/wiki/Chinese_language" title="Chinese language">Chinese</a>
</th>
<th><a href="/wiki/Pinyin" title="Pinyin">Pinyin</a>
</th>
<th><a href="/wiki/Tamil_language" title="Tamil language">Tamil</a>
</th>
<th>Region
</th>
<th>Area (km2)
</th>
<th>Population<sup class="reference" id="cite_ref-7"><a href="#cite_note-7">[7]</a></sup>
</th>
<th>Density (/km2)
</th></tr>
<tr>
<td><a href="/wiki/Ang_Mo_Kio" title="Ang Mo Kio">Ang Mo Kio</a>
</td>
<td>
</td>
<td>宏茂桥
</td>
<td>Hóng mào qiáo
</td>
<td>ஆங் மோ கியோ
</td>
<td><a href="/wiki/North-East_Region,_Singapore" title="North-East Region, Singapore">North-East</a>
</td>
<td>13.94
</td>
<td>163,950
</td>
<td>13,400
</td></tr>
<tr>
<td><a href="/wiki/Bedok" title="Bedok">Bedok</a>
</td>
<td>*
</td>
<td>勿洛
</td>
<td>

We know that the table is set up in rows (starting with 'tr' tags) with the data sitting within 'td' tags in each row. We aren’t too worried about the header row with the 'th' elements as we know what each of the columns represents by looking at the table.

We know we have to start looping through the rows to get the data for every neighborhood in the table. The table is well structured with each neighborhood having its own defined row. This makes things somewhat easier.
We will set up nine empty lists (A, B, C, D, E, F, G, H, I) to store our data in.

To start, we want to use the Beautiful Soup ‘find_all’ function again and set it to look for the string ‘tr’. We will then set up a FOR loop for each row within that array and set Python to loop through the rows, one by one.

Within the loop we are going to use find_all again to search each row for 'td' tags with the ‘td’ string. We will add all of these to a variable called ‘cells’ and then check to make sure that there are nine items in our ‘cells’ array (i.e. one for each column).

If there are then we use the find(text=True)) option to extract the content string from within each 'td' element in that row and add them to the A-I lists we created at the start of this step.

In [5]:
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
H=[]
I=[]

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==9:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        D.append(cells[3].find(text=True))
        E.append(cells[4].find(text=True))
        F.append(cells[5].find(text=True))
        G.append(cells[6].find(text=True))
        H.append(cells[7].find(text=True))
        I.append(cells[8].find(text=True))

We’ll create a dataframe with pandas, assigning each of the lists A-I into columns with the names of our source table columns.

In [6]:
df=pd.DataFrame(A,columns=['Neighborhood'])
df['Malay']=B
df['Chinese']=C
df['Pinyin']=D
df['Tamil']=E
df['Region']=F
df['Area']=G
df['Population']=H
df['Density']=I
df = df.replace('\n','', regex=True) #removes newlines from pandas dataframe cells
df

Unnamed: 0,Neighborhood,Malay,Chinese,Pinyin,Tamil,Region,Area,Population,Density
0,Ang Mo Kio,,宏茂桥,Hóng mào qiáo,ஆங் மோ கியோ,North-East,13.94,163950,13400
1,Bedok,*,勿洛,Wù luò,பிடோக்,East,21.69,279380,13000
2,Bishan,,碧山,Bì shān,பீஷான்,Central,7.62,88010,12000
3,Boon Lay,,文礼,Wén lǐ,பூன் லே,West,8.23,30,3.6
4,Bukit Batok,*,武吉巴督,Wǔjí bā dū,புக்கிட் பாத்தோக்,West,11.13,153740,14000
5,Bukit Merah,*,红山,Hóng shān,புக்கிட் மேரா,Central,14.34,151980,11000
6,Bukit Panjang,*,武吉班让,Wǔjí bān ràng,பக்கிட் பஞ்சாங்,West,8.99,139280,15000
7,Bukit Timah,*,武吉知马,Wǔjí zhī mǎ,புக்கித் திமா,Central,17.53,77430,4400
8,Central Water Catchment,Kawasan Tadahan Air Tengah,中央集水区,Zhōngyāng jí shuǐ qū,மத்திய நீர் நீர்ப்பிடிப்பு,North,37.15,*,*
9,Changi,*,樟宜,Zhāng yí,சாங்கி,East,40.61,1830,80.62


We will drop the rows that we aren't interested in.

In [7]:
df = df.drop(["Malay","Chinese","Pinyin","Tamil","Area","Population","Density"], axis=1)
df

Unnamed: 0,Neighborhood,Region
0,Ang Mo Kio,North-East
1,Bedok,East
2,Bishan,Central
3,Boon Lay,West
4,Bukit Batok,West
5,Bukit Merah,Central
6,Bukit Panjang,West
7,Bukit Timah,Central
8,Central Water Catchment,North
9,Changi,East


Now, we attempt to get the geographical coordinates of each neighborhood using the Geocoder Python package.

In [8]:
# import geocoder
!pip install geocoder 
import geocoder

# define a function to get coordinates
def get_latlng(neighborhood):
    
    # initialize your variable to None
    lat_lng_coords = None

# loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.arcgis('{}, Singapore, Singapore'.format(neighborhood))
      lat_lng_coords = g.latlng  
    return lat_lng_coords



In [9]:
coordinates = [get_latlng(neighborhood) for neighborhood in df["Neighborhood"].tolist()]
coordinates

[[1.3716100000000324, 103.84546000000006],
 [1.3242500000000632, 103.95297000000005],
 [1.3507900000000745, 103.85110000000009],
 [1.3480500000000575, 103.71216000000004],
 [1.349520000000041, 103.75277000000006],
 [1.283070000000066, 103.81667000000004],
 [1.3787700000000314, 103.76977000000005],
 [1.3404100000000199, 103.77221000000009],
 [1.2904100000000653, 103.85211000000004],
 [1.3699600000000487, 103.99311000000006],
 [1.3699600000000487, 103.99311000000006],
 [1.3861600000000749, 103.74618000000004],
 [1.3143800000000283, 103.76537000000008],
 [1.3769100000000662, 103.95534000000004],
 [1.3114700000000425, 103.88218000000006],
 [1.371240000000057, 103.89162000000005],
 [1.3343700000000354, 103.74367000000007],
 [1.339490000000069, 103.70739000000003],
 [1.3155814524669893, 103.8677221291336],
 [1.4196700000000533, 103.70232000000004],
 [1.4136480149046893, 103.79271014165711],
 [1.2957900000000677, 103.89544000000006],
 [1.2785700000000588, 103.85762000000005],
 [1.321440000000

We’ll now create a dataframe with pandas for the coordinates.

In [10]:
df_coordinates = pd.DataFrame(coordinates, columns=['Latitude', 'Longitude'])
df_coordinates

Unnamed: 0,Latitude,Longitude
0,1.37161,103.84546
1,1.32425,103.95297
2,1.35079,103.8511
3,1.34805,103.71216
4,1.34952,103.75277
5,1.28307,103.81667
6,1.37877,103.76977
7,1.34041,103.77221
8,1.29041,103.85211
9,1.36996,103.99311


We then proceed to merge the two dataframes together.

In [11]:
df['Latitude'] = df_coordinates['Latitude']
df['Longitude'] = df_coordinates['Longitude']
df

Unnamed: 0,Neighborhood,Region,Latitude,Longitude
0,Ang Mo Kio,North-East,1.37161,103.84546
1,Bedok,East,1.32425,103.95297
2,Bishan,Central,1.35079,103.8511
3,Boon Lay,West,1.34805,103.71216
4,Bukit Batok,West,1.34952,103.75277
5,Bukit Merah,Central,1.28307,103.81667
6,Bukit Panjang,West,1.37877,103.76977
7,Bukit Timah,Central,1.34041,103.77221
8,Central Water Catchment,North,1.29041,103.85211
9,Changi,East,1.36996,103.99311


Let's now save the dataframe as a CSV file.

In [12]:
df.to_csv("df.csv", index=False)

We will use geopy library to get the latitude and longitude values of Singapore. In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>sg_explorer</em>, as shown below.

In [13]:
address = 'Singapore, Singapore'

geolocator = Nominatim(user_agent="sg_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Singapore are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Singapore are 1.357107, 103.8194992.


We shall now create a map of Singapore with the neighborhoods superimposed on top.

In [14]:
# create map of Singapore using latitude and longitude values
map_singapore = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for neighborhood, region, lat, lng in zip(df['Neighborhood'], df['Region'], df['Latitude'], df['Longitude']):
    label = '{},{}'.format(neighborhood, region)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_singapore)  
    
map_singapore

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

In [15]:
#Define Foursquare Credentials and Version
CLIENT_ID = 'GXFMIUCASBPTACN4A4FT3JYTV5231VRGH0KORBYUB5ZDA3MB' # your Foursquare ID
CLIENT_SECRET = 'A23I11YNDEPF3MRNHCLGOZDIH0BFO2DBKRD30FKVYUBR2RIZ' # your Foursquare Secret
VERSION = '20200717' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: GXFMIUCASBPTACN4A4FT3JYTV5231VRGH0KORBYUB5ZDA3MB
CLIENT_SECRET:A23I11YNDEPF3MRNHCLGOZDIH0BFO2DBKRD30FKVYUBR2RIZ


Let's get the top 100 venues that are in each neighborhood within a radius of 500 meters.

In [16]:
#create the GET request URL

LIMIT = 100

radius = 500

def getNearbyVenues(names, regions, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, reg, lat, lng in zip(names, regions, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            reg,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                  'Region',
                  'Neighborhood Latitude',
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We now write the code to run the above function on each neighborhood and create a new dataframe called singapore_venues.

In [17]:
singapore_venues = getNearbyVenues(names=df['Neighborhood'],
                                   regions=df['Region'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Ang Mo Kio
Bedok
Bishan
Boon Lay
Bukit Batok
Bukit Merah
Bukit Panjang
Bukit Timah
Central Water Catchment
Changi
Changi Bay
Choa Chu Kang
Clementi
Downtown Core
Geylang
Hougang
Jurong East
Jurong West
Kallang
Lim Chu Kang
Mandai
Marina East
Marina South
Marine Parade
Museum
Newton
North-Eastern Islands
Novena
Orchard
Outram
Pasir Ris
Paya Lebar
Pioneer
Punggol
Queenstown
River Valley
Rochor
Seletar
Sembawang
Sengkang
Serangoon
Simpang
Singapore River
Southern Islands
Straits View
Sungei Kadut
Tampines
Tanglin
Tengah
Toa Payoh
Tuas
Western Islands
Western Water Catchment
Woodlands
Yishun


Let's check the size of the resulting dataframe.

In [18]:
print(singapore_venues.shape)
singapore_venues.head()

(1630, 8)


Unnamed: 0,Neighborhood,Region,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Ang Mo Kio,North-East,1.37161,103.84546,Face Ban Mian 非板面 (Ang Mo Kio),1.372031,103.847504,Noodle House
1,Ang Mo Kio,North-East,1.37161,103.84546,NTUC FairPrice,1.371507,103.847082,Supermarket
2,Ang Mo Kio,North-East,1.37161,103.84546,Xi Xiang Feng Yong Tau Foo 喜相逢酿豆腐,1.371975,103.846408,Chinese Restaurant
3,Ang Mo Kio,North-East,1.37161,103.84546,MOS Burger,1.36917,103.847831,Burger Joint
4,Ang Mo Kio,North-East,1.37161,103.84546,Kam Jia Zhuang Restaurant,1.368167,103.844118,Asian Restaurant


Let's check how many venues were returned for each neighborhood.

In [19]:
singapore_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Region,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ang Mo Kio,54,54,54,54,54,54,54
Bedok,10,10,10,10,10,10,10
Bishan,42,42,42,42,42,42,42
Boon Lay,12,12,12,12,12,12,12
Bukit Batok,28,28,28,28,28,28,28
Bukit Merah,13,13,13,13,13,13,13
Bukit Panjang,6,6,6,6,6,6,6
Bukit Timah,59,59,59,59,59,59,59
Central Water Catchment,79,79,79,79,79,79,79
Changi,8,8,8,8,8,8,8


Let's find out how many unique categories can be curated from all the returned venues.

In [20]:
print('There are {} unique categories.'.format(len(singapore_venues['Venue Category'].unique())))

There are 219 unique categories.


It is now time to analyze each neighborhood.

In [21]:
# one hot encoding
singapore_onehot = pd.get_dummies(singapore_venues[['Venue Category']], prefix="", prefix_sep="")

# add region column back to dataframe
singapore_onehot['Region'] = singapore_venues['Region'] 

# move region column to the first column
fixed_columns = [singapore_onehot.columns[-1]] + list(singapore_onehot.columns[:-1])
singapore_onehot = singapore_onehot[fixed_columns]

# add neighborhood column back to dataframe
singapore_onehot['Neighborhood'] = singapore_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [singapore_onehot.columns[-1]] + list(singapore_onehot.columns[:-1])
singapore_onehot = singapore_onehot[fixed_columns]

singapore_onehot.head()

Unnamed: 0,Neighborhood,Region,Accessories Store,Airport,Airport Terminal,American Restaurant,Arcade,Art Gallery,Arts & Crafts Store,Arts & Entertainment,...,Video Game Store,Vietnamese Restaurant,Watch Shop,Water Park,Waterfront,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Ang Mo Kio,North-East,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Ang Mo Kio,North-East,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Ang Mo Kio,North-East,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Ang Mo Kio,North-East,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Ang Mo Kio,North-East,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [22]:
singapore_onehot.shape

(1630, 221)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.

In [23]:
singapore_grouped = singapore_onehot.groupby('Neighborhood').mean().reset_index()

# add region column back to dataframe
singapore_grouped['Region']=singapore_onehot['Region']
singapore_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Terminal,American Restaurant,Arcade,Art Gallery,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,...,Vietnamese Restaurant,Watch Shop,Water Park,Waterfront,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Region
0,Ang Mo Kio,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,North-East
1,Bedok,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,North-East
2,Bishan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047619,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,North-East
3,Boon Lay,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,North-East
4,Bukit Batok,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,North-East
5,Bukit Merah,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,North-East
6,Bukit Panjang,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,North-East
7,Bukit Timah,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033898,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,North-East
8,Central Water Catchment,0.0,0.0,0.0,0.0,0.0,0.025316,0.0,0.0,0.012658,...,0.0,0.0,0.0,0.012658,0.0,0.0,0.012658,0.0,0.0,North-East
9,Changi,0.0,0.25,0.125,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,North-East


Let's confirm the new size.

In [24]:
singapore_grouped.shape

(55, 221)

Now, we shall create a dataframe to view only the mean of the frequency of Indian Restaurants in each neighborhood.

In [25]:
indian_restaurants = singapore_grouped[["Neighborhood","Indian Restaurant"]]
indian_restaurants

Unnamed: 0,Neighborhood,Indian Restaurant
0,Ang Mo Kio,0.0
1,Bedok,0.0
2,Bishan,0.0
3,Boon Lay,0.083333
4,Bukit Batok,0.0
5,Bukit Merah,0.076923
6,Bukit Panjang,0.0
7,Bukit Timah,0.033898
8,Central Water Catchment,0.025316
9,Changi,0.0


Now, we run k-means to cluster the neighborhoods in Singapore into three clusters.

In [26]:
# set number of clusters
kclusters = 3

indian_restaurants_clustering = indian_restaurants.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(indian_restaurants_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 2, 0, 2, 0, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster labels and coordinates.

In [27]:
# add clustering labels
indian_restaurants.insert(0, 'Cluster Labels', kmeans.labels_)

# merge with df to add latitude/longitude for each neighborhood
singapore_merged = df.join(indian_restaurants.set_index('Neighborhood'), on='Neighborhood')

singapore_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Region,Latitude,Longitude,Cluster Labels,Indian Restaurant
0,Ang Mo Kio,North-East,1.37161,103.84546,0,0.0
1,Bedok,East,1.32425,103.95297,0,0.0
2,Bishan,Central,1.35079,103.8511,0,0.0
3,Boon Lay,West,1.34805,103.71216,2,0.083333
4,Bukit Batok,West,1.34952,103.75277,0,0.0


Finally, let's visualize the resulting clusters.

In [28]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(singapore_merged['Latitude'], singapore_merged['Longitude'], singapore_merged['Neighborhood'], singapore_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Finally, we can examine each cluster.

In [29]:
singapore_merged.loc[singapore_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Region,Latitude,Longitude,Cluster Labels,Indian Restaurant
0,Ang Mo Kio,North-East,1.37161,103.84546,0,0.0
1,Bedok,East,1.32425,103.95297,0,0.0
2,Bishan,Central,1.35079,103.8511,0,0.0
4,Bukit Batok,West,1.34952,103.75277,0,0.0
6,Bukit Panjang,West,1.37877,103.76977,0,0.0
7,Bukit Timah,Central,1.34041,103.77221,0,0.033898
8,Central Water Catchment,North,1.29041,103.85211,0,0.025316
9,Changi,East,1.36996,103.99311,0,0.0
10,Changi Bay,East,1.36996,103.99311,0,0.0
11,Choa Chu Kang,West,1.38616,103.74618,0,0.0


In [30]:
singapore_merged.loc[singapore_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Region,Latitude,Longitude,Cluster Labels,Indian Restaurant
36,Rochor,Central,1.30413,103.85029,1,0.384615


In [31]:
singapore_merged.loc[singapore_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Region,Latitude,Longitude,Cluster Labels,Indian Restaurant
3,Boon Lay,West,1.34805,103.71216,2,0.083333
5,Bukit Merah,Central,1.28307,103.81667,2,0.076923
14,Geylang,Central,1.31147,103.88218,2,0.075
23,Marine Parade,Central,1.32144,103.87004,2,0.125
25,Newton,Central,1.31218,103.83912,2,0.041667
34,Queenstown,Central,1.29966,103.80172,2,0.043478
54,Yishun,North,1.43621,103.83582,2,0.058824
