# Applied Data Science Capstone Project

## Segmenting and Clustering the Mass Rapid Transit (MRT) Train Stations in Singapore

# Introduction

The Mass Rapid Transit (MRT), is a heavy rail rapid transit system that constitutes the bulk of the railway network in Singapore, spanning—with the exception of the forested core and the island's rural northwest—the length and width of the city-state's main island. It is one of two of our main public transport in Singapore. The other being public bus services.

Currently, there are 5 MRT lines in Singapore: **East-West Line, North-South Line, North-East Line, Circle Line and Downtown Line**.

As the MRT is extensively use as a form of commute for most Singaporeans, these locations often offer a good flow of human traffic.

In this Capstone Project for the Applied Data Science Professional Certificate by IBM, I will be analysing the establishments around each MRT Station. This will help us to decide what type of potential business to set up, and which location is ideal to start a business.


# Data

I will be explaining the datasets required and its sources.

### MRT Station Data
I have gotten Station Names, Postal Codes, Latitude and Longitude of all MRT Stations in Singapore through a user on Github who has collated this data. (https://github.com/xkjyeah/singapore-postal-codes/blob/master/mrt_stations.json)

###  Venue Data
I will be getting the data of other business and venues around each and every MRT Station through Foursquare API

---


## Exploratory Data Analysis

##### Importing the required libraries

In [1]:
import pandas as pd #import pandas library
import requests #import requests library
from bs4 import BeautifulSoup #import BeautifulSoup for web scraping
import numpy as np #import numpy library
from geopy.geocoders import Nominatim #import Nominatim to retrieve Singapore's Longitude and Latitude
import matplotlib.cm as cm #import matplotlib's colourmap
import matplotlib.colors as colors #matplotlib's colour library
from sklearn.cluster import KMeans #for clustering of MRT Stations
import json #to access Foursquare data
from pandas.io.json import json_normalize #for normalising data
import folium #for data visualisation
print('Importing done!')

Importing done!


##### Importing and cleaning up of the MRT Station data by removing unnecesary columns and missing values

In [2]:
with open('mrt_stations.json') as data_file:
    mrt = json.load(data_file)
    
mrt_data = pd.json_normalize(mrt, 'Possible Locations', ['Station', 'Station Name'])
mrt_data
column_titles = ['Station', 'Station Name','LATITUDE', 'LONGITUDE']
mrt_data = mrt_data.reindex(columns = column_titles)
mrt_data['LATITUDE'] = pd.to_numeric(mrt_data['LATITUDE'], errors = 'coerce')
mrt_data['LONGITUDE'] = pd.to_numeric(mrt_data['LONGITUDE'], errors = 'coerce')
mrt_data = mrt_data.groupby('Station Name', as_index=False).agg(lambda x: x.tolist())
mrt_data.loc[mrt_data['Station Name'] == 'Jurong East']

Unnamed: 0,Station Name,Station,LATITUDE,LONGITUDE
52,Jurong East,"[NS1, EW24]","[1.33315261987297, 1.33315261987297]","[103.742286544006, 103.742286544006]"


In [3]:
mrt_data = pd.read_csv('mrt_stations.csv')
mrt_data.drop(['ADDRESS','BLK_NO','POSTAL','ROAD_NAME','SEARCHVAL','X','Y'], axis = 1, inplace = True)
mrt_data.dropna(inplace = True)
mrt_data

Unnamed: 0,BUILDING,LATITUDE,LONGITUDE,Station,Station Name
0,JURONG EAST MRT STATION,1.333153,103.742287,NS1,Jurong East
1,BUKIT BATOK MRT STATION,1.349034,103.749567,NS2,Bukit Batok
2,BUKIT GOMBAK MRT STATION,1.358612,103.751791,NS3,Bukit Gombak
3,CHOA CHU KANG MRT STATION,1.385363,103.744371,NS4,Choa Chu Kang
4,YEW TEE MRT STATION,1.397535,103.747405,NS5,Yew Tee
...,...,...,...,...,...
178,TAMPINES WEST MRT STATION,1.345515,103.938437,DT31,Tampines West
179,TAMPINES MRT STATION,1.353302,103.945145,DT32,Tampines
181,TAMPINES EAST MRT STATION,1.356191,103.954634,DT33,Tampines East
182,UPPER CHANGI MRT STATION,1.341740,103.961473,DT34,Upper Changi


##### Getting the coordinates of Singapore

In [4]:
address = 'Singapore, SG'

geolocator = Nominatim(user_agent="Singapore")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f'The latitude and longitude of Singapore is {latitude}, {longitude} respectively')

The latitude and longitude of Singapore is 1.357107, 103.8194992 respectively


##### Creating a map of Singapore with the MRT Stations superimposed on top

In [5]:
map_SG = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, long, name , station in zip(mrt_data['LATITUDE'], mrt_data['LONGITUDE'], 
                                     mrt_data['Station Name'], mrt_data['Station']):
    label = f'{station}, {name}'
    label = folium.Popup(label, parse_html=True)
    if 'DT' in station:
        folium.CircleMarker(location = [lat, long],
                        radius = 5,
                        popup = label,
                        color = '#273f8a',
                        fill = True,
                        fill_color = '#939fc4',
                        fill_opacity = 0.7).add_to(map_SG)
    elif 'CC' in station:
        folium.CircleMarker(location = [lat, long],
                        radius = 5,
                        popup = label,
                        color = '#fcfc00',
                        fill = True,
                        fill_color = '#d3d332',
                        fill_opacity = 0.7).add_to(map_SG)
    elif 'NE' in station:
        folium.CircleMarker(location = [lat, long],
                        radius = 5,
                        popup = label,
                        color = '#5f418b',
                       fill = True,
                        fill_color = '#9480b2',
                        fill_opacity = 0.7).add_to(map_SG)
    elif 'EW' in station:
        folium.CircleMarker(location = [lat, long],
                        radius = 5,
                        popup = label,
                        color = '#21891d',
                        fill = True,
                        fill_color = '#79b877',
                        fill_opacity = 0.7).add_to(map_SG)
    elif 'NS' in station:
        folium.CircleMarker(location = [lat, long],
                        radius = 5,
                        popup = label,
                        color = '#990000',
                        fill = True,
                        fill_color = '#ff6666',
                        fill_opacity = 0.7).add_to(map_SG)
    
map_SG

# Methodology

In this project I aim to gather the most commonly visited places from each MRT Station and find out the categories of the venues. This will be done by using Foursquare to extract venue and venue category datas for each MRT Station around a 500m radius of the MRT Station.

Next, we will do a count on how many of each category of venues are there around each MRT Station by grouping the venues together. Afterwhich, we will group the data by MRT Stations instead so that we can find out the most common types of businesses around each MRT Station.

We will then use k *means* clustering to cluster MRT Stations with similar common venues so that we can focus on the clusters we are interested in. This will give us a handful of options to set up our food business.

##### Defining Foursquare credentials and version

In [1]:
CLIENT_ID = 'CLIENT ID'
CLIENT_SECRET = 'CLIENT SECRET'
VERSION = '20180605' 

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: CLIENT ID
CLIENT_SECRET:CLIENT SECRET


##### Exploring the data using the first MRT Station

In [7]:
print(mrt_data.loc[0, 'Station Name'])
print('\n-------\n')

Station_latitude = mrt_data.loc[0, 'LATITUDE'] 
Station_longitude = mrt_data.loc[0, 'LONGITUDE'] 

Station_name = mrt_data.loc[0, 'Station Name'] 

print(f'Latitude and longitude values of {Station_name} are {Station_latitude}, {Station_longitude}.')

Jurong East

-------

Latitude and longitude values of Jurong East are 1.33315261987297, 103.742286544006.


##### Creating GET request via Foursquare API

In [8]:
LIMIT = 100
radius = 400
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID,CLIENT_SECRET,VERSION,Station_latitude,Station_longitude,radius,LIMIT)

results = requests.get(url).json()

##### Sending GET request and extracting the category of venues

In [9]:
results = requests.get(url).json()

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
print('Done!')

Done!


##### Converting data from JSON to *pandas* data frame

In [10]:
venues = results['response']['groups'][0]['items']

nearby_venues = pd.json_normalize(venues) 

filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

print(nearby_venues.head(), '\n ----- \n',f'{nearby_venues.shape[0]} venues were returned by Foursquare.')

                                                name              categories  \
0                                             UNIQLO          Clothing Store   
1                                          MUJI 無印良品  Furniture / Home Store   
2                          Song Fa Bak Kut Teh 松發肉骨茶      Chinese Restaurant   
3                                           The Rink            Skating Rink   
4  Tonkatsu by Ma Maison とんかつ マメゾン (Tonkatsu by M...     Japanese Restaurant   

        lat         lng  
0  1.333175  103.743160  
1  1.333187  103.743064  
2  1.333394  103.743420  
3  1.333424  103.740345  
4  1.333668  103.742818   
 ----- 
 64 venues were returned by Foursquare.


### Now let's explore all the stations!

##### Extracting the venues for each MRT Station

In [11]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Station', 
                  'Station Latitude', 
                  'Station Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

mrt_venues = getNearbyVenues(names=mrt_data['Station Name'],
                                   latitudes=mrt_data['LATITUDE'],
                                   longitudes=mrt_data['LONGITUDE']
                                  )


Jurong East
Bukit Batok
Bukit Gombak
Choa Chu Kang
Yew Tee
Kranji
Marsiling
Woodlands
Admiralty
Sembawang
Yishun
Khatib
Yio Chu Kang
Ang Mo Kio
Bishan
Braddell
Toa Payoh
Novena
Newton
Orchard
Somerset
Dhoby Ghaut
City Hall
Raffles Place
Marina Bay
Marina South Pier
Pasir Ris
Tampines
Simei
Tanah Merah
Bedok
Kembangan
Eunos
Paya Lebar
Aljunied
Kallang
Lavender
Bugis
City Hall
Raffles Place
Tanjong Pagar
Outram Park
Tiong Bahru
Redhill
Queenstown
Commonwealth
Buona Vista
Dover
Clementi
Jurong East
Chinese Garden
Lakeside
Boon Lay
Pioneer
Joo Koon
Gul Circle
Tuas Crescent
Tuas West Road
Tuas Link
Expo
Changi Airport
HarbourFront
Outram Park
Chinatown
Clarke Quay
Dhoby Ghaut
Little India
Farrer Park
Boon Keng
Potong Pasir
Woodleigh
Serangoon
Kovan
Hougang
Buangkok
Sengkang
Punggol
Dhoby Ghaut
Bras Basah
Esplanade
Promenade
Nicoll Highway
Stadium
Mountbatten
Dakota
Paya Lebar
MacPherson
Tai Seng
Bartley
Serangoon
Lorong Chuan
Bishan
Marymount
Caldecott
Botanic Gardens
Farrer Road
Holland Vi

##### Take a look at the results

In [12]:
print(mrt_venues.shape)
mrt_venues.head()

(5864, 7)


Unnamed: 0,Station,Station Latitude,Station Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Jurong East,1.333153,103.742287,UNIQLO,1.333175,103.74316,Clothing Store
1,Jurong East,1.333153,103.742287,MUJI 無印良品,1.333187,103.743064,Furniture / Home Store
2,Jurong East,1.333153,103.742287,Song Fa Bak Kut Teh 松發肉骨茶,1.333394,103.74342,Chinese Restaurant
3,Jurong East,1.333153,103.742287,The Rink,1.333424,103.740345,Skating Rink
4,Jurong East,1.333153,103.742287,Tonkatsu by Ma Maison とんかつ マメゾン (Tonkatsu by M...,1.333668,103.742818,Japanese Restaurant


##### Grouping the venues by MRT Stations

In [13]:
mrt_venues.groupby('Station').count()

Unnamed: 0_level_0,Station Latitude,Station Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Admiralty,16,16,16,16,16,16
Aljunied,38,38,38,38,38,38
Ang Mo Kio,40,40,40,40,40,40
Bartley,11,11,11,11,11,11
Bayfront,98,98,98,98,98,98
...,...,...,...,...,...,...
Woodleigh,8,8,8,8,8,8
Yew Tee,10,10,10,10,10,10
Yio Chu Kang,21,21,21,21,21,21
Yishun,49,49,49,49,49,49


In [14]:
categories = len(mrt_venues['Venue Category'].unique())

print(f'There are {categories} unique categories.')

There are 312 unique categories.


### Moving on to analysing each MRT Station

##### Applying hot encoding to the venues

In [15]:
mrt_onehot = pd.get_dummies(mrt_venues[['Venue Category']], prefix="", prefix_sep="")

mrt_onehot['Station'] = mrt_venues['Station'] 

fixed_columns = [mrt_onehot.columns[-1]] + list(mrt_onehot.columns[:-1])
mrt_onehot = mrt_onehot[fixed_columns]

mrt_onehot.head()

Unnamed: 0,Station,ATM,Accessories Store,Airport,Airport Lounge,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,...,Water Park,Waterfall,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Yunnan Restaurant
0,Jurong East,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Jurong East,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Jurong East,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Jurong East,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Jurong East,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
mrt_onehot.shape

(5864, 313)

##### Grouping the venues by MRT Stations

In [17]:
mrt_grouped = mrt_onehot.groupby('Station').mean().reset_index()
mrt_grouped

Unnamed: 0,Station,ATM,Accessories Store,Airport,Airport Lounge,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,...,Water Park,Waterfall,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Yunnan Restaurant
0,Admiralty,0.125,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Aljunied,0.000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Ang Mo Kio,0.000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bartley,0.000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bayfront,0.000,0.020408,0.0,0.0,0.0,0.000000,0.0,0.020408,0.020408,...,0.0,0.0,0.040816,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
114,Woodleigh,0.000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
115,Yew Tee,0.000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
116,Yio Chu Kang,0.000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
117,Yishun,0.000,0.000000,0.0,0.0,0.0,0.020408,0.0,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
mrt_grouped.shape

(119, 313)

##### Let's take a look at the top 5 most common venues of each MRT Station

In [19]:
num_top_venues = 5

for mrt in mrt_grouped['Station']:
    print("----"+mrt+"----")
    temp = mrt_grouped[mrt_grouped['Station'] == mrt].T.reset_index()
    temp.columns = ['Venue','Frequency']
    temp = temp.iloc[1:]
    temp['Frequency'] = temp['Frequency'].astype(float)
    temp = temp.round({'Frequency': 2})
    print(temp.sort_values('Frequency', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n -----\n')

----Admiralty----
              Venue  Frequency
0               ATM       0.12
1              Park       0.12
2       Supermarket       0.12
3     Metro Station       0.06
4  Basketball Court       0.06

 -----

----Aljunied----
                           Venue  Frequency
0             Chinese Restaurant       0.11
1                   Noodle House       0.08
2               Asian Restaurant       0.08
3  Vegetarian / Vegan Restaurant       0.08
4                     Food Court       0.05

 -----

----Ang Mo Kio----
                 Venue  Frequency
0          Coffee Shop       0.10
1           Food Court       0.08
2         Dessert Shop       0.08
3  Japanese Restaurant       0.05
4      Bubble Tea Shop       0.05

 -----

----Bartley----
           Venue  Frequency
0   Noodle House       0.18
1    Bus Station       0.18
2       Bus Stop       0.18
3  Metro Station       0.09
4      Pet Store       0.09

 -----

----Bayfront----
                 Venue  Frequency
0                Hote

4         Harbor / Marina       0.04

 -----

----Marina South Pier----
                Venue  Frequency
0       Boat or Ferry       0.14
1      Breakfast Spot       0.14
2  Mexican Restaurant       0.14
3                 Bar       0.14
4      History Museum       0.14

 -----

----Marsiling----
           Venue  Frequency
0  Grocery Store       0.11
1    Coffee Shop       0.11
2     Food Court       0.11
3    Bus Station       0.11
4    Pizza Place       0.06

 -----

----Marymount----
                   Venue  Frequency
0     Chinese Restaurant       0.33
1             Food Court       0.17
2                 Bakery       0.17
3        Thai Restaurant       0.17
4  Outdoors & Recreation       0.17

 -----

----Mattar----
              Venue  Frequency
0       Coffee Shop       0.18
1        Food Court       0.14
2  Asian Restaurant       0.09
3       Gas Station       0.09
4       Bus Station       0.09

 -----

----Mountbatten----
                  Venue  Frequency
0          Noodle 

                 Venue  Frequency
0    Korean Restaurant       0.06
1                 Café       0.06
2  Japanese Restaurant       0.06
3                Hotel       0.06
4          Coffee Shop       0.05

 -----

----Telok Blangah----
                Venue  Frequency
0         Bus Station       0.12
1          Food Court       0.12
2  Chinese Restaurant       0.08
3                 Gym       0.08
4         Supermarket       0.04

 -----

----Tiong Bahru----
                 Venue  Frequency
0   Chinese Restaurant       0.19
1          Coffee Shop       0.11
2                 Café       0.08
3           Food Court       0.08
4  Japanese Restaurant       0.06

 -----

----Toa Payoh----
                Venue  Frequency
0         Coffee Shop       0.14
1  Chinese Restaurant       0.10
2        Dessert Shop       0.07
3          Food Court       0.07
4         Snack Place       0.07

 -----

----Tuas Crescent----
               Venue  Frequency
0      Train Station       0.33
1    Harbor / 

##### Sorting the data into *pandas* data frame in descending order

In [20]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

##### Now we'll look at the top 10 most common venues of each MRT Station

In [21]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']


columns = ['Station']
for ind in np.arange(num_top_venues):
    try:
        columns.append(f'{ind+1}{indicators[ind]} Most Common Venue')            
    except:
        columns.append(f'{ind+1}th Most Common Venue')


mrt_venues_sorted = pd.DataFrame(columns=columns)
mrt_venues_sorted['Station'] = mrt_grouped['Station']

for ind in np.arange(mrt_grouped.shape[0]):
    mrt_venues_sorted.iloc[ind, 1:] = return_most_common_venues(mrt_grouped.iloc[ind, :], num_top_venues)

mrt_venues_sorted.head()

Unnamed: 0,Station,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Admiralty,ATM,Supermarket,Park,Bakery,Food Truck,Food Court,Bus Station,Smoke Shop,Metro Station,Seafood Restaurant
1,Aljunied,Chinese Restaurant,Noodle House,Asian Restaurant,Vegetarian / Vegan Restaurant,Food Court,Dim Sum Restaurant,Coffee Shop,Seafood Restaurant,Indian Restaurant,Boarding House
2,Ang Mo Kio,Coffee Shop,Food Court,Dessert Shop,Supermarket,Fast Food Restaurant,Bubble Tea Shop,Japanese Restaurant,Bank,Modern European Restaurant,Shopping Mall
3,Bartley,Bus Station,Noodle House,Bus Stop,Pet Store,Seafood Restaurant,Food Truck,Building,Metro Station,Filipino Restaurant,Fish & Chips Shop
4,Bayfront,Hotel,Boutique,Tea Room,Roof Deck,Casino,Waterfront,Lounge,Japanese Restaurant,Italian Restaurant,Garden


## Classification of Data

##### Using k *means* clustering to form clusters

In [22]:
kclusters = 5

mrt_grouped_clustering = mrt_grouped.drop('Station', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(mrt_grouped_clustering)

kmeans.labels_[0:10] 

array([1, 1, 2, 1, 0, 0, 0, 2, 2, 0], dtype=int32)

##### Adding the clusters into our data frame

In [23]:
mrt_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

mrt_merged = mrt_data

mrt_merged = mrt_merged.join(mrt_venues_sorted.set_index('Station'), on='Station Name')
mrt_merged = mrt_merged.dropna()
mrt_merged['Cluster Labels'] = mrt_merged[['Cluster Labels']].astype(int)

mrt_merged.head()

Unnamed: 0,BUILDING,LATITUDE,LONGITUDE,Station,Station Name,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,JURONG EAST MRT STATION,1.333153,103.742287,NS1,Jurong East,0,Japanese Restaurant,Food Court,Café,Chinese Restaurant,Coffee Shop,Shopping Mall,Department Store,Sandwich Place,Korean Restaurant,Bubble Tea Shop
1,BUKIT BATOK MRT STATION,1.349034,103.749567,NS2,Bukit Batok,2,Coffee Shop,Chinese Restaurant,Food Court,Malay Restaurant,Bus Station,Frozen Yogurt Shop,Grocery Store,Café,Sandwich Place,Mobile Phone Shop
2,BUKIT GOMBAK MRT STATION,1.358612,103.751791,NS3,Bukit Gombak,2,Food Court,Coffee Shop,Chinese Restaurant,Stadium,Sandwich Place,Juice Bar,Malay Restaurant,Fast Food Restaurant,Lake,Supermarket
3,CHOA CHU KANG MRT STATION,1.385363,103.744371,NS4,Choa Chu Kang,2,Coffee Shop,Fast Food Restaurant,Asian Restaurant,Furniture / Home Store,Bakery,Playground,Sandwich Place,Chinese Restaurant,Café,Noodle House
4,YEW TEE MRT STATION,1.397535,103.747405,NS5,Yew Tee,0,Fast Food Restaurant,Asian Restaurant,Japanese Restaurant,Café,Sandwich Place,Diner,Pool,Shopping Mall,Food Court,Food & Drink Shop


##### Now we'll create a map of Singapore with our clusters

In [29]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, long, poi, cluster in zip(mrt_merged['LATITUDE'], mrt_merged['LONGITUDE'], 
                                   mrt_merged['Station Name'], mrt_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results

### Cluster 1 (Red)

In [30]:
mrt_merged.loc[mrt_merged['Cluster Labels'] == 0, mrt_merged.columns[[0,1] + list(range(5, mrt_merged.shape[1]))]]


Unnamed: 0,BUILDING,LATITUDE,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,JURONG EAST MRT STATION,1.333153,0,Japanese Restaurant,Food Court,Café,Chinese Restaurant,Coffee Shop,Shopping Mall,Department Store,Sandwich Place,Korean Restaurant,Bubble Tea Shop
4,YEW TEE MRT STATION,1.397535,0,Fast Food Restaurant,Asian Restaurant,Japanese Restaurant,Café,Sandwich Place,Diner,Pool,Shopping Mall,Food Court,Food & Drink Shop
7,WOODLANDS MRT STATION,1.436067,0,Japanese Restaurant,Shopping Mall,Café,Coffee Shop,Chinese Restaurant,Fast Food Restaurant,Asian Restaurant,Frozen Yogurt Shop,Clothing Store,Electronics Store
20,NOVENA MRT STATION,1.320441,0,Café,Coffee Shop,Japanese Restaurant,Hotel,Asian Restaurant,Italian Restaurant,Ramen Restaurant,Restaurant,Hainan Restaurant,Sandwich Place
23,ORCHARD MRT STATION,1.303980,0,Boutique,Cosmetics Shop,Sushi Restaurant,Hotel,Shopping Mall,Bakery,Coffee Shop,Japanese Restaurant,Bubble Tea Shop,Department Store
...,...,...,...,...,...,...,...,...,...,...,...,...,...
168,JALAN BESAR MRT STATION,1.305171,0,Indian Restaurant,Chinese Restaurant,Hotel,Café,Vegetarian / Vegan Restaurant,Food Court,Bakery,Dessert Shop,Dumpling Restaurant,Hostel
169,BENDEMEER MRT STATION,1.313673,0,Hostel,BBQ Joint,Vegetarian / Vegan Restaurant,Restaurant,Coffee Shop,Seafood Restaurant,Café,Noodle House,Soccer Stadium,Soup Place
179,TAMPINES MRT STATION,1.353302,0,Bakery,Café,Clothing Store,Dessert Shop,Coffee Shop,Gym,Chinese Restaurant,Thai Restaurant,Pharmacy,Japanese Restaurant
182,UPPER CHANGI MRT STATION,1.341740,0,Café,Soccer Field,Event Space,Convenience Store,Pool,Restaurant,Gym / Fitness Center,Gym Pool,Metro Station,Asian Restaurant


### Cluster 2 (Purple)

In [31]:
mrt_merged.loc[mrt_merged['Cluster Labels'] == 1, mrt_merged.columns[[0,1] + list(range(5, mrt_merged.shape[1]))]]


Unnamed: 0,BUILDING,LATITUDE,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,KRANJI MRT STATION,1.425087,1,Racetrack,Stadium,Lighthouse,Noodle House,Bus Line,Bus Station,Bus Stop,Go Kart Track,Gym,Discount Store
9,ADMIRALTY MRT STATION,1.440585,1,ATM,Supermarket,Park,Bakery,Food Truck,Food Court,Bus Station,Smoke Shop,Metro Station,Seafood Restaurant
13,YIO CHU KANG MRT STATION,1.381756,1,Noodle House,Food Court,Volleyball Court,Cafeteria,Gym / Fitness Center,Bus Stop,Bus Station,Bus Line,Vegetarian / Vegan Restaurant,Fast Food Restaurant
18,BRADDELL MRT STATION,1.340469,1,Noodle House,Chinese Restaurant,Food Court,Café,Bakery,Thai Restaurant,Seafood Restaurant,Asian Restaurant,Fast Food Restaurant,Hakka Restaurant
21,NEWTON MRT STATION,1.31232,1,Chinese Restaurant,Italian Restaurant,Hotel Bar,Seafood Restaurant,Indian Restaurant,Thai Restaurant,Japanese Restaurant,Gym / Fitness Center,Food Court,Noodle House
40,KEMBANGAN MRT STATION,1.321038,1,Chinese Restaurant,Indian Restaurant,Bus Stop,Juice Bar,Malay Restaurant,Shopping Mall,Bistro,Supermarket,Noodle House,Train Station
41,EUNOS MRT STATION,1.319784,1,Chinese Restaurant,Coffee Shop,Vegetarian / Vegan Restaurant,Noodle House,Food,Grocery Store,Gym,Bubble Tea Shop,Train Station,Asian Restaurant
44,ALJUNIED MRT STATION,1.316433,1,Chinese Restaurant,Noodle House,Asian Restaurant,Vegetarian / Vegan Restaurant,Food Court,Dim Sum Restaurant,Coffee Shop,Seafood Restaurant,Indian Restaurant,Boarding House
45,KALLANG MRT STATION,1.311489,1,Food Court,Snack Place,Hostel,BBQ Joint,Supermarket,Seafood Restaurant,Chinese Restaurant,Rock Club,Noodle House,Indian Restaurant
57,QUEENSTOWN MRT STATION,1.294551,1,Food Court,Noodle House,Chinese Restaurant,Seafood Restaurant,Train Station,Spa,Pool,Stadium,Café,Italian Restaurant


### Cluster 3 (Blue)

In [32]:
mrt_merged.loc[mrt_merged['Cluster Labels'] == 2, mrt_merged.columns[[0,1] + list(range(5, mrt_merged.shape[1]))]]


Unnamed: 0,BUILDING,LATITUDE,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,BUKIT BATOK MRT STATION,1.349034,2,Coffee Shop,Chinese Restaurant,Food Court,Malay Restaurant,Bus Station,Frozen Yogurt Shop,Grocery Store,Café,Sandwich Place,Mobile Phone Shop
2,BUKIT GOMBAK MRT STATION,1.358612,2,Food Court,Coffee Shop,Chinese Restaurant,Stadium,Sandwich Place,Juice Bar,Malay Restaurant,Fast Food Restaurant,Lake,Supermarket
3,CHOA CHU KANG MRT STATION,1.385363,2,Coffee Shop,Fast Food Restaurant,Asian Restaurant,Furniture / Home Store,Bakery,Playground,Sandwich Place,Chinese Restaurant,Café,Noodle House
6,MARSILING MRT STATION,1.432521,2,Coffee Shop,Food Court,Bus Station,Grocery Store,Indian Restaurant,Music Venue,Fast Food Restaurant,Hainan Restaurant,Pet Store,BBQ Joint
10,SEMBAWANG MRT STATION,1.449051,2,Fast Food Restaurant,Coffee Shop,Convenience Store,Asian Restaurant,Supermarket,Bistro,Fish & Chips Shop,Shopping Mall,Chinese Restaurant,BBQ Joint
11,YISHUN MRT STATION,1.429443,2,Food Court,Coffee Shop,Chinese Restaurant,Bus Line,Café,Supermarket,Fast Food Restaurant,Hainan Restaurant,Italian Restaurant,Fried Chicken Joint
12,KHATIB MRT STATION,1.417383,2,Coffee Shop,Skate Park,Supermarket,Asian Restaurant,Indian Restaurant,Train Station,Convenience Store,Pharmacy,Shopping Mall,Bus Stop
14,ANG MO KIO MRT STATION,1.369933,2,Coffee Shop,Food Court,Dessert Shop,Supermarket,Fast Food Restaurant,Bubble Tea Shop,Japanese Restaurant,Bank,Modern European Restaurant,Shopping Mall
16,BISHAN MRT STATION,1.350839,2,Food Court,Coffee Shop,Café,Chinese Restaurant,Bubble Tea Shop,Supermarket,Japanese Restaurant,Asian Restaurant,Cosmetics Shop,Ice Cream Shop
19,TOA PAYOH MRT STATION,1.332629,2,Coffee Shop,Chinese Restaurant,Food Court,Snack Place,Dessert Shop,Cosmetics Shop,Bakery,Bubble Tea Shop,Steakhouse,Supermarket


### Cluster 4 (Green)

In [33]:
mrt_merged.loc[mrt_merged['Cluster Labels'] == 3, mrt_merged.columns[[0,1] + list(range(5, mrt_merged.shape[1]))]]


Unnamed: 0,BUILDING,LATITUDE,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
86,LITTLE INDIA MRT STATION,1.3068,3,Indian Restaurant,Vegetarian / Vegan Restaurant,Restaurant,Music Venue,Bakery,General College & University,Park,Food,Coffee Shop,Motel
153,LITTLE INDIA MRT STATION,1.3068,3,Indian Restaurant,Vegetarian / Vegan Restaurant,Restaurant,Music Venue,Bakery,General College & University,Park,Food,Coffee Shop,Motel


### Cluster 5 (Orange)

In [34]:
mrt_merged.loc[mrt_merged['Cluster Labels'] == 4, mrt_merged.columns[[0,1] + list(range(5, mrt_merged.shape[1]))]]


Unnamed: 0,BUILDING,LATITUDE,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
71,TUAS WEST ROAD MRT STATION,1.329989,4,Soup Place,Yunnan Restaurant,Fast Food Restaurant,Field,Filipino Restaurant,Fish & Chips Shop,Flea Market,Flower Shop,Food,Food & Drink Shop


## Discussion

Based on the results shown above, Singapore can indeed be seen as a food hub as most of the clusters are filled with food places in minimally the top 4 most common venues. There are some interesting insights gathered through the process of analysing the data. 

Firstly, clusters with food court and coffee shops tend to take up a large area of Singapore. This shows us that our local coffee shops makes us majority of the food places in our country.

Next, although Singapore's regions are separated by housing, industrial and business districts, we can see that clusters do not follow such behaviour which is interesting as I would believe that business districts would have more restaurant and higher priced venues. However, we can see that the businesss district is split up into the various clusters which indicates that is it not a reasonable grouping choice.

Lastly, I have observed that shopping malls tend to be placed at the lower half of the top 10 most common venues. This could be a result of Singaporeans shifting trends towards online shopping without needing to head down to physical malls. 

## Conclusion

In conclusion, the purpose of this project was to study the various MRT Stations to figure out possible options of  businesses to start up. In this aspect, we have managed to gather sufficient data and explore the data with data analysis and data visualisation. Through this process, we have offered a variety of choice to anyone who is interested in exploring any forms of business. 

Considering the business competition is only one of many factors that have to be taken into account when making a decision to start a business. For example there are also factors such as human traffic and rental prices that has to be considered before making a decison.

Furthermore, as data is requested from FourSquare, we are only able to get data of FourSquare users. This may not be a very accurate representation of Singaporean's top venues.