# IBM Data Science Capstone Project

## Identify Good Locations to Open New Flower Shop

### 1. Background
In this capstone project, I will be using data from **Foursquare API** (https://developer.foursquare.com/docs/api) to identify good locations for opening a flower shop in the Sydney city area. By observing the neigbourhoods of existing flower shops and find other locations with high similarity with existing businesses but have no or less competition.

### 2. Problem Description
It is always difficult to open or expand a business in a location which we are not so familar with. We may already have a good understanding on the type of businesses that we want to start, but the characteristics of another location which may be completely different. For example, a florist who wants to start her own business might had already worked in a flower shop for a number of years, she knows her regular customers well, how busy the area, etc. While it is almost impossible to spend time researching all different areas, using data analysis would be a good starting point to identify locations which have higher chance of success, at least to identify locations which have higher similarity with other flower shops which keep running.


### 3. Data


#### 3.1. Data Description
To set this project in a more manageable scope, we have limited the target locations to be the surrounding areas (within 500m) of all train stations within 50km of the Sydney CBD (*atitude: -33.86785, longitude:151.20732*). The rationale for choosing surrounding areas of train stations is because higher pedestrian traffic are in general more favorable for operating flower shop businesses.

All data sources will be sourced from the Foursquare API. We will obtain the geolocations as well as other venue data e.g. type of venues within the **target area**  to identify the similarity of each surrounding areas.


#### 3.1. Data to Collect
1. Train stations geographic data in the target area.
2. All flower shops location data in the target area, and identify those within 500m of any train stations.
3. Other neighbourhood data i.e. types of venues within 500m of all train stations (to classify locations with similarities)

Install necessary packages:

In [None]:
# Install folium package for displaying map, uncomment below to install
import sys
!{sys.executable} -m pip install folium

In [None]:
# Install geopy package for geographic data, uncomment below to install
!{sys.executable} -m pip install geopy

Import required libraries for data cleaning, calling API and displaying maps:

In [43]:
# import libraries
import requests
import pandas as pd
import numpy as np
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from geopy.distance import geodesic
from sklearn.cluster import KMeans

Set up Foursquare API credentials:

In [44]:
# foursquare credentials removed

# foursquare_data = {
#     'CLIENT_ID': XXXXXXXX,
#     'CLIENT_SECRET': XXXXXXXX,
#     'VERSION': XXXXXXXX
# }

foursquare_data = {
    'CLIENT_ID': 'RIVZK4U5WD5SX4O0T3RMTS13DVQ4MR1IQYEFDUJPQGMAZ1Z1',
    'CLIENT_SECRET': '33VRFUWUSDABHMSB4QTM5A41LSTWXN3ZOC3ZCSIPHY1NFRO2',
    'VERSION': 20190715
}

# Setup credentials
CLIENT_ID = foursquare_data['CLIENT_ID'] # Foursquare Client ID
CLIENT_SECRET = foursquare_data['CLIENT_SECRET'] # your Foursquare Secret
VERSION = foursquare_data['VERSION'] # Foursquare API version

List out all **train station** and **flower shops** locations using Foursquare API, we need to obtain the categoryId as specified in their documentation (https://developer.foursquare.com/docs/resources/categories)

In [45]:
# Global variables
# LIMIT: maximum number of records return by the API call
LIMIT = 100 

# Sydney location
syd_lat = -33.86785
syd_long = 151.20732

# Foursquare API 'categoryId' for train stations
train_cat_id = '4bf58dd8d48988d129951735'
flower_cat_id = '4bf58dd8d48988d11b951735'

Create a function to get all **venues** within a certain radius of a location.

In [46]:
# Function to get venue list
def get_venues(latitudes, longitudes, categoryId, radius):
    venues_list = []

    # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            syd_lat,
            syd_long,
            categoryId,
            radius,
            LIMIT)

    # make the GET request
    results = requests.get(url).json()['response']['venues']

    venues_list.append([(
                 v['name'], 
                 v['location']['lat'], 
                 v['location']['lng']) for v in results])

    venues_df = pd.DataFrame([item for venue in venues_list for item in venue])
    venues_df.columns = ['Name', 
                      'Latitude', 
                      'Longitude']
    return(venues_df)

Create a list of all **train stations** within **50km** of **Sydney CBD**.

In [47]:
# distance from the Sydney CBD in meters. i.e. 50,000 = 50km
cbd_radius = 50000
stations_df = get_venues(syd_lat, syd_long, train_cat_id, cbd_radius)
stations_df.columns=['Name','Latitude','Longitude',]
stations_df.head()

Unnamed: 0,Name,Latitude,Longitude
0,Wynyard Station (Main Concourse),-33.865676,151.206163
1,Town Hall Station,-33.873644,151.206781
2,Town Hall Station (Main Concourse),-33.874051,151.20685
3,Ashfield Station,-33.887855,151.125538
4,Museum Station (Concourse),-33.8767,151.209732


Data cleaning to remove item with **Concourse**, **Platform** or **Interchange** in the station names, which represent facilities or locations of a train station.

In [48]:
# remove items with Concourse, Platform, Interchange in the name
stations_df = stations_df[~stations_df['Name'].str.contains('Concourse')]
stations_df = stations_df[~stations_df['Name'].str.contains('Platform')]
stations_df = stations_df[~stations_df['Name'].str.contains('Interchange')]
stations_df.head()

Unnamed: 0,Name,Latitude,Longitude
1,Town Hall Station,-33.873644,151.206781
3,Ashfield Station,-33.887855,151.125538
5,Kings Cross Station,-33.874495,151.222429
6,Fairfield Station,-33.872459,150.956831
7,Kogarah Station,-33.962524,151.132609


In [49]:
# print number of train stations
print('There are {} stations with 50km of Sydney CBD.'.format(stations_df.shape[0]))

There are 38 stations with 50km of Sydney CBD.


Create a list of all **flower shops** within **50km** of **Sydney CBD**.

In [50]:
# distance from the Sydney CBD in meters. i.e. 50,000 = 50km
cbd_radius = 50000
flower_shops_df = get_venues(syd_lat, syd_long, flower_cat_id, cbd_radius)
flower_shops_df.columns=['Name','Latitude','Longitude']
flower_shops_df.head()

Unnamed: 0,Name,Latitude,Longitude
0,Flowers On Martin Place,-33.8676,151.208
1,Colors On Stems,-33.772752,150.968903
2,Bubble Nini Tea,-33.88582,151.19912
3,Cuppa Flower,-33.902409,151.203652
4,Westflowers,-33.773892,150.696385


In [51]:
# print number of train stations
print('There are {} flower shops with 50km of Sydney CBD.'.format(flower_shops_df.shape[0]))

There are 50 flower shops with 50km of Sydney CBD.


Create a map of **Sydney** with **train stations** and **flowershops** superimposed on top.

In [52]:
# create map of Sydney using latitude and longitude values
syd_map = folium.Map(location=[syd_lat, syd_long], zoom_start=9)

# add station markets to map
for lat, lng, name in zip(stations_df['Latitude'], stations_df['Longitude'], stations_df['Name']):
    label = name
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        parse_html=False).add_to(syd_map)  

# add flower shops markets to map
for lat, lng, name in zip(flower_shops_df['Latitude'], flower_shops_df['Longitude'], flower_shops_df['Name']):
    label = name
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7,
        parse_html=False).add_to(syd_map)  

syd_map

Calculate the distance between each flower shop to the closest train stations, keep those which are within 500m of any stations.

In [53]:
near_station_shops_list = []

for shop_lat, shop_lng, shop_name in zip(flower_shops_df['Latitude'], flower_shops_df['Longitude'], flower_shops_df['Name']):
    flower_shop_loc = (shop_lat, shop_lng)

    for station_lat, station_lng, station_name in zip(stations_df['Latitude'], stations_df['Longitude'], stations_df['Name']):
        station_loc = (station_lat, station_lng)
        if geodesic(flower_shop_loc, station_loc).km <= 1:
            near_station_shops_list.append([shop_name, shop_lat, shop_lng, station_name])
            
near_station_shops_df = pd.DataFrame(near_station_shops_list)
near_station_shops_df.columns = ['Name', 
                                 'Latitude', 
                                 'Longitude',
                                 'Station']
near_station_shops_df.head()

Unnamed: 0,Name,Latitude,Longitude,Station
0,Flowers On Martin Place,-33.8676,151.208,Town Hall Station
1,Bubble Nini Tea,-33.88582,151.19912,Redfern Station
2,Bubble Nini Tea,-33.88582,151.19912,Central Station
3,Cuppa Flower,-33.902409,151.203652,Green Square Station
4,Fika by Cuppa Flower,-33.884921,151.200009,Redfern Station


In [54]:
near_station_shops_grouped = near_station_shops_df.groupby('Name').count()
near_station_shops_grouped.head()

Unnamed: 0_level_0,Latitude,Longitude,Station
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Blooming Trails,1,1,1
Bright Flowers,1,1,1
Bubble Nini Tea,2,2,2
Buds and Bowers,1,1,1
Butterfly Blooms Garden Centre,1,1,1


In [56]:
# print number of train stations
print('There are {} flower shops with 50km of Sydney CBD are close the train stations (within 500m).'.format(near_station_shops_grouped.shape[0]))

There are 21 flower shops with 50km of Sydney CBD are close the train stations (within 500m).


display a map which only list **train stations** and **flower shops** within 500m of any train stations.

In [57]:
# create a new map of Sydney using latitude and longitude values
syd_map_shops_near_stations = folium.Map(location=[syd_lat, syd_long], zoom_start=9)

# add station markets to map
for lat, lng, name in zip(stations_df['Latitude'], stations_df['Longitude'], stations_df['Name']):
    label = name
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        parse_html=False).add_to(syd_map_shops_near_stations)  

# add flower shops near station markets to map
for lat, lng, name in zip(near_station_shops_df['Latitude'], near_station_shops_df['Longitude'], near_station_shops_df['Name']):
    label = name
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7,
        parse_html=False).add_to(syd_map_shops_near_stations)  

syd_map_shops_near_stations

In [100]:
# print flower shops operating near train stations
print('There are {}% of flower shops operating along Sydney train lines'.format(round(near_station_shops_grouped.shape[0]/flower_shops_df.shape[0]*100,2)))

There are 42.0% of flower shops operating along Sydney train lines


Create a function to get nearby venues of all stations in the **target area**.

In [58]:
# reuse the get nearby venue created in previous excerise
def get_nearby_venues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [59]:
# set limit to 100 venues
LIMIT = 100

# get all stations' surrounding
stations_surrounding = get_nearby_venues(names=stations_df['Name'],
                                         latitudes=stations_df['Latitude'],
                                         longitudes=stations_df['Longitude']
                                         )


In [60]:
stations_surrounding.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Town Hall Station,-33.873644,151.206781,Kinokuniya,-33.872456,151.207525,Bookstore
1,Town Hall Station,-33.873644,151.206781,Grandma's Bar,-33.872138,151.205636,Cocktail Bar
2,Town Hall Station,-33.873644,151.206781,MUJI 無印良品,-33.872634,151.207226,Miscellaneous Shop
3,Town Hall Station,-33.873644,151.206781,Marble Bar,-33.871971,151.20714,Cocktail Bar
4,Town Hall Station,-33.873644,151.206781,Kings Comics,-33.875246,151.207837,Bookstore


In [61]:
# print number of venues surrounding train stations
print('Number of station surrouding venues: {}'.format(stations_surrounding.shape[0]))

Number of station surrouding venues: 1414


In [64]:
stations_surrounding.groupby('Neighborhood')['Venue'].count()

Neighborhood
Artarmon Station                  21
Ashfield Station                  41
Bankstown Station                 57
Burwood Station                   44
Campsie Station                   19
Central Station                   82
Chatswood Station                 42
Clyde Station                      4
Domestic Airport Station          43
Fairfield Station                 19
Glenfield Station                  5
Gordon Station                    18
Green Square Station              20
Hornsby Station                   44
International Airport Station     37
Kings Cross Station              100
Kogarah Station                   26
Macquarie University Station      57
Meadowbank Station                14
Milsons Point Station             52
Narwee Station                     6
Newtown Station                   75
Parramatta Station                87
Penrith Station                   35
Petersham Station                 22
Redfern Station                   70
Rhodes Station           

Number of unique categories can be curated from all the returned venues:

In [65]:
print('There are {} uniques categories.'.format(len(stations_surrounding['Venue Category'].unique())))

There are 195 uniques categories.


### 4. Problem Solving Approach

To solve the identifying shop location problem, the approach of using the data would be:
1. After collecting the neighourboods data i.e. types of venues of different train stations, perform **one hot encoding** to encode the  top 20 "surrounding neighourboods" of each train station.
2. Using *k*-mean clustering to put each train stations in a cluster. Stations should have high similarity with others within it.
3. Calculating the average concentration of flower shops in each cluster and identify the cluster with the highest density of flower shops (**best cluster**).
4. Within the **best cluster**, find out the locations with no or least number of flower shop as the **target locations**.

#### 4.1. Perform on hot encoding
Encode surrounding neighorhoods of each train stations:

In [66]:
# one hot encoding
stations_onehot = pd.get_dummies(stations_surrounding[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood (station) column back to dataframe
stations_onehot['Station'] = stations_surrounding['Neighborhood'] 

# move station column to the first column
fixed_columns = [stations_onehot.columns[-1]] + list(stations_onehot.columns[:-1])
stations_onehot = stations_onehot[fixed_columns]

stations_onehot.head()

Unnamed: 0,Station,Accessories Store,Airport Food Court,Airport Gate,Airport Lounge,Airport Terminal,American Restaurant,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Udon Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Yoga Studio
0,Town Hall Station,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Town Hall Station,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Town Hall Station,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Town Hall Station,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Town Hall Station,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by station and by taking the mean of the frequency of occurrence of each category

In [67]:
# calculate the frequency mean for the occurence of each type of venue surrounding the station
stations_grouped = stations_onehot.groupby('Station').mean().reset_index()
stations_grouped.head()

Unnamed: 0,Station,Accessories Store,Airport Food Court,Airport Gate,Airport Lounge,Airport Terminal,American Restaurant,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Udon Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Yoga Studio
0,Artarmon Station,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ashfield Station,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.0
2,Bankstown Station,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.157895,0.0,0.0,0.0,0.0
3,Burwood Station,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0
4,Campsie Station,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0


In [68]:
num_top_venues = 20

for hood in stations_grouped['Station']:
    print("----"+hood+"----")
    temp = stations_grouped[stations_grouped['Station'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Artarmon Station----
                            venue  freq
0                            Café  0.19
1                 Thai Restaurant  0.10
2             Japanese Restaurant  0.10
3                     Coffee Shop  0.05
4               Convenience Store  0.05
5                          Bakery  0.05
6                            Park  0.05
7                       BBQ Joint  0.05
8                Asian Restaurant  0.05
9                  Sandwich Place  0.05
10                    Pizza Place  0.05
11               Sushi Restaurant  0.05
12         Furniture / Home Store  0.05
13                          Plaza  0.05
14               Ramen Restaurant  0.05
15                          Motel  0.05
16               Pedestrian Plaza  0.00
17           Pakistani Restaurant  0.00
18  Paper / Office Supplies Store  0.00
19                    Pastry Shop  0.00


----Ashfield Station----
                  venue  freq
0   Dumpling Restaurant  0.07
1              Platform  0.07
2   Japanese Resta

                     venue  freq
0                     Café  0.25
1      Sporting Goods Shop  0.10
2        Electronics Store  0.10
3              Coffee Shop  0.10
4            Train Station  0.05
5              Bus Station  0.05
6   Furniture / Home Store  0.05
7                      Gym  0.05
8          Thai Restaurant  0.05
9                      Bar  0.05
10          Clothing Store  0.05
11             Supermarket  0.05
12               Pet Store  0.05
13      Persian Restaurant  0.00
14                    Pier  0.00
15         Other Nightlife  0.00
16                   Plaza  0.00
17              Playground  0.00
18                Platform  0.00
19    Pakistani Restaurant  0.00


----Hornsby Station----
                    venue  freq
0                    Café  0.23
1    Fast Food Restaurant  0.07
2       Electronics Store  0.05
3                     Gym  0.05
4       Korean Restaurant  0.05
5                  Bakery  0.05
6           Movie Theater  0.05
7        Asian Restaurant

                    venue  freq
0                    Café  0.32
1   Portuguese Restaurant  0.18
2             Beer Garden  0.05
3                    Pool  0.05
4             Pizza Place  0.05
5      Falafel Restaurant  0.05
6                Pharmacy  0.05
7      Chinese Restaurant  0.05
8                    Park  0.05
9                     Pub  0.05
10          Bowling Green  0.05
11          Grocery Store  0.05
12            Yoga Studio  0.05
13              Mini Golf  0.00
14               Pie Shop  0.00
15                 Market  0.00
16       Pedestrian Plaza  0.00
17  Performing Arts Venue  0.00
18     Persian Restaurant  0.00
19       Malay Restaurant  0.00


----Redfern Station----
                    venue  freq
0                    Café  0.24
1                     Bar  0.10
2         Thai Restaurant  0.07
3                  Bakery  0.07
4                     Pub  0.06
5             Pizza Place  0.04
6             Coffee Shop  0.03
7    Fast Food Restaurant  0.03
8     Japanese

                            venue  freq
0                   Historic Site   0.2
1                           Beach   0.2
2                            Park   0.2
3                       Pet Store   0.2
4                   Train Station   0.2
5           Performing Arts Venue   0.0
6                    Noodle House   0.0
7                          Office   0.0
8                    Optical Shop   0.0
9                 Other Nightlife   0.0
10           Pakistani Restaurant   0.0
11  Paper / Office Supplies Store   0.0
12                    Pastry Shop   0.0
13               Pedestrian Plaza   0.0
14              Accessories Store   0.0
15                      Newsstand   0.0
16             Persian Restaurant   0.0
17                       Pharmacy   0.0
18                       Pie Shop   0.0
19                           Pier   0.0




First, let's write a function to sort the venues in descending order.

In [69]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 20 venues for each station.

In [70]:
num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Station']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Station'] = stations_grouped['Station']

for ind in np.arange(stations_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(stations_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Station,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Artarmon Station,Café,Thai Restaurant,Japanese Restaurant,Plaza,Ramen Restaurant,Bakery,Sushi Restaurant,BBQ Joint,Furniture / Home Store,...,Park,Sandwich Place,Coffee Shop,Motel,Convenience Store,Pizza Place,Farmers Market,Fast Food Restaurant,Fish Market,Flea Market
1,Ashfield Station,Dumpling Restaurant,Platform,Japanese Restaurant,Shanghai Restaurant,Electronics Store,Asian Restaurant,Chinese Restaurant,Supermarket,Malay Restaurant,...,Sushi Restaurant,Department Store,Polish Restaurant,Bakery,Pub,Falafel Restaurant,Restaurant,Bubble Tea Shop,Café,Bar
2,Bankstown Station,Vietnamese Restaurant,Café,Sports Bar,Middle Eastern Restaurant,Buffet,Gym,Chinese Restaurant,Steakhouse,Asian Restaurant,...,Italian Restaurant,Shopping Mall,Burger Joint,Sandwich Place,Bus Station,Portuguese Restaurant,Record Shop,Multiplex,Pizza Place,Coffee Shop
3,Burwood Station,Café,Noodle House,Chinese Restaurant,Supermarket,Fast Food Restaurant,Japanese Restaurant,Coffee Shop,Department Store,Salad Place,...,Middle Eastern Restaurant,Cantonese Restaurant,Big Box Store,Sandwich Place,Pub,Multiplex,Fried Chicken Joint,Hookah Bar,Optical Shop,Ice Cream Shop
4,Campsie Station,Malay Restaurant,Chinese Restaurant,Korean Restaurant,Liquor Store,Fried Chicken Joint,Fast Food Restaurant,Shopping Mall,Sushi Restaurant,Bus Station,...,Gym,Department Store,Vietnamese Restaurant,Pharmacy,Japanese Restaurant,Duty-free Shop,Farmers Market,Flea Market,Dim Sum Restaurant,Fish Market


#### 4.2 Cluster Stations by Similiar Surrounding

Using *k*-mean clustering to put each train stations in a cluster. Stations should have high similarity with others within it. Run cluster the stations into 5 clusters.

In [71]:
# set number of clusters
kclusters = 5

stations_grouped_clustering = stations_grouped.drop('Station', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(stations_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 2, 2, 2, 0, 2, 2, 3, 2, 2], dtype=int32)

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [72]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [73]:
# merge dataframes to add latitude/longitude for each station
stations_merged = stations_df.join(neighborhoods_venues_sorted.set_index('Station'), on='Name')
stations_merged.rename(columns={'Name':'Station'},inplace=True)
stations_merged.head()

Unnamed: 0,Station,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
1,Town Hall Station,-33.873644,151.206781,2,Japanese Restaurant,Café,Coffee Shop,Hotel,Cocktail Bar,Malay Restaurant,...,Tea Room,Hotel Bar,Ramen Restaurant,Record Shop,Burger Joint,Shopping Mall,Hobby Shop,Speakeasy,Bar,Clothing Store
3,Ashfield Station,-33.887855,151.125538,2,Dumpling Restaurant,Platform,Japanese Restaurant,Shanghai Restaurant,Electronics Store,Asian Restaurant,...,Sushi Restaurant,Department Store,Polish Restaurant,Bakery,Pub,Falafel Restaurant,Restaurant,Bubble Tea Shop,Café,Bar
5,Kings Cross Station,-33.874495,151.222429,1,Café,Italian Restaurant,Coffee Shop,Pub,Bar,Thai Restaurant,...,Burger Joint,Gym,Mexican Restaurant,Sandwich Place,Japanese Restaurant,Dumpling Restaurant,Indian Restaurant,Bakery,Speakeasy,Hotel
6,Fairfield Station,-33.872459,150.956831,2,Fast Food Restaurant,Thai Restaurant,Asian Restaurant,Tennis Court,Bar,Sandwich Place,...,Supermarket,Department Store,Bowling Alley,Iraqi Restaurant,Pharmacy,Flea Market,Food & Drink Shop,Farmers Market,Food Court,Fish Market
7,Kogarah Station,-33.962524,151.132609,1,Café,Thai Restaurant,Pizza Place,Pub,Portuguese Restaurant,Supermarket,...,Bakery,Convenience Store,Fast Food Restaurant,Lebanese Restaurant,Japanese Restaurant,Indian Restaurant,Platform,Coffee Shop,Train Station,Bagel Shop


In [74]:
# count number of stations in each cluster
stations_merged.groupby('Cluster Labels')['Station'].count()

Cluster Labels
0     2
1    14
2    19
3     2
4     1
Name: Station, dtype: int64

Display station clusters in a map.

In [75]:
# create map
map_clusters = folium.Map(location=[syd_lat, syd_long], zoom_start=9)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(stations_merged['Latitude'], stations_merged['Longitude'], stations_merged['Station'], stations_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Insert the number of flower shops into the cluster dataframe.

In [76]:
num_shops_station_df = near_station_shops_df.groupby('Station').count().reset_index()
num_shops_station_df.drop(['Latitude','Longitude'],axis=1, inplace=True)
num_shops_station_df.rename(columns={'Name':'Flower Shop Count'},inplace=True)
num_shops_station_df.head()

Unnamed: 0,Station,Flower Shop Count
0,Bankstown Station,1
1,Central Station,3
2,Green Square Station,2
3,International Airport Station,1
4,Kings Cross Station,2


In [77]:
stations_merged = stations_merged.join(num_shops_station_df.set_index('Station'), on='Station')

In [78]:
stations_merged.fillna(0,inplace=True)
stations_merged.head()

Unnamed: 0,Station,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue,Flower Shop Count
1,Town Hall Station,-33.873644,151.206781,2,Japanese Restaurant,Café,Coffee Shop,Hotel,Cocktail Bar,Malay Restaurant,...,Hotel Bar,Ramen Restaurant,Record Shop,Burger Joint,Shopping Mall,Hobby Shop,Speakeasy,Bar,Clothing Store,3.0
3,Ashfield Station,-33.887855,151.125538,2,Dumpling Restaurant,Platform,Japanese Restaurant,Shanghai Restaurant,Electronics Store,Asian Restaurant,...,Department Store,Polish Restaurant,Bakery,Pub,Falafel Restaurant,Restaurant,Bubble Tea Shop,Café,Bar,0.0
5,Kings Cross Station,-33.874495,151.222429,1,Café,Italian Restaurant,Coffee Shop,Pub,Bar,Thai Restaurant,...,Gym,Mexican Restaurant,Sandwich Place,Japanese Restaurant,Dumpling Restaurant,Indian Restaurant,Bakery,Speakeasy,Hotel,2.0
6,Fairfield Station,-33.872459,150.956831,2,Fast Food Restaurant,Thai Restaurant,Asian Restaurant,Tennis Court,Bar,Sandwich Place,...,Department Store,Bowling Alley,Iraqi Restaurant,Pharmacy,Flea Market,Food & Drink Shop,Farmers Market,Food Court,Fish Market,0.0
7,Kogarah Station,-33.962524,151.132609,1,Café,Thai Restaurant,Pizza Place,Pub,Portuguese Restaurant,Supermarket,...,Convenience Store,Fast Food Restaurant,Lebanese Restaurant,Japanese Restaurant,Indian Restaurant,Platform,Coffee Shop,Train Station,Bagel Shop,1.0


Calculate the mean number of flower shop in each cluster

In [88]:
stations_selected = stations_merged[['Station','Cluster Labels','Latitude','Longitude','Flower Shop Count']]
cluster_count_df = stations_selected.groupby('Cluster Labels').mean().reset_index()
cluster_count_df[['Cluster Labels','Flower Shop Count']]

Unnamed: 0,Cluster Labels,Flower Shop Count
0,0,0.0
1,1,1.0
2,2,0.578947
3,3,0.0
4,4,0.0


Find the cluster with highest concentration of flower shops.

In [89]:
cluster_count_df = cluster_count_df.sort_values(by=['Flower Shop Count'],ascending=False).reset_index()
print('The cluster with highest flower shop density is {}, which have an average of {} per station.'.format(
                                                                                                          cluster_count_df['Cluster Labels'][0],
                                                                                                          round(cluster_count_df['Flower Shop Count'][0], 2)))


The cluster with highest flower shop density is 1, which have an average of 1.0 per station.


List all stations in the cluster with highest flower shop density.

In [90]:
best_stations_df = stations_selected.loc[stations_merged['Cluster Labels'] == cluster_count_df['Cluster Labels'][0]]
best_stations_df

Unnamed: 0,Station,Cluster Labels,Latitude,Longitude,Flower Shop Count
5,Kings Cross Station,1,-33.874495,151.222429,2.0
7,Kogarah Station,1,-33.962524,151.132609,1.0
12,Rhodes Station,1,-33.83083,151.086965,0.0
15,Artarmon Station,1,-33.808965,151.183467,0.0
21,Macquarie University Station,1,-33.777471,151.118091,0.0
22,St Leonards Station,1,-33.82315,151.194199,0.0
23,Milsons Point Station,1,-33.846497,151.212057,1.0
30,Green Square Station,1,-33.905861,151.20272,2.0
31,Narwee Station,1,-33.947483,151.070773,0.0
32,Redfern Station,1,-33.891989,151.19875,4.0


List all stations within the cluster best for operating flower shop but **without** a flower shop.

In [91]:
final_list = best_stations_df[best_stations_df['Flower Shop Count'] == 0]
final_list

Unnamed: 0,Station,Cluster Labels,Latitude,Longitude,Flower Shop Count
12,Rhodes Station,1,-33.83083,151.086965,0.0
15,Artarmon Station,1,-33.808965,151.183467,0.0
21,Macquarie University Station,1,-33.777471,151.118091,0.0
22,St Leonards Station,1,-33.82315,151.194199,0.0
31,Narwee Station,1,-33.947483,151.070773,0.0
46,Hornsby Station,1,-33.703385,151.098361,0.0


In [92]:
# create map
map_final = folium.Map(location=[syd_lat, syd_long], zoom_start=9)

# add markers to the map
markers_colors = []
for lat, lon, station in zip(final_list['Latitude'], final_list['Longitude'], final_list['Station']):
    label = folium.Popup(station, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7).add_to(map_final)
       
map_final