# An analysis of potential theme park locations for the metropolitan Melbourne Area

This notebook contains the capstone project work done for the IBM Professional Certificate on Data Science. Complete details of this project are available in this [report](https://github.com/andremun/Coursera_Capstone/blob/master/report/report.pdf).

## Introduction

Melbourne, Victoria is the second largest metropolitan area in Australia, with an estimated 5.191 million inhabitants as of 2019. According to [Mercer's Quality of living city ranking](https://mobilityexchange.mercer.com/Insights/quality-of-living-rankings) it is of the cities with the highest quality of life in the world, with one of the best education systems in the country according to [Quarely Simmonds Best Student's cities](https://www.topuniversities.com/city-rankings/2018#sorting=rank+custom=rank+order=desc+search=), making it an attractive city for families. However, Melbourne lacks of large theme and attraction parks. According to the state government of Victoria, [the only two parks](https://liveinmelbourne.vic.gov.au/discover/things-to-do-in-melbourne/childrens-activities-in-melbourne) in the region are: 
	
- Luna Park located in the inner city suburb of St. Kilda, at 8 kilometres from the central business district, with an area of over 11 thousand square metres. It opened in 1912, contains 20 attractions and operates year round.
-  Adventure Park located in the nearby city of Geelong, at 92 kilometres from the central business district, with an area of over 2 square kilometres. It opened in 1994, contains 20 attractions and operates from October to April, which correspond to the summer months in Australia.
	
Therefore, **the aim of this project is to identify the location of a new attraction park** closer to the central business district than Adventure Park, and an available area for expansion larger than Luna Park. The location should be within the metropolitan region, close to population growth corridors, with a high number of families nearby. Ideally, the location should be close to a suburban train station and other amenities like shopping centres, museums or zoological parks. The project would not focus on the economical feasibility of the park, but the families in the vicinity should have a moderate income that would allow them visit the park often. Interested parties would be developers, government entities, and families.

This notebook is divided in the following sections: (1) Importing libraries, where all the necessary packages are loaded into memory, (2) Data collection and cleaning, where the data from the existing datasets are pre-processed, (3) ...


## 1 - Importing Libraries

For this work, standard libraries such as `pandas`, `numpy`, `sklearn` and `matplotlib` will be used.

In [62]:
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from geopy.distance import geodesic
import requests # library to handle requests
import matplotlib.cm as cm # Matplotlib and associated plotting modules
import matplotlib.colors as colors
from sklearn.cluster import KMeans # import k-means from clustering stage
from sklearn.metrics import silhouette_score
import folium # map rendering library
print('Libraries imported.')

Libraries imported.


## 2 - Data collection and cleaning

### Postcode data

The first dataset corresponds to the **postcodes** for the State of Victoria, as provided by [Zen10](https://zen10.com.au/melbourne-suburb-list/), a Search Engine Optimisation Consultancy based on Melbourne. The data is in the file `victoria_postcodes.csv` and was imported using the `pandas` package. It is noteworthy that some suburbs have more postcodes associated to it. Therefore, the data was pre-processed to group each postcode with a unique suburb name. Moreover, any suburb that does not correspond to the Melbourne metropolitan area (which are marked with the regions `vic far country` and `vic country`) were removed.

In [61]:
file_suburb_pcode = 'victoria_postcodes.csv'
data_suburb_pcode = pd.read_csv(file_suburb_pcode, names=['Postcode','Suburb','Region'])
data_suburb_pcode['Suburb'] = data_suburb_pcode['Suburb'].str.lower()
data_suburb_pcode['Region'] = data_suburb_pcode['Region'].str.lower()
data_suburb_pcode = data_suburb_pcode.loc[data_suburb_pcode['Region']!='vic far country']
data_suburb_pcode = data_suburb_pcode.loc[data_suburb_pcode['Region']!='vic country']
suburb_pcodes = pd.DataFrame(columns=['Suburb','Postcodes'])
suburb_names = data_suburb_pcode['Suburb'].value_counts().index

# For every single unique suburb
for ii in range(0,len(suburb_names)):
    # Extract the data for that unique postcode
    idx = data_suburb_pcode['Suburb'] == suburb_names[ii]
    pcs = data_suburb_pcode.loc[idx]['Postcode'].values.tolist()
    suburb_pcodes.loc[ii] = [suburb_names[ii],pcs]
    
suburb_pcodes.head(5)

Unnamed: 0,Suburb,Postcodes
0,melbourne,"[3000, 3001, 3004]"
1,dandenong south,"[3164, 3175]"
2,coburg,[3058]
3,mount burnett,[3781]
4,fitzroy north,[3068]


### Geographical data

The second dataset contains the **geographical** information for each suburb in the Melbourne metropolitan area. This data was originally in [GeoJSON format](https://github.com/stephenmuss/suburb-boundaries-geojson/blob/master/vic.json), and is more thoroughly documented in the [report](https://github.com/andremun/Coursera_Capstone/blob/master/report/report.pdf). This file was a bit tricky to pre-process in Jupyter Notebooks, due to a nested structure in the `geometry` variable, which contained the coordinates of the boundary points. Given the limtied amount of time for this project, I decided to use [MATLAB](https://www.mathworks.com/) to obtain the centroid and the area in squared kilometers. The interested reader can check the MATLAB script provided [here](). The result from this pre-processing is the `greater_melbourne_suburb_data.csv` file which is loaded using `pandas`, and joined with the postcode data.

In [40]:
file_suburb_coord = 'greater_melbourne_suburb_data.csv'
data_suburb_coord = pd.read_csv(file_suburb_coord)
data_suburb_coord['Suburb'] = data_suburb_coord['Suburb'].str.lower()
data_suburb = data_suburb_coord
data_suburb = data_suburb.join(suburb_pcodes.set_index('Suburb'), on='Suburb')
data_suburb.head(5)

Unnamed: 0,Suburb,Longitude,Latitude,Area,Postcodes
0,abbotsford,144.999797,-37.804704,2.22337,[3067]
1,aberfeldie,144.897425,-37.759636,1.971027,[3040]
2,airport west,144.881337,-37.723862,4.653908,[3042]
3,albanvale,144.768545,-37.746106,2.416946,[3021]
4,albert park,144.963193,-37.844753,4.345716,[3206]


For our purpose, it is also important to know the distance of each suburb to Melbourne's Central Business District (CBD). The first step is to use the `geopy` package to determine the coordinates of the CBD. Next, these coordinates will be compared to the coordinates of the suburb centroids.

In [76]:
melbourne_cbd_address = 'Melbourne, VIC'

geolocator = Nominatim(user_agent = "explorer")
melbourne_cbd_geoloc = geolocator.geocode(melbourne_cbd_address)
latitude = melbourne_cbd_geoloc.latitude
longitude = melbourne_cbd_geoloc.longitude
melb_cbd_coords = (latitude, longitude)
print('The geograpical coordinates of Melbourne are {}, {}.'.format(latitude, longitude))
data_suburb['Distance to CBD'] = pd.Series(dtype='object')

for ii in range(0,data_suburb.shape[0]):
    suburb_coords = (data_suburb['Latitude'].iloc[ii], data_suburb['Longitude'].iloc[ii])
    data_suburb.loc[ii,'Distance to CBD'] = geodesic(melb_cbd_coords, suburb_coords).km
    
data_suburb.head()

The geograpical coordinates of Melbourne are -37.8142176, 144.9631608.


Unnamed: 0,Suburb,Longitude,Latitude,Area,Postcodes,Population,Population under 15,Affluent Population,Density,Afluent Ratio,Distance to CBD
0,abbotsford,144.999797,-37.804704,2.22337,[3067],8199,648.0,2815.0,3687.65,0.343335,3.3946
1,aberfeldie,144.897425,-37.759636,1.971027,[3040],25939,4281.0,6833.0,13160.1,0.263426,8.38031
2,airport west,144.881337,-37.723862,4.653908,[3042],15762,2687.0,3106.0,3386.83,0.197056,12.3511
3,albanvale,144.768545,-37.746106,2.416946,[3021],54190,9489.0,3454.0,22420.9,0.0637387,18.7373
4,albert park,144.963193,-37.844753,4.345716,[3206],10366,1805.0,3721.0,2385.34,0.358962,3.38922


### Demographical data

The third and fourth datasets contain **demograpical** information for each postcode in the Melborune metropolitan area. This data was extracted from the [2016 Australian Census Data](https://www.abs.gov.au/websitedbs/D3310114.nsf/Home/Assuring%20Census%20Data%20Quality), using the [TableBuilder Application](https://www.abs.gov.au/websitedbs/censushome.nsf/home/tablebuilder). The third dataset corresponds to the population by age (in five year increments) and it is in the `age_by_postcode.csv` file, while the fourth dataset corresponds to the population by  weekly income and it is in the `income_by_postcode.csv` file. From this data we are interested in calculating:

1. The population size of each suburb.
2. The population objective, which are children under the age of 15.
3. The density of each suburb, defined as the area in squared kilometres divided over the population.
4. The affluent population of each suburb, or thsoe with an income in or above the national median of 66,000 dollars per year, or about 1,300 dollars per week before taxes.
5. The ratio of affluent density of each suburb, defined by the ratio between the affluent population and the total population.

Note that we do not care about postcodes whose population is zero; hence, they are removed from the tables. These results will be appended to our `data_suburb` dataframe.

In [41]:
file_age_by_pcode = 'age_by_postcode.csv'
data_age_by_pcode = pd.read_csv(file_age_by_pcode)
data_age_by_pcode.set_index('Postcode', inplace = True)

file_income_by_pcode = 'income_by_postcode.csv'
data_income_by_pcode = pd.read_csv(file_income_by_pcode)
data_income_by_pcode.set_index('Postcode', inplace = True)

ages_by_suburb = pd.DataFrame(columns=data_age_by_pcode.columns)
incomes_by_suburb = pd.DataFrame(columns=data_income_by_pcode.columns)

for ii in data_suburb.index:
    pcs = data_suburb['Postcodes'].iloc[ii]
    
    idx = data_age_by_pcode.index == str(pcs[0])
    for jj in range(1,len(pcs)):
        aux = data_age_by_pcode.index == str(pcs[jj])
        for kk in range(0,len(idx)):
            idx[kk] = idx[kk] or aux[kk]

    ages_by_suburb.loc[ii] =  data_age_by_pcode.loc[idx].sum()
    
    idx = data_income_by_pcode.index == str(pcs[0])
    for jj in range(1,len(pcs)):
        aux = data_income_by_pcode.index == str(pcs[jj])
        for kk in range(0,len(idx)):
            idx[kk] = idx[kk] or aux[kk]

    incomes_by_suburb.loc[ii] =  data_income_by_pcode.loc[idx].sum()

ages_by_suburb.index = data_suburb['Suburb']
ages_by_suburb.rename(columns={'Total': 'Population'}, inplace = True)
ages_by_suburb['Population under 15'] = ages_by_suburb[['0-4 years','5-9 years','10-14 years']].sum(axis=1)
ages_by_suburb = ages_by_suburb[['Population','Population under 15']]

incomes_by_suburb.index = data_suburb['Suburb']
incomes_by_suburb.rename(columns={'Total': 'Population by Income'}, inplace = True)
incomes_by_suburb['Affluent Population'] = incomes_by_suburb[['1,250-1,499 (65,000-77,999)',
                                                              '1,500-1,749 (78,000-90,999)',
                                                              '1,750-1,999 (91,000-103,999)',
                                                              '2,000-2,999 (104,000-155,999)',
                                                              '3,000 or more (156,000 or more)']].sum(axis=1)
incomes_by_suburb = incomes_by_suburb[['Affluent Population']]

data_suburb = data_suburb.join(ages_by_suburb, on='Suburb')
data_suburb = data_suburb.join(incomes_by_suburb, on='Suburb')

data_suburb = data_suburb.loc[data_suburb['Population']!=0]

data_suburb['Density'] = data_suburb[['Population']].divide(other = data_suburb['Area'], axis=0)
data_suburb['Afluent Ratio'] = data_suburb[['Affluent Population']].divide(other = data_suburb['Population'], axis=0)

data_suburb.head(5)

Unnamed: 0,Suburb,Longitude,Latitude,Area,Postcodes,Population,Population under 15,Affluent Population,Density,Afluent Ratio
0,abbotsford,144.999797,-37.804704,2.22337,[3067],8199,648.0,2815.0,3687.65,0.343335
1,aberfeldie,144.897425,-37.759636,1.971027,[3040],25939,4281.0,6833.0,13160.1,0.263426
2,airport west,144.881337,-37.723862,4.653908,[3042],15762,2687.0,3106.0,3386.83,0.197056
3,albanvale,144.768545,-37.746106,2.416946,[3021],54190,9489.0,3454.0,22420.9,0.0637387
4,albert park,144.963193,-37.844753,4.345716,[3206],10366,1805.0,3721.0,2385.34,0.358962


### Foursquare API setup

The final set of data will be obtained from the Foursquare API. There are a few things that are of our interest:

1. The location of the nearest bus, light rail or train station.
2. The location of other ammenities such as shopping malls, museums, zoological parks.

The first step is to set up the API calls with the ID and SECRET, which have been stored in a separate file, not available in the repository for security reasons.

In [55]:
fsquare_secret_file = '../secret.csv'
fsquare_secret_data = pd.read_csv(fsquare_secret_file)
CLIENT_ID = fsquare_secret_data['CLIENT_ID'][0]
CLIENT_SECRET = fsquare_secret_data['CLIENT_SECRET'][0]
VERSION = '20180605'
LIMIT = 100

Light Rail Station
Metro Station
Outlet Mall
Playground
Recreation Center
Shopping Mall
Shopping Plaza
Train Station
Zoo
Zoo Exhibit


In [56]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(name, 
                             lat, 
                             lng, 
                             v['venue']['name'], 
                             v['venue']['location']['lat'], 
                             v['venue']['location']['lng'],  
                             v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Suburb', 
                             'Suburb Latitude', 
                             'Suburb Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
    
    return(nearby_venues)

In [57]:
melbourne_venues = getNearbyVenues(names = data_suburb['Suburb'],
                                   latitudes = data_suburb['Latitude'],
                                   longitudes = data_suburb['Longitude'])
print(melbourne_venues.shape)
melbourne_venues.head(10)

abbotsford
aberfeldie
airport west
albanvale
albert park
albion
alphington
altona
altona meadows
altona north
ardeer
armadale
arthurs creek
arthurs seat
ascot vale
ashbourne
ashburton
ashwood
aspendale
aspendale gardens
attwood
avondale heights
avonsleigh
bacchus marsh
badger creek
balaclava
balliang
balnarring
balnarring beach
balwyn
balwyn north
bangholme
baxter
bayles
bayswater
bayswater north
beaconsfield
beaconsfield upper
beaumaris
belgrave
belgrave heights
belgrave south
bellfield
bend of islands
bentleigh
bentleigh east
berwick
beveridge
bittern
black rock
blackburn
blackburn north
blackburn south
blairgowrie
blind bight
bonbeach
boneo
boronia
botanic ridge
box hill
box hill north
box hill south
braeside
braybrook
briar hill
brighton
brighton east
broadmeadows
brookfield
brooklyn
brunswick
brunswick east
brunswick west
bulla
bulleen
bullengarook
bundoora
burnley
burnside
burnside heights
burwood
burwood east
cairnlea
calder park
camberwell
campbellfield
cannons creek
canterbury

Unnamed: 0,Suburb,Suburb Latitude,Suburb Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,abbotsford,-37.804704,144.999797,Lentil As Anything,-37.802724,145.003507,Vegetarian / Vegan Restaurant
1,abbotsford,-37.804704,144.999797,Three Bags Full,-37.807318,144.996603,Café
2,abbotsford,-37.804704,144.999797,The Kitchen at Weylandts,-37.805311,144.997345,Café
3,abbotsford,-37.804704,144.999797,Abbotsford Convent,-37.802574,145.00441,Cultural Center
4,abbotsford,-37.804704,144.999797,Slow Food Market,-37.802481,145.003597,Farmers Market
5,abbotsford,-37.804704,144.999797,Abbotsford Convent Gardens,-37.802454,145.00351,Garden
6,abbotsford,-37.804704,144.999797,The Park Hotel,-37.802769,144.997029,Pub
7,abbotsford,-37.804704,144.999797,Retreat Hotel,-37.801126,144.997548,Pub
8,abbotsford,-37.804704,144.999797,Salvos Store,-37.80598,144.99826,Thrift / Vintage Store
9,abbotsford,-37.804704,144.999797,Kappaya Japanese Soul Food,-37.80259,145.003582,Japanese Restaurant


## 3 - Setting up map information

In [54]:
# create map of Melbourne using latitude and longitude values
map_melbourne_metro = folium.Map(location=[latitude, longitude], zoom_start=9)

# add markers to map
for lat, lng, suburb in zip(data_suburb['Latitude'], data_suburb['Longitude'], data_suburb['Suburb']):
    label = '{}, VIC'.format(suburb)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_melbourne_metro)  
    
map_melbourne_metro

In [59]:
# one hot encoding
melbourne_onehot = pd.get_dummies(melbourne_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
melbourne_onehot['Suburb'] = melbourne_venues['Suburb'] 

# move neighborhood column to the first column
fixed_columns = [melbourne_onehot.columns[-1]] + list(melbourne_onehot.columns[:-1])
melbourne_onehot = melbourne_onehot[fixed_columns]

# Now we see the frequency data for each type of venue in Toronto
melbourne_grouped = melbourne_onehot.groupby('Suburb').mean().reset_index()
melbourne_grouped.head(10)

Unnamed: 0,Suburb,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Art Gallery,...,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Yunnan Restaurant,Zoo,Zoo Exhibit
0,abbotsford,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,aberfeldie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,airport west,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,albanvale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,albert park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,alphington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,altona,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,altona meadows,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,altona north,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,ardeer,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [30]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Suburb']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Suburb'] = melbourne_grouped['Suburb']

for ind in np.arange(melbourne_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(melbourne_grouped.iloc[ind, :], num_top_venues)

venues_sorted.head(10)

Unnamed: 0,Suburb,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Abbotsford,Café,Pub,Garden,Cultural Center,Thrift / Vintage Store,Vegetarian / Vegan Restaurant,Convenience Store,Farmers Market,Sporting Goods Shop,Japanese Restaurant
1,Airport West,Grocery Store,Food Truck,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Zoo Exhibit
2,Albanvale,Furniture / Home Store,Zoo Exhibit,Food Truck,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court
3,Albert Park,Athletics & Sports,Café,Breakfast Spot,Sports Club,Light Rail Station,Park,Tennis Court,Golf Course,Australian Restaurant,Playground
4,Albion,Market,Deli / Bodega,Café,Food Truck,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court
5,Alphington,Fast Food Restaurant,Farmers Market,Train Station,Liquor Store,Convenience Store,Gym / Fitness Center,Thai Restaurant,Food & Drink Shop,Fish Market,Flea Market
6,Altona,Thai Restaurant,Discount Store,Furniture / Home Store,Convenience Store,Zoo Exhibit,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court
7,Altona Meadows,Park,Dog Run,Convenience Store,Business Service,Fish & Chips Shop,Zoo Exhibit,Food Court,Flea Market,Flower Shop,Food
8,Altona North,Badminton Court,Business Service,Zoo Exhibit,Fast Food Restaurant,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court
9,Ardeer,Garden Center,Motel,Gift Shop,Football Stadium,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck


In [35]:
melbourne_grouped_clustering = melbourne_grouped.drop('Suburb', 1)

k_vals = range(2,529)
silhouette_avg = np.zeros(len(k_vals))

for i in range(0,len(k_vals)):
    # run k-means clustering
    kmeans = KMeans(n_clusters = k_vals[i], random_state = 0)
    cluster_labels = kmeans.fit_predict(melbourne_grouped_clustering)
    
    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg[i] = silhouette_score(melbourne_grouped_clustering, cluster_labels)
    print("For n_clusters =", k_vals[i], "The average silhouette_score is :", round(silhouette_avg[i],4))


For n_clusters = 2 The average silhouette_score is : 0.2627
For n_clusters = 3 The average silhouette_score is : -0.0152
For n_clusters = 4 The average silhouette_score is : 0.0021
For n_clusters = 5 The average silhouette_score is : 0.0133
For n_clusters = 6 The average silhouette_score is : 0.0266
For n_clusters = 7 The average silhouette_score is : 0.0276
For n_clusters = 8 The average silhouette_score is : 0.0408
For n_clusters = 9 The average silhouette_score is : 0.0489
For n_clusters = 10 The average silhouette_score is : 0.0342
For n_clusters = 11 The average silhouette_score is : 0.0383
For n_clusters = 12 The average silhouette_score is : 0.051
For n_clusters = 13 The average silhouette_score is : 0.0422
For n_clusters = 14 The average silhouette_score is : 0.0495
For n_clusters = 15 The average silhouette_score is : 0.0556
For n_clusters = 16 The average silhouette_score is : 0.0261
For n_clusters = 17 The average silhouette_score is : 0.0619
For n_clusters = 18 The average 

For n_clusters = 137 The average silhouette_score is : 0.1329
For n_clusters = 138 The average silhouette_score is : 0.1326
For n_clusters = 139 The average silhouette_score is : 0.1316
For n_clusters = 140 The average silhouette_score is : 0.1323
For n_clusters = 141 The average silhouette_score is : 0.1327
For n_clusters = 142 The average silhouette_score is : 0.142
For n_clusters = 143 The average silhouette_score is : 0.1264
For n_clusters = 144 The average silhouette_score is : 0.1273
For n_clusters = 145 The average silhouette_score is : 0.1282
For n_clusters = 146 The average silhouette_score is : 0.1279
For n_clusters = 147 The average silhouette_score is : 0.1283
For n_clusters = 148 The average silhouette_score is : 0.1274
For n_clusters = 149 The average silhouette_score is : 0.1371
For n_clusters = 150 The average silhouette_score is : 0.1256
For n_clusters = 151 The average silhouette_score is : 0.1326
For n_clusters = 152 The average silhouette_score is : 0.1368
For n_clu

For n_clusters = 270 The average silhouette_score is : 0.1277
For n_clusters = 271 The average silhouette_score is : 0.1275
For n_clusters = 272 The average silhouette_score is : 0.128
For n_clusters = 273 The average silhouette_score is : 0.1282
For n_clusters = 274 The average silhouette_score is : 0.1279
For n_clusters = 275 The average silhouette_score is : 0.1269
For n_clusters = 276 The average silhouette_score is : 0.1264
For n_clusters = 277 The average silhouette_score is : 0.1259
For n_clusters = 278 The average silhouette_score is : 0.1264
For n_clusters = 279 The average silhouette_score is : 0.1263
For n_clusters = 280 The average silhouette_score is : 0.1255
For n_clusters = 281 The average silhouette_score is : 0.1242
For n_clusters = 282 The average silhouette_score is : 0.1244
For n_clusters = 283 The average silhouette_score is : 0.1255
For n_clusters = 284 The average silhouette_score is : 0.1263
For n_clusters = 285 The average silhouette_score is : 0.1251
For n_clu

  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 346 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 347 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 348 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 349 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 350 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 351 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 352 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 353 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 354 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 355 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 356 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 357 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 358 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 359 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 360 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 361 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 362 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 363 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 364 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 365 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 366 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 367 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 368 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 369 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


For n_clusters = 370 The average silhouette_score is : 0.1105
For n_clusters = 371 The average silhouette_score is : 0.1105


  return self.fit(X, sample_weight=sample_weight).labels_


ValueError: n_samples=371 should be >= n_clusters=372