# Data Science Capstone

### IBM Data Science Professional Certificate 

## 1.Business Problem

Melbourne is the city of coffee, some coffee cup even has the mark of “We love to make coffee for the city that love to drink it.”
Hence, this project aims to verify whether Melbourne citizens enjoy drinking coffee through exploring the Melbourne suburbs to find the 3 most common venues in each area. Besides, cluster the suburbs using k-means clustering to find which are has more coffee lover than other areas. 

This report will be targeted to people who are looking to travel to Melbourne and experience a coffee tour. Besides, it also useful for those people who want to start a new business by finding a good location to open a café.

## 2.Data

The suburbs names, suburb coordinates and postal information from Melbourne are extracted from the Internet, and with this information at hands (URL:https://www.matthewproctor.com/full_australian_postcodes_vic). The coordinates will be utilized for map generation, and as input for the Foursquare API, which will be leveraged to provide venues information of each division.

In the following, I will mainly focus on the venue category parameter, refining and clustering different categories of venues in major groups that will facilitate the analysis and also make it possible for the generation of a better visualization.  
**Libraries Which are Used to Develope the Project:**  
1).Pandas: For creating and manipulating dataframes.   
2).Folium: Python visualization library would be used to visualize the neighborhoods cluster distribution of using interactive leaflet map.   
3).Scikit Learn: For importing k-means clustering.   
4).JSON: Library to handle JSON files.   
5).Geocoder: To retrieve Location Data.   
6).Beautiful Soup and Requests: To scrap and library to handle http requests.   
7).Matplotlib: Python Plotting Module.   

### Install libraries

In [1]:
!pip install geocoder
!pip install folium



In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
import geocoder

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## 3.Methodology

### 3.1 Extract and Clean Data

The raw data obtained from Internet might have many blank or duplicate data. Hence, it is necessary to clean the data for better analysis.

Firstly, we need download the data form Internet and creat a CSV file to save the data we downloaded.

In [3]:
import requests

from bs4 import BeautifulSoup

url = 'https://www.matthewproctor.com/full_australian_postcodes_vic'

results = requests.get(url)
#print(results.text)

soup = BeautifulSoup(results.text,'lxml')
#print(soup.prettify())

In [4]:
import csv
csv_file=open('Melbourne.csv','w')
csv_writer=csv.writer(csv_file)
csv_writer.writerow(['Postcode', 'Locality', 'Longitude', 'Latitude', 'SA4 Name'])

for tr in soup.find_all('tr')[1:]:
    tds = tr.find_all('td')
    if len(tds)==14:
        Postcode = tds[1].text
        Locality = tds[2].text
        Longitude = tds[4].text
        Latitude = tds[5].text
        SA4_Name = tds[11].text
    #print(Post_Office, PDN, Suburb_now)
    csv_writer.writerow([Postcode, Locality,Longitude, Latitude, SA4_Name ])

csv_file.close()


In [5]:
Melbourne_df = pd.read_csv('Melbourne.csv')
Melbourne_df[['Locality']] = Melbourne_df[['Locality']].astype(str)

Melbourne_df.dtypes
#Melbourne_df[Melbourne_df.columns] = Melbourne_df.apply(lambda x: x.str.strip('\n '))
Melbourne_df.head()

Unnamed: 0,Postcode,Locality,Longitude,Latitude,SA4 Name
0,3000,MELBOURNE,144.956776,-37.817403,Melbourne - Inner
1,3001,MELBOURNE,144.956776,-37.817403,Melbourne - Inner
2,3002,EAST MELBOURNE,144.982207,-37.818517,Melbourne - Inner
3,3003,WEST MELBOURNE,144.949592,-37.810871,Melbourne - Inner
4,3004,MELBOURNE,144.970161,-37.844246,Melbourne - Inner


Some blanks in column "SA4 Name" are empty, so we use following command to drop NaN data.

In [6]:
Melbourne_df = Melbourne_df[Melbourne_df['SA4 Name'].notnull()]
Melbourne_df.head()

Unnamed: 0,Postcode,Locality,Longitude,Latitude,SA4 Name
0,3000,MELBOURNE,144.956776,-37.817403,Melbourne - Inner
1,3001,MELBOURNE,144.956776,-37.817403,Melbourne - Inner
2,3002,EAST MELBOURNE,144.982207,-37.818517,Melbourne - Inner
3,3003,WEST MELBOURNE,144.949592,-37.810871,Melbourne - Inner
4,3004,MELBOURNE,144.970161,-37.844246,Melbourne - Inner


Then, we deop the data which have same coordinate.

In [7]:
#  Drop same coords
Melbourne_df1 = Melbourne_df.drop_duplicates(subset=['Longitude','Latitude'])

In [8]:
Melbourne = Melbourne_df1[Melbourne_df1['SA4 Name'].str.contains("Melbourne - Inner")]
Melbourne.index = range(len(Melbourne))
#Melbourne.reset_index(drop=True)
Melbourne.head()

Unnamed: 0,Postcode,Locality,Longitude,Latitude,SA4 Name
0,3000,MELBOURNE,144.956776,-37.817403,Melbourne - Inner
1,3002,EAST MELBOURNE,144.982207,-37.818517,Melbourne - Inner
2,3003,WEST MELBOURNE,144.949592,-37.810871,Melbourne - Inner
3,3004,MELBOURNE,144.970161,-37.844246,Melbourne - Inner
4,3005,WORLD TRADE CENTRE,144.950858,-37.824608,Melbourne - Inner


In [10]:
address = 'Melbourne'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude_x = location.latitude
longitude_y = location.longitude
print('The Geograpical Co-ordinate of Melbourne are {}, {}.'.format(latitude_x, longitude_y))

The Geograpical Co-ordinate of Melbourne are -37.8142176, 144.9631608.


## 4.Map of Melbourne suburbs

In [11]:
map_Melbourne = folium.Map(location=[latitude_x, longitude_y], zoom_start=10)

for lat, lng, loc in zip(Melbourne['Latitude'], Melbourne['Longitude'], Melbourne['Locality']):
    
    label = '{}'.format(loc)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Melbourne)  
    
map_Melbourne

In [12]:
#Foursquare API
CLIENT_ID = 'I0UUVMVBJN5XACBXZS0VRGNHIONEUKCQCACW3NBB5Z0FHRNL' # my Foursquare ID
CLIENT_SECRET = 'CRNV2A521WLCZEONIPCIDDR3BS0REYDOZUEKMNXBTVKYCURQ' # my Foursquare Secret
VERSION = '20201213'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: '+CLIENT_ID)
print('CLIENT_SECRET: '+CLIENT_SECRET)

Your credentails:
CLIENT_ID: I0UUVMVBJN5XACBXZS0VRGNHIONEUKCQCACW3NBB5Z0FHRNL
CLIENT_SECRET: CRNV2A521WLCZEONIPCIDDR3BS0REYDOZUEKMNXBTVKYCURQ


In [13]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

## 5.Categories of Nearby Venues

In [14]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # making GET request
        results = requests.get(url).json()
        venue_results =results['response']['groups'][0]['items']
        #venue_results= requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in venue_results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Locality', 
                  'Locality Latitude', 
                  'Locality Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [15]:
# Nearby Venues
Melbourne_venues = getNearbyVenues(names=Melbourne['Locality'],
                                   latitudes=Melbourne['Latitude'],
                                   longitudes=Melbourne['Longitude']
                                  )

MELBOURNE
EAST MELBOURNE
WEST MELBOURNE
MELBOURNE
WORLD TRADE CENTRE
SOUTH WHARF
SOUTHBANK
DOCKLANDS
UNIVERSITY OF MELBOURNE
FLEMINGTON
MOONEE PONDS
ROYAL MELBOURNE HOSPITAL
HOTHAM HILL
MELBOURNE UNIVERSITY
CARLTON
CARLTON NORTH
BRUNSWICK SOUTH
BRUNSWICK
BRUNSWICK EAST
SUMNER
FITZROY
COLLINGWOOD
ABBOTSFORD
CLIFTON HILL
NORTHCOTE
THORNBURY
ALPHINGTON
COTHAM
KEW EAST
BALWYN
DEEPDENE
BALWYN NORTH
TEMPLESTOWE
TEMPLESTOWE LOWER
DONCASTER
BURNLEY
BURNLEY NORTH
AUBURN SOUTH
AUBURN
CAMBERWELL
CAMBERWELL EAST
MONT ALBERT
BOX HILL
BOX HILL NORTH
BLACKBURN
CHAPEL STREET NORTH
HAWKSBURN
ARMADALE
KOOYONG
CAULFIELD EAST
GLEN IRIS
BURWOOD EAST
CAULFIELD JUNCTION
CAULFIELD
BOORAN ROAD PO
BENTLEIGH EAST
PRAHRAN
ST KILDA
BALACLAVA
BRIGHTON ROAD
ELSTERNWICK
BRIGHTON
BRIGHTON EAST
HAMPTON
MOORABBIN
HIGHETT
SANDRINGHAM
CHELTENHAM
BEAUMARIS
MENTONE
ASPENDALE
BONBEACH
HEATHERTON
BENTLEIGH
SOUTH MELBOURNE
ALBERT PARK
MELBOURNE


In [16]:
print('There are {} Uniques Categories.'.format(len(Melbourne_venues['Venue Category'].unique())))
Melbourne_venues.groupby('Locality').count().head()

There are 207 Uniques Categories.


Unnamed: 0_level_0,Locality Latitude,Locality Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Locality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ABBOTSFORD,22,22,22,22,22,22
ALBERT PARK,26,26,26,26,26,26
ALPHINGTON,4,4,4,4,4,4
ARMADALE,13,13,13,13,13,13
ASPENDALE,6,6,6,6,6,6


In [17]:
# one hot encoding
Melbourne_onehot = pd.get_dummies(Melbourne_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Melbourne_onehot['Locality'] = Melbourne_venues['Locality'] 

# move neighborhood column to the first column
fixed_columns = [Melbourne_onehot.columns[-1]] + list(Melbourne_onehot.columns[:-1])
Melbourne_onehot = Melbourne_onehot[fixed_columns]
Melbourne_grouped = Melbourne_onehot.groupby('Locality').mean().reset_index()
Melbourne_onehot.head(5)

Unnamed: 0,Locality,Arepa Restaurant,Art Gallery,Arts & Entertainment,Asian Restaurant,Athletics & Sports,Australian Restaurant,Austrian Restaurant,Auto Garage,Auto Workshop,BBQ Joint,Bagel Shop,Bakery,Bar,Baseball Field,Basketball Court,Beach,Beer Garden,Bistro,Board Shop,Bookstore,Boutique,Bowling Green,Breakfast Spot,Brewery,Bridge,Bubble Tea Shop,Buffet,Building,Burger Joint,Burrito Place,Butcher,Café,Camera Store,Candy Store,Car Wash,Cemetery,Chaat Place,Chinese Restaurant,City Hall,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Gym,College Theater,Concert Hall,Convenience Store,Costume Shop,Creperie,Cricket Ground,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dive Bar,Dry Cleaner,Dumpling Restaurant,Egyptian Restaurant,Electronics Store,Eye Doctor,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Fish Market,Fishing Spot,Flea Market,Food Court,Food Truck,Football Stadium,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Gaming Cafe,Garden,Garden Center,Gas Station,Gastropub,Gay Bar,General Entertainment,German Restaurant,Gift Shop,Golf Course,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Halal Restaurant,Harbor / Marina,Health & Beauty Service,Hobby Shop,Hockey Arena,Home Service,Hostel,Hotel,Hotel Bar,Hungarian Restaurant,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indie Theater,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Karaoke Bar,Kitchen Supply Store,Korean Restaurant,Lebanese Restaurant,Light Rail Station,Liquor Store,Lounge,Malay Restaurant,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Mini Golf,Miscellaneous Shop,Modern European Restaurant,Monument / Landmark,Moroccan Restaurant,Movie Theater,Multiplex,Museum,Music Store,Music Venue,Nightclub,Noodle House,Opera House,Outlet Mall,Paper / Office Supplies Store,Park,Pedestrian Plaza,Peking Duck Restaurant,Performing Arts Venue,Persian Restaurant,Pet Store,Pharmacy,Pier,Pizza Place,Playground,Plaza,Polish Restaurant,Pool,Portuguese Restaurant,Pub,Ramen Restaurant,Recreation Center,Rental Service,Restaurant,River,Rock Climbing Spot,Rock Club,Roof Deck,Sake Bar,Salad Place,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shop & Service,Shopping Mall,Soccer Field,Social Club,Soup Place,Southern / Soul Food Restaurant,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Stadium,Steakhouse,Supermarket,Sushi Restaurant,Szechuan Restaurant,Taco Place,Tea Room,Tennis Court,Tennis Stadium,Thai Restaurant,Theater,Thrift / Vintage Store,Tibetan Restaurant,Tour Provider,Train Station,Tram Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Yunnan Restaurant,Zoo Exhibit
0,MELBOURNE,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,MELBOURNE,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,MELBOURNE,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,MELBOURNE,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,MELBOURNE,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [42]:
Melbourne_onehot.shape

(1163, 208)

In [18]:
num_top_venues = 3
for hood in Melbourne_grouped['Locality']:
    print("---- "+hood+" ----")
    temp =Melbourne_grouped[Melbourne_grouped['Locality'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---- ABBOTSFORD ----
                 venue  freq
0                 Café  0.18
1                  Pub  0.14
2  Japanese Restaurant  0.05


---- ALBERT PARK ----
          venue  freq
0          Café  0.15
1         Beach  0.15
2  Tram Station  0.08


---- ALPHINGTON ----
            venue  freq
0  Farmers Market  0.25
1  Rental Service  0.25
2            Café  0.25


---- ARMADALE ----
                 venue  freq
0                 Café  0.38
1         Tram Station  0.31
2  Japanese Restaurant  0.08


---- ASPENDALE ----
                 venue  freq
0                 Café  0.17
1  Sporting Goods Shop  0.17
2        Jewelry Store  0.17


---- AUBURN ----
          venue  freq
0          Café  0.27
1  Tram Station  0.09
2          Park  0.09


---- AUBURN SOUTH ----
  venue  freq
0  Café  0.13
1   Gym  0.07
2   Bar  0.07


---- BALACLAVA ----
               venue  freq
0               Café  0.18
1        Coffee Shop  0.14
2  Convenience Store  0.09


---- BALWYN ----
                venu

In [19]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### Most Common venues near Melbourne each suburb

In [20]:
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

columns = ['Locality']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

Locality_venues_sorted = pd.DataFrame(columns=columns)
Locality_venues_sorted['Locality'] = Melbourne_grouped['Locality']

for ind in np.arange(Melbourne_grouped.shape[0]):
    Locality_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Melbourne_grouped.iloc[ind, :], num_top_venues)

Locality_venues_sorted.describe()

Unnamed: 0,Locality,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
count,74,74,74,74
unique,74,24,40,43
top,MELBOURNE,Café,Coffee Shop,Pub
freq,1,43,6,5


In [43]:
Locality_venues_sorted.head()

Unnamed: 0,Cluster Labels,Locality,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,2,ABBOTSFORD,Café,Pub,Convenience Store
1,2,ALBERT PARK,Café,Beach,Tram Station
2,0,ALPHINGTON,Rental Service,Café,Train Station
3,0,ARMADALE,Café,Tram Station,Playground
4,2,ASPENDALE,Sporting Goods Shop,Home Service,Café


It is obviously from above table, cafe or coffee shop are the most popular place in Melbourne 

## 6.Clustering

K-means clustering was used to group the locations into three clusters. The different clusters were displayed on a map of Melbourne, as shown in the following figure:

In [34]:
# Using K-Means to cluster neighborhood into 4 clusters
Melbourne_grouped_clustering = Melbourne_grouped.drop('Locality', 1)
kmeans = KMeans(n_clusters=3, random_state=0).fit(Melbourne_grouped_clustering)
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 2, 2, 0, 2, 2, 2, 0, 0,
       2, 2, 0, 0, 0, 2, 1, 0, 0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 2, 2,
       2, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 2,
       2, 0, 2, 2, 0, 0, 0, 2], dtype=int32)

In [22]:
Locality_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Melbourne_merged =Melbourne_grouped

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Melbourne_merged = Melbourne_merged.join(Locality_venues_sorted.set_index('Locality'), on='Locality')

Melbourne_merged.head()# check the last columns!

Unnamed: 0,Postcode,Locality,Longitude,Latitude,SA4 Name,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,3000,MELBOURNE,144.956776,-37.817403,Melbourne - Inner,2.0,Café,Coffee Shop,Japanese Restaurant
1,3002,EAST MELBOURNE,144.982207,-37.818517,Melbourne - Inner,2.0,Cricket Ground,Tennis Stadium,Bar
2,3003,WEST MELBOURNE,144.949592,-37.810871,Melbourne - Inner,2.0,Café,Indian Restaurant,Pub
3,3004,MELBOURNE,144.970161,-37.844246,Melbourne - Inner,2.0,Café,Coffee Shop,Japanese Restaurant
4,3005,WORLD TRADE CENTRE,144.950858,-37.824608,Melbourne - Inner,2.0,Café,Bar,Hotel


### Map of Clusters

In [65]:
# create map
map_clusters = folium.Map(location=[latitude_x, longitude_y], zoom_start=11)
kclusters = 3
# set color scheme for the clusters
x = np.arange(kclusters)
colors_array = cm.rainbow(np.linspace(0, 1, kclusters))
#rainbow = ['#ecc86f', '#ff6d38', '#ff0000']
#print(rainbow)
# add markers to the map

markers_colors = []
for lat, lon, nei , cluster in zip(Melbourne_merged['Latitude'], 
                                   Melbourne_merged['Longitude'], 
                                   Melbourne_merged['Locality'], 
                                   Melbourne_merged['Cluster Labels']):
    label = folium.Popup(str(nei) + ' Cluster ' + str(cluster), parse_html=True)   
    int(cluster)
#    print(cluster)
    if cluster == 0:
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color='#f53623',
            fill=True,
            fill_color='#f53623',
            fill_opacity=0.7).add_to(map_clusters)
    if cluster == 1:
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color='#170ada',
            fill=True,
            fill_color='#170ada',
            fill_opacity=0.7).add_to(map_clusters)
    if cluster == 2:
        folium.CircleMarker(
                [lat, lon],
                radius=5,
                popup=label,
                color='#1a8606',
                fill=True,
                fill_color='#1a8606',
                fill_opacity=0.7).add_to(map_clusters)
map_clusters


In [60]:
Melbourne_merged.loc[Melbourne_merged['Cluster Labels'] == 0,Melbourne_merged.columns[[1] + list(range(5, Melbourne_merged.shape[1]))]]

Unnamed: 0,Locality,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
8,UNIVERSITY OF MELBOURNE,0.0,Café,Coffee Shop,Pub
10,MOONEE PONDS,0.0,Café,Japanese Restaurant,Greek Restaurant
11,ROYAL MELBOURNE HOSPITAL,0.0,Café,Coffee Shop,Pub
12,HOTHAM HILL,0.0,Café,Pub,Football Stadium
15,CARLTON NORTH,0.0,Café,Breakfast Spot,Pub
17,BRUNSWICK,0.0,Café,Bar,Grocery Store
21,COLLINGWOOD,0.0,Café,Japanese Restaurant,Pizza Place
24,NORTHCOTE,0.0,Café,Bar,Pizza Place
25,THORNBURY,0.0,Café,Playground,Zoo Exhibit
26,ALPHINGTON,0.0,Rental Service,Café,Train Station


In [61]:
Melbourne_merged.loc[Melbourne_merged['Cluster Labels'] == 1,Melbourne_merged.columns[[1] + list(range(5, Melbourne_merged.shape[1]))]]

Unnamed: 0,Locality,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
19,SUMNER,1.0,Tram Station,Park,Convenience Store
43,BOX HILL NORTH,1.0,Golf Course,Park,Zoo Exhibit
50,GLEN IRIS,1.0,Park,Playground,Café
54,BOORAN ROAD PO,1.0,Playground,Coffee Shop,Tram Station
71,BONBEACH,1.0,Park,Playground,Soccer Field


In [62]:
Melbourne_merged.loc[Melbourne_merged['Cluster Labels'] == 2,Melbourne_merged.columns[[1] + list(range(5, Melbourne_merged.shape[1]))]]

Unnamed: 0,Locality,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,MELBOURNE,2.0,Café,Coffee Shop,Japanese Restaurant
1,EAST MELBOURNE,2.0,Cricket Ground,Tennis Stadium,Bar
2,WEST MELBOURNE,2.0,Café,Indian Restaurant,Pub
3,MELBOURNE,2.0,Café,Coffee Shop,Japanese Restaurant
4,WORLD TRADE CENTRE,2.0,Café,Bar,Hotel
5,SOUTH WHARF,2.0,Bar,Hotel,Australian Restaurant
6,SOUTHBANK,2.0,Bar,Performing Arts Venue,Theater
7,DOCKLANDS,2.0,Café,Hotel,Portuguese Restaurant
9,FLEMINGTON,2.0,Park,Café,Hotel
13,MELBOURNE UNIVERSITY,2.0,Zoo Exhibit,Convenience Store,Sculpture Garden


## 7.Result

It can be seen from above table, the cluster 0 almost has all 1st Most Common Venue in cafe, except Rental Service and Thrift/Vintage Store. As for cluster 2 which are releated to sport venues or tram Station. The last cluster 1st Most Common Venue are various, but the cafe is occupied almost 33% as most common venue. 

## 8.Discussion

From above data and anlysis, it is clearly that the Melbourne citizens enjoy drinking coffee. Besides, the location of cluster 0 which most common venue is cafe are dispersed. There is not a certain place where locals like to drink coffee, this might prove that Melbourne has many good cafe. 

On the other side, this also give the chance for those people who want to open a new cafe. From above map, the following suburbs would be there choice:   
1.CAULFIELD EAST   
2.BOX HILL   
3.ALPHINGTON   
4.KEW EAST   
5.FLEMINGTON   
Because these area's people like to go to cafe but not too often than other areas. This would be a chance for starting a new business.

## 9.Conclusion

This project analyzed data from the Melbourne area to determine whether Melbourne citizens enjoy drinking coffee and find which asuburb has more coffee lover than other areas. Trough alaBy using K-means clustering, it was possible to determine the best location based from cluster map. Therefore, an entrepreneur opening up a new coffee shop in any of the locations grouped into the location listed above would be likely to have a much higher number of potential customers.