<h1> Daniel Setyanata - Capstone Project </h1>

<h2> 1. Introduction </h2>

<h3> 1.1 Background

Melbourne, Victoria, Australia is a city that is famous for its population's affinity for coffee and brunch. Thus, we can say that Victoria is one of the best city for a businessman/woman to open a cafe. However, it is not an entirely easy process due to the number of cafes that are already open and operating in the area. Therefore, by using the Foursquare data, I am hoping to find the best location to open a cafe where demand is still high and there are not that many cafes that are already operating in the area. Additionally, the population used is 2018 population as it is more recent.

The data used is the list of areas of Greater Melbourne from [Wikipedia](https://en.wikipedia.org/wiki/Local_government_areas_of_Victoria#Greater_Melbourne) and the Foursquare data that shows cafes in Victoria.

<h3> 1.2 Target Audience

The target audience for this report is people who are looking to open a cafe in Victoria and trying to find the best location to open the cafe. Additionally, investors looking to invest in cafes in Victoria might benefit from this report.

<h2> 2. Data Acquisition and Cleaning

Firstly, we scrape the data from [Wikipedia](https://en.wikipedia.org/wiki/Local_government_areas_of_Victoria#Greater_Melbourne) and use the Greater Melbourne areas only. The 'Council seat' column is how the area is usually called by the locals, thus we rename it to area. Density and population are also used to determine which area is better for opening a cafe. However, for density, we only take into consideration areas with densities higher than 2000 as values lower might not be beneficial to open a cafe.

In [10]:
from lxml import html
from bs4 import BeautifulSoup

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium

In [84]:
r = requests.get("https://en.wikipedia.org/wiki/Local_government_areas_of_Victoria#Greater_Melbourne")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find_all('table')[0] 
df2 = pd.read_html(str(table))[0]
df2.rename(columns={'Council seat': 'Area', '(2018)[2][1]': 'Pop', 'Land area[1]': 'Land area', 'Density(2018)[1]': 'Density'}, inplace=True)
col1=df2['Area']['Area']
col2=df2['Land area']['Density']
col3=df2['Population']['Pop']
df = pd.DataFrame(list(zip(col1, col2, col3)), 
               columns =['Area', 'Density', 'Population'])
df.sort_values(by='Density',ascending=False,inplace=True)
df.reset_index(drop=True,inplace=True)
df = df.loc[0:13,:]
df.head()

Unnamed: 0,Area,Density,Population
0,St Kilda,5466,113200
1,Richmond,5041,98521
2,Melbourne,4550,169961
3,Malvern,4530,116207
4,Caulfield North,3977,153858


We use Geopy to find the latitude and longitude of areas in Victoria. However, there are several wrong values of latitude and longitude because there might be other areas in the world with the same name. Thus, we have to manually correct the wrong values to the corrent values, which is found using a Google search.

In [85]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent='Australia_explorer')
def city_coordinates(city_name):
    lat = geolocator.geocode(city_name).latitude
    lng = geolocator.geocode(city_name).longitude
    return lat,lng
df['Latitude'],df['Longitude'] = zip(*df['Area'].apply(city_coordinates))
df.loc[0,'Latitude'],df.loc[0,'Longitude'] = -37.8640, 144.9820
df.loc[1,'Latitude'],df.loc[1,'Longitude'] = -37.8230, 144.9980
df.loc[3,'Latitude'],df.loc[3,'Longitude'] = -37.8572, 145.0342
df.loc[5,'Latitude'],df.loc[5,'Longitude'] = -37.7413, 144.9666
df.loc[6,'Latitude'],df.loc[6,'Longitude'] = -37.7431, 145.0081
df.loc[7,'Latitude'],df.loc[7,'Longitude'] = -37.8321, 145.0637
df.loc[9,'Latitude'],df.loc[9,'Longitude'] = -37.9525, 145.0123
df=df.loc[0:13,]
df.head()

Unnamed: 0,Area,Density,Population,Latitude,Longitude
0,St Kilda,5466,113200,-37.864,144.982
1,Richmond,5041,98521,-37.823,144.998
2,Melbourne,4550,169961,-37.814218,144.963161
3,Malvern,4530,116207,-37.8572,145.0342
4,Caulfield North,3977,153858,-37.870828,145.021801


Next, we will be using Foursquare API data to show cafes and/or restaurants in Victoria. We need to use Foursquare API Client ID and Client Secret, specified below. Additionally, we only consider the venue category food as our main focus is cafes and/or restaurants.

In [40]:
CLIENT_ID = '4FZSETCZATEWPQ2KIONQUFQFQ0WYGKJNPIKUM44AWBJD5CZY'
CLIENT_SECRET = 'TWQI0NR2QXZQHKHRL1MFV1GHB00PWHAVRSVEQBMNSDRPZ145'
VERSION = '20180605'

from pandas.io.json import json_normalize
categories_loaded = False
categories_hierarchy = {}
top_level_categories = {}

def getAllCategories(force_load=False):
    global categories_loaded
    global categories_hierarchy
    if(force_load or not categories_loaded):
        url='https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION
        )
        categories_hierarchy = requests.get(url).json()['response']['categories']
        categories_loaded=True
    return categories_hierarchy

def assignCategory(category_name,json_tree):
    global top_level_categories
    top_level_categories[json_tree['name']] = category_name
    for category in json_tree['categories']:
        assignCategory(category_name,category)
    
def assignTopLevelCategory():
    global top_level_categories
    global categories_hierarchy
    getAllCategories()
    for category in categories_hierarchy:
        top_level = category['name']
        top_level_categories[top_level] = top_level
        for child_category in category['categories']:
            assignCategory(top_level,child_category)
        
def getNearbyVenues(nbrs,lats,lngs,radius=5000):
    global top_level_categories
    venues = []
    for nbr,lat,lng in zip(nbrs,lats,lngs):
        url='https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            100
        )
        items = requests.get(url).json()['response']['groups'][0]['items']
        venues += [[nbr,lat,lng,item['venue']['name'],item['venue']['location']['lat'],
                    item['venue']['location']['lng'],
                   top_level_categories[item['venue']['categories'][0]['name']]] for item in items]
        
    nbr_df = pd.DataFrame(venues)
    nbr_df.columns = ['Area','Area Latitude','Area Longitude','Venue','Venue Latitude','Venue Longitude','Venue Category']
    return nbr_df

assignTopLevelCategory()
venues_df = getNearbyVenues(df.loc[:,'Area'],df.loc[:,'Latitude'],df.loc[:,'Longitude'])
cafes_df=venues_df[venues_df['Venue Category'] == 'Food']

<h2> 3. Data Analysis and Evaluation

There are two ways to determine which area is the best for opening a cafe.

Firstly, we can compare the number of cafes/restaurants already open in the area. From the dataframe cafes_count, we can see that Greensborough, Melbourne, Richmond, Glen Waverley, Moonee Ponds and St Kilda would be the six best areas in Victoria to open a cafe.

Secondly, we can compare the density of each area. From df[['Area','Density']], we can see that St Kilda, Richmond, Melbourne, Malvern, Caulfield North and Coburg are the six best areas.

In [89]:
sorted_area = sorted(df['Area'])
cafes_count_list = cafes_df.set_index(["Area", "Venue"]).count(level="Area")['Venue Category']
cafes_count = pd.DataFrame(list(zip(sorted_area,cafes_count_list)), columns=['Area','Count'])
cafes_count.sort_values(by='Count',ascending=True,inplace=True)
cafes_count

Unnamed: 0,Area,Count
5,Greensborough,46
7,Melbourne,47
11,Richmond,53
4,Glen Waverley,55
8,Moonee Ponds,55
13,St Kilda,59
12,Sandringham,60
3,Footscray,61
9,Nunawading,63
2,Coburg,66


In [90]:
df[['Area','Density']]

Unnamed: 0,Area,Density
0,St Kilda,5466
1,Richmond,5041
2,Melbourne,4550
3,Malvern,4530
4,Caulfield North,3977
5,Coburg,3567
6,Preston,3022
7,Camberwell,3013
8,Footscray,2927
9,Sandringham,2841


<h2> 4. Conclusion

Based on the analysis above, we can conclude that Melbourne, Richmond and St Kilda would be the best areas to open a cafe in due to the high density of its population and the fact that there are not that many cafes/restaurants yet in those areas compared to other areas.

Further analysis and criteria could be used to improve this report. For example, the demography of the population could also determine the profitability of a cafe (e.g. older generation might not prefer brunch). Additionally, the economic condition of the population could also be a factor of whether an area would be a good location to open a cafe.