# Table of contents
* [1. Introduction: Business Problem](#introduction)
* [2. Data](#data)
* [3. Methodology](#methodology)
* [4. Analysis](#analysis)
* [5. Results and Discussion](#results)
* [6. Conclusion](#conclusion)

## 1. Introduction: Business Problem <a name="introduction"></a>

Hochiminh City is the most populous city in Vietnam with a population of 8.4 million (13 million in the metropolitan area) as of 2017. As a major gateway to Vietnam, the city received over 8.6 million international visitors in 2019. Therefor, it would have a very number of potential customer if a restaurant is open.
So if someone ask to open a Japanese Restaurant in Hochiminh City, where it should be opened so it's profit the best?

## 2. Data <a name="data"></a>

In order to solve the above question, the data of Hochiminh City like district, population, area, population density
Geolocation data collected from FourSquare wil also be used to collect number of restaurants and their type and location in every neighborhood as well as other venues.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
from pandas.io.json import json_normalize 
from geopy.geocoders import Nominatim
import warnings
warnings.filterwarnings('ignore')
import folium

In [2]:
address = 'Hochiminh, VN'
geolocator = Nominatim(user_agent="HCM_explorer")
To_location = geolocator.geocode(address)
To_latitude = To_location.latitude
To_longitude = To_location.longitude
print('The geograpical coordinate of Hochiminh City are {}, {}.'.format(To_latitude, To_longitude))

The geograpical coordinate of Hochiminh City are 10.7888764, 106.7034958.


First, import data of Hochiminh City

In [3]:
HCM=pd.read_excel('data/HCM.xlsx')
dist_1_latitude=HCM['Latitude'][0]
dist_1_longitude=HCM['Longitude'][0]
HCM.head()

Unnamed: 0,Name,Area (km2),Population,Population Density (person/km2),Latitude,Longitude
0,District 1,7.72,142000,18394,10.775659,106.700424
1,District 2,49.79,180000,3615,10.787273,106.74981
2,District 3,4.92,190000,38618,10.78437,106.684409
3,District 4,4.18,175000,41866,10.757826,106.701297
4,District 5,4.27,159000,37237,10.754028,106.663375


Second, import data from Foursquare

In [4]:
#Foursquare
CLIENT_ID = 'CTCYFTQL3YTRLS10VQU2WCM00PJPOQQ4JZWKG1U4ICUBCTR3' 
CLIENT_SECRET = '4NVBJ0ZLPKDTB0KDXUCOQW2CZFYRX2A1O150EBFZMSDGADAG'
VERSION = '20180605'
LIMIT = 100 
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    dist_1_latitude, 
    dist_1_longitude, 
    radius, 
    LIMIT)

url
import requests
results = requests.get(url).json()

In [5]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print('Processing ',name,'.....')
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [7]:
#get data of all Venue in Ho Chi Minh
HCM_Venues = getNearbyVenues(names=HCM['Name'],
                                   latitudes=HCM['Latitude'],
                                   longitudes=HCM['Longitude']
                                  )

Processing  District 1 .....
Processing  District 2 .....
Processing  District 3 .....
Processing  District 4 .....
Processing  District 5 .....
Processing  District 6 .....
Processing  District 7 .....
Processing  District 8 .....
Processing  District 9 .....
Processing  District 10 .....
Processing  District 11 .....
Processing  District 12 .....
Processing  Binh Tan District .....
Processing  Binh Thanh District .....
Processing  Go Vap District .....
Processing  Phu Nhuan District .....
Processing  Tan Binh District .....
Processing  Tan Phu District .....
Processing  Thu Duc District .....


We have a data of every venue of HCM:

In [8]:
HCM_Venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,District 1,10.775659,106.700424,Pasteur Street Brewing Company,10.77522,106.700894,Brewery
1,District 1,10.775659,106.700424,Liberty Central Saigon Citypoint Hotel,10.774758,106.700795,Hotel
2,District 1,10.775659,106.700424,The Old Compass Cafe,10.774816,106.700685,Café
3,District 1,10.775659,106.700424,CGV Cinemas Liberty CityPoint,10.774763,106.700766,Multiplex
4,District 1,10.775659,106.700424,O Lé,10.774772,106.699524,Spanish Restaurant


Sort all venues which are restaurant only:

In [9]:
df_restaurant=HCM_Venues[HCM_Venues['Venue Category'].str.contains('Restaurant')]
df_restaurant.reset_index(drop=True,inplace=True)
df_restaurant.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,District 1,10.775659,106.700424,O Lé,10.774772,106.699524,Spanish Restaurant
1,District 1,10.775659,106.700424,Huong Lai,10.777258,106.700219,Vietnamese Restaurant
2,District 1,10.775659,106.700424,Secret Garden,10.777422,106.699926,Vietnamese Restaurant
3,District 1,10.775659,106.700424,Mountain Retreat,10.77437,106.700532,Vietnamese Restaurant
4,District 1,10.775659,106.700424,TukTuk Thai Bistro,10.777389,106.700521,Thai Restaurant


## 3. Methodology <a name="methodology"></a>

In this project, we will invest the data of Hochiminh City like district, population, area, population density  find out if there is any correlation between them and the number of restaurant opened by using FourSquare API. If there is, we can narrow some neightborhoods to open the restaurants.

After that, we will use k-means clustering to segment and cluster the neighborhoods in the city of New York and see what kind of restaurant is opened in district so we can avoid it and choose a better place.

## 4. Analysis <a name="analysis"></a>

### 4.1 Correlation

Let's see how many restaurant we have on each District

In [10]:
df_restaurant_num=df_restaurant['Neighborhood'].value_counts().to_frame()
df_restaurant_num

Unnamed: 0,Neighborhood
District 1,34
District 3,23
District 4,12
District 5,8
Tan Binh District,7
District 7,6
District 10,5
District 11,3
Go Vap District,2
District 2,2


Let's see how many venues we have on each District

In [11]:
df_venue_num=HCM_Venues['Neighborhood'].value_counts().to_frame()
df_venue_num

Unnamed: 0,Neighborhood
District 1,100
District 3,40
District 10,23
District 4,20
District 5,19
Tan Binh District,17
District 7,15
Phu Nhuan District,13
Binh Thanh District,10
Thu Duc District,6


Let's add the above info to our dataframe

In [37]:
df_HCM=pd.merge(HCM,df_restaurant_num,how='left',left_on='Name',right_index=True)
df_HCM.rename(columns={'Neighborhood':'Number of Restaurant'},inplace=True)
df_HCM_new=pd.merge(df_HCM,df_venue_num,how='left',left_on='Name',right_index=True)
df_HCM_new.rename(columns={'Neighborhood':'Number of Venue'},inplace=True)
df_HCM_new.fillna(value=0,inplace=True)

Unnamed: 0,Name,Area (km2),Population,Population Density (person/km2),Latitude,Longitude,Number of Restaurant,Number of Venue
0,District 1,7.72,142000,18394,10.775659,106.700424,34.0,100.0
1,District 2,49.79,180000,3615,10.787273,106.74981,2.0,4.0
2,District 3,4.92,190000,38618,10.78437,106.684409,23.0,40.0
3,District 4,4.18,175000,41866,10.757826,106.701297,12.0,20.0
4,District 5,4.27,159000,37237,10.754028,106.663375,8.0,19.0
5,District 6,7.14,233000,32633,10.748093,106.635236,0.0,5.0
6,District 7,35.69,360000,10087,10.734034,106.721579,6.0,15.0
7,District 8,19.11,424000,22187,10.724088,106.628626,0.0,2.0
8,District 9,114.0,397000,3482,10.84284,106.828685,0.0,0.0
9,District 10,5.72,234000,40909,10.774596,106.667954,5.0,23.0


In [38]:
df_HCM_new.head(18)

Unnamed: 0,Name,Area (km2),Population,Population Density (person/km2),Latitude,Longitude,Number of Restaurant,Number of Venue
0,District 1,7.72,142000,18394,10.775659,106.700424,34.0,100.0
1,District 2,49.79,180000,3615,10.787273,106.74981,2.0,4.0
2,District 3,4.92,190000,38618,10.78437,106.684409,23.0,40.0
3,District 4,4.18,175000,41866,10.757826,106.701297,12.0,20.0
4,District 5,4.27,159000,37237,10.754028,106.663375,8.0,19.0
5,District 6,7.14,233000,32633,10.748093,106.635236,0.0,5.0
6,District 7,35.69,360000,10087,10.734034,106.721579,6.0,15.0
7,District 8,19.11,424000,22187,10.724088,106.628626,0.0,2.0
8,District 9,114.0,397000,3482,10.84284,106.828685,0.0,0.0
9,District 10,5.72,234000,40909,10.774596,106.667954,5.0,23.0


let's see the correlation between the number of Restaurant with the rest info:

In [13]:
from scipy import stats
df_HCM_new.corr().loc['Number of Restaurant']

Area (km2)                        -0.371799
Population                        -0.495202
Population Density (person/km2)    0.189533
Latitude                          -0.197038
Longitude                         -0.129953
Number of Restaurant               1.000000
Number of Venue                    0.950511
Name: Number of Restaurant, dtype: float64

We found that the correlation coefficient between Number of Restaurant and Population, and between Number of Restaurant and Number of Venue are noticable: -0.495202, 0.950511. So let check their P-value:

In [14]:
pearson_coef, p_value = stats.pearsonr(df_HCM_new['Number of Restaurant'],df_HCM_new['Population'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value )

The Pearson Correlation Coefficient is -0.49520242127936664  with a P-value of P = 0.03109951445332337


**=> there is a moderate relation between Number of Restaurant and Population**

In [15]:
pearson_coef, p_value = stats.pearsonr(df_HCM_new['Number of Restaurant'],df_HCM_new['Number of Venue'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value )

The Pearson Correlation Coefficient is 0.95051089106888  with a P-value of P = 4.672574022418637e-10


**=> there is a strong relation between Number of Restaurant and Number of Venue**

Top 5 district have high Number of Venue:

In [16]:
df_HCM_new.sort_values(by=['Number of Venue'],ascending=False).head()

Unnamed: 0,Name,Area (km2),Population,Population Density (person/km2),Latitude,Longitude,Number of Restaurant,Number of Venue
0,District 1,7.72,142000,18394,10.775659,106.700424,34.0,100.0
2,District 3,4.92,190000,38618,10.78437,106.684409,23.0,40.0
9,District 10,5.72,234000,40909,10.774596,106.667954,5.0,23.0
3,District 4,4.18,175000,41866,10.757826,106.701297,12.0,20.0
4,District 5,4.27,159000,37237,10.754028,106.663375,8.0,19.0


### 4.2 K-means

Let's see how many kind of restaurant in Hochiminh City:

In [17]:
df_restaurant.groupby('Neighborhood').count().iloc[:,0]

Neighborhood
Binh Thanh District     1
District 1             34
District 10             5
District 11             3
District 12             1
District 2              2
District 3             23
District 4             12
District 5              8
District 7              6
Go Vap District         2
Phu Nhuan District      2
Tan Binh District       7
Name: Neighborhood Latitude, dtype: int64

In [18]:
print('There are {} uniques categories.'.format(len(df_restaurant['Venue Category'].unique())))

There are 23 uniques categories.


In [19]:
# one hot encoding
HCM_onehot = pd.get_dummies(df_restaurant[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
HCM_onehot['Neighborhood'] = df_restaurant['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [HCM_onehot.columns[-1]] + list(HCM_onehot.columns[:-1])
HCM_onehot = HCM_onehot[fixed_columns]

HCM_onehot.head()

Unnamed: 0,Neighborhood,Argentinian Restaurant,Asian Restaurant,Chinese Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Fast Food Restaurant,French Restaurant,German Restaurant,Hawaiian Restaurant,...,Middle Eastern Restaurant,North Indian Restaurant,Restaurant,Scandinavian Restaurant,Seafood Restaurant,Spanish Restaurant,Sushi Restaurant,Thai Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,District 1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,District 1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,District 1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,District 1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,District 1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [20]:
HCM_grouped = HCM_onehot.groupby('Neighborhood').mean().reset_index()
HCM_grouped.head()

Unnamed: 0,Neighborhood,Argentinian Restaurant,Asian Restaurant,Chinese Restaurant,Dim Sum Restaurant,Eastern European Restaurant,Fast Food Restaurant,French Restaurant,German Restaurant,Hawaiian Restaurant,...,Middle Eastern Restaurant,North Indian Restaurant,Restaurant,Scandinavian Restaurant,Seafood Restaurant,Spanish Restaurant,Sushi Restaurant,Thai Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,Binh Thanh District,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,District 1,0.029412,0.058824,0.0,0.0,0.029412,0.0,0.029412,0.029412,0.029412,...,0.029412,0.029412,0.058824,0.0,0.0,0.029412,0.058824,0.058824,0.029412,0.352941
2,District 10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8
3,District 11,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.333333
4,District 12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Ok, we now can see the frequency of different kind of restaurant in each District:

In [22]:
num_top_venues = 5

for hood in HCM_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = HCM_grouped[HCM_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Binh Thanh District----
                           venue  freq
0          Vietnamese Restaurant   1.0
1             Mexican Restaurant   0.0
2  Vegetarian / Vegan Restaurant   0.0
3                Thai Restaurant   0.0
4               Sushi Restaurant   0.0


----District 1----
                   venue  freq
0  Vietnamese Restaurant  0.35
1    Japanese Restaurant  0.06
2        Thai Restaurant  0.06
3       Sushi Restaurant  0.06
4             Restaurant  0.06


----District 10----
                           venue  freq
0          Vietnamese Restaurant   0.8
1              Korean Restaurant   0.2
2             Chinese Restaurant   0.0
3      Middle Eastern Restaurant   0.0
4  Vegetarian / Vegan Restaurant   0.0


----District 11----
                           venue  freq
0          Vietnamese Restaurant  0.33
1             Seafood Restaurant  0.33
2               Asian Restaurant  0.33
3             Mexican Restaurant  0.00
4  Vegetarian / Vegan Restaurant  0.00


----District 12--

In [23]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Restaurant'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Restaurant'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = HCM_grouped['Neighborhood']

for ind in np.arange(HCM_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(HCM_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
0,Binh Thanh District,Vietnamese Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant,Dim Sum Restaurant
1,District 1,Vietnamese Restaurant,Restaurant,Asian Restaurant,Japanese Restaurant,Korean Restaurant
2,District 10,Vietnamese Restaurant,Korean Restaurant,Thai Restaurant,Italian Restaurant,Asian Restaurant
3,District 11,Vietnamese Restaurant,Asian Restaurant,Seafood Restaurant,Japanese Restaurant,Chinese Restaurant
4,District 12,Restaurant,Vietnamese Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant
5,District 2,Vietnamese Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant,Dim Sum Restaurant
6,District 3,Vietnamese Restaurant,French Restaurant,Seafood Restaurant,Asian Restaurant,Korean Restaurant
7,District 4,Seafood Restaurant,Vietnamese Restaurant,Mexican Restaurant,Fast Food Restaurant,Japanese Restaurant
8,District 5,Chinese Restaurant,Vietnamese Restaurant,Dim Sum Restaurant,Asian Restaurant,Japanese Restaurant
9,District 7,Vietnamese Restaurant,Sushi Restaurant,Scandinavian Restaurant,Japanese Restaurant,Italian Restaurant


In [24]:
from sklearn.cluster import KMeans #importing KMeans

kclusters = 5
HCM_grouped_clustering = HCM_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(HCM_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 3, 0, 1, 2, 0, 3, 1, 3, 3])

In [25]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_) # add clustering labels

In [26]:
neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
0,0,Binh Thanh District,Vietnamese Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant,Dim Sum Restaurant
1,3,District 1,Vietnamese Restaurant,Restaurant,Asian Restaurant,Japanese Restaurant,Korean Restaurant
2,0,District 10,Vietnamese Restaurant,Korean Restaurant,Thai Restaurant,Italian Restaurant,Asian Restaurant
3,1,District 11,Vietnamese Restaurant,Asian Restaurant,Seafood Restaurant,Japanese Restaurant,Chinese Restaurant
4,2,District 12,Restaurant,Vietnamese Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant


In [27]:
df_HCM_new.head()

Unnamed: 0,Name,Area (km2),Population,Population Density (person/km2),Latitude,Longitude,Number of Restaurant,Number of Venue
0,District 1,7.72,142000,18394,10.775659,106.700424,34.0,100.0
1,District 2,49.79,180000,3615,10.787273,106.74981,2.0,4.0
2,District 3,4.92,190000,38618,10.78437,106.684409,23.0,40.0
3,District 4,4.18,175000,41866,10.757826,106.701297,12.0,20.0
4,District 5,4.27,159000,37237,10.754028,106.663375,8.0,19.0


In [28]:
# merge HCM_grouped with df_HCM_new to add latitude/longitude for each neighborhood
df_HCM_new = df_HCM_new.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Name',how='inner')

And now we can see the top 5 common restaurant for each district:

In [29]:
df_HCM_new

Unnamed: 0,Name,Area (km2),Population,Population Density (person/km2),Latitude,Longitude,Number of Restaurant,Number of Venue,Cluster Labels,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
0,District 1,7.72,142000,18394,10.775659,106.700424,34.0,100.0,3,Vietnamese Restaurant,Restaurant,Asian Restaurant,Japanese Restaurant,Korean Restaurant
1,District 2,49.79,180000,3615,10.787273,106.74981,2.0,4.0,0,Vietnamese Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant,Dim Sum Restaurant
2,District 3,4.92,190000,38618,10.78437,106.684409,23.0,40.0,3,Vietnamese Restaurant,French Restaurant,Seafood Restaurant,Asian Restaurant,Korean Restaurant
3,District 4,4.18,175000,41866,10.757826,106.701297,12.0,20.0,1,Seafood Restaurant,Vietnamese Restaurant,Mexican Restaurant,Fast Food Restaurant,Japanese Restaurant
4,District 5,4.27,159000,37237,10.754028,106.663375,8.0,19.0,3,Chinese Restaurant,Vietnamese Restaurant,Dim Sum Restaurant,Asian Restaurant,Japanese Restaurant
6,District 7,35.69,360000,10087,10.734034,106.721579,6.0,15.0,3,Vietnamese Restaurant,Sushi Restaurant,Scandinavian Restaurant,Japanese Restaurant,Italian Restaurant
9,District 10,5.72,234000,40909,10.774596,106.667954,5.0,23.0,0,Vietnamese Restaurant,Korean Restaurant,Thai Restaurant,Italian Restaurant,Asian Restaurant
10,District 11,5.14,209000,40661,10.762974,106.650084,3.0,4.0,1,Vietnamese Restaurant,Asian Restaurant,Seafood Restaurant,Japanese Restaurant,Chinese Restaurant
11,District 12,52.74,620000,11756,10.867153,106.641332,1.0,2.0,2,Restaurant,Vietnamese Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant
13,Binh Thanh District,20.78,499000,24013,10.810583,106.709142,1.0,10.0,0,Vietnamese Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant,Dim Sum Restaurant


In [30]:
# visualize the resulting clusters

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[To_latitude, To_longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_HCM_new['Latitude'], df_HCM_new['Longitude'], df_HCM_new['Name'], df_HCM_new['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters



In [31]:
df_HCM_new.loc[df_HCM_new['Cluster Labels'] == 0, df_HCM_new.columns[[1] + list(range(5, df_HCM_new.shape[1]))]]

Unnamed: 0,Area (km2),Longitude,Number of Restaurant,Number of Venue,Cluster Labels,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
1,49.79,106.74981,2.0,4.0,0,Vietnamese Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant,Dim Sum Restaurant
9,5.72,106.667954,5.0,23.0,0,Vietnamese Restaurant,Korean Restaurant,Thai Restaurant,Italian Restaurant,Asian Restaurant
13,20.78,106.709142,1.0,10.0,0,Vietnamese Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant,Dim Sum Restaurant
15,4.88,106.680264,2.0,13.0,0,Vietnamese Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant,Dim Sum Restaurant


In [32]:
df_HCM_new.loc[df_HCM_new['Cluster Labels'] == 1, df_HCM_new.columns[[1] + list(range(5, df_HCM_new.shape[1]))]]

Unnamed: 0,Area (km2),Longitude,Number of Restaurant,Number of Venue,Cluster Labels,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
3,4.18,106.701297,12.0,20.0,1,Seafood Restaurant,Vietnamese Restaurant,Mexican Restaurant,Fast Food Restaurant,Japanese Restaurant
10,5.14,106.650084,3.0,4.0,1,Vietnamese Restaurant,Asian Restaurant,Seafood Restaurant,Japanese Restaurant,Chinese Restaurant


In [33]:
df_HCM_new.loc[df_HCM_new['Cluster Labels'] == 2, df_HCM_new.columns[[1] + list(range(5, df_HCM_new.shape[1]))]]

Unnamed: 0,Area (km2),Longitude,Number of Restaurant,Number of Venue,Cluster Labels,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
11,52.74,106.641332,1.0,2.0,2,Restaurant,Vietnamese Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant


In [34]:
df_HCM_new.loc[df_HCM_new['Cluster Labels'] == 3, df_HCM_new.columns[[1] + list(range(5, df_HCM_new.shape[1]))]]

Unnamed: 0,Area (km2),Longitude,Number of Restaurant,Number of Venue,Cluster Labels,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
0,7.72,106.700424,34.0,100.0,3,Vietnamese Restaurant,Restaurant,Asian Restaurant,Japanese Restaurant,Korean Restaurant
2,4.92,106.684409,23.0,40.0,3,Vietnamese Restaurant,French Restaurant,Seafood Restaurant,Asian Restaurant,Korean Restaurant
4,4.27,106.663375,8.0,19.0,3,Chinese Restaurant,Vietnamese Restaurant,Dim Sum Restaurant,Asian Restaurant,Japanese Restaurant
6,35.69,106.721579,6.0,15.0,3,Vietnamese Restaurant,Sushi Restaurant,Scandinavian Restaurant,Japanese Restaurant,Italian Restaurant
16,22.43,106.652597,7.0,17.0,3,Vietnamese Restaurant,Asian Restaurant,Sushi Restaurant,Seafood Restaurant,Restaurant


In [35]:
df_HCM_new.loc[df_HCM_new['Cluster Labels'] == 4, df_HCM_new.columns[[1] + list(range(5, df_HCM_new.shape[1]))]]

Unnamed: 0,Area (km2),Longitude,Number of Restaurant,Number of Venue,Cluster Labels,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
14,19.73,106.66529,2.0,4.0,4,Vietnamese Restaurant,Fast Food Restaurant,Japanese Restaurant,Asian Restaurant,Chinese Restaurant


## 5. Results and Discussion <a name="results"></a>

Our analyst shows that there is a strong relation between Number of Restaurant and Number of Venue. So its seems that the restaurant should be opened where the economy is higher than other place. We can narrow down the top 5 district have highest economy: 1,3,10,4,5. Within these 5 district, the ratio between restaurant's number and Venue's number of district 1, 10 are quite low compare to other place so these 2 districts seem like the place we are looking for.

By using K-means, we found that on District 1, the Japanese restaurant is in the 4th most common Restaurant. So if we open a Japanese restaurant in District 1, there quite a number of rivalry. But in District 10, the Japanese restaurant isn't in top 5 common restaurant so District 10 is likely a place to open the Japanese restaurant.

## 6. Conclusion <a name="conclusion"></a>

Purpose of this project was to identify Hochiminh City area in order to aid stakeholders in narrowing down the search for optimal location for a new Japanese restaurant. By calculating restaurant density distribution from Foursquare data and the relation of between each characteristics of District, we have choose District 10 as the starting point for final decision by stakeholders. The exact location will be made only after stakeholders considers on other characteristics of district like the attractiveness, real estate avalability, estate's price, traffic,...