# Capstone Project - The Battle of Neighborhoods

# Introduction

## Business Problem

The target audience of this analysis are people who want to open a new business in catering industry in Hong Kong but have limited funds and little knowledge on catering industry.<br>
This analysis will make use of foursquare location data to help their decision.


Hong Kong is crowded with shops, buildings and people. Although it is highly populated, it is hard to maintain a business as rents incur lots of cost.  <br>
Therefore, as an important factor that affacts the number of customer, choosing a correct location is vital in store’s profitability, which is crucial to the business survival rate. <br>
So, suppose you want to open a new resturant in Hong Kong, which neighbourhood should you choose?

In this analysis, let's use coffee shop as an example as it is one of the most popular type of restaurant in Hong Kong. <br>
We will explore the catering industry in different neighbourhoods in Hong Kong

Here are several questions to answer: <br>
*Which locations have the most/least coffee shops?*<br>
*Is there a relationship between the number of nearby facilities and the number of coffee shops?* <br>
*Which other locations are similar in term of the distribution of catering businesses ?* 



# Data

To solve this problem, we need to use the following data and tools. 
1. The names of various neighborhoods in Hong Kong <br>
The name of communities are retrieved from the Midland Realty, a well known real estate company (https://www.midland.com.hk/).
2. Coordinates of various neighborhoods in Hong Kong <br>
We need latitude and longitude to plot different locations for visualization. <br>
They will also be used to explore the neighborhood using Foursquare api. <br>
We will use GeoPy's Nominatim to get the latitude and longitude for each neighborhood
3. Foursquare api <br>
We will utilize the Foursquare api to search the list of venues grouped by categories for each neighborhood

# Methodology

In this analysis, there are 4 main sections.
1. Data collection  <br>
Scrap the neighborhoods' names from website, perform data cleansing, prepare coordinate information   
2. Data exploration  <br>
Visualise the neighborhoods' location, search for venues, understand the data with descriptive statistics 
3. Data preparation   <br>
Prepare data for clustering algorithm
4. Modelling  <br>
Use clustering algorithm to group neighborhoods into clusters

## Data collection

#### Scraping the community names from Midland Realty
<a id='5'></a>

#### Import BeautifulSoup and scrap the text

In [179]:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = "https://en.midland.com.hk/estate/Kowloon-20"
html = requests.get(url).text
soup = bs(html, 'html.parser')

#### Find the text after parsing

In [180]:
ta = soup.find('nav', {'id': 'districts'})
Children = ta.findChildren("a" , recursive=False)
subdistricts = []
for child in Children:
    subdistricts.append(child.text)
subdistricts

['Kowloon (1049)',
 'Tsim Sha Tsui                                      (87)',
 'Kowloon Station                                    (8)',
 'Yau Ma Tei                                         (10)',
 'Kingspark                                          (12)',
 'Mongkok                                            (105)',
 'Tai Kok Tsui                                       (36)',
 'Olympic                                            (11)',
 'Lai Chi Kok                                        (8)',
 'Mei Foo                                            (11)',
 'Cheung Sha Wan / Sham Shui Po                      (189)',
 'Yau Yat Chuen                                      (35)',
 'Kowloon Tong / Beacon Hill                         (72)',
 'Ho Man Tin                                         (130)',
 'Hung Hom                                           (55)',
 'To Kwa Wan                                         (68)',
 'Kai Tak                                            (7)',
 'Kowloon City       

#### Convert the list into data frame and display the first few rows

In [181]:
Kowloon_Location = pd.DataFrame({'District': subdistricts[0],'neighborhood': subdistricts[1:]})
Kowloon_Location.head()

Unnamed: 0,District,neighborhood
0,Kowloon (1049),Tsim Sha Tsui ...
1,Kowloon (1049),Kowloon Station ...
2,Kowloon (1049),Yau Ma Tei ...
3,Kowloon (1049),Kingspark ...
4,Kowloon (1049),Mongkok ...


#### Repeat above for Hong Kong Island and concate the dataframe

In [182]:
url = "https://en.midland.com.hk/estate/Hong-Kong-10"
html = requests.get(url).text
soup2 = bs(html, 'html.parser')
soup2
ta = soup2.find('nav', {'id': 'districts'})
Children = ta.findChildren("a" , recursive=False)
subdistricts = []
for child in Children:
    subdistricts.append(child.text)
subdistricts
HK_Island_Location = pd.DataFrame({'District': subdistricts[0],'neighborhood': subdistricts[1:]})
print(HK_Island_Location.head())

HK_Location = pd.concat([Kowloon_Location, HK_Island_Location],ignore_index=True)

           District                                       neighborhood
0  Hong Kong (1397)  Chai Wan                                      ...
1  Hong Kong (1397)  Heng Fa Chuen (Chai Wan)                      ...
2  Hong Kong (1397)  Shau Kei Wan                                  ...
3  Hong Kong (1397)  Sai Wan Ho / Taikoo                           ...
4  Hong Kong (1397)  Quarry Bay                                    ...


#### Remove the unwanted brackets and numbers

In [183]:
import re
def removedigit(x):
    x = re.findall("([a-zA-Z ]+.+[a-zA-Z ]+)", x)[0].strip()
    return x

HK_Location = HK_Location.applymap(removedigit)
HK_Location.head()

Unnamed: 0,District,neighborhood
0,Kowloon,Tsim Sha Tsui
1,Kowloon,Kowloon Station
2,Kowloon,Yau Ma Tei
3,Kowloon,Kingspark
4,Kowloon,Mongkok


#### Some additional adjustments...
To avoid confusion and ensure correct coordinates are produced during the search,
1. Several neighborhoods in Hong Kong have similar names - For searching purpose, I will add the district name of that neighborhood.
2. Some neighborhoods' names can be confusing, such as Kowloon station (Actually it is a metro station). I will rename them or drop them
3. For some rows, there is "/" to seperate 2 or more neighborhoods. Indeed they are different neighborhoods. I will remove the "/" and create an additional row. 

##### Split and create additional rows

In [184]:
HK_Location2 = pd.concat([pd.Series(row['District'], row['neighborhood'].split('/'))              
                    for _, row in HK_Location.iterrows()]).reset_index().rename(index=str, columns={"index": "neighborhood", 0: "District"})

HK_Location2[HK_Location2.columns] = HK_Location2.apply(lambda x: x.str.strip())
HK_Location2

Unnamed: 0,neighborhood,District
0,Tsim Sha Tsui,Kowloon
1,Kowloon Station,Kowloon
2,Yau Ma Tei,Kowloon
3,Kingspark,Kowloon
4,Mongkok,Kowloon
5,Tai Kok Tsui,Kowloon
6,Olympic,Kowloon
7,Lai Chi Kok,Kowloon
8,Mei Foo,Kowloon
9,Cheung Sha Wan,Kowloon


##### Rename neighborhood name /  Drop rows

In [185]:
HK_Location2.loc[HK_Location2.neighborhood == 'Kingspark', 'neighborhood'] = 'Kings park'
HK_Location2.loc[HK_Location2.neighborhood == 'Tseung Kwan O Station', 'neighborhood'] = 'Tseung Kwan O'
HK_Location2.loc[HK_Location2.neighborhood == 'Yau Tong', 'neighborhood'] = 'Yau Tong Station'
HK_Location2.loc[HK_Location2.neighborhood == 'Heng Fa Chuen (Chai Wan)', 'neighborhood'] = 'Heng Fa Chuen'
HK_Location2.loc[HK_Location2.neighborhood == 'Central Mid-Levels', 'neighborhood'] = 'Mid-Levels'
HK_Location2.loc[HK_Location2.neighborhood == 'Central', 'neighborhood'] = 'Central District'

#Droping
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "Tiu Keng Leng"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "Lohas Park"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "Hang Hau"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "Kowloon Station"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "Mid-Levels"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "North Point Midlevel"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "Mid Level East"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "North"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "Hong Kong West"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "Western Mid-Levels"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "Tai Hang"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "Stanley"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "Shek O"].index, inplace=True)
HK_Location2.drop(HK_Location2[HK_Location2.neighborhood == "Tai Tam"].index, inplace=True)


HK_Location2.reset_index(drop=True, inplace=True)
HK_Location2

Unnamed: 0,neighborhood,District
0,Tsim Sha Tsui,Kowloon
1,Yau Ma Tei,Kowloon
2,Kings park,Kowloon
3,Mongkok,Kowloon
4,Tai Kok Tsui,Kowloon
5,Olympic,Kowloon
6,Lai Chi Kok,Kowloon
7,Mei Foo,Kowloon
8,Cheung Sha Wan,Kowloon
9,Sham Shui Po,Kowloon


The location names are ready.
Now it's time to use Nominatim to search the coordinate of each neighborhood

#### Import the neccessary library 

In [186]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
#from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
#import matplotlib.cm as cm
#import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported.


#### Create a new dataframe to contain the coordinate information

In [187]:
Location = pd.DataFrame(columns=['Neighborhood','District','Latitude', 'Longitude','searching'])
Location

Unnamed: 0,Neighborhood,District,Latitude,Longitude,searching


In [188]:
def Nominatimcoordinate(address,neighborhood,district):
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinate of' , address,  'are {}, {}.'.format(latitude, longitude))
    Location.loc[len(Location)] = [neighborhood,district,latitude,longitude,address]

geolocator = Nominatim(user_agent="ibm")
for index, row in HK_Location2.iterrows():
    neighborhood = row['neighborhood']
    district = row['District']
    address = row['neighborhood'] + " " + row['District']
    try:
        Nominatimcoordinate(address,neighborhood,district)
    except:
        address = row['neighborhood'] + " " + "Hong Kong"
        Nominatimcoordinate(address,neighborhood,district)    

The geograpical coordinate of Tsim Sha Tsui Kowloon are 22.298872, 114.1741181.
The geograpical coordinate of Yau Ma Tei Kowloon are 22.3069384, 114.1644631.
The geograpical coordinate of Kings park Kowloon are 22.3105718, 114.1742986.
The geograpical coordinate of Mongkok Kowloon are 22.3196852, 114.1683968.
The geograpical coordinate of Tai Kok Tsui Kowloon are 22.3210684, 114.1612096.
The geograpical coordinate of Olympic Kowloon are 22.317798, 114.160234013365.
The geograpical coordinate of Lai Chi Kok Kowloon are 22.3294409, 114.146655017682.
The geograpical coordinate of Mei Foo Hong Kong are 22.33880415, 114.136548836551.
The geograpical coordinate of Cheung Sha Wan Kowloon are 22.336246, 114.1517566.
The geograpical coordinate of Sham Shui Po Kowloon are 22.3300955, 114.1609403.
The geograpical coordinate of Yau Yat Chuen Kowloon are 22.3310303, 114.1744603.
The geograpical coordinate of Kowloon Tong Kowloon are 22.331632, 114.1800196.
The geograpical coordinate of Beacon Hill 

In [189]:
Location.head(10)

Unnamed: 0,Neighborhood,District,Latitude,Longitude,searching
0,Tsim Sha Tsui,Kowloon,22.298872,114.174118,Tsim Sha Tsui Kowloon
1,Yau Ma Tei,Kowloon,22.306938,114.164463,Yau Ma Tei Kowloon
2,Kings park,Kowloon,22.310572,114.174299,Kings park Kowloon
3,Mongkok,Kowloon,22.319685,114.168397,Mongkok Kowloon
4,Tai Kok Tsui,Kowloon,22.321068,114.16121,Tai Kok Tsui Kowloon
5,Olympic,Kowloon,22.317798,114.160234,Olympic Kowloon
6,Lai Chi Kok,Kowloon,22.329441,114.146655,Lai Chi Kok Kowloon
7,Mei Foo,Kowloon,22.338804,114.136549,Mei Foo Hong Kong
8,Cheung Sha Wan,Kowloon,22.336246,114.151757,Cheung Sha Wan Kowloon
9,Sham Shui Po,Kowloon,22.330095,114.16094,Sham Shui Po Kowloon
