# Capstone Project

Challenge description: 
The municipal government of Seoul is planning on new development of areas in town where underdeveloped. It is intending to build new community of apartments for newly wed family and family with young kids. I am a researcher in the municipal research center that is in charge of studying areas in Seoul that are relatively underdeveloped, and am responsible for reporting it to the mayor. With the following conditions from the mayor.

- Conditions:
1. The new development plan is mostly about constructing new generation apartments for residency purposes, especially for newly wed family and family with young kids. Thus, the businesses around the target area should not be inappropriate (in other words, there can't be many businesses around such as pubs, nightclubs, karaokes, and motels).
2. The districts close to the Han river or Central part of Seoul such as Yongan-gu, Gangnam-Gu, Jung-gu, Jongro-Gu, Yeongdeungpo-Gu, Yangcheon-Gu, Songpa-Gu, Seongdong-gu, Seodaemun-gu, Dongdaemun-gu, Gwangjin-Gu, Mapo-Gu, Seocho-Gu, Dongjak-Gu, Gangdong-gu and Guro-Gu are considered overdeveloped as there had already been city renewal plan over these areas for last decades. These areas are already quite populated with small businesses and big financial and technological industries. I, As a local citizen of Seoul, is well-aware of this fact and would like to investigate areas other than these. Preferably, the ones that are on the outer edge of the city.
3. The mayor would like to avoid areas close to airport as it can be noisy for the new apartment community, as well as those close to mountains as they can result in significant and unnecessary increase of construction and development costs.


<h2> Importing necessary tools and libraries </h2>

In [30]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Libraries imported.


# Importing data of Seoul

In [31]:
#importing data of seoul(data organized into a csv file, reference from Wikipedia)
import pandas as pd
df = pd.read_csv('seoul_data.csv')
df = df.drop(["Hangul", "Hanja"], axis = 1)
df

Unnamed: 0,Code,Neighborhood,Latitude,Longitude
0,010NN ~ 012NN,Gangbuk District,37.6396,127.0257
1,013NN ~ 015NN,Dobong District,37.6688,127.0471
2,015NN ~ 019NN,Nowon District,37.6542,127.0568
3,020NN ~ 023NN,Jungnang District,37.6066,127.0927
4,024NN ~ 026NN,Dongdaemun District,37.5744,127.04
5,027NN ~ 029NN,Seongbuk District,37.5891,127.0182
6,030NN ~ 032NN,Jongno District,37.573,126.9794
7,033NN ~ 035NN,Eunpyeong District,37.6027,126.9291
8,036NN ~ 038NN,Seodaemun District,37.5791,126.9368
9,039NN ~ 042NN,Mapo District,37.5638,126.9084


In [32]:
#Getting the geographical coordinates of Seoul City
address = 'Seoul, Seoul'

geolocator = Nominatim(user_agent="seoul_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Seoul are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Seoul are 37.5551121, 126.9726473.


In [33]:
# create map of Seoul using latitude and longitude values
map_seoul = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, code, neighborhood in zip(df['Latitude'], df['Longitude'], df['Code'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, code)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_seoul)  
    
map_seoul

<h2> Analysis of districts </h2>

Let us investigate into areas that are considered suburbs of Seoul. The western districts are not likable as there seems to be an airport, causing lots of noise for the new apartment community. The ones up in the north are also not favorable as there is Korean North Mountain running along making the cost of leveling and construction of new apartments skyrocketed. Guri, the city just to the east of Seoul, is already occupied with millions of population and many local business facilities. Thus the districts on the eastern Seoul do not have much space left for new development. The ones on the east are also quite much developed. There is also an airport close to Seongnam, a city just southeast of Seoul. That leaves the southwestern districts of Seoul as the biggest candidate. Then we would like to take a look at Geumcheon-gu and Gwanak-gu. Let's take a look at Gwanak-gu District first.

In [34]:
##Simplifying the map for segmenting and clustering of Geumcheon-gu area
gwanak_data = df[df['Neighborhood'] == 'Gwanak District'].reset_index(drop=True)
gwanak_data.head()


Unnamed: 0,Code,Neighborhood,Latitude,Longitude
0,087NN ~ 089NN,Gwanak District,37.4784,126.9516


In [35]:
#Getting the geographical coordinates of Gwanak District
address = 'Gwanak-gu, Seoul'

geolocator = Nominatim(user_agent="gwanak_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Gwanak District are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Gwanak District are 37.4672582, 126.9482884.


In [36]:
#Visualizing neighborhoods of Gwanak District
# create map of Gwanak District using latitude and longitude values
map_gwanak = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(gwanak_data['Latitude'], gwanak_data['Longitude'], gwanak_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_gwanak)  
    
map_gwanak

Looking at the map we can find that Gwanak District is not an ideal candidate as there is a mountain in the area, this is found to be 'Gwanak Mountain'. Let's take a look at Geumcheon District instead.

In [37]:
#Simplifying the map for segmenting and clustering of Geumcheon-gu area
guemcheon_data = df[df['Neighborhood'] == 'Geumcheon District'].reset_index(drop=True)
guemcheon_data.head()

Unnamed: 0,Code,Neighborhood,Latitude,Longitude
0,085NN ~ 086NN,Geumcheon District,37.4519,126.902


In [38]:
#Getting the geographical coordinates of Geumcheon District
address = 'Geumcheon-gu, Seoul'

geolocator = Nominatim(user_agent="geumcheon_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Geumcheon District are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Geumcheon District are 37.4565, 126.8954.


In [39]:
#Visualizing neighborhoods of Geumcheon District
# create map of Guemcheon District using latitude and longitude values
map_guemcheon = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(guemcheon_data['Latitude'], guemcheon_data['Longitude'], guemcheon_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_guemcheon)  
    
map_guemcheon

According to what is shown on the map above about Guemcheon District, comparing the one with Gwanak District, there seems to be no mountains where as Gwanak District has mountains running along the area called 'Gwanak Mountains.' This makes Guemcheon District more appropriate area for the city's new residence development plan. Let's explore the Guemcheon District in detail.
Now we will utilize Foursqaure API to explore the neighborhood of Guemcheon District in detail. First, we will declare credentials for Foursquare API

In [8]:
CLIENT_ID = 'S1FQAUOOGLPF2G11YPPAP5H2EPXBZ2CVZ3XEKEN3MRJPNCA2' # your Foursquare ID
CLIENT_SECRET = '2Y1NE0TMAYGSNOCDMRGVRW5BXAOSWYHYR4CB2BY3FAEFQFMU' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: S1FQAUOOGLPF2G11YPPAP5H2EPXBZ2CVZ3XEKEN3MRJPNCA2
CLIENT_SECRET:2Y1NE0TMAYGSNOCDMRGVRW5BXAOSWYHYR4CB2BY3FAEFQFMU


In [9]:
#Getting the name of the neighborhood
guemcheon_data.loc[0, 'Neighborhood']

'Geumcheon District'

In [10]:
#Getting the coordinates of Geumcheon District
neighborhood_latitude = guemcheon_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = guemcheon_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = guemcheon_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Geumcheon District are 37.4519, 126.902.


Let's get the top 100 venues that are in Geumcheon District within a radius of 2km

In [11]:
# Top 100 venues within 2km of radius for Geumcheon District
LIMIT = 100
radius = 2000
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

url




'https://api.foursquare.com/v2/venues/explore?&client_id=S1FQAUOOGLPF2G11YPPAP5H2EPXBZ2CVZ3XEKEN3MRJPNCA2&client_secret=2Y1NE0TMAYGSNOCDMRGVRW5BXAOSWYHYR4CB2BY3FAEFQFMU&v=20180605&ll=37.4519,126.902&radius=2000&limit=100'

In [14]:
#Sending the GET request to see the results
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d9d5f0d02a172001b3a49a9'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Geum-cheon-gu',
  'headerFullLocation': 'Geum-cheon-gu, Seoul',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 29,
  'suggestedBounds': {'ne': {'lat': 37.46990001800002,
    'lng': 126.92463159336815},
   'sw': {'lat': 37.433899981999986, 'lng': 126.87936840663185}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4c0a1915ffb8c9b667076b61',
       'name': '강강술래',
       'location': {'address': '금천구 시흥대로 191',
        'crossStreet': '시흥점',
        'lat': 37.45087970385289,
        'lng': 126.90182367238292,
        'labeledLatLngs': [{'label': 'display',

In [15]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [16]:
#Forming pandas dataframe for venues in Geumcheon District
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,강강술래,BBQ Joint,37.45088,126.901824
1,타요키즈카페,Recreation Center,37.449274,126.88954
2,Starbucks (스타벅스),Coffee Shop,37.448713,126.903119
3,동흥관,Chinese Restaurant,37.455069,126.898628
4,Fitness Center,Gym,37.46737,126.897865


In [17]:
#The number of venues returned by Foursquare
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

29 venues were returned by Foursquare.


<H2> Exploring Neighborhoods in Guemcheon District </H2>

In [18]:
#The function to explore neighborhoods in Guemcheon District
def getNearbyVenues(names, latitudes, longitudes, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [19]:
#Run above function to explore neighborhoods in Guemcheon District in detail
guemcheon_venues = getNearbyVenues(names=guemcheon_data['Neighborhood'],
                                   latitudes=guemcheon_data['Latitude'],
                                   longitudes=guemcheon_data['Longitude']
                                  )

Geumcheon District


In [20]:
print(guemcheon_venues.shape)
guemcheon_venues.head()

(29, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Geumcheon District,37.4519,126.902,강강술래,37.45088,126.901824,BBQ Joint
1,Geumcheon District,37.4519,126.902,타요키즈카페,37.449274,126.88954,Recreation Center
2,Geumcheon District,37.4519,126.902,Starbucks (스타벅스),37.448713,126.903119,Coffee Shop
3,Geumcheon District,37.4519,126.902,동흥관,37.455069,126.898628,Chinese Restaurant
4,Geumcheon District,37.4519,126.902,Fitness Center,37.46737,126.897865,Gym


In [21]:
#Check how many venues are returned for Guemcheon District
guemcheon_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Geumcheon District,29,29,29,29,29,29


In [22]:
#Find how many unique categories of businesses are identified
print('There are {} uniques categories.'.format(len(guemcheon_venues['Venue Category'].unique())))

There are 18 uniques categories.


<H2> Analyzing Each Neighborhood </H2>

In [23]:
# one hot encoding
guemcheon_onehot = pd.get_dummies(guemcheon_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
guemcheon_onehot['Neighborhood'] = guemcheon_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [guemcheon_onehot.columns[-1]] + list(guemcheon_onehot.columns[:-1])
guemcheon_onehot = guemcheon_onehot[fixed_columns]

guemcheon_onehot.head()

Unnamed: 0,Neighborhood,BBQ Joint,Bakery,Big Box Store,Buffet,Bus Stop,Café,Chinese Restaurant,Coffee Shop,Donut Shop,Fast Food Restaurant,Gym,Hotel,Ice Cream Shop,Metro Station,Multiplex,Recreation Center,Supermarket,Trail
0,Geumcheon District,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Geumcheon District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,Geumcheon District,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,Geumcheon District,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,Geumcheon District,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [24]:
#Examine the new dataframe size
guemcheon_onehot.shape

(29, 19)

In [25]:
#Find the mean of the frequency of occurrence of each business category
guemcheon_grouped = guemcheon_onehot.groupby('Neighborhood').mean().reset_index()
guemcheon_grouped

Unnamed: 0,Neighborhood,BBQ Joint,Bakery,Big Box Store,Buffet,Bus Stop,Café,Chinese Restaurant,Coffee Shop,Donut Shop,Fast Food Restaurant,Gym,Hotel,Ice Cream Shop,Metro Station,Multiplex,Recreation Center,Supermarket,Trail
0,Geumcheon District,0.034483,0.206897,0.034483,0.034483,0.034483,0.034483,0.034483,0.103448,0.068966,0.103448,0.034483,0.034483,0.034483,0.034483,0.034483,0.034483,0.068966,0.034483


In [26]:
#Confirm the new size
guemcheon_grouped.shape

(1, 19)

In [27]:
#Print each neighborhood with top 5 most common venues
num_top_venues = 5

for hood in guemcheon_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = guemcheon_grouped[guemcheon_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Geumcheon District----
                  venue  freq
0                Bakery  0.21
1  Fast Food Restaurant  0.10
2           Coffee Shop  0.10
3           Supermarket  0.07
4            Donut Shop  0.07




In [28]:
#Putting it into pandas dataframe
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [29]:
#New dataframe with top 10 venues for Guemcheon District
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = guemcheon_grouped['Neighborhood']

for ind in np.arange(guemcheon_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(guemcheon_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Geumcheon District,Bakery,Coffee Shop,Fast Food Restaurant,Donut Shop,Supermarket,Big Box Store,Buffet,Bus Stop,Café,Chinese Restaurant


<H3> Conclusion & Proposal </H3>

The examination of businesses exist in Guemcheon District has resulted in finding of the dominant type of business being food industry. The four out of top 5 venues in this area is found to be bakery, fast food restaurants, coffee shops, and donut shops. There are also seem to be many bus stops in this area, making it a more qualified candidate for a new apartment community, due to ease of access for transportation. There are also a number of trail stations available, as it is the 10th most common venues in this area. The mayor would also be happy that pubs, karaokes, and motels are apparently not common types of business that already exist in this area. Last but not the least, the area is not surrounded by mountains, airports, or other facilities that may not cause discomfort for local residents. Thus, I would recommend this Guemcheon District as the number 1 candidate area for the new residence community development plan.
