
<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Bogotá D.C.</font></h1>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
pip install geopy

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install folium

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install wget zsh -f


Usage:   
  /Users/duarfel/opt/anaconda3/bin/python -m pip install [options] <requirement specifier> [package-index-options] ...
  /Users/duarfel/opt/anaconda3/bin/python -m pip install [options] -r <requirements file> [package-index-options] ...
  /Users/duarfel/opt/anaconda3/bin/python -m pip install [options] [-e] <vcs project url> ...
  /Users/duarfel/opt/anaconda3/bin/python -m pip install [options] [-e] <local project path> ...
  /Users/duarfel/opt/anaconda3/bin/python -m pip install [options] <archive url/path> ...

-f option requires 1 argument
Note: you may need to restart the kernel to use updated packages.


In [4]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes  #uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


<a id='item1'></a>

## 1. Download and Explore Dataset

Data taken from Bogotá Data website : https://datosabiertos.bogota.gov.co

For your convenience, I downloaded the files and placed it on the server, so you can simply run a `wget` command and access the data. So let's go ahead and do that.

In [5]:
!wget -q -O 'bogota_data.json' https://datosabiertos.bogota.gov.co/dataset/b0c66a77-3230-4d0c-a119-dead7f9b8b8e/resource/9c3829e3-6b4b-4aac-a3e5-297fe0127b67/download/egba.geojson
print('Data downloaded!')

Data downloaded!


#### Load and explore the data

Next, let's load the data.

In [6]:
with open('bogota_data.json') as json_data:
    bogota_data = json.load(json_data)
#bogota_data

Let's take a quick look at the data.

Notice how all the relevant data is in the *features* key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [7]:
neighborhoods_data = bogota_data['features']

Let's take a look at the first item in this list.

In [8]:
neighborhoods_data[0]

{'type': 'Feature',
 'properties': {'SUBCATEGOR': 'L',
  'NOMBRE_EST': 'CHIBCHOMBIA',
  'DIRECCION': 'CL 27 # 4 - 49 P 1',
  'LOC': '03',
  'SECTOR_CAT': 'LA MACARENA',
  'LATITUD': 4.613799,
  'LONGITUD': -74.066259},
 'geometry': {'type': 'Point',
  'coordinates': [-74.06625899961625, 4.613799000073394]}}

#### Tranform the data into a *pandas* dataframe

The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe.

In [9]:
# define the dataframe columns
column_names = ['Borough','Name','Address','Subcategory', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.

In [10]:
neighborhoods.head()

Unnamed: 0,Borough,Name,Address,Subcategory,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.

In [11]:
for data in neighborhoods_data:
    name = data['properties']['NOMBRE_EST'] 
    LOC = data['properties']['LOC'] 
    address = data['properties']['DIRECCION']
    subcategoy = data['properties']['SUBCATEGOR']
    neighborhood= data['properties']['SECTOR_CAT']
    lat= data['properties']['LATITUD']
    long= data['properties']['LONGITUD']
    
    neighborhoods = neighborhoods.append({'Borough': LOC,
                                          'Name': name,
                                          'Address': address,
                                          'Subcategory':subcategoy,
                                          'Neighborhood': neighborhood,
                                          'Latitude': lat,
                                          'Longitude': long}, ignore_index=True)

Quickly examine the resulting dataframe.

In [12]:
neighborhoods.head(100)

Unnamed: 0,Borough,Name,Address,Subcategory,Neighborhood,Latitude,Longitude
0,3,CHIBCHOMBIA,CL 27 # 4 - 49 P 1,L,LA MACARENA,4.613799,-74.066259
1,3,RESTAURANTE ROMULO Y REMO MACARENA,KR 4A # 26D - 90,L,LA MACARENA,4.613799,-74.066259
2,3,BOGOTA BEER COMPANY S A,KR 4A # 27 - 3,L,LA MACARENA,4.614092,-74.06589
3,3,PRESTO BTA CALLE 27,KR 7 # 27 - 38 LC 1 ED COLISSEUM,L,SAN DIEGO,4.614158,-74.069222
4,3,LA HAMBURGUESERIA DE LA MACARENA,KR 4A # 27 - 27,L,LA MACARENA,4.614193,-74.06644
5,3,MC DONALD S CENTRO INTERNACIONAL,KR 10 # 26 - 55,L,SAN DIEGO,4.614194,-74.070309
6,3,BOCA COCINA LATINA,KR 10 # 27 - 51 LC 174 ED Residencias Tequenda...,L,SAN DIEGO,4.614208,-74.070278
7,3,GAUDI RESTAURANTE ESPAÑOL,KR 4A # 27 - 52,K,LA MACARENA,4.614401,-74.066068
8,3,REPUBLIK MUSEO,TV 6 # 27 - 50 LC 2,L,SAN DIEGO,4.614652,-74.068623
9,3,ARCHIES CENTRO INTERNACIONAL,KR 7 # 27 - 80,J,SAN DIEGO,4.614853,-74.069126


And make sure that the dataset has all 5 boroughs and 306 neighborhoods.

In [13]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 15 boroughs and 515 neighborhoods.


#### Use geopy library to get the latitude and longitude values of Bogotá D.C..

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [14]:
address = 'Bogota, Colombia'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Bogotá D.C. are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Bogotá D.C. are 4.59808, -74.0760439.


#### Create a map of Bogota with neighborhoods superimposed on top.

In [15]:
# create map of New York using latitude and longitude values
map_bogota = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_bogota)  
    
map_bogota

As we did with all of Bogotá D.C., let's visualizat bogota the neighborhoods in it.

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [16]:
CLIENT_ID = 'U5HBM0544TWPO4G52TXO5POYFSGGZKHZV0LD1R4CLRND01GN' # your Foursquare ID
CLIENT_SECRET = 'JXOBNNXG3Y3GVUTNR20Q3RY5JGDHEC1LRS3AO1NXRVPZJKIP' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: U5HBM0544TWPO4G52TXO5POYFSGGZKHZV0LD1R4CLRND01GN
CLIENT_SECRET:JXOBNNXG3Y3GVUTNR20Q3RY5JGDHEC1LRS3AO1NXRVPZJKIP


#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [17]:
neighborhood_fs=neighborhoods.groupby('Neighborhood').mean().reset_index()

neighborhood_fs

Unnamed: 0,Neighborhood,Latitude,Longitude
0,AEROPUERTO EL DORADO,4.690904,-74.134005
1,ANTIGUO COUNTRY,4.671517,-74.057248
2,BELLAVISTA,4.655442,-74.054903
3,BOCHICA II,4.71094,-74.112074
4,CEDRITOS,4.717639,-74.033637
5,CEMENTERIO JARDINES APOGEO,4.597093,-74.176075
6,CENTRO ADMINISTRATIVO,4.596079,-74.075121
7,CHAPINERO CENTRAL,4.640445,-74.065861
8,CHAPINERO NORTE,4.652679,-74.061606
9,CHICO NORTE,4.676578,-74.048629


Get the neighborhood's latitude and longitude values.

In [18]:

neighborhood_latitude = neighborhood_fs.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhood_fs.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name= neighborhood_fs.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of AEROPUERTO EL DORADO are 4.690904263157895, -74.1340046842105.


First, let's create the GET request URL. Name your URL **url**.

In [19]:
# type your answer here

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION,  
    neighborhood_latitude, 
    neighborhood_longitude,
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=U5HBM0544TWPO4G52TXO5POYFSGGZKHZV0LD1R4CLRND01GN&client_secret=JXOBNNXG3Y3GVUTNR20Q3RY5JGDHEC1LRS3AO1NXRVPZJKIP&v=20180605&ll=4.690904263157895,-74.1340046842105&radius=500&limit=100'

Send the GET request and examine the resutls

In [20]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f1f75a1531fe8660078541a'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Aeropuerto El Dorado',
  'headerFullLocation': 'Aeropuerto El Dorado, Bogotá',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 22,
  'suggestedBounds': {'ne': {'lat': 4.695404267657899,
    'lng': -74.12949798601662},
   'sw': {'lat': 4.686404258657891, 'lng': -74.13851138240437}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '55ef708d498ef7f599495935',
       'name': 'BBC La Bodega - Terminal Puente Aéreo',
       'location': {'lat': 4.693784745511199,
        'lng': -74.13515286278391,
        'labeledLatLngs': [{'label': 'display',
          'lat': 

In [21]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [22]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  nearby_venues = json_normalize(venues) # flatten JSON


Unnamed: 0,name,categories,lat,lng
0,BBC La Bodega - Terminal Puente Aéreo,Brewery,4.693785,-74.135153
1,Juan Valdez Café,Café,4.693891,-74.135197
2,Top Sushi Wok,Sushi Restaurant,4.691265,-74.136686
3,Juan Valdez Café,Café,4.693489,-74.134734
4,Movich Buro 26,Hotel,4.689655,-74.130106


In [23]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

22 venues were returned by Foursquare.


<a id='item2'></a>