# Proyecto Capstone para IBM Data Science Certificate

## Introduction

This laboratory is part of week 4 of the final project of the Big Data professional certification carried out by IBM through Coursera. In this work, explore, segment and group the neighborhoods in the city of Barranquilla.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download, Data Setup and Explore Dataset</a>

2. <a href="#item2">Create initial table with 103 postal codes ('Postcode', 'Borough','Neighborhood') </a>

3. <a href="#item3">Concatinate table from part 1 with Geospacial Coordinates ('Latititude', 'Longitude')</a>

4. <a href="#item4">Generate maps to visual neighborhoods and how they cluster using geopy & folium</a>

</font>
</div>



## Download, Data Setup and Explore Dataset

In [1]:
import numpy as np # library to handle data in a vectorized manner
import time
import pandas as pd # library for data analsysis
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

!pip install beautifulsoup4

#mapping tools
!pip install geopy 

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!pip install folium
# map rendering library
import folium  

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)


# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/e8/b5/7bb03a696f2c9b7af792a8f51b82974e51c268f15e925fc834876a4efa0b/beautifulsoup4-4.9.0-py3-none-any.whl (109kB)
[K     |████████████████████████████████| 112kB 25.5MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/05/cf/ea245e52f55823f19992447b008bcbb7f78efc5960d77f6c34b5b45b36dd/soupsieve-2.0-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.0 soupsieve-2.0
Libraries imported.


In [2]:
!conda install -c anaconda xlrd --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [3]:
coordinates_df = pd.DataFrame( data = { 
    'Neighborhood': ['Altos De Riomar','Miramar','Andalucia','Altos Del Limon','El Golf','Riomar','Villa Country','El Tabor','Alto Prado','Prado','Villa Campestre','Los Nogales','Villa Del Mar','Ciudad Jardin','Boston','Las Tres Aves Marias','El Recreo'], 
    'Latitude' : ['11.015578','11.003472','11.016125','11.013992', '11.008695', '11.011865', '11.006075', '11.002278', 
                  '11.001832', '11.001382', '11.023403', '10.993977', '11.005024', '10.994720','10.986946', '11.021004', 
                  '10.980112' ],
    'Longitude': ['-74.820644','-74.835111','-74.815162','-74.826282', '-74.808828','-74.831650', '-74.804908', 
                  '-74.828940', '-74.809796', '-74.798429', '-74.862230', '-74.827701', '-74.827649', '-74.818642',
                  '-74.793942', '-74.808954', '-74.800932' ]})

In [4]:
coordinates_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Altos De Riomar,11.015578,-74.820644
1,Miramar,11.003472,-74.835111
2,Andalucia,11.016125,-74.815162
3,Altos Del Limon,11.013992,-74.826282
4,El Golf,11.008695,-74.808828


In [5]:
coordinates_df.to_csv("Barrios_coord.csv", index=False)

In [6]:
coordinates_df.info

<bound method DataFrame.info of             Neighborhood   Latitude   Longitude
0        Altos De Riomar  11.015578  -74.820644
1                Miramar  11.003472  -74.835111
2              Andalucia  11.016125  -74.815162
3        Altos Del Limon  11.013992  -74.826282
4                El Golf  11.008695  -74.808828
5                 Riomar  11.011865  -74.831650
6          Villa Country  11.006075  -74.804908
7               El Tabor  11.002278  -74.828940
8             Alto Prado  11.001832  -74.809796
9                  Prado  11.001382  -74.798429
10       Villa Campestre  11.023403  -74.862230
11           Los Nogales  10.993977  -74.827701
12         Villa Del Mar  11.005024  -74.827649
13         Ciudad Jardin  10.994720  -74.818642
14                Boston  10.986946  -74.793942
15  Las Tres Aves Marias  11.021004  -74.808954
16             El Recreo  10.980112  -74.800932>

In [7]:
coordinates_df['Latitude'] = coordinates_df['Latitude'].astype('float', errors = 'ignore')
coordinates_df['Longitude'] = coordinates_df['Longitude'].astype('float', errors = 'ignore')
coordinates_df.dtypes

Neighborhood     object
Latitude        float64
Longitude       float64
dtype: object

In [8]:
address = 'Barranquilla, Colombia'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Barranquilla are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Barranquilla are 10.9799669, -74.8013085.


In [9]:


Barranquilla_map = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, neighborhood in zip(coordinates_df['Latitude'], coordinates_df['Longitude'], coordinates_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(Barranquilla_map)  
    
Barranquilla_map

#### Explore a neighborhood using Foursquare API

In [10]:
CLIENT_ID = 'N1FXT1RQOUZAZCMKURH3PFJYM5VGHNSMZQHK4JXWVQ5DDI1Y' # my Foursquare ID
CLIENT_SECRET = 'PLQF0OEC3F2QSJREOWM2YSTJTOAWWAAP324E1EWBFA0AGHTG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
radius = 500
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: N1FXT1RQOUZAZCMKURH3PFJYM5VGHNSMZQHK4JXWVQ5DDI1Y
CLIENT_SECRET:PLQF0OEC3F2QSJREOWM2YSTJTOAWWAAP324E1EWBFA0AGHTG


#### Let's explore the 'studio district'

In [11]:
#define objects for 'Studio District' index [3] in df_toronto
neighborhood_latitude = coordinates_df.loc[3, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = coordinates_df.loc[3, 'Longitude'] # neighborhood longitude value
neighborhood_name = coordinates_df.loc[3, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Altos Del Limon are 11.013992, -74.826282.


#### Now, let's get the top 100 venues that are in Studio District within a radius of 500 meters.

In [12]:
#step 1 - create the correct GET request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display GET request URL

'https://api.foursquare.com/v2/venues/explore?&client_id=N1FXT1RQOUZAZCMKURH3PFJYM5VGHNSMZQHK4JXWVQ5DDI1Y&client_secret=PLQF0OEC3F2QSJREOWM2YSTJTOAWWAAP324E1EWBFA0AGHTG&v=20180605&ll=11.013992,-74.826282&radius=500&limit=100'

In [13]:
results = requests.get(url).json()
results; # remove ';' to see json data

#### Clean the json and structure it into a pandas dataframe.

In [14]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [15]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Parque Bulevar Buenavista,Park,11.015699,-74.826604
1,Centro Comercial Buenavista I,Shopping Mall,11.013285,-74.827622
2,Centro Comercial Buenavista II,Shopping Mall,11.014186,-74.828039
3,El Giratorio,Gastropub,11.012761,-74.827267
4,Salvator's Pizza,Pizza Place,11.014151,-74.826245


In [16]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

82 venues were returned by Foursquare.


In [17]:
map_studio = folium.Map(location=[neighborhood_latitude, neighborhood_longitude], zoom_start=17)

# add markers to map
for lat, lng, name, categories in zip(nearby_venues['lat'], nearby_venues['lng'], nearby_venues['name'], nearby_venues['categories']):
  label = '{},{}'.format(categories,name)
  label = folium.Popup(label, parse_html=True)
  folium.CircleMarker(
      [lat, lng],
      radius=5,
      popup=label,
      color='blue',
      fill=True,
      fill_color='#3186cc',
      fill_opacity=0.7).add_to(map_studio) 
    
map_studio

#### Cluster Analysis of Venues across all neighborhoods

In [18]:
# create a function to get all venues for each neighborhood
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [19]:
#run function for all toronto neighborhoods and create df 'toronto_venues'
Barranquilla_venues = getNearbyVenues(names=coordinates_df['Neighborhood'],
                                   latitudes=coordinates_df['Latitude'],
                                   longitudes=coordinates_df['Longitude']
                                  )

Altos De Riomar
Miramar
Andalucia
Altos Del Limon
El Golf
Riomar
Villa Country
El Tabor
Alto Prado
Prado
Villa Campestre
Los Nogales
Villa Del Mar
Ciudad Jardin
Boston
Las Tres Aves Marias
El Recreo


In [20]:
print(Barranquilla_venues.shape)
Barranquilla_venues.head()

(396, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Altos De Riomar,11.015578,-74.820644,Margarita Saieh De Jassir,11.013333,-74.819209,Dessert Shop
1,Altos De Riomar,11.015578,-74.820644,La Masía,11.013817,-74.822861,Soccer Field
2,Altos De Riomar,11.015578,-74.820644,"Patinodromo ""Rafael Naranjo Pertuz""",11.013164,-74.82369,Skating Rink
3,Altos De Riomar,11.015578,-74.820644,Carreta & Paja,11.014282,-74.822123,Steakhouse
4,Altos De Riomar,11.015578,-74.820644,El Chuzón de Joselo,11.013687,-74.822538,Fast Food Restaurant


In [21]:
Barranquilla_venues.groupby('Neighborhood').count().head()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alto Prado,50,50,50,50,50,50
Altos De Riomar,6,6,6,6,6,6
Altos Del Limon,82,82,82,82,82,82
Andalucia,13,13,13,13,13,13
Boston,7,7,7,7,7,7


In [22]:
print('There are {} unique categories.'.format(len(Barranquilla_venues['Venue Category'].unique())))

There are 94 unique categories.


#### Create 'one hot' file with dummy values by venue category

In [23]:
# one hot encoding
Barranquilla_onehot = pd.get_dummies(Barranquilla_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Barranquilla_onehot['Neighborhood'] = Barranquilla_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [Barranquilla_onehot.columns[-1]] + list(Barranquilla_onehot.columns[:-1])
Barranquilla_onehot = Barranquilla_onehot[fixed_columns]

Barranquilla_onehot.head()


Unnamed: 0,Neighborhood,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Arts & Entertainment,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bar,Beer Garden,Big Box Store,Bistro,Bookstore,Boutique,Bowling Alley,Brazilian Restaurant,Breakfast Spot,Burger Joint,Café,Cajun / Creole Restaurant,Caribbean Restaurant,Casino,Clothing Store,Coffee Shop,Comfort Food Restaurant,Convention Center,Cupcake Shop,Deli / Bodega,Department Store,Dessert Shop,Donut Shop,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Flower Shop,Food,Food Court,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Gastropub,Gluten-free Restaurant,Golf Course,Grocery Store,Gym,Gym / Fitness Center,Hotel,Ice Cream Shop,Italian Restaurant,Japanese Restaurant,Juice Bar,Karaoke Bar,Latin American Restaurant,Market,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Multiplex,Nightclub,Park,Peruvian Restaurant,Pharmacy,Pie Shop,Pizza Place,Playground,Pub,Recreation Center,Restaurant,Roof Deck,Salad Place,Sandwich Place,Scenic Lookout,Seafood Restaurant,Shoe Store,Shopping Mall,Shopping Plaza,Skating Rink,Snack Place,Soccer Field,Soup Place,Spanish Restaurant,Steakhouse,Supermarket,Sushi Restaurant,Taco Place,Thai Restaurant,Theme Park Ride / Attraction,Theme Restaurant,Toy / Game Store,Vegetarian / Vegan Restaurant,Women's Store
0,Altos De Riomar,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Altos De Riomar,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
2,Altos De Riomar,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Altos De Riomar,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,Altos De Riomar,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [24]:
Barranquilla_grouped = Barranquilla_onehot.groupby('Neighborhood').mean().reset_index()
Barranquilla_grouped.head()

Unnamed: 0,Neighborhood,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Arts & Entertainment,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bar,Beer Garden,Big Box Store,Bistro,Bookstore,Boutique,Bowling Alley,Brazilian Restaurant,Breakfast Spot,Burger Joint,Café,Cajun / Creole Restaurant,Caribbean Restaurant,Casino,Clothing Store,Coffee Shop,Comfort Food Restaurant,Convention Center,Cupcake Shop,Deli / Bodega,Department Store,Dessert Shop,Donut Shop,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Flower Shop,Food,Food Court,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Gastropub,Gluten-free Restaurant,Golf Course,Grocery Store,Gym,Gym / Fitness Center,Hotel,Ice Cream Shop,Italian Restaurant,Japanese Restaurant,Juice Bar,Karaoke Bar,Latin American Restaurant,Market,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Multiplex,Nightclub,Park,Peruvian Restaurant,Pharmacy,Pie Shop,Pizza Place,Playground,Pub,Recreation Center,Restaurant,Roof Deck,Salad Place,Sandwich Place,Scenic Lookout,Seafood Restaurant,Shoe Store,Shopping Mall,Shopping Plaza,Skating Rink,Snack Place,Soccer Field,Soup Place,Spanish Restaurant,Steakhouse,Supermarket,Sushi Restaurant,Taco Place,Thai Restaurant,Theme Park Ride / Attraction,Theme Restaurant,Toy / Game Store,Vegetarian / Vegan Restaurant,Women's Store
0,Alto Prado,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.02,0.02,0.0,0.02,0.02,0.02,0.0,0.02,0.0,0.02,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.04,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.06,0.0,0.06,0.02,0.02,0.02,0.02,0.0,0.0,0.0,0.02,0.02,0.02,0.0,0.0,0.04,0.04,0.0,0.06,0.0,0.0,0.0,0.04,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.02,0.06,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0
1,Altos De Riomar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.166667,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Altos Del Limon,0.0,0.0,0.0,0.0,0.0,0.012195,0.0,0.012195,0.02439,0.0,0.0,0.0,0.0,0.02439,0.012195,0.012195,0.0,0.0,0.02439,0.012195,0.0,0.0,0.012195,0.02439,0.012195,0.0,0.0,0.012195,0.012195,0.036585,0.02439,0.012195,0.012195,0.0,0.060976,0.0,0.0,0.0,0.012195,0.02439,0.012195,0.012195,0.012195,0.0,0.0,0.0,0.0,0.0,0.036585,0.04878,0.012195,0.0,0.0,0.0,0.0,0.0,0.0,0.012195,0.012195,0.012195,0.012195,0.0,0.02439,0.0,0.0,0.012195,0.036585,0.012195,0.0,0.0,0.012195,0.0,0.02439,0.036585,0.0,0.02439,0.012195,0.036585,0.012195,0.012195,0.012195,0.012195,0.0,0.0,0.02439,0.012195,0.04878,0.0,0.0,0.012195,0.012195,0.02439,0.0,0.012195
3,Andalucia,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.076923,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.076923,0.0,0.230769,0.0,0.0,0.0,0.153846,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Boston,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
Barranquilla_grouped.shape

(17, 95)

#### Let's print each neighborhood along with the top 5 most common venues

In [26]:
num_top_venues = 5

for hood in Barranquilla_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = Barranquilla_grouped[Barranquilla_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Alto Prado----
                venue  freq
0  Italian Restaurant  0.06
1               Hotel  0.06
2                 Bar  0.06
3          Steakhouse  0.06
4         Pizza Place  0.06


----Altos De Riomar----
                  venue  freq
0          Soccer Field  0.17
1              Pie Shop  0.17
2          Dessert Shop  0.17
3  Fast Food Restaurant  0.17
4            Steakhouse  0.17


----Altos Del Limon----
                  venue  freq
0  Fast Food Restaurant  0.06
1      Sushi Restaurant  0.05
2        Ice Cream Shop  0.05
3         Shopping Mall  0.04
4                 Hotel  0.04


----Andalucia----
            venue  freq
0     Pizza Place  0.23
1      Restaurant  0.15
2     Supermarket  0.08
3  Sandwich Place  0.08
4        Pharmacy  0.08


----Boston----
                  venue  freq
0  Caribbean Restaurant  0.29
1  Gym / Fitness Center  0.14
2          Soccer Field  0.14
3      Department Store  0.14
4    Seafood Restaurant  0.14


----Ciudad Jardin----
                

#### First, let's write a function to sort the venues in descending order.

In [27]:
def return_most_common_venues(row, num_top_venues):

    row_categories = row.iloc[1:]

    row_categories_sorted = row_categories.sort_values(ascending=False)

    

    return row_categories_sorted.index.values[0:num_top_venues]

In [28]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Barranquilla_grouped['Neighborhood']

for ind in np.arange(Barranquilla_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Barranquilla_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Alto Prado,Italian Restaurant,Hotel,Pizza Place,Bar,Steakhouse
1,Altos De Riomar,Pie Shop,Steakhouse,Dessert Shop,Fast Food Restaurant,Soccer Field
2,Altos Del Limon,Fast Food Restaurant,Sushi Restaurant,Ice Cream Shop,Hotel,Department Store
3,Andalucia,Pizza Place,Restaurant,Supermarket,Pharmacy,Park
4,Boston,Caribbean Restaurant,Farmers Market,Seafood Restaurant,Department Store,Soccer Field


#### Cluster neighborhoods

In [29]:
# set number of clusters
kclusters = 5

Barranquilla_grouped_clustering = Barranquilla_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Barranquilla_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:5]

array([0, 0, 0, 0, 4], dtype=int32)

In [30]:
Barranquilla_merged = coordinates_df

# add clustering labels
Barranquilla_merged ['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Barranquilla_merged  = Barranquilla_merged .join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

Barranquilla_merged .head() # check the last columns!

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Altos De Riomar,11.015578,-74.820644,0,Pie Shop,Steakhouse,Dessert Shop,Fast Food Restaurant,Soccer Field
1,Miramar,11.003472,-74.835111,0,Gym / Fitness Center,Department Store,Nightclub,Shopping Mall,Fast Food Restaurant
2,Andalucia,11.016125,-74.815162,0,Pizza Place,Restaurant,Supermarket,Pharmacy,Park
3,Altos Del Limon,11.013992,-74.826282,0,Fast Food Restaurant,Sushi Restaurant,Ice Cream Shop,Hotel,Department Store
4,El Golf,11.008695,-74.808828,4,Bar,Pizza Place,Sushi Restaurant,Bakery,Italian Restaurant


In [31]:
Barranquilla_merged.tail()

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
12,Villa Del Mar,11.005024,-74.827649,0,Fast Food Restaurant,Middle Eastern Restaurant,Restaurant,Seafood Restaurant,Nightclub
13,Ciudad Jardin,10.99472,-74.818642,0,Burger Joint,Middle Eastern Restaurant,Steakhouse,Bakery,Gastropub
14,Boston,10.986946,-74.793942,0,Caribbean Restaurant,Farmers Market,Seafood Restaurant,Department Store,Soccer Field
15,Las Tres Aves Marias,11.021004,-74.808954,0,Fast Food Restaurant,Park,Farmers Market,Comfort Food Restaurant,Convention Center
16,El Recreo,10.980112,-74.800932,2,Soccer Field,Fast Food Restaurant,Restaurant,Department Store,Bakery


## Final: cluster map by venues

In [32]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Barranquilla_merged['Latitude'], Barranquilla_merged['Longitude'], Barranquilla_merged['Neighborhood'], Barranquilla_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Anexos

In [33]:
Barranquilla_merged.to_csv("Barranquilla_clusters.csv", index=False)

In [34]:
Barranquilla_clusters= Barranquilla_merged

In [35]:
Barranquilla_clusters.drop(columns=['1st Most Common Venue','2nd Most Common Venue','3rd Most Common Venue','4th Most Common Venue','5th Most Common Venue'], axis=1, inplace = True, errors = 'ignore')
Barranquilla_clusters.head(20)

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels
0,Altos De Riomar,11.015578,-74.820644,0
1,Miramar,11.003472,-74.835111,0
2,Andalucia,11.016125,-74.815162,0
3,Altos Del Limon,11.013992,-74.826282,0
4,El Golf,11.008695,-74.808828,4
5,Riomar,11.011865,-74.83165,0
6,Villa Country,11.006075,-74.804908,0
7,El Tabor,11.002278,-74.82894,2
8,Alto Prado,11.001832,-74.809796,2
9,Prado,11.001382,-74.798429,1


In [36]:
Barranquilla_clusters.to_csv("Barranquilla_clusters.csv", index=False)

In [37]:
coordinates_df.drop(columns=['Cluster Labels'], axis=1, inplace = True, errors = 'ignore')
coordinates_df

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Altos De Riomar,11.015578,-74.820644
1,Miramar,11.003472,-74.835111
2,Andalucia,11.016125,-74.815162
3,Altos Del Limon,11.013992,-74.826282
4,El Golf,11.008695,-74.808828
5,Riomar,11.011865,-74.83165
6,Villa Country,11.006075,-74.804908
7,El Tabor,11.002278,-74.82894
8,Alto Prado,11.001832,-74.809796
9,Prado,11.001382,-74.798429


In [38]:
coordinates_df.to_csv("Barrios_coord.csv", index=False)