## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Introduction</a>
    
2. <a href="#item2">Objectives</a>

3. <a href="#item3">Data</a>

4. <a href="#item4">Methodology</a>

5. <a href="#item5">Analysis</a>

6. <a href="#item6">Results and Discussion</a>  
    
7. <a href="#item7">Conclusion</a>  
    
</font>
</div>

## 1. Introduction <a name="item1"></a>

The analysis that will be carried out in this project is the comparison of two important cities of Peru: Lima and Callao.

Lima is the capital and most populous city of the Republic of Peru.6 It is located on the central coast of the country, on the shores of the Pacific Ocean, forming an extensive and populous urban area known as Metropolitan Lima, flanked by the coastal desert and extended over the valleys of the Chillón, Rímac and Lurín rivers. According to the Peruvian census of 2017, Lima has more than 8.5 million inhabitants; note 1 while its urban agglomeration has more than 11 million inhabitants, 8 30% of the Peruvian population, figures that make it in the most populous city in the country.

El Callao is a port city located in the constitutional province of Callao, which is located in the center-west of Peru and in turn on the central coast of the Peruvian coast and in the central western area of South America. Being on the shores of the Pacific Ocean, it develops to the west of the province of Lima and 15 kilometers from the historic center of Lima, a city with which it is conurbed.


## 2. Objectives <a name="item2"></a>

Throughout this project, the study will be conducted with a focus on the classification of the area using Foursquare data, segmentation and clustering.
The main objective of this data analysis project is to segment areas of Lima and Callao based on most common places captured from Foursquare.
In addition, through the use of segmentation and clustering, we will determine the following:

*If there is similarity or difference of both cities.

*The classification of the area located within the city, whether residential, tourist, sport, art or others.


## 3. Data <a name="item3"></a>

The data for this project have been acquired from the wikipedia pages and have also been restructured to a csv file for a more understandable reading and manipulation.
The links to the files are:

-https: //github.com/cpalominoch/Coursera_Capstone/blob/master/Lima_df.csv

-https: //github.com/cpalominoch/Coursera_Capstone/blob/master/Callao_df.csv

The coordinates were obtained using google api.

Foursquare api is being used for segementation and clustering.


## 4. Methodology <a name="item4"></a>

In this project, we will direct our efforts to detect similarities or differences between Lima and Callao, with respect to various areas of interest such as: sports, food, tourism and others.

In the first step, data has been collected from the zip code, district and area of Lima and Callao.
The districts of Santiago de Surco (Lima) and Callao (Callao) will be taken as shown, and the areas will be reflated on maps.

The second step in our analysis will use the Foursquare API to obtain places in the surrounding area of Santiago de Surco and Callao.

In the third step, we will focus on finding each neighborhood along with the 5 most common places.
We will present a map of all those locations, but we will also create groups (using k-means clustering) of those locations to identify general areas that differentiate or have similarities between both cities.

## 5. Analysis <a name="item5"></a>

In [3]:
#Setup libraries

!pip install geopy

import numpy as np
import pandas as pd
!pip install geocoder
import geocoder

#!conda install -c conda-forge geopy --yes 

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/53/fc/3d1b47e8e82ea12c25203929efb1b964918a77067a874b2c7631e2ec35ec/geopy-1.21.0-py2.py3-none-any.whl (104kB)
[K     |████████████████████████████████| 112kB 5.1MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-1.21.0
Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 10.4MB/s ta 0:00:01
[?25hCollecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/rateli

In [4]:
#Read csv file contain Lima data

df_Lima = pd.read_csv("Lima_df.csv", encoding='ISO-8859-1"')
df_Lima.head()

Unnamed: 0,Postal Code,District,Area
0,15011,Ate,Ate_15011
1,15012,Ate,Ate_15012
2,15023,Ate,Ate_15023
3,15479,Ate,Ate_15479
4,15483,Ate,Ate_15483


In [5]:
#Examine data of Lima

print('Lima dataframe has {} district and {} areas.'.format(
        len(df_Lima['District'].unique()),
        df_Lima.shape[0]
    )
)

#grouping data by District
df_Lima.groupby('District').count()

Lima dataframe has 162 district and 286 areas.


Unnamed: 0_level_0,Postal Code,Area
District,Unnamed: 1_level_1,Unnamed: 2_level_1
Alis,1,1
Ambar,1,1
Andajes,1,1
Antioquáa,1,1
Arahuay,1,1
...,...,...
Villa el Salvador,6,6
Vitis,1,1
Viñac,1,1
Yauyos,1,1


In [6]:
#Read csv file contain Callao data

df_Callao = pd.read_csv("Callao_df.csv", encoding='ISO-8859-1"')
df_Callao.head()

Unnamed: 0,Postal Code,District,Area
0,7001,Callao,Callao_7001
1,7006,Carmen de la Legua Reynoso,Carmen de la Legua Reynoso_7006
2,7011,Bellavista,Bellavista_7011
3,7016,La Perla,La Perla_7016
4,7021,La Punta,La Punta_7021


In [7]:

#Examine data of Callao

print('Callao dataframe has {} district and {} areas.'.format(
        len(df_Callao['District'].unique()),
        df_Callao.shape[0]
    )
)

#grouping data by District
df_Callao.groupby('District').count()


Callao dataframe has 7 district and 15 areas.


Unnamed: 0_level_0,Postal Code,Area
District,Unnamed: 1_level_1,Unnamed: 2_level_1
Ancón,1,1
Bellavista,1,1
Callao,5,5
Carmen de la Legua Reynoso,1,1
La Perla,1,1
La Punta,1,1
Ventanilla,5,5


In [8]:
#Google api

GOOGLE_API_KEY='AIzaSyAQWqMTOcyLBRDR2skO4F_5QEWzNDOlUHw'

#Function to get latitude and longitude of Peru

def get_latlng(postal_code):
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Peru'.format(postal_code), key=GOOGLE_API_KEY)
        lat_lng_coords = g.latlng
    return lat_lng_coords


In [9]:
#New column for Lima dataframe

postal_codes1 = df_Lima['Postal Code']    
coords = [ get_latlng(postal_code) for postal_code in postal_codes1.tolist() ]

df_Lima_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
df_Lima['Latitude'] = df_Lima_coords['Latitude']
df_Lima['Longitude'] = df_Lima_coords['Longitude']

df_Lima.head(10)

Status code Unknown from https://maps.googleapis.com/maps/api/geocode/json: ERROR - HTTPSConnectionPool(host='maps.googleapis.com', port=443): Read timed out. (read timeout=5.0)


Unnamed: 0,Postal Code,District,Area,Latitude,Longitude
0,15011,Ate,Ate_15011,-12.034139,-76.949903
1,15012,Ate,Ate_15012,-12.056509,-76.949903
2,15023,Ate,Ate_15023,-12.099424,-76.970067
3,15479,Ate,Ate_15479,-12.012905,-76.797125
4,15483,Ate,Ate_15483,-12.025183,-76.820198
5,15487,Ate,Ate_15487,-12.027929,-76.877861
6,15491,Ate,Ate_15491,-12.037547,-76.903801
7,15494,Ate,Ate_15494,-12.041299,-76.923973
8,15498,Ate,Ate_15498,-12.027729,-76.932617
9,15063,Barranco,Barranco_15063,-12.157589,-77.016144


In [10]:
#New column for Callao dataframe

postal_codes2 = df_Callao['Postal Code']    
coords1 = [ get_latlng(postal_code1) for postal_code1 in postal_codes2.tolist() ]

df_Callao_coords = pd.DataFrame(coords1, columns=['Latitude', 'Longitude'])
df_Callao['Latitude'] = df_Callao_coords['Latitude']
df_Callao['Longitude'] = df_Callao_coords['Longitude']

df_Callao.head(10)

Unnamed: 0,Postal Code,District,Area,Latitude,Longitude
0,7001,Callao,Callao_7001,37.926805,-122.056054
1,7006,Carmen de la Legua Reynoso,Carmen de la Legua Reynoso_7006,-12.046231,-77.088103
2,7011,Bellavista,Bellavista_7011,-12.064096,-77.111121
3,7016,La Perla,La Perla_7016,-12.070231,-77.122629
4,7021,La Punta,La Punta_7021,-12.059843,-77.139888
5,7026,Callao,Callao_7026,-12.043178,-77.11256
6,7031,Callao,Callao_7031,-12.00575,-77.119752
7,7036,Callao,Callao_7036,-12.013203,-77.099613
8,7041,Callao,Callao_7041,-12.025722,-77.098174
9,7051,Ventanilla,Ventanilla_7051,-11.882764,-77.119752


In [11]:
#Map of Lima

from geopy.geocoders import Nominatim
import folium

address = 'Lima, Peru'
geolocator =  Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of Lima using latitude and longitude values
map_Lima = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_Lima['Latitude'], df_Lima['Longitude'], df_Lima['District'], df_Lima['Area']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Lima)  
    
map_Lima

In [12]:
#Map of Callao

address = 'Callao, Peru'
geolocator =  Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of Callao using latitude and longitude values
map_Callao = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_Callao['Latitude'], df_Callao['Longitude'], df_Callao['District'], df_Callao['Area']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Callao)  
    
map_Callao

In [38]:
#Slice the original dataframe and create a new dataframe of the Santiago de Surco

df_Surco = df_Lima[df_Lima['District'] == 'Santiago de Surco'].reset_index(drop=True)

#Get the geographical coordinates of Santiago de Surco, Lima

address = 'Santiago de Surco, Lima'
geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of Santiago de Surco using latitude and longitude values
map_Surco = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, district,postalcode in zip(df_Surco['Latitude'], df_Surco['Longitude'], df_Surco['District'],df_Surco['Postal Code']):
     label = '{}, {}'.format(district, postalcode)
     folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Surco)  
    
map_Surco

In [39]:
#Slice the original dataframe and create a new dataframe of Callao

df_Callao_d = df_Callao[df_Callao['District'] == 'Callao'].reset_index(drop=True)

#Get the geographical coordinates of Callao

address = 'Callao, Callao'
geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of Callao using latitude and longitude values

map_Callao_d = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map

for lat, lng, district,postalcode in zip(df_Callao_d['Latitude'], df_Callao['Longitude'], df_Callao_d['District'],df_Callao_d['Postal Code']):
     label = '{}, {}'.format(district, postalcode)
     folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Callao_d)  
    
map_Callao_d

In [15]:
#Using Foursquare API to get venues at surounding area of Santiago de Surco.

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#Define Foursquare Credentials and Version

CLIENT_ID = 'V4VXKTSABDGCFZE3IU1WVNB4OO4O5BHNEYY4BPQLIHM1MUI4' # your Foursquare ID
CLIENT_SECRET = '4V5XP5VIP0YYXLTQL3BZMTGNGNYNCQJM1TOHAEE2UJ4X0LHT' # your Foursquare Secret
VERSION = '20180604'

#explore the first neighborhood in our dataframe
#Get the neighborhood's latitude and longitude values.

neighborhood_latitude = df_Surco.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_Surco.loc[0, 'Longitude'] # neighborhood longitude value
neighborhood_name = df_Surco.loc[0, 'Area'] # neighborhood name

#get the top 100 venues that are in Santiago de Surco within a radius of 500 meters
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

#Send the GET request and examine the resutls
results = requests.get(url).json()

#borrow the get_category_type function from the Foursquare lab.
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#clean the json and structure it into a pandas dataframe
venues = results['response']['groups'][0]['items']    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
print('{} venues were returned by Foursquare for Santiago de Surco, Lima.'.format(nearby_venues.shape[0]))
nearby_venues.head()

25 venues were returned by Foursquare for Santiago de Surco, Lima.


Unnamed: 0,name,categories,lat,lng
0,Embarcadero 41 Fusión,Seafood Restaurant,-12.120941,-76.99654
1,El Piombino,Snack Place,-12.120579,-76.99533
2,La Casa del Ceviche,Seafood Restaurant,-12.120855,-76.996136
3,Wing Factory,Wings Joint,-12.121674,-76.992644
4,Pisco y Pesca,Seafood Restaurant,-12.118546,-76.997427


In [16]:
#explore the first neighborhood in our dataframe
#Get the neighborhood's latitude and longitude values.

neighborhood_latitude = df_Callao_d.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_Callao_d.loc[0, 'Longitude'] # neighborhood longitude value
neighborhood_name = df_Callao_d.loc[0, 'Area'] # neighborhood name

#get the top 100 venues that are in Callao within a radius of 500 meters
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

#Send the GET request and examine the resutls
results = requests.get(url).json()

#clean the json and structure it into a pandas dataframe
venues = results['response']['groups'][0]['items']    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
print('{} venues were returned by Foursquare for Callao, Callao.'.format(nearby_venues.shape[0]))
nearby_venues.head()

22 venues were returned by Foursquare for Callao, Callao.


Unnamed: 0,name,categories,lat,lng
0,iLoveKickboxing,Boxing Gym,37.926769,-122.056626
1,Renaissance ClubSport Walnut Creek Hotel,Hotel,37.925007,-122.056368
2,Bay Club Walnut Creek,Gym / Fitness Center,37.925261,-122.056904
3,Parada,Peruvian Restaurant,37.926811,-122.055849
4,Taheri's Mediterranean,Mediterranean Restaurant,37.928452,-122.05759


In [17]:
#function to repeat the same process to all area
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Area', 
                  'Area Latitude', 
                  'Area Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#run the above function on each neighborhood and create a new dataframe
Surco_venues = getNearbyVenues(names=df_Surco['Area'],
                                   latitudes=df_Surco['Latitude'],
                                   longitudes=df_Surco['Longitude']
                                  )
#check the size of the resulting dataframe
print(Surco_venues.shape)
Surco_venues.head()

Santiago de Surco_15038
Santiago de Surco_15039
Santiago de Surco_15049
Santiago de Surco_15054
(89, 7)


Unnamed: 0,Area,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Santiago de Surco_15038,-12.120241,-76.995987,Embarcadero 41 Fusión,-12.120941,-76.99654,Seafood Restaurant
1,Santiago de Surco_15038,-12.120241,-76.995987,El Piombino,-12.120579,-76.99533,Snack Place
2,Santiago de Surco_15038,-12.120241,-76.995987,La Casa del Ceviche,-12.120855,-76.996136,Seafood Restaurant
3,Santiago de Surco_15038,-12.120241,-76.995987,Wing Factory,-12.121674,-76.992644,Wings Joint
4,Santiago de Surco_15038,-12.120241,-76.995987,Pisco y Pesca,-12.118546,-76.997427,Seafood Restaurant


In [18]:

#run the above function on each neighborhood and create a new dataframe
Callao_venues = getNearbyVenues(names=df_Callao_d['Area'],
                                   latitudes=df_Callao_d['Latitude'],
                                   longitudes=df_Callao_d['Longitude']
                                  )

#check the size of the resulting dataframe
print(Callao_venues.shape)
Callao_venues.head()

Callao_7001
Callao_7026
Callao_7031
Callao_7036
Callao_7041
(42, 7)


Unnamed: 0,Area,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Callao_7001,37.926805,-122.056054,iLoveKickboxing,37.926769,-122.056626,Boxing Gym
1,Callao_7001,37.926805,-122.056054,Renaissance ClubSport Walnut Creek Hotel,37.925007,-122.056368,Hotel
2,Callao_7001,37.926805,-122.056054,Bay Club Walnut Creek,37.925261,-122.056904,Gym / Fitness Center
3,Callao_7001,37.926805,-122.056054,Parada,37.926811,-122.055849,Peruvian Restaurant
4,Callao_7001,37.926805,-122.056054,Taheri's Mediterranean,37.928452,-122.05759,Mediterranean Restaurant


In [19]:
#check how many venues were returned for each area
print('There are {} uniques categories in Lima.'.format(len(Surco_venues['Venue Category'].unique())))
Surco_venues.groupby('Area').count()

There are 42 uniques categories in Lima.


Unnamed: 0_level_0,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Area,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Santiago de Surco_15038,25,25,25,25,25,25
Santiago de Surco_15039,39,39,39,39,39,39
Santiago de Surco_15049,13,13,13,13,13,13
Santiago de Surco_15054,12,12,12,12,12,12


In [20]:
#check how many venues were returned for each area
print('There are {} uniques categories in Callao.'.format(len(Callao_venues['Venue Category'].unique())))
Callao_venues.groupby('Area').count()

There are 32 uniques categories in Callao.


Unnamed: 0_level_0,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Area,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Callao_7001,22,22,22,22,22,22
Callao_7026,5,5,5,5,5,5
Callao_7031,2,2,2,2,2,2
Callao_7036,12,12,12,12,12,12
Callao_7041,1,1,1,1,1,1


## Analyze Lima

In [21]:
# one hot encoding
Surco_onehot = pd.get_dummies(Surco_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Surco_onehot['Area'] = Surco_venues['Area'] 

# move neighborhood column to the first column
fixed_columns = [Surco_onehot.columns[-1]] + list(Surco_onehot.columns[:-1])
Surco_onehot = Surco_onehot[fixed_columns]

#examine the new dataframe size after one hot encoding
print('{} rows were returned after one hot encoding.'.format(Surco_onehot.shape[0]))

#group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
Surco_grouped = Surco_onehot.groupby('Area').mean().reset_index()

#examine the new dataframe size after one hot encoding
print('{} rows were returned after grouping.'.format(Surco_grouped.shape[0]))


89 rows were returned after one hot encoding.
4 rows were returned after grouping.


In [22]:

#print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in Surco_grouped['Area']:
    print("----"+hood+"----")
    temp = Surco_grouped[Surco_grouped['Area'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Santiago de Surco_15038----
                 venue  freq
0               Bakery  0.16
1   Seafood Restaurant  0.16
2          Snack Place  0.08
3                 Pool  0.08
4  Peruvian Restaurant  0.08


----Santiago de Surco_15039----
                  venue  freq
0           Pizza Place  0.10
1           Coffee Shop  0.10
2        Sandwich Place  0.10
3                  Park  0.08
4  Fast Food Restaurant  0.08


----Santiago de Surco_15049----
                venue  freq
0           Nightclub  0.31
1  Athletics & Sports  0.15
2                 Pub  0.15
3               Trail  0.08
4              Bakery  0.08


----Santiago de Surco_15054----
                 venue  freq
0     Asian Restaurant  0.17
1  Fried Chicken Joint  0.17
2              Dog Run  0.08
3                 Park  0.08
4           Restaurant  0.08




In [23]:
#put into a pandas dataframe

#write a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#create the new dataframe and display the top 10 venues for each neighborhood
num_top_venues = 8

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Area']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
areas_venues_sorted = pd.DataFrame(columns=columns)
areas_venues_sorted['Area'] = Surco_grouped['Area']

for ind in np.arange(Surco_grouped.shape[0]):
    areas_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Surco_grouped.iloc[ind, :], num_top_venues)

areas_venues_sorted.head()

Unnamed: 0,Area,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,Santiago de Surco_15038,Bakery,Seafood Restaurant,Peruvian Restaurant,Pool,Snack Place,Wings Joint,Sandwich Place,Italian Restaurant
1,Santiago de Surco_15039,Pizza Place,Sandwich Place,Coffee Shop,Fast Food Restaurant,Park,Chinese Restaurant,Bakery,Peruvian Restaurant
2,Santiago de Surco_15049,Nightclub,Athletics & Sports,Pub,Soccer Stadium,Trail,Donut Shop,Bakery,Music Venue
3,Santiago de Surco_15054,Asian Restaurant,Fried Chicken Joint,BBQ Joint,Gym / Fitness Center,Burger Joint,Park,Peruvian Restaurant,Dog Run


In [24]:
#K-mean Cluster Lima

from sklearn.cluster import KMeans

# set number of clusters
kclusters = 3

Surco_grouped_clustering = Surco_grouped.drop('Area', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Surco_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

#create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
Surco_merged = df_Surco

# add clustering labels
Surco_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Surco_merged = Surco_merged.join(areas_venues_sorted.set_index('Area'), on='Area')

Surco_merged.head()



Unnamed: 0,Postal Code,District,Area,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,15038,Santiago de Surco,Santiago de Surco_15038,-12.120241,-76.995987,0,Bakery,Seafood Restaurant,Peruvian Restaurant,Pool,Snack Place,Wings Joint,Sandwich Place,Italian Restaurant
1,15039,Santiago de Surco,Santiago de Surco_15039,-12.13089,-76.984468,0,Pizza Place,Sandwich Place,Coffee Shop,Fast Food Restaurant,Park,Chinese Restaurant,Bakery,Peruvian Restaurant
2,15049,Santiago de Surco,Santiago de Surco_15049,-12.137855,-77.013264,1,Nightclub,Athletics & Sports,Pub,Soccer Stadium,Trail,Donut Shop,Bakery,Music Venue
3,15054,Santiago de Surco,Santiago de Surco_15054,-12.163562,-76.994547,2,Asian Restaurant,Fried Chicken Joint,BBQ Joint,Gym / Fitness Center,Burger Joint,Park,Peruvian Restaurant,Dog Run


In [40]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#Finally, let's visualize the resulting clusters

#get the geographical coordinates of Manhattan
address = 'Santiago de Surco, Lima'
geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of Santiago de Surco using latitude and longitude values
Surco_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Surco_merged['Latitude'], Surco_merged['Longitude'], Surco_merged['Area'], Surco_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(Surco_clusters)

Surco_clusters

## Analyze Callao

In [26]:

# one hot encoding
Callao_onehot = pd.get_dummies(Callao_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Callao_onehot['Area'] = Callao_venues['Area'] 

# move neighborhood column to the first column
fixed_columns = [Callao_onehot.columns[-1]] + list(Callao_onehot.columns[:-1])
Callao_onehot = Callao_onehot[fixed_columns]

#examine the new dataframe size after one hot encoding
print('{} rows were returned after one hot encoding.'.format(Callao_onehot.shape[0]))

#group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
Callao_grouped = Callao_onehot.groupby('Area').mean().reset_index()

#examine the new dataframe size after one hot encoding
print('{} rows were returned after grouping.'.format(Callao_grouped.shape[0]))


42 rows were returned after one hot encoding.
5 rows were returned after grouping.


In [27]:
#print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in Callao_grouped['Area']:
    print("----"+hood+"----")
    temp = Callao_grouped[Callao_grouped['Area'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Callao_7001----
                  venue  freq
0           Coffee Shop  0.09
1                 Hotel  0.09
2          Liquor Store  0.05
3  Gym / Fitness Center  0.05
4                 Trail  0.05


----Callao_7026----
               venue  freq
0             Market   0.2
1          BBQ Joint   0.2
2      Big Box Store   0.2
3        Fish Market   0.2
4  Electronics Store   0.2


----Callao_7031----
              venue  freq
0           Airport   0.5
1    Breakfast Spot   0.5
2            Market   0.0
3             Trail   0.0
4  Sushi Restaurant   0.0


----Callao_7036----
                venue  freq
0               Plaza  0.25
1        Burger Joint  0.17
2  Chinese Restaurant  0.08
3              Bakery  0.08
4                Park  0.08


----Callao_7041----
              venue  freq
0            Bakery   1.0
1           Airport   0.0
2            Market   0.0
3             Trail   0.0
4  Sushi Restaurant   0.0




In [28]:
#create the new dataframe and display the top 10 venues for each neighborhood
num_top_venues = 8

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Area']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
areas_venues_sorted = pd.DataFrame(columns=columns)
areas_venues_sorted['Area'] = Callao_grouped['Area']

for ind in np.arange(Callao_grouped.shape[0]):
    areas_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Callao_grouped.iloc[ind, :], num_top_venues)

areas_venues_sorted.head()

Unnamed: 0,Area,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,Callao_7001,Hotel,Coffee Shop,Mediterranean Restaurant,Bakery,Boxing Gym,Breakfast Spot,Café,Chinese Restaurant
1,Callao_7026,BBQ Joint,Big Box Store,Electronics Store,Market,Fish Market,Train Station,Gym / Fitness Center,Bakery
2,Callao_7031,Airport,Breakfast Spot,Trail,BBQ Joint,Bakery,Big Box Store,Boxing Gym,Burger Joint
3,Callao_7036,Plaza,Burger Joint,Sushi Restaurant,Bakery,Restaurant,Pizza Place,Chinese Restaurant,Peruvian Restaurant
4,Callao_7041,Bakery,Train Station,Trail,BBQ Joint,Big Box Store,Boxing Gym,Breakfast Spot,Burger Joint


In [29]:
# K-mean Cluster Callao

# set number of clusters
kclusters = 3

Callao_grouped_clustering = Callao_grouped.drop('Area', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Callao_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

#create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
Callao_merged = df_Callao_d

# add clustering labels
Callao_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Callao_merged = Callao_merged.join(areas_venues_sorted.set_index('Area'), on='Area')

Callao_merged.head() # check the last columns!

Unnamed: 0,Postal Code,District,Area,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,7001,Callao,Callao_7001,37.926805,-122.056054,1,Hotel,Coffee Shop,Mediterranean Restaurant,Bakery,Boxing Gym,Breakfast Spot,Café,Chinese Restaurant
1,7026,Callao,Callao_7026,-12.043178,-77.11256,1,BBQ Joint,Big Box Store,Electronics Store,Market,Fish Market,Train Station,Gym / Fitness Center,Bakery
2,7031,Callao,Callao_7031,-12.00575,-77.119752,2,Airport,Breakfast Spot,Trail,BBQ Joint,Bakery,Big Box Store,Boxing Gym,Burger Joint
3,7036,Callao,Callao_7036,-12.013203,-77.099613,1,Plaza,Burger Joint,Sushi Restaurant,Bakery,Restaurant,Pizza Place,Chinese Restaurant,Peruvian Restaurant
4,7041,Callao,Callao_7041,-12.025722,-77.098174,0,Bakery,Train Station,Trail,BBQ Joint,Big Box Store,Boxing Gym,Breakfast Spot,Burger Joint


In [30]:
#Finally, let's visualize the resulting clusters

#get the geographical coordinates of Manhattan
address = 'Callao, Callao'
geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map
Callao_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Callao_merged['Latitude'], Callao_merged['Longitude'], Callao_merged['Area'], Callao_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(Callao_clusters)
       
Callao_clusters

## 6. Results and Discussion <a name="item6"></a>

In [31]:
#Cluster 1 for Lima
Surco_merged.loc[Surco_merged['Cluster Labels'] == 0, Surco_merged.columns[[2] + list(range(5, Surco_merged.shape[1]))]]

Unnamed: 0,Area,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,Santiago de Surco_15038,0,Bakery,Seafood Restaurant,Peruvian Restaurant,Pool,Snack Place,Wings Joint,Sandwich Place,Italian Restaurant
1,Santiago de Surco_15039,0,Pizza Place,Sandwich Place,Coffee Shop,Fast Food Restaurant,Park,Chinese Restaurant,Bakery,Peruvian Restaurant


In [32]:
#Cluster 2 for Lima
Surco_merged.loc[Surco_merged['Cluster Labels'] == 1, Surco_merged.columns[[2] + list(range(5, Surco_merged.shape[1]))]]

Unnamed: 0,Area,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
2,Santiago de Surco_15049,1,Nightclub,Athletics & Sports,Pub,Soccer Stadium,Trail,Donut Shop,Bakery,Music Venue


In [33]:
#Cluster 3 for Lima
Surco_merged.loc[Surco_merged['Cluster Labels'] == 2, Surco_merged.columns[[2] + list(range(5, Surco_merged.shape[1]))]]

Unnamed: 0,Area,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
3,Santiago de Surco_15054,2,Asian Restaurant,Fried Chicken Joint,BBQ Joint,Gym / Fitness Center,Burger Joint,Park,Peruvian Restaurant,Dog Run


In [34]:
#Cluster 1 for Callao
Callao_merged.loc[Callao_merged['Cluster Labels'] == 0, Callao_merged.columns[[2] + list(range(5, Callao_merged.shape[1]))]]

Unnamed: 0,Area,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
4,Callao_7041,0,Bakery,Train Station,Trail,BBQ Joint,Big Box Store,Boxing Gym,Breakfast Spot,Burger Joint


In [35]:
#Cluster 2 for Callao
Callao_merged.loc[Callao_merged['Cluster Labels'] == 1, Callao_merged.columns[[2] + list(range(5, Callao_merged.shape[1]))]]

Unnamed: 0,Area,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,Callao_7001,1,Hotel,Coffee Shop,Mediterranean Restaurant,Bakery,Boxing Gym,Breakfast Spot,Café,Chinese Restaurant
1,Callao_7026,1,BBQ Joint,Big Box Store,Electronics Store,Market,Fish Market,Train Station,Gym / Fitness Center,Bakery
3,Callao_7036,1,Plaza,Burger Joint,Sushi Restaurant,Bakery,Restaurant,Pizza Place,Chinese Restaurant,Peruvian Restaurant


In [37]:
#Cluster 3 for Callao
Callao_merged.loc[Callao_merged['Cluster Labels'] == 2, Callao_merged.columns[[2] + list(range(5, Callao_merged.shape[1]))]]

Unnamed: 0,Area,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
2,Callao_7031,2,Airport,Breakfast Spot,Trail,BBQ Joint,Bakery,Big Box Store,Boxing Gym,Burger Joint


After performing the analysis of the district of Santiago de Surco (Lima), it can be seen that there are categories associated with:
restaurant, sport, pub, park, fast food.

Regarding the analysis of the district of Callao (Callao),
It can be seen that there are categories associated with: restaurant, gym, fast food, hotel, market, airport.

## 7. Conclusion <a name="item7"></a>

As a result of the analysis it can be concluded that both cities have categories of places in common associated with business, gastronomy and tourism.

Lima is the capital of Peru.
Callao is a great international exchange center for having airports and ports.