## Clustering of Boroughs based on Cultural events on the city of Lima

#### Introduction

##### Overview
The goal of this project is to group districts of my city based on cultural activities like concerts, expositions, theatre, festivals, etc., and also on the entertainment venues around it. In order to determinate what is the most fun district to go a have a good time and to choose which district fit more to our preferences.


##### Problem
Information about scheduled cultural events is needed, even those that have already passed. That data will help us to determinate which districts have the most cultural activities, which can be used for recommendations systems


##### Interest
People with specific interests can use this information to select which place is more fun to go out. Also for foreign people who want to know more about Lima, and use this classification when they decide to make a trip to visit.

#### Data
It can be done a clustering with only Foursquare's cultural venues around a district, but for more accurate precision, it is a good idea to include information about past and future events. Fortunately, I found a web page where it can extract it (here), it contains tables with information like the type of event, address, name, district, and price. For simplicity, I will use only information around March and Abril of 2019.

First, I scraped this web page List of districts, to get the districts of Lima. I used all the tool I learned so far to return coordinates and to search venues with Foursquare, but we need to just filter by cultural venues. The more difficult part is to scrap the cultural agenda web page because it is needed to make a loop to search for every single day around two months. 

For the features, I decided to count the numbers of cultural venues that Foursquare will give me, and also the count for every type of cultural event from the cultural agenda web page.



In [37]:
## Import packages to scrap the internet

import pandas as pd
import requests
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim 
import folium
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors



#### Get a list of districs of Lima
scrap wikipedia https://en.wikipedia.org/wiki/List_of_districts_of_Lima


In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_districts_of_Lima'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')


In [5]:
# temporarily output
#print(soup.prettify())

table_rows = soup.table.tbody.find_all('tr')
df_districts = pd.DataFrame(columns = ['District','PostalCode'])
for row in table_rows:
    row_list = []
    for item in row.find_all('td'):
        row_list.append(item.text.strip())
    if len(row_list) == 8:
        df_districts = df_districts.append({'District':row_list[0],'PostalCode':row_list[6]}, ignore_index=True)
    
df_districts
## save to csv file
#df_districts.to_csv('districts_lima.csv')


## change name of district Lima for 'Cercado de Lima', for compatibility
df_districts.replace({'Lima': 'Cercado de Lima'}, inplace=True)
df_districts


## get lat lon

df_districts_latlon = pd.read_csv('latlon.csv', index_col=0)
df_districts_latlon.drop('PostalCode', axis=1, inplace=True)
df_districts_latlon

Unnamed: 0,District,latitude,longitude
0,Ancón,-11.714147,-77.111861
1,Ate,-12.038249,-76.893745
2,Barranco,-12.144676,-77.023201
3,Breña,-12.05996,-77.052672
4,Carabayllo,-11.809484,-76.999271
5,Chaclacayo,-11.988587,-76.759217
6,Chorrillos,-12.195981,-77.012527
7,Cieneguilla,-12.076801,-76.77911
8,Comas,-11.933186,-77.045026
9,El Agustino,-12.044897,-76.99875


#### Get data from data agenda web site
In this section I want to retrieve all the information from  https://www.enlima.pe/ 

In [68]:
## test for a single date
## url = https://www.enlima.pe/calendario-cultural/dia/2019-03-01

url_agenda = 'https://www.enlima.pe/calendario-cultural/dia/2019-03-01'

response_agenda = requests.get(url_agenda)
soup_agenda = BeautifulSoup(response_agenda.text, 'html.parser')

In [69]:
# temporarily output
#print(soup_agenda.prettify())

table_rows = soup_agenda.table.tbody.find_all('tr')
df_agenda = pd.DataFrame(columns = ['Time','Type', 'EventName', 'Place', 'District', 'Price'])

for row in table_rows:
    row_list = []
    for item in row.find_all('td'):
        row_list.append(item.text.strip())
    if len(row_list) == 6:
        df_agenda = df_agenda.append(
            {'Time':row_list[0],
             'Type':row_list[1],
             'EventName': row_list[2],
             'Place': row_list[3],
             'District': row_list[4],
             'Price': row_list[5]}, ignore_index=True)
        

## events on Lima
df_agenda = df_agenda[df_agenda['District'] == 'Lima']
df_agenda


Unnamed: 0,Time,Type,EventName,Place,District,Price
2,,Cine,Suspiria,Varias sedes - Lima,Lima,S/ 25
3,,Cine,Siempre serás mi hijo,Salas de cine comercial,Lima,S/ 25
10,10:00 am,Exposición,Zimoun,Espacio Fundación Telefónica,Lima,GRATIS
18,11:00 am,Exposición,Festival Internacional de Acuarela IWS - ICPNA...,Varias sedes - Lima,Lima,GRATIS
20,5:00 pm,Otros,Kontedores,Kontenedores (Boulevard de Asia),Lima,
21,6:00 pm,Taller,Film 16 Milímetros,Espacio Fundación Telefónica,Lima,GRATIS
25,8:30 pm,Teatro,¿Qué hacemos con Walter?,Teatro Luigi Pirandello,Lima,S/ 30 a S/ 95


#### Make a loop for the days on March and Abril


In [65]:
## function to get url 
def get_url_enlima(month=3, day=1):
    return 'https://www.enlima.pe/calendario-cultural/dia/2019-{0:02d}-{1:02d}'.format(month, day)

print(get_url_enlima(3,1))
print(get_url_enlima(3,15))



https://www.enlima.pe/calendario-cultural/dia/2019-03-01
https://www.enlima.pe/calendario-cultural/dia/2019-03-15


In [71]:
df_agenda_list = []
for month in range(2,6):
    for day in range(1,32):## 31 days on March 30 days on Abril
        
        if month == 2 and day > 28:
            pass
        if month == 4 and day == 31:
            pass
        else:
            url_agenda = get_url_enlima(month=month, day=day)
            print(url_agenda)
            response_agenda = requests.get(url_agenda)
            soup_agenda = BeautifulSoup(response_agenda.text, 'html.parser')
            table_rows = soup_agenda.table.tbody.find_all('tr')
            df_agenda = pd.DataFrame(columns = ['Time','Type', 'EventName', 'Place', 'District', 'Price'])

            for row in table_rows:
                row_list = []
                for item in row.find_all('td'):
                    row_list.append(item.text.strip())
                if len(row_list) == 6:
                    df_agenda = df_agenda.append(
                        {'Time':row_list[0],
                         'Type':row_list[1],
                         'EventName': row_list[2],
                         'Place': row_list[3],
                         'District': row_list[4],
                         'Price': row_list[5]}, ignore_index=True)
            df_agenda_list.append(df_agenda)
            
            

https://www.enlima.pe/calendario-cultural/dia/2019-02-01
https://www.enlima.pe/calendario-cultural/dia/2019-02-02
https://www.enlima.pe/calendario-cultural/dia/2019-02-03
https://www.enlima.pe/calendario-cultural/dia/2019-02-04
https://www.enlima.pe/calendario-cultural/dia/2019-02-05
https://www.enlima.pe/calendario-cultural/dia/2019-02-06
https://www.enlima.pe/calendario-cultural/dia/2019-02-07
https://www.enlima.pe/calendario-cultural/dia/2019-02-08
https://www.enlima.pe/calendario-cultural/dia/2019-02-09
https://www.enlima.pe/calendario-cultural/dia/2019-02-10
https://www.enlima.pe/calendario-cultural/dia/2019-02-11
https://www.enlima.pe/calendario-cultural/dia/2019-02-12
https://www.enlima.pe/calendario-cultural/dia/2019-02-13
https://www.enlima.pe/calendario-cultural/dia/2019-02-14
https://www.enlima.pe/calendario-cultural/dia/2019-02-15
https://www.enlima.pe/calendario-cultural/dia/2019-02-16
https://www.enlima.pe/calendario-cultural/dia/2019-02-17
https://www.enlima.pe/calendari

In [72]:
# check the lenght
print(len(df_agenda_list))

types_of_events = []
districts = []
for df_agenda in df_agenda_list:
    types_of_events.extend(df_agenda['Type'].unique())
    districts.extend(df_agenda['District'].unique())
    
types_of_events = list(set(types_of_events))
districts = list(set(districts))

print(len(districts))
districts

123
20


['Punta Hermosa',
 'San Miguel',
 'Pueblo Libre',
 'Breña',
 'Comas',
 'Lima',
 'San Isidro',
 'Callao',
 'Magdalena del Mar',
 'Barranco',
 'Lince',
 'San Borja',
 'Santiago de Surco',
 'Chorrillos',
 'Cercado Callao',
 'Jesús María',
 'Miraflores',
 'Ica',
 'Cercado de Lima',
 'Surquillo']

#### Filter by Disctrict

From the scraping process, we obtain many events, filter by district to see first insights

In [73]:
# function to get cultural agenda by district name
def get_events_by_district(district='Cercado de Lima'):
    df_agenda_district = pd.DataFrame(columns = ['Time','Type', 'EventName', 'Place', 'District', 'Price'])
    for df_agenda in df_agenda_list:
        df_agenda_district = df_agenda_district.append(df_agenda[df_agenda['District'] == district], ignore_index=True)
    return df_agenda_district

In [74]:
df_cercado_Lima = get_events_by_district()
df_cercado_Lima.head()



Unnamed: 0,Time,Type,EventName,Place,District,Price
0,9:00 am,Exposición,Papeles sobre el suelo,Instituto Italiano de Cultura,Cercado de Lima,GRATIS
1,10:00 am,Exposición,Lima. 484 aniversario,Galería Municipal Pancho Fierro,Cercado de Lima,GRATIS
2,7:00 pm,Teatro,Laberinto,AAA Asociación de Artistas Aficionados,Cercado de Lima,S/ 15
3,8:00 pm,Teatro,Laberinto,AAA Asociación de Artistas Aficionados,Cercado de Lima,S/ 15
4,9:00 am,Exposición,Papeles sobre el suelo,Instituto Italiano de Cultura,Cercado de Lima,GRATIS


In [75]:
## let's get some insights on District of Cercado de Lima
df_cercado_Lima.dtypes

Time         object
Type         object
EventName    object
Place        object
District     object
Price        object
dtype: object

In [76]:
# how many types of events exists

df_cercado_Lima['Type'].value_counts()


Exposición               282
Teatro                    41
Niños                     39
Cine                      16
Otros                     15
Conciertos                13
Artes Escénicas            5
Artes Escénicas, Cine      2
Taller                     1
Name: Type, dtype: int64

In [81]:
#df_cercado_Lima['EventName'].value_counts()

In [82]:
#df_cercado_Lima['Place'].value_counts()

In [77]:

# one hot encoding
cercadoLima_onehot = pd.get_dummies(df_cercado_Lima[['Type']], prefix="", prefix_sep="")

cercadoLima_onehot['District'] = df_cercado_Lima['District'] 

# move neighborhood column to the first column
fixed_columns = [cercadoLima_onehot.columns[-1]] + list(cercadoLima_onehot.columns[:-1])
cercadoLima_onehot = cercadoLima_onehot[fixed_columns]

cercadoLima_onehot.head()

Unnamed: 0,District,Artes Escénicas,"Artes Escénicas, Cine",Cine,Conciertos,Exposición,Niños,Otros,Taller,Teatro
0,Cercado de Lima,0,0,0,0,1,0,0,0,0
1,Cercado de Lima,0,0,0,0,1,0,0,0,0
2,Cercado de Lima,0,0,0,0,0,0,0,0,1
3,Cercado de Lima,0,0,0,0,0,0,0,0,1
4,Cercado de Lima,0,0,0,0,1,0,0,0,0


In [78]:
cercadoLima_grouped = cercadoLima_onehot.groupby('District').mean().reset_index()

In [79]:
cercadoLima_grouped


Unnamed: 0,District,Artes Escénicas,"Artes Escénicas, Cine",Cine,Conciertos,Exposición,Niños,Otros,Taller,Teatro
0,Cercado de Lima,0.012077,0.004831,0.038647,0.031401,0.681159,0.094203,0.036232,0.002415,0.099034


#### Get average type events for District


In [81]:
df_district_avg_events = pd.DataFrame(columns = ['District'] + types_of_events)

for district in districts:
    print("District name = ", district)
    df_events_by_district = get_events_by_district(district)
    events_onehot = pd.get_dummies(df_events_by_district[['Type']], prefix="", prefix_sep="")
    events_onehot['District'] = df_events_by_district['District'] 
   
    events_grouped = events_onehot.groupby('District').mean().reset_index()

    df_district_avg_events = df_district_avg_events.append(events_grouped, ignore_index=True)


df_district_avg_events.fillna(0, inplace=True)

df_district_avg_events = df_district_avg_events[df_district_avg_events['District'] != 'Ica']
df_district_avg_events = df_district_avg_events[['District'] + types_of_events]
df_district_avg_events


District name =  Punta Hermosa
District name =  San Miguel
District name =  Pueblo Libre
District name =  Breña
District name =  Comas
District name =  Lima
District name =  San Isidro
District name =  Callao
District name =  Magdalena del Mar
District name =  Barranco
District name =  Lince
District name =  San Borja
District name =  Santiago de Surco
District name =  Chorrillos
District name =  Cercado Callao
District name =  Jesús María
District name =  Miraflores
District name =  Ica
District name =  Cercado de Lima
District name =  Surquillo


Unnamed: 0,District,Cine,Taller,Exposición,Otros,Niños,Teatro,"Artes Escénicas, Cine",Conciertos,Artes Escénicas
0,Punta Hermosa,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,San Miguel,0.0,0.2,0.4,0.0,0.0,0.0,0.0,0.3,0.1
2,Pueblo Libre,0.0,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0
3,Breña,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Comas,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5,Lima,0.134892,0.068345,0.303957,0.156475,0.070144,0.160072,0.0,0.084532,0.021583
6,San Isidro,0.225705,0.0,0.652038,0.012539,0.043887,0.047022,0.0,0.003135,0.015674
7,Callao,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
8,Magdalena del Mar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
9,Barranco,0.006726,0.073991,0.784753,0.008969,0.0,0.089686,0.0,0.033632,0.002242


In [82]:
df_district_avg_events.dtypes

District                  object
Cine                     float64
Taller                   float64
Exposición               float64
Otros                    float64
Niños                    float64
Teatro                   float64
Artes Escénicas, Cine    float64
Conciertos               float64
Artes Escénicas          float64
dtype: object

#### Merge dataframes with latitude and longitude



In [83]:
df_lima = df_districts_latlon.join(df_district_avg_events.set_index('District'), on='District')
df_lima.fillna(0, inplace=True)
df_lima


Unnamed: 0,District,latitude,longitude,Cine,Taller,Exposición,Otros,Niños,Teatro,"Artes Escénicas, Cine",Conciertos,Artes Escénicas
0,Ancón,-11.714147,-77.111861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ate,-12.038249,-76.893745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Barranco,-12.144676,-77.023201,0.006726,0.073991,0.784753,0.008969,0.0,0.089686,0.0,0.033632,0.002242
3,Breña,-12.05996,-77.052672,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Carabayllo,-11.809484,-76.999271,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Chaclacayo,-11.988587,-76.759217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Chorrillos,-12.195981,-77.012527,0.23913,0.0,0.086957,0.673913,0.0,0.0,0.0,0.0,0.0
7,Cieneguilla,-12.076801,-76.77911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Comas,-11.933186,-77.045026,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,El Agustino,-12.044897,-76.99875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [84]:
CLIENT_ID = 'WNTCUUMRO1SFIFXDKRLVZBRDY3FXATPKL1NL2QDAJYZRN1IV' # your Foursquare ID
CLIENT_SECRET = '1JB0KD13UETR1CWTGRZBUF0I1HETPKFY2SA2U01ZM4D5U3V2' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


Your credentails:
CLIENT_ID: WNTCUUMRO1SFIFXDKRLVZBRDY3FXATPKL1NL2QDAJYZRN1IV
CLIENT_SECRET:1JB0KD13UETR1CWTGRZBUF0I1HETPKFY2SA2U01ZM4D5U3V2


In [85]:
def getNearbyVenues(names, latitudes, longitudes, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [86]:
lima_venues = getNearbyVenues(names=df_lima['District'],
                                   latitudes=df_lima['latitude'],
                                   longitudes=df_lima['longitude']
                                  )

Ancón
Ate
Barranco
Breña
Carabayllo
Chaclacayo
Chorrillos
Cieneguilla
Comas
El Agustino
Independencia
Jesús María
La Molina
La Victoria
Cercado de Lima
Lince
Los Olivos
Lurigancho
Lurín
Magdalena del Mar
Miraflores
Pachacamac
Pucusana
Pueblo Libre
Puente Piedra
Punta Hermosa
Punta Negra
Rímac
San Bartolo
San Borja
San Isidro
San Juan de Lurigancho
San Juan de Miraflores
San Luis
San Martín de Porres
San Miguel
Santa Anita
Santa María del Mar District
Santa Rosa
Santiago de Surco
Surquillo
Villa El Salvador
Villa María del Triunfo
Cercado Callao


In [87]:
lima_venues.shape

(2091, 7)

In [88]:
lima_venues.rename(columns={'Neighborhood': 'District'}, inplace=True)


In [89]:
# one hot encoding
lima_onehot = pd.get_dummies(lima_venues[['Venue Category']], prefix="", prefix_sep="")
lima_onehot['District'] = lima_venues['District'] 

# move neighborhood column to the first column
fixed_columns = [lima_onehot.columns[-1]] + list(lima_onehot.columns[:-1])
lima_onehot = lima_onehot[fixed_columns]

lima_onehot.head()

Unnamed: 0,District,Airport,American Restaurant,Arcade,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Vietnamese Restaurant,Water Park,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Ancón,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Ate,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Ate,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Ate,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Ate,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [90]:
lima_grouped = lima_onehot.groupby('District').mean().reset_index()
lima_grouped

Unnamed: 0,District,Airport,American Restaurant,Arcade,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Vietnamese Restaurant,Water Park,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Ancón,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ate,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Barranco,0.0,0.0,0.0,0.0,0.02,0.02,0.01,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0
3,Breña,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Carabayllo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Cercado Callao,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027027,0.0,0.0
6,Cercado de Lima,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Chaclacayo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Chorrillos,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Cieneguilla,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [91]:
df_lima = df_lima.join(lima_grouped.set_index('District'), on='District')
df_lima


Unnamed: 0,District,latitude,longitude,Cine,Taller,Exposición,Otros,Niños,Teatro,"Artes Escénicas, Cine",...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Vietnamese Restaurant,Water Park,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Ancón,-11.714147,-77.111861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ate,-12.038249,-76.893745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Barranco,-12.144676,-77.023201,0.006726,0.073991,0.784753,0.008969,0.0,0.089686,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0
3,Breña,-12.05996,-77.052672,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Carabayllo,-11.809484,-76.999271,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Chaclacayo,-11.988587,-76.759217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Chorrillos,-12.195981,-77.012527,0.23913,0.0,0.086957,0.673913,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Cieneguilla,-12.076801,-76.77911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Comas,-11.933186,-77.045026,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,El Agustino,-12.044897,-76.99875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [92]:
#df_lima.drop('latitude', axis=1, inplace=True)
#df_lima.drop('longitude', axis=1, inplace=True)
df_lima.describe()

Unnamed: 0,latitude,longitude,Cine,Taller,Exposición,Otros,Niños,Teatro,"Artes Escénicas, Cine",Conciertos,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Vietnamese Restaurant,Water Park,Wine Bar,Wings Joint,Women's Store,Yoga Studio
count,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,...,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0
mean,-12.089182,-76.980339,0.049639,0.021279,0.150748,0.08786,0.016122,0.013384,0.00011,0.041904,...,0.002278,0.000455,0.001818,0.000227,0.000455,0.000455,0.000455,0.001069,0.000227,0.002125
std,0.15283,0.107911,0.164543,0.082382,0.294552,0.239448,0.074215,0.049661,0.000728,0.159431,...,0.010576,0.002107,0.004952,0.001508,0.002107,0.002107,0.002107,0.004524,0.001508,0.007458
min,-12.487005,-77.169422,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-12.148801,-77.051989,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-12.077092,-77.009454,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,-12.017144,-76.920672,0.001682,0.0,0.021739,0.014939,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,-11.714147,-76.759217,1.0,0.5,1.0,1.0,0.483254,0.304428,0.004831,1.0,...,0.052632,0.01,0.02,0.01,0.01,0.01,0.01,0.027027,0.01,0.043478


#### Data Analysis and Feature Selection


In [53]:
#df_lima.corr()

In [93]:
from scipy import stats

pearson_coef, p_value = stats.pearsonr(df_lima['Bar'], df_lima['Conciertos'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 

The Pearson Correlation Coefficient is -0.03366166275730461  with a P-value of P = 0.8282710401868173


In [94]:
pearson_coef, p_value = stats.pearsonr(df_lima['Stadium'], df_lima['Conciertos'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 

The Pearson Correlation Coefficient is -0.04054556224043486  with a P-value of P = 0.7938506392393652


In [95]:
pearson_coef, p_value = stats.pearsonr(df_lima['Teatro'], df_lima['Theater'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 

The Pearson Correlation Coefficient is 0.04447328325009404  with a P-value of P = 0.7743791111759418


#### Machine Learning, Clustering

we select K means beacause we wanna group similar districts without having any previus information or labeled districts

In [103]:

# set number of clusters
kclusters = 4
lima_grouped_clustering = df_lima.drop(['latitude','longitude','District'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(lima_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
kmeans.labels_


array([3, 1, 0, 1, 1, 1, 2, 1, 2, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       3, 2, 1, 2, 3, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 3, 0, 0, 1, 1, 0])

In [104]:
# add clustering labels
#df_lima.drop('Cluster Labels', axis=1, inplace=True)
df_lima.insert(0, 'Cluster Labels', kmeans.labels_)

df_lima.head()

Unnamed: 0,Cluster Labels,District,latitude,longitude,Cine,Taller,Exposición,Otros,Niños,Teatro,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Vietnamese Restaurant,Water Park,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,3,Ancón,-11.714147,-77.111861,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Ate,-12.038249,-76.893745,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,Barranco,-12.144676,-77.023201,0.006726,0.073991,0.784753,0.008969,0.0,0.089686,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0
3,1,Breña,-12.05996,-77.052672,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,Carabayllo,-11.809484,-76.999271,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
#df_lima.dtypes

#### Visualizations


In [105]:
latitude_lima = -12.04641
longitude_lima = -77.0449447

In [107]:


# create map
map_clusters = folium.Map(location=[latitude_lima, longitude_lima], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_lima['latitude'], df_lima['longitude'], df_lima['District'], df_lima['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters
