# The Battle of Neighborhoods (Granada)

Granada is the capital city of the province of Granada, in the autonomous community of Andalusia, Spain. In this notebook we are going to explore, segment, and cluster the towns in the province of Granada.

In the 2005 national census, the population of the city of Granada proper was 236,982, and the population of the entire urban area was estimated to be 472,638, ranking as the 13th-largest urban area of Spain. About 3.3% of the population did not hold Spanish citizenship, the largest number of these people (31%; or 1% of the total population) coming from South America.

First of all, let's import all required libraries:

In [1]:
!conda install -c conda-forge geopy --yes 
!conda install -c conda-forge folium=0.5.0 --yes
!conda install lxml --yes

!pip install geocoder
!pip install xlrd

import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
import folium # plotting library
import matplotlib.cm as cm
import matplotlib.colors as colors
import json # library to handle JSON files
import geocoder # to get coordinates

from pandas.io.json import json_normalize # tranforming json file into a pandas dataframe library
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
from IPython.display import Image # libraries for displaying images
from IPython.core.display import HTML # libraries for displaying images
from sklearn.cluster import KMeans # import k-means from clustering stage

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.2

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    scikit-learn-0.20.1        |   py36h22eb022_0         5.7 MB
    liblapack-3.8.0            |      11_openblas          10 KB  conda-forge
    numpy-1.18.1               |   py36h95a1406_0         5.2 MB  conda-forge
    liblapacke-3.8.0           |      11_openblas          10 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    libopenblas-0.3.6          |       h5a2b251_2         7.7 MB
    scipy-1.4.1                |   py36h921218d_0        

In [2]:
def get_latlng(NBH):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Granada, Granada, Spain'.format(NBH))
        lat_lng_coords = g.latlng
    return lat_lng_coords

## Data Collection

#### Municipalties in the province of Granada

Get municipalties' information from the official website of Junta de Andalucía, which is the entity responsible for gathering all the information in Andalusia.

In [3]:
dfM0 = pd.read_excel('https://www.juntadeandalucia.es/institutodeestadisticaycartografia/sima/datos/smex99.xls',skiprows=9,header=1)

Separate latitude and longitude in different columns

In [4]:
dfM0.insert(3,"Latitud" ,dfM0["Coordenadas del núcleo principal. 2019"].str.split(",", n = 1, expand = True)[0].astype(float))
dfM0.insert(4,"Longitud",dfM0["Coordenadas del núcleo principal. 2019"].str.split(",", n = 1, expand = True)[1].astype(float))
dfM0.drop(columns =["Coordenadas del núcleo principal. 2019"], inplace = True)

We can see there are a huge amount of data in this Data Frame

In [5]:
list(dfM0)

['Provincia',
 'CodMun',
 'Municipio',
 'Latitud',
 'Longitud',
 'Extensión superficial. 2019',
 'Perímetro. 2019',
 'Altitud sobre el nivel del mar. 2019',
 'Número de núcleos que componen el municipio. 2018',
 'Distancia a la capital. 2019',
 'Población total. 2018',
 'Población. Hombres. 2018',
 'Población. Mujeres. 2018',
 'Población en núcleos. 2018',
 'Población en diseminados. 2018',
 'Edad media. 2018',
 'Porcentaje de población menor de 20 años. 2018',
 'Porcentaje de población mayor de 65 años. 2018',
 'Incremento relativo de la población en diez años. 2018',
 'Número de extranjeros. 2018',
 'Principal procedencia de los extranjeros residentes. 2018',
 'Porcentaje que representa respecto total de extranjeros. 2018',
 'Emigraciones. 2017',
 'Inmigraciones. 2017',
 'Nacimientos. 2017',
 'Defunciones. 2017',
 'Matrimonios. 2017',
 'Centros de Infantil. 2016',
 'Centros de Primaria. 2016',
 'Centros de Enseñanza Secundaria Obligatoria. 2016',
 'Centros  de Bachillerato. 2016',
 '

We are going to work only with a few columns

In [6]:
dfM1 = dfM0[['Provincia','Municipio','Latitud','Longitud']]

Print the number of provinces and municipalties

In [7]:
print('The dataframe has {} provinces and {} municipalties.'.format(len(dfM0['Provincia'].unique()),len(dfM0['Municipio'].unique())))

The dataframe has 8 provinces and 785 municipalties.


Let's see the name of the provinces and how many municipalties there are in each one

In [8]:
dfM1.groupby('Provincia').count().iloc[:,0:1].sort_values(by='Municipio',ascending=False).reset_index()

Unnamed: 0,Provincia,Municipio
0,Granada,174
1,Sevilla,106
2,Almería,103
3,Málaga,103
4,Jaén,97
5,Huelva,80
6,Córdoba,77
7,Cádiz,45


As we see, Granada is by far the province with more municipalties in Andalusia.

Now let's select only the towns in the province of Granada:

In [9]:
dfM2 = dfM1[dfM1['Provincia'].str.contains("Granada")].reset_index(drop=True)
dfM2.head()

Unnamed: 0,Provincia,Municipio,Latitud,Longitud
0,Granada,Agrón,37.030104,-3.829328
1,Granada,Alamedilla,37.581725,-3.244387
2,Granada,Albolote,37.230548,-3.657339
3,Granada,Albondón,36.827281,-3.211111
4,Granada,Albuñán,37.227319,-3.13335


#### Neighborhoods in Granada

Get the neighborhood's information

In [10]:
dfN0 = pd.read_html('https://es.wikipedia.org/wiki/Distritos_de_Granada')[0].drop(columns='Ubicación').replace({" y ":", ","-":", ",r'\.': ''},regex=True)

Separate the neighborhoods into different cells

In [11]:
dfN1 = pd.DataFrame([sub.split(",") for sub in dfN0.Barrios])

Restructure the in two columns: District and Neighborhood

In [12]:
data=[]
for i in range(0, 8):
    for j in range(0, 8):
        data.append([dfN0.Distrito[j],dfN1[i][j]])

Create a DataFrame with the info above

In [13]:
dfN2 = pd.DataFrame(data,columns=['Distrito','Barrio'])

Remove null values

In [14]:
dfN3 = dfN2[dfN2["Barrio"].notnull()].reset_index(drop=True)
print('Granada has {} neighborhoods.'.format(len(dfN3['Barrio'].unique())))

Granada has 40 neighborhoods.


Create a list with the coordinates of every neighborhood

In [15]:
LL = pd.DataFrame([get_latlng(NBH) for NBH in dfN3["Barrio"].tolist()], columns=['Latitude', 'Longitude'])

Add coordinates to the DataFrame

In [16]:
dfN3['Latitud' ] = LL['Latitude' ]
dfN3['Longitud'] = LL['Longitude']
dfN3.head()

Unnamed: 0,Distrito,Barrio,Latitud,Longitud
0,Albayzín,Albaicín,37.18358,-3.59253
1,Beiro,Joaquina Eguaras,37.20211,-3.61112
2,Centro,Centro,37.17652,-3.60395
3,Chana,Angustias,37.202432,-3.609718
4,Genil,Bola de Oro,37.1628,-3.58298


## Data Visualization

Get the coordinates of Granada

In [17]:
location = Nominatim(user_agent = "foursquare_agent").geocode('Granada, Granada, Spain')

In [18]:
map = folium.Map(location = [location.latitude, location.longitude], zoom_start = 13)

for lat, lng, M in zip(dfM2.Latitud, dfM2.Longitud, dfM2.Municipio):
    folium.features.CircleMarker([lat, lng],radius=5.0,color='green',popup=M).add_to(map)
for lat, lng, N in zip(dfN3.Latitud, dfN3.Longitud, dfN3.Barrio):
    folium.features.CircleMarker([lat, lng],radius=2.5,color='blue' ,popup=N).add_to(map)
    
map

Now, let's concentrate in the neighboorhoods of Granada (blue points)

## Getting Venues

Define FourSquare Credentials

In [19]:
ID  = '5BE1CFT1DAZS0QY4L02YFE3312POUPPMMW3PTEMNBYU3UNPC'
sec = 'TTCGHPU30N3LWFLLCHWMOSDHQ1BBFVU2I53NRYUJNEZR3X0F'
ver = '20180604'

Create the DataFrame of Venues

In [20]:
lim = 100; rad = 500

venues_list = []

for N, lat, lon in zip(dfN3['Barrio'],dfN3['Latitud'],dfN3['Longitud']):
                    
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(ID,sec,ver,lat,lon,rad,lim)
    results = requests.get(url).json()["response"]['groups'][0]['items']
    venues_list.append([(N,lat,lon,v['venue']['name'],v['venue']['location']['lat'],v['venue']['location']['lng'], v['venue']['categories'][0]['name']) for v in results])

dfV = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
dfV.columns = ['Neighborhood','NLat','NLon','Venue','VLat','VLon','VCat']

print('There are {} venues'.format(dfV.shape[0]))

There are 1055 venues


Let's find out how many unique categories can be curated from all the returned venues

In [21]:
print('There are {} uniques categories.'.format(len(dfV['VCat'].unique())))

There are 108 uniques categories.


Let's homogeneize the categories

In [22]:
dfV.loc[dfV['VCat'].str.contains("Restaurant"),'VCat'] = 'Restaurant'
dfV.loc[dfV['VCat'].str.contains("Bar"       ),'VCat'] = 'Bar'
dfV.loc[dfV['VCat'].str.contains("Cafe"      ),'VCat'] = 'Café'
dfV.loc[dfV['VCat'].str.contains("Stadium"   ),'VCat'] = 'Stadium'
dfV.loc[dfV['VCat'].str.contains("Gym"       ),'VCat'] = 'Gym'
dfV.loc[dfV['VCat'].str.contains("Shop"      ),'VCat'] = 'Shop'
dfV.loc[dfV['VCat'].str.contains("Pub"       ),'VCat'] = 'Pub'
dfV.loc[dfV['VCat'].str.contains("Store"     ),'VCat'] = 'Store'

print('There are {} homogeneized categories.'.format(len(dfV['VCat'].unique())))

There are 61 homogeneized categories.


Number of venues by neighborhood

In [23]:
dfV.groupby('Neighborhood').count().iloc[:,0:1].sort_values(by='NLat',ascending=False).reset_index()

Unnamed: 0,Neighborhood,NLat
0,Sacromonte,100
1,Encina,87
2,San Matías,76
3,Centro,68
4,Realejo,60
5,Vergeles,55
6,Figares,54
7,Lancha del Genil,50
8,Plaza de Toros,48
9,Sagrario,47


Number of neighborhoods by venue category

In [24]:
dfV.groupby('VCat').count().iloc[:,0:1].sort_values(by='Neighborhood',ascending=False).reset_index()

Unnamed: 0,VCat,Neighborhood
0,Restaurant,364
1,Shop,97
2,Bar,76
3,Hotel,68
4,Plaza,63
...,...,...
56,Bus Line,1
57,Botanical Garden,1
58,Tea Room,1
59,BBQ Joint,1


#### One Hot Encoding

In [25]:
OHE = pd.get_dummies(dfV[['VCat']], prefix="", prefix_sep="") # one hot encoding
OHE['Neighborhood'] = dfV['Neighborhood'] # add neighborhood column back to dataframe
fixcols = [OHE.columns[-1]] + list(OHE.columns[:-1]) # move neighborhood column to the first column
OHE = OHE[fixcols]

Group rows by town and by taking the mean of the frequency of occurrence of each category

In [26]:
OHG = OHE.groupby('Neighborhood').mean().reset_index()
OHg = OHG.drop('Neighborhood', 1)
OHg.shape

(40, 61)

The matrix we are going to use for clustering has 40 neighborhoods and 60 unique categories

## Clustering

Run k-means to cluster the Granada towns into clusters

In [34]:
K = 5

kmeans = KMeans(n_clusters = K, random_state=0).fit(OHg) # run k-means clustering

ntv = 10

indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(ntv):
    try:
        columns.append('{}{}'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th'.format(ind+1))
        
NVS = pd.DataFrame(columns=columns)

NVS['Neighborhood'] = OHG['Neighborhood']

for ind in np.arange(OHG.shape[0]):
    NVS.iloc[ind, 1:] = OHG.iloc[ind, :].iloc[1:].sort_values(ascending=False).index.values[0:ntv]
    
NVS.insert(0,'Cluster',kmeans.labels_) # add clustering labels

dfN4 = dfN3.join(NVS.set_index('Neighborhood'), on='Barrio') # merge grouped with data to add latitude/longitude for each neighborhood
dfN5 = dfN4[dfN4['Cluster'].notnull()].reset_index(drop = True)

Create the map

In [35]:
map_clusters = folium.Map(location=[lat,lon], zoom_start=13)

# set color scheme for the clusters
x = np.arange(K)
ys = [i + x + (i*x)**2 for i in range(K)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dfN5['Latitud'], dfN5['Longitud'], dfN5['Barrio'], dfN5['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Analysis of clusters

Sumary of categories by cluster

In [36]:
df1 = dfV.merge(NVS.iloc[:,0:2], on='Neighborhood', how='left')
df2 = df1[['Venue','VCat','Cluster']]
df3 = df2.pivot_table(index='VCat',columns='Cluster',aggfunc='count',fill_value=0)
df3

Unnamed: 0_level_0,Venue,Venue,Venue,Venue,Venue
Cluster,0,1,2,3,4
VCat,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Art Museum,2,0,0,0,0
Arts & Entertainment,0,0,0,0,1
BBQ Joint,1,0,0,0,0
Bar,13,1,0,3,59
Bed & Breakfast,1,0,0,0,6
...,...,...,...,...,...
Store,13,0,0,0,17
Supermarket,0,0,0,0,6
Tea Room,0,0,0,0,1
Theater,2,0,0,0,6


Now let's examine the clusters:

**Cluster 0**

In [45]:
dfN5.loc[dfN5['Cluster'] == 1-1, dfN5.columns[[1] + list(range(5, dfN5.shape[1]))]]

Unnamed: 0,Barrio,1st,2nd,3rd,4th,5th,6th,7th,8th,9th,10th
1,Joaquina Eguaras,Restaurant,Bus Station,Store,Park,Shop,Plaza,Campground,Boutique,Café,Bus Line
3,Angustias,Park,Café,Bus Station,Store,Shop,Restaurant,Hotel,Breakfast Spot,Plaza,Boutique
12,Camino de los Neveros,Store,Shop,Restaurant,Park,Gym,Movie Theater,Boutique,College Library,Campground,Castle
15,Vergeles,Restaurant,Historic Site,Shop,Bar,Plaza,Hotel,Bistro,Garden,Performing Arts Venue,Café
16,Haza Grande,Pharmacy,Brewery,Plaza,Stadium,Hotel,Restaurant,College Residence Hall,Castle,Church,College Library
18,Realejo,Restaurant,Historic Site,Shop,Bar,Hotel,Plaza,Bistro,Garden,Performing Arts Venue,Café
26,Bobadilla,Shop,Store,Intersection,Rock Club,Concert Hall,Castle,Church,College Library,College Residence Hall,Winery
28,Casería de Montijo,Shop,Garden,Bar,Restaurant,Café,Concert Hall,Castle,Church,College Library,College Residence Hall
35,La Paz,Shop,Store,Plaza,Park,Café,Diner,Deli / Bodega,Dance Studio,Dry Cleaner,Concert Hall
39,San Ildefonso,Café,Hotel,Restaurant,Plaza,Shop,Scenic Lookout,Historic Site,Garden,Deli / Bodega,Dance Studio


**Cluster 1**

In [38]:
dfN5.loc[dfN5['Cluster'] == 2-1, dfN5.columns[[1] + list(range(5, dfN5.shape[1]))]]

Unnamed: 0,Barrio,1st,2nd,3rd,4th,5th,6th,7th,8th,9th,10th
8,El Fargue,Restaurant,Winery,Café,Gastropub,Garden,Food,Dry Cleaner,Diner,Deli / Bodega,Dance Studio
22,Rosaleda,Restaurant,Bar,Winery,Campground,Gastropub,Garden,Food,Dry Cleaner,Diner,Deli / Bodega


**Cluster 2**

In [39]:
dfN5.loc[dfN5['Cluster'] == 3-1, dfN5.columns[[1] + list(range(5, dfN5.shape[1]))]]

Unnamed: 0,Barrio,1st,2nd,3rd,4th,5th,6th,7th,8th,9th,10th
38,Rey Badis,Shop,Winery,Café,Gastropub,Garden,Food,Dry Cleaner,Diner,Deli / Bodega,Dance Studio


**Cluster 3**

In [40]:
dfN5.loc[dfN5['Cluster'] == 4-1, dfN5.columns[[1] + list(range(5, dfN5.shape[1]))]]

Unnamed: 0,Barrio,1st,2nd,3rd,4th,5th,6th,7th,8th,9th,10th
5,Almanjáyar,Bar,Breakfast Spot,Winery,Campground,Gastropub,Garden,Food,Dry Cleaner,Diner,Deli / Bodega
21,Cartuja,Gym,Bar,Lake,College Residence Hall,Park,Castle,Gastropub,Garden,Food,Dry Cleaner
30,Cerrillo de Maracena,Gym,Stadium,Breakfast Spot,Park,Diner,Bar,Food,Dry Cleaner,Garden,Deli / Bodega
32,Parque Nueva Granada,Gym,Lake,College Residence Hall,Park,General College & University,Gastropub,Garden,Food,Dry Cleaner,Diner


**Cluster 4**

In [48]:
dfN5.loc[dfN5['Cluster'] == 5-1, dfN5.columns[[1] + list(range(5, dfN5.shape[1]))]]

Unnamed: 0,Barrio,1st,2nd,3rd,4th,5th,6th,7th,8th,9th,10th
0,Albaicín,Restaurant,Scenic Lookout,Café,Plaza,Hotel,Historic Site,Hostel,Dance Studio,Garden,Brewery
2,Centro,Restaurant,Shop,Hotel,Café,Bar,Plaza,Bed & Breakfast,Pub,Pizza Place,Hostel
4,Bola de Oro,Restaurant,Bar,Stadium,Winery,Concert Hall,Castle,Church,College Library,College Residence Hall,Deli / Bodega
6,Camino de Ronda,Restaurant,Hotel,Pizza Place,Store,Bar,Pub,Shop,Dry Cleaner,Sandwich Place,Bookstore
7,Zaidín,Restaurant,Theater,Supermarket,Winery,Concert Hall,Castle,Church,College Library,College Residence Hall,Deli / Bodega
9,La Cruz,Restaurant,Pizza Place,Gym,Café,Campground,Store,Dry Cleaner,Diner,Deli / Bodega,Concert Hall
10,Sagrario,Restaurant,Hotel,Shop,Bar,Café,Park,Pub,Gym,Hostel,Bus Station
11,Chana,Restaurant,Pizza Place,Stadium,Breakfast Spot,Park,Winery,College Library,Campground,Castle,Church
13,Campo Verde,Café,Supermarket,Restaurant,Winery,Garden,Food,Dry Cleaner,Diner,Deli / Bodega,Dance Studio
14,Figares,Restaurant,Hotel,Shop,Diner,Pizza Place,Gym,Bar,Café,Nightclub,General College & University
