# Capstone Project 
# *Find myself the best neighborhood to live in Barcelona*

### Yu Deng

## 1. Introduction

This report is for the capstone course of IBM Data Science, a professional certification series provided by Coursera.com. In this section, I am going to utilize many data science tools taught in all of 9 courses and produce a data analysis report by myself.


I am a business school student in China. This autumn I will exchange in one of the best business school in Europe, ESADE, Barcelona, and stay their for about 3-4 months. The school doesn't offer on-campus accommodation, so I have to rent an apartment together with my schoolmates. It's a tough task for me to decide where to live. However, the capstone course brings us Foursquare API and the methods to cluster neighborhoods based on their similarity. I decide to deploy methods alike to choose the best place to live.

My analysis goes through following processes:

    1. Pick out neighborhoods near my school, ESADE, as candidates.
    2. Use data from Foursquare to explore famous venues in selected neighborhoods.
    3. Cluster these neighborhoods base on their similarity with the help of machine learning.
    4. Sort out the most common venues in each neighborhood, calculate their propotions.
    5. Transform my needs and interests into a vector which consists of scores on different venues. Multiply the neighborhoods dataframe by the vector to choose the best place to live. 

## 2. Data

The data of my report consists of three part:
    
    1.The geospatial data of ESADE Business School.
    2.The list of neighborhoods in Barcelona and their geospatial data.
    3.Informations of venues in specific neighborhoods provided by Foursquare.com.
    
I will conduct the process of data collection step by step.

Before I start, it's neccessary to import all the packages required.

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json

from geopy.geocoders import Nominatim

from pandas.io.json import json_normalize

import matplotlib as plt

from sklearn.cluster import KMeans

import folium

print('Succeed!')

Succeed!


### 1. Use geopy library to get the geographical coordinate of ESADE Business School

In [3]:
esade_add = 'ESADE Business School'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(esade_add)
esade_lat = location.latitude
esade_lng = location.longitude
print('The geographical coordinate of ESADE Business School are {}, {}.'.format(esade_lat, esade_lng))

The geographical coordinate of ESADE Business School are 41.3947771, 2.1150794468981124.


### 2. Choose the districts around ESADE and find out their subordinated neighborhoods

On the wikipedia page ESADE Business School https://en.wikipedia.org/wiki/Districts_of_Barcelona, we can find the basic informations about every districts in Barcelona, including their sizes, populations, neighborhoods and so on. Firstly I scrape the table and transform it into a dataframe.

In [17]:
url = "https://en.wikipedia.org/wiki/Districts_of_Barcelona"
barce_wiki = pd.read_html(url)[3]
barce_wiki.head()

Unnamed: 0,Number,District,Size km2,Population,Density inhabitants/km2,Neighbourhoods,Councilman[2],Party
0,1,Ciutat Vella,4.49,111290,24786,"La Barceloneta, El Gòtic, El Raval, Sant Pere,...",Jordi Rabassa i Massons,Barcelona en Comú
1,2,Eixample,7.46,262485,35586,"L'Antiga Esquerra de l'Eixample, La Nova Esque...",Jordi Martí Grau,Barcelona en Comú
2,3,Sants-Montjuïc,21.35,177636,8321,"La Bordeta, la Font de la Guatlla, Hostafrancs...",Marc Serra Solé,Barcelona en Comú
3,4,Les Corts,6.08,82588,13584,"les Corts, la Maternitat i Sant Ramon, Pedralbes",Xavier Marcé Carol,Socialists' Party of Catalonia
4,5,Sarrià-Sant Gervasi,20.09,140461,6992,"El Putget i Farró, Sarrià, Sant Gervasi - la B...",Albert Batlle i Bastardas,Socialists' Party of Catalonia


I do some basic cleanings and only keep *District*, *Density* and *Neighborhoods* columns.

In [23]:
barce_district = barce_wiki.drop(['Number', 'Size km2', 'Population', 'Councilman[2]', 'Party'], axis=1)
barce_district.rename(columns = {'Density inhabitants/km2': 'Density'}, inplace=True)
barce_district.head()

Unnamed: 0,District,Density,Neighbourhoods
0,Ciutat Vella,24786,"La Barceloneta, El Gòtic, El Raval, Sant Pere,..."
1,Eixample,35586,"L'Antiga Esquerra de l'Eixample, La Nova Esque..."
2,Sants-Montjuïc,8321,"La Bordeta, la Font de la Guatlla, Hostafrancs..."
3,Les Corts,13584,"les Corts, la Maternitat i Sant Ramon, Pedralbes"
4,Sarrià-Sant Gervasi,6992,"El Putget i Farró, Sarrià, Sant Gervasi - la B..."


Use the same geopy package to get the geographical coordinates of each district.

In [42]:
districts = barce_district.iloc[:,0]
districts_lat = []
districts_lng = []

for district in districts:
    geolocator = Nominatim(user_agent="my-application")
    location = geolocator.geocode(district + ", Barcelona", timeout=10)
    districts_lat.append(location.latitude)
    districts_lng.append(location.longitude)

barce_district['Latitude'] = pd.Series(districts_lat)
barce_district['Longitude'] = pd.Series(districts_lng)

barce_district

Unnamed: 0,District,Density,Neighbourhoods,Latitude,Longitude
0,Ciutat Vella,24786,"La Barceloneta, El Gòtic, El Raval, Sant Pere,...",41.374962,2.173265
1,Eixample,35586,"L'Antiga Esquerra de l'Eixample, La Nova Esque...",41.393394,2.166085
2,Sants-Montjuïc,8321,"La Bordeta, la Font de la Guatlla, Hostafrancs...",41.340234,2.133347
3,Les Corts,13584,"les Corts, la Maternitat i Sant Ramon, Pedralbes",41.385244,2.132863
4,Sarrià-Sant Gervasi,6992,"El Putget i Farró, Sarrià, Sant Gervasi - la B...",41.413043,2.108356
5,Gràcia,28660,"Vila de Gràcia, el Camp d'en Grassot i Gràcia ...",41.410171,2.155136
6,Horta-Guinardó,14217,"El Baix Guinardó, El Guinardó, Can Baró, El Ca...",41.42854,2.143597
7,Nou Barris,20520,"Can Peguera, Canyelles, Ciutat Meridiana, La G...",41.446727,2.172565
8,Sant Andreu,21737,"Baró de Viver, Bon Pastor, El Congrés i els In...",41.437439,2.196859
9,Sant Martí,20466,"El Besòs i el Maresme, el Clot, El Camp de l'A...",41.406782,2.203655


Use the folium package, I can plot coordinates of ESADE Business School and all districts on the map.

In [52]:
barce_add = 'Barcelona'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(barce_add, timeout=10)
barce_lat = location.latitude
barce_lng = location.longitude

barce_map = folium.Map(location=[barce_lat, barce_lng], zoom_start=12)

for lat, lng, district in zip(barce_district['Latitude'], barce_district['Longitude'], barce_district['District']):
    folium.CircleMarker(
        [lat, lng],
        radius=10,
        popup=district,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.3,
        parse_html=False).add_to(barce_map)  

folium.Marker([esade_lat, esade_lng], popup='ESADE').add_to(barce_map)

barce_map

Obviously I want to live near my school since it is easier to commute. So I choose the top 5 closest districts displayed by the map above, which are ***Les Corts, Sarrià-Sant Gervasi, Eixample, Gràcia and Horta-Guinardó***.

In [55]:
barce_district_candi = barce_district.iloc[[1,3,4,5,6],:].reset_index(drop = True)
barce_district_candi

Unnamed: 0,District,Density,Neighbourhoods,Latitude,Longitude
0,Eixample,35586,"L'Antiga Esquerra de l'Eixample, La Nova Esque...",41.393394,2.166085
1,Les Corts,13584,"les Corts, la Maternitat i Sant Ramon, Pedralbes",41.385244,2.132863
2,Sarrià-Sant Gervasi,6992,"El Putget i Farró, Sarrià, Sant Gervasi - la B...",41.413043,2.108356
3,Gràcia,28660,"Vila de Gràcia, el Camp d'en Grassot i Gràcia ...",41.410171,2.155136
4,Horta-Guinardó,14217,"El Baix Guinardó, El Guinardó, Can Baró, El Ca...",41.42854,2.143597


### 3. Acquire the venues data in neighborhoods

In this section, I will split the dataframe of districts into neighborhoods within them. Than use Foursquare.com to explore venues in each neighborhood. Since the venues data naturally leads me towards next section: clustering and further analysis, I leave it in the third part of my report. Here I will simply finish the split work.

In [62]:
neigh_split = barce_district_candi['Neighbourhoods'].str.split(', ', expand=True).stack().to_frame() 
neigh_split = neigh_split.reset_index(level=1, drop=True).rename(columns={0:'Neighborhood'}) 

barce_neigh = barce_district_candi.join(neigh_split).drop[[1,2,3,4]]
barce_neigh

TypeError: 'method' object is not subscriptable