# Battle of the neighbourhoods

In this capstone project for the IBM Professional Certificate in Data Science (via [Coursera](www.coursera.org)), we have to leverage webscraping, Foursquare data (venues around a given location), and machine learning to solve a problem of our choosing. 

## Clustering European cities

### Background

The problem that I have chosen to investigate is: how similar are European cities? Which cities are similar? Which are dissimilar? On what grounds are they similar/dissimilar? If you were interested in moving but didn't know where to, how would you decide which cities made the shortlist? 
<br><br> 
Supposing you are interested in moving to a(nother) European city, you might be interested in first having an overview of what sorts of amenities and venues different cities have. This would be very useful information to have. Perhaps you are interested in knowing which cities are big on nature or have a good LGBT scene or have lots of diverse restaurants. Or perhaps you have already lived in Paris and are interested in seeing which other European cities are similar so you can move to a similar city you're likely to love (or avoid them completely if you hated living in Paris). You might also be interested in seeing if Paris is similar to other cities in France or if it is more similar to cities in other countries. 

### How we will go about this

We will tackle this problem in five steps. First, we will use webscraping to obtain the top cities by population size (within city limits) in the European Union (EU). [This wikipedia page](https://en.wikipedia.org/wiki/List_of_cities_in_the_European_Union_by_population_within_city_limits) lists all cities in the EU with a population > 300,000. As noted on the webpage, population is calculated as the number of people living within the city limits. This might not include people living in the larger urban area, depending on how the city defines its own city limits. For example, Paris is the most populous city in the EU if we include the wider urban area but only the 4th by city limits. However, as we are not particularly interested in population size as a predictor of city similarity, we can safely ignore this. 
<br><br>
Below is some sample code walking us through how we will approach this problem. 

In [22]:
# Example code

import pandas as pd
import requests 

url = "https://en.wikipedia.org/wiki/List_of_cities_in_the_European_Union_by_population_within_city_limits"
html_data = requests.get(url).text
eu_cities_data = pd.read_html(str(html_data))[0]
eu_cities_data.columns = ["City", "Country", "Population", "CensusDate", "Ref", "Photo"]
eu_cities_data.drop(["CensusDate", "Ref", "Photo"], axis = 1, inplace = True)
eu_cities_data.head()

Unnamed: 0,City,Country,Population
0,Berlin,Germany,3669495
1,Madrid,Spain,3348536
2,Rome,Italy,2856133
3,Paris,France,2140526
4,Vienna,Austria,1921153


Above we can see the top five cities in the EU by population size. 
<br><br>
Secondly, we will obtain the latitudes and longitudes for each of the cities using geopy. 

In [23]:
# Example code

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent = 'myapplication')

for city in range(0,5):
    cityname = eu_cities_data.loc[city, "City"]
    location = geolocator.geocode(cityname)
    print(location.address)
    print("The latitude of {} is {} and the longitude is {}".format(cityname, location.latitude, location.longitude))
    if city < 4: # i.e., not last
        print("")

Berlin, 10117, Deutschland
The latitude of Berlin is 52.5170365 and the longitude is 13.3888599

Madrid, Área metropolitana de Madrid y Corredor del Henares, Comunidad de Madrid, 28001, España
The latitude of Madrid is 40.4167047 and the longitude is -3.7035825

Roma, Roma Capitale, Lazio, Italia
The latitude of Rome is 41.8933203 and the longitude is 12.4829321

Paris, Île-de-France, France métropolitaine, France
The latitude of Paris is 48.8566969 and the longitude is 2.3514616

Wien, Österreich
The latitude of Vienna is 48.2083537 and the longitude is 16.3725042


(Isn't it also fun to see how different cities and countries are called in their local languages?)

In [24]:
# Getting latitude and longitude data for each city

eu_cities_data["Latitude"] = ""
eu_cities_data["Longitude"] = ""

for city in range(0, len(eu_cities_data)):
    city_name = eu_cities_data.loc[city, "City"]
    location = geolocator.geocode(city_name)
    eu_cities_data.loc[city, "Latitude"] = location.latitude
    eu_cities_data.loc[city, "Longitude"] = location.longitude
    
eu_cities_data.head()

Unnamed: 0,City,Country,Population,Latitude,Longitude
0,Berlin,Germany,3669495,52.517,13.3889
1,Madrid,Spain,3348536,40.4167,-3.70358
2,Rome,Italy,2856133,41.8933,12.4829
3,Paris,France,2140526,48.8567,2.35146
4,Vienna,Austria,1921153,48.2084,16.3725


Thirdly, we will use Foursquare API data to find the venues in that city. We will set the radius search quite high (to 5km) in order to capture as many venues in the city as possible. 

In [25]:
# Here client ID and client secret will need to be filled in and the limit will be set higher

CLIENT_ID = ""
CLIENT_SECRET = ""
VERSION = "20180605" # Foursquare API version
LIMIT = 50 # default Foursquare API limit value is 100 but we will probably use something much higher in the actual project

In [26]:
# Defining a function which will find all venues in a radius of 5km around the city latitude and longitude and then create a table with that data

def getNearbyVenues(city, country, latitude, longitude, radius = 5000):    
    venues_list = []
    for city, country, lat, lng in zip(city, country, latitude, longitude):
#        print("{}, {}".format(city, country)) # This will be included in the final project but for now we just want to give a taste of what we will do 
        # Create the API request URL
        url = "https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)  
        # Make the GET request
        results = requests.get(url).json()["response"]["groups"][0]["items"]
        # return only relevant information for each nearby venue
        venues_list.append([(
            city, 
            country, 
            lat, 
            lng, 
            v["venue"]["name"], 
            v["venue"]["location"]["lat"], 
            v["venue"]["location"]["lng"],  
            v["venue"]["categories"][0]["name"]) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ["City", 
                             "Country",
                             "City latitude", 
                             "City longitude", 
                             "Venue", 
                             "Venue latitude", 
                             "Venue longitude",
                             "Venue category"]
    return(nearby_venues)

In [27]:
# Example code 

eu_venues = getNearbyVenues(city = eu_cities_data["City"], 
                            country = eu_cities_data["Country"], 
                            latitude = eu_cities_data["Latitude"],
                            longitude = eu_cities_data["Longitude"]
                           )
eu_venues.head()

Unnamed: 0,City,Country,City latitude,City longitude,Venue,Venue latitude,Venue longitude,Venue category
0,Berlin,Germany,52.517037,13.38886,Dussmann das KulturKaufhaus,52.518312,13.388708,Bookstore
1,Berlin,Germany,52.517037,13.38886,Dussmann English Bookshop,52.518223,13.389239,Bookstore
2,Berlin,Germany,52.517037,13.38886,Lafayette Gourmet,52.514385,13.389569,Gourmet Shop
3,Berlin,Germany,52.517037,13.38886,Konzerthaus Berlin,52.513639,13.391795,Concert Hall
4,Berlin,Germany,52.517037,13.38886,Gendarmenmarkt,52.51357,13.39272,Plaza


Fourthly, we will get the most common types of venue that you can find in each city and then fifth and finally we will cluster cities based on that information.

In [28]:
# Library for displaying maps

!pip install folium
import folium
print("")
print("Folium installed!")
print("")


Folium installed!



In [29]:
lat_europe = 54.5260
long_europe = 15.2551

map_europe = folium.Map(location = [lat_europe, long_europe], zoom_start = 4)

for lat, lng, city, country, population in zip(eu_cities_data["Latitude"], 
                                               eu_cities_data["Longitude"], 
                                               eu_cities_data["City"], 
                                               eu_cities_data["Country"], 
                                               eu_cities_data["Population"]):
    label = "{}, {} (population: {:,})".format(city, country, population)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = "darkmagenta",
        fill = True,
        fill_color = "mediumvioletred",
        fill_opacity = 0.7,
        parse_html = False).add_to(map_europe)  
    
map_europe

(In the event that the above map does not load, you can also view a static image of it [here](https://github.com/annahudson/Coursera_Capstone/blob/main/EU_most_populous_cities.png))