# Final Capstone Project  
## The Battle of the Neighborhoods

### Abstract

_In this project, we will compare different cities in a given list.
The cities will be clustered using the k-means algorithm into different clusters.
The program is set to be general, so it may be used to aid in any decision making related to going to a new city.
The application created by this project helps any person going through this common procedure.
Even though this project only gets information from the Foursquare API to categorize the different cities, it can easily be generalized to add new features in the future._

## Table of contents
* [Introduction](#Introduction)
* [Data](#Data)
* [Methodology](#Methodology)
* [Analysis](#Analysis)
* [Results and Discussion](#Results)
* [Conclusion](#Conclusion)

## 1 - Introduction <a name="Introduction"></a>

<!-- where you discuss the business problem and who would be interested in this project. -->
Many times someone needs to go to a city - either for a temporary visit or a permanent move -, a choice between different places is possible.
However, this decision is not easy as it involves many factors: size of the city, available restaurants and recreational places, prices and many others not always related to the place itself.
These situations usually require a reasonable amount of research and analysis, and there are many available tools that can help in different phases of this process.
<a href="https://www.wikipedia.com">Wikipedia</a>, <a href="https://www.google.com/flights">Google Flights</a>, <a href="https://www.booking.com">Booking.com</a>, <a href="https://maps.google.com/">Google Maps</a> and <a href="https://www.foursquare.com/">Foursquare</a> are examples of instruments that are used to get familiar with a city, obtain forms and prices on how to get there, and the places available for staying and entertainment.

In all of these tools, the users need to analyze each option by themselves and then, after withdrawing weaker candidates or strengthening another one with the obtained information, make a verdict.
Often, one still feels insecure about the choice made.
In this sense, a tool that automatically groups cities in a given list, returning useful information about each group certainly assist the procedure.
Such application can be used with unknown cities in an iterative way to reduce the possible alternatives, or also to include known cities as a reference.

With this comparison, we may answer questions such as:
- Is a big city with many restaurants more equal to a small city with many restaurants, or a big city with more parks?
- Which city is more like Dublin: Cologne or Amsterdam?
- Is Shanghai, Tokyo or Johannesburg more different than São Paulo?

In Chapter 2, we discuss the data that will be used.
Chapter 3 is devoted to explain the methods used in this work.
The analysis and build up of the functions is done in Chapter 4.
These functions are then used in Chapter 5 to obtain results for different groups of cities.
Finally, in Chapter 6 we summarize our findinds and discuss possible ways to expand this work.

## 2 - Data <a name="Data"></a>

<!-- where you describe the data that will be used to solve the problem and the source of the data. -->
To compare the different cities, we will obtain an extensive list of venues using the Foursquare API.
The categories of the venues - restaurant, park, stadium, etc. - will be used to cluster the cities in different groups, which will be displayed on a map and the final _character_ of each city will be printed out.
The total number of venues fetched from the API will also be used, and can be seen as a measure of the city size.
Other types of data may be added later, to make the comparison more complete.

## 3 - Methodology <a name="Methodology"></a>

<!-- which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why. -->
Our goal is to compare different cities. 
With this intent, we are going to use K-Means clustering algorithm, which labels the cities in a chosen number of labels.
Since the final result may depend on the initial choice of the centroids, we chose to initialize the method with 10 different initial configurations.
The number of clusters will be chosen depending on the number of cities in the list.
This method provides an initial idea of which cities are more similar or distinct.
To quantify the distinction between the cities, we calculate the distance matrix using the Euclidean distance from the `scipy` package.
This provides a numerical quantity that is used to distinguish either common or uncommon places.
We also create a `Folium` map to display the resulting clustering.

The features to be used on the methods above are the categories of the venues that will be fetched from the Foursquare API.
On top of that, we will use the property 'distance' of the venues provided by Foursquare.
However, one must be very careful when trying to define quantities from the distances of the venues. 
In this project, we initially defined features such as: the total distance of venues (summing up all the distances of the venues), the maximum distance (obtained from the fetched venue that is located further from the location but still inside the city). 
Those quantities were intented to give an idea of the size of the city. 
Nevertheless, since the fecthing process does not acquire all the venues equally, the resulting values did not reflect reality: small cities like Jülich had larger values than Amsterdan or New York.
To avoid outlier values of distances, the median was also tried - but without success.

Finally, we have then defined the quantity _density_, which is obtained from the number of venues inside a given radius (in this case, 200km).
Even then, the _density_ value must carefully considered for two reasons: some cities does not have all the venues registered in the database and the geography of the place may hinder large density of venues even in large cities (as happens, for example, in Rio de Janeiro).
For this reason, an optional input was added to use the density or not when clustering the cities (use_dentity=True or False).

Since the numbers obtained for the features built upon the average of venues categories are small compared to the density one, we need to use feature scaling.
This not only avoids overestimating the importante of the density feature, but is also useful for future extensions that may be included.

The clustering, distance matrix and map is returned by a single function that takes as an input a list of addresses, the number of clusters (default value is 3) and the option use_density. If the later is chosen to be True, the marker size is calculated with the value of the density. Separate functions for each stage are also provided.

## 4 - Analysis <a name="Analysis"></a>

We first import the libraries required for the analysis:

In [25]:
# Numerical library
import numpy as np 

# Dataframe library
import pandas as pd 

# To obtain latitude and longitude values for a given addresses
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim

# library to handle JSON files
import json 

# library to handle requests to the API
import requests 

And define the sentitive data for the Foursquare API:

In [26]:
# The code was removed by Watson Studio for sharing.

### Generic Functions

To make a tool that may be used to answer different types of questions regarding city comparisons, it is convenient to build generic functions.

* Function to get latitude and longitude from an address as an input:

In [27]:
def get_latitude_longitude(address):
    geolocator = Nominatim(user_agent="city_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    return [latitude,longitude]

* Function to build a DataFrame with the address and it's geographical coordinates (Latitude and Longitude):

In [28]:
def build_latitude_longitude_df(addresses):
    latitude_longitude = []
    for ind,address in enumerate(addresses):
        latitude_longitude.append( get_latitude_longitude(address) )
        # print('The geograpical coordinates of {} are {}, {}.'.format(address, latitude_longitude[ind][0], latitude_longitude[ind][1]))

    # Creating a dataframe object
    df = pd.DataFrame({'Address':addresses,'Latitude':np.array(latitude_longitude)[:,0],'Longitude':np.array(latitude_longitude)[:,1]})
    return df

* Function that fetches Venues of different cities:

In [29]:
def getNearbyVenues(names, latitudes, longitudes, radius=10000, LIMIT=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&intent=browse&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['venues']

        # Check if list is empty
        if(len(results) == 0):
            print('City {} returned 0 venues by Foursquare.'.format(name))
            continue
        
        # Checking first 10 results for possible alternate names for the city (e.g., for cities written in a different language)
        alt_names = set([name])
        for v in results[0:10]:
            try:
                city = v['location']['city']
                alt_names.add(v['location']['city'])
            except:
                #print ("city not found in {}".format(v))
                pass
        # print('Alternate names: {}'.format(alt_names))

        # Return only relevant information for each nearby venue:

        for v in results:
            # Avoid venues with empty categories
            try:
                category = v['categories'][0]['name']
            except:
#                 print ("category not found in {}".format(v))
                continue
            
            
            # Filtering venues outside of the city:
            try:
                city = v['location']['city']
            except:
                city = None
#                 print ("city not found in {}".format(v))
#                 categories_list = v['venue']['location']['city']  row['venue.categories']
            if (city not in alt_names):
#                 print('Venue in city {} instead of {}.'.format(city, name))
                continue

            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['name'], 
                v['location']['lat'], 
                v['location']['lng'],
                v['location']['distance'],
                v['categories'][0]['name'])])
#         print('For neighborhood {}, {} venues were returned by Foursquare.'.format(name,len(results)))

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Address', 
                  'Address Latitude', 
                  'Address Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude',
                  'Venue Distance',
                  'Venue Category']
    print('{} uniques categories were obtained, which will be transformed into features.'.format(len(nearby_venues['Venue Category'].unique())))
    return(nearby_venues)

* Function to transform the Categories into columns with binary values:

In [30]:
def build_one_hot_df(df,column,extra_columns):
    venues_onehot = pd.get_dummies(df[[column]], prefix="", prefix_sep="")

    # add extra columns back to dataframe...
    venues_onehot[extra_columns] = df[extra_columns]

    # ...and moving to first position:
    venues_onehot = venues_onehot.set_index(extra_columns).reset_index()
    return venues_onehot

* We also build a function to use when grouping the values for each city. This function will also generate new features that may represent the city:

In [31]:
def build_features(df):
    d = {}
    d = df.mean()
    d['Density'] = df[df['Venue Distance'] < 200.0]['Address'].count()/200.0

    return pd.Series(d)

* And finally, a wrapping function that takes a list of cities and builds the final DataFrame to be used with different Machine Learning methods.
This function performs the following tasks:
 - Gets the Latitude and Longitude of the cities;
 - Obtain the list of venues, their categories and their distances;
 - Transforms the categories into features (columns) with binary values;
 - Groups the different cities by taking the average for each category, and optionally the number of venues inside a radius ('Density').

In [32]:
def build_dataframe(addresses, use_density=True):
    df_in = build_latitude_longitude_df(addresses)

    venues = getNearbyVenues(names=df_in['Address'],
                             latitudes=df_in['Latitude'],
                             longitudes=df_in['Longitude']
                            )

    venues_onehot = build_one_hot_df(venues, 'Venue Category', ['Address','Venue Distance'])

    venues_onehot = venues_onehot.groupby('Address').apply(build_features).drop('Venue Distance', axis=1)

    if(use_density==False):
        venues_onehot.drop('Density', axis=1, inplace=True)

    return venues_onehot

### Feature scaling

Even though all our features have values between 0 and 1, the numbers corresponding to the average categories can be small compared to the density (when this is included).
Besides, if further extensions are made to the feature list, normalizing the values will avoid overestimating or underestimating the importance of a certain feature.
Therefore, we scale the features using the `StandardScaler` from the `sklearn` preprocessing library

In [84]:
from sklearn.preprocessing import StandardScaler

In [89]:
def scale_features(x):
    x = x.values #returns a numpy array
    min_max_scaler = MinMaxScaler()
    scaled_features = StandardScaler().fit_transform(x)
    return scaled_features

### Distance calculation

To quantify the difference between the cities, we calculate the distance matrix using the _Euclidean distance_.
We first import the `scipy` library:

In [13]:
import scipy

In [77]:
def calc_dist_matrix(feature_df):
    leng = feature_df.shape[0]
    D = scipy.zeros([leng,leng])

    for i in range(leng):
        for j in range(leng):
            D[i,j] = scipy.spatial.distance.euclidean(feature_df[i], feature_df[j])
    return D

### Clustering cities

To recognize the characteristics of each city after performing some machine learning algorithm, we can analyse the top venues in a civen city using the following function:

In [33]:
# Defining the function to return the most common category
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

# Defining the function to build a DF with the most common venues as columns
def build_top_venues_df(input_df,num_top_venues = 10):
    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = ['Address']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # create a new dataframe
    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
    neighborhoods_venues_sorted['Address'] = input_df['Address']

    for ind in np.arange(input_df.shape[0]):
        neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(input_df.iloc[ind, :], num_top_venues)

    return neighborhoods_venues_sorted

So, we build a Dataframe with all the features (venues categories and, optionally, density), and apply the K-Means Clustering algorithm to group the cities. After clustering the different cities, we use the function above to print the main categories for each city and their respective cluster.

In [34]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [87]:
df_complete = build_dataframe(addresses, use_density=True)
df_complete

259 uniques categories were obtained, which will be transformed into features.


Unnamed: 0_level_0,Accessories Store,Adult Boutique,Airport Service,American Restaurant,Aquarium,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Train Station,Trattoria/Osteria,Travel Agency,Turkish Restaurant,University,Well,Wine Bar,Winery,Women's Store,Density
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Amsterdam, Netherlands",0.0,0.028986,0.0,0.007246,0.0,0.014493,0.007246,0.014493,0.0,0.007246,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.69
"Cologne, Germany",0.007812,0.0,0.0,0.0,0.007812,0.0,0.007812,0.0,0.0,0.007812,...,0.007812,0.0,0.007812,0.007812,0.0,0.007812,0.007812,0.0,0.0,0.61
"Dublin, Ireland",0.00813,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01626,0.0,0.0,0.0,0.00813,0.0,0.03252,0.59
"Juelich, Germany",0.0,0.0,0.0,0.01087,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.315
"London, England",0.0,0.0,0.0,0.00885,0.0,0.00885,0.0,0.0,0.00885,0.0,...,0.0,0.0,0.017699,0.0,0.0,0.0,0.0,0.0,0.00885,0.51
"New York, USA",0.0,0.0,0.0,0.007692,0.0,0.0,0.0,0.0,0.0,0.007692,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015385,0.615
"Sao Paulo, Brazil",0.0,0.0,0.006579,0.013158,0.0,0.0,0.0,0.006579,0.0,0.006579,...,0.0,0.0,0.006579,0.0,0.006579,0.0,0.0,0.006579,0.0,0.725


In [91]:
pd.DataFrame(scale_features(df_complete),columns=df_complete.columns, index=df_complete.index)

Unnamed: 0_level_0,Accessories Store,Adult Boutique,Airport Service,American Restaurant,Aquarium,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Train Station,Trattoria/Osteria,Travel Agency,Turkish Restaurant,University,Well,Wine Bar,Winery,Women's Store,Density
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Amsterdam, Netherlands",-0.63228,2.44949,-0.408248,0.088411,-0.408248,2.034685,1.496441,2.204617,-0.408248,0.838042,...,-0.408248,-0.408248,-0.979637,-0.408248,-0.408248,-0.408248,-0.63228,-0.408248,-0.710156,0.881308
"Cologne, Germany",1.536617,-0.408248,-0.408248,-1.453253,2.44949,-0.608067,1.662713,-0.57796,-0.408248,0.993271,...,2.44949,-0.408248,0.12839,2.44949,-0.408248,2.44949,1.536617,-0.408248,-0.710156,0.244492
"Dublin, Ireland",1.624783,-0.408248,-0.408248,-1.453253,-0.408248,-0.608067,-0.631831,-0.57796,-0.408248,-1.148887,...,-0.408248,-0.408248,1.3265,-0.408248,-0.408248,-0.408248,1.624783,-0.408248,2.138281,0.085288
"Juelich, Germany",-0.63228,-0.408248,-0.408248,0.859243,-0.408248,-0.608067,-0.631831,-0.57796,-0.408248,-1.148887,...,-0.408248,2.44949,-0.979637,-0.408248,-0.408248,-0.408248,-0.63228,-0.408248,-0.710156,-2.103767
"London, England",-0.63228,-0.408248,-0.408248,0.429487,-0.408248,1.005649,-0.631831,-0.57796,2.44949,-1.148887,...,-0.408248,-0.408248,1.530582,-0.408248,-0.408248,-0.408248,-0.63228,-0.408248,0.064971,-0.551528
"New York, USA",-0.63228,-0.408248,-0.408248,0.183282,-0.408248,-0.608067,-0.631831,-0.57796,-0.408248,0.960314,...,-0.408248,-0.408248,-0.979637,-0.408248,-0.408248,-0.408248,-0.63228,-0.408248,0.637374,0.284293
"Sao Paulo, Brazil",-0.63228,-0.408248,2.44949,1.346084,-0.408248,-0.608067,-0.631831,0.685183,-0.408248,0.655035,...,-0.408248,-0.408248,-0.046562,-0.408248,2.44949,-0.408248,-0.63228,2.44949,-0.710156,1.159915


In [126]:
def group_and_cluster(addresses, kclusters = 3, use_density=True):
    # Building Dataframe with all the Features to be used in k-means clustering
    df_complete = build_dataframe(addresses, use_density=use_density)

    # Scaling features
    scaled_features = scale_features(df_complete)
    
    # Calculating the distance matrix between the cities and transforming to DataFrame:
    dist_matrix = pd.DataFrame(calc_dist_matrix(scaled_features), columns=addresses, index=addresses)

    # Recovering Dataframe with scaled features:
    # df_complete = pd.DataFrame(scaled_features,columns=df_complete.columns, index=df_complete.index)

    # run k-means clustering
    kmeans = KMeans(init = "k-means++", n_clusters=kclusters, n_init = 12).fit(scaled_features)

    # Creating DataFrame with top venue categories
    if(use_density == True):
        df_new = df_complete.drop('Density', axis=1)
    else:
        df_new = df_complete
    top_venues_df = build_top_venues_df(df_new.reset_index())


    # Add Density and Clustering labels
    if(use_density==True):
        top_venues_df.insert(0, 'Density', df_complete.reset_index()['Density'])
    top_venues_df.insert(0, 'Cluster Labels', kmeans.labels_)

    
    cities = build_latitude_longitude_df(addresses)

    # Merge the top venue categories dataframe with the one with city names and latitude/longitude
    # The merge on the 'right' is meant to avoid the rare case that an addresses returns no venues from the Foursquare API
    top_venues_df_final = cities.join(top_venues_df.set_index('Address'), on='Address', how='right')

    return dist_matrix, top_venues_df_final

### Visualizing the results

Lastly, to visualize the cities and their cluster, we use a Folium map. First, importing the required libraries:

In [41]:
# Matplotlib color modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Map rendering library
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if folium is not installed
import folium

And we define a complete function that creates the dataframe and cluster the cities, returning an interactive map with their labels printed out, as well as a dataframe with the top 10 venues characteritics:

In [106]:
def group_cluster_map(addresses,kclusters = 3, use_density=True):

    # Create Database with features and cluster cities using k-means
    dist_matrix, df_clustered = group_and_cluster(addresses, kclusters=kclusters, use_density=use_density)

    # Create Folium map
    map_clusters = folium.Map()

    # Fit minimum and maximum latitude and longitude
    map_clusters.fit_bounds([[df_clustered['Latitude'].min(), df_clustered['Longitude'].min()], [df_clustered['Latitude'].max(), df_clustered['Longitude'].max()]])

    # set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # Add markers to the map:
    if(use_density==False): # Not using density
        markers_colors = []
        for lat, lon, poi, cluster in zip(df_clustered['Latitude'], df_clustered['Longitude'], df_clustered['Address'], df_clustered['Cluster Labels']):
            label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
            folium.CircleMarker(
                [lat, lon],
                radius=5,
                popup=label,
                color=rainbow[cluster-1],
                fill=True,
                fill_color=rainbow[cluster-1],
                fill_opacity=0.7).add_to(map_clusters)
    else: # Using density for marker size
        markers_colors = []
        for lat, lon, poi, size, cluster in zip(df_clustered['Latitude'], df_clustered['Longitude'], df_clustered['Address'], df_clustered['Density'], df_clustered['Cluster Labels']):
            label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
            folium.CircleMarker(
                [lat, lon],
                radius=10*(size**2),
                popup=label,
                color=rainbow[cluster-1],
                fill=True,
                fill_color=rainbow[cluster-1],
                fill_opacity=0.7).add_to(map_clusters)
    return df_clustered, dist_matrix, map_clusters

## 5 - Results and Discussion <a name="Results"></a> 

<!-- where you discuss the results. -->
<!-- where you discuss any observations you noted and any recommendations you can make based on the results. -->
With the generic functions defined in the previous chapter, we can now use it to compare different cities and try to answer a few questions.

### Big and small cities

We start by comparing big cities (Dublin, London, Cologne, Amsterdam, New York and São Paulo), as well as a small one (Jülich, Germany):

In [107]:
addresses = ['Dublin, Ireland', 'London, England', 'Cologne, Germany', 'Juelich, Germany', 'Amsterdam, Netherlands', 'New York, USA', 'Sao Paulo, Brazil']
df_final, dist_matrix, map_clusters = group_cluster_map(addresses,kclusters = 3, use_density=True)

259 uniques categories were obtained, which will be transformed into features.


The returned Dataframe lists the most common categories of venues, as well as the cluster calculated by the K-Means algorithm:

In [100]:
df_final

Unnamed: 0,Address,Latitude,Longitude,Cluster Labels,Density,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Dublin, Ireland",53.349764,-6.260273,0,0.59,Mobile Phone Shop,Coffee Shop,Salon / Barbershop,Bus Line,Bus Stop,Women's Store,Café,Fast Food Restaurant,Ice Cream Shop,Convenience Store
1,"London, England",51.507322,-0.127647,0,0.51,Office,Outdoor Sculpture,Taxi,Government Building,Italian Restaurant,Coworking Space,Embassy / Consulate,Tech Startup,Coffee Shop,Business Service
2,"Cologne, Germany",50.938361,6.959974,2,0.61,Hotel,Office,Bar,Italian Restaurant,German Restaurant,Plaza,Gay Bar,Doctor's Office,Residential Building (Apartment / Condo),Historic Site
3,"Juelich, Germany",50.922093,6.361102,0,0.315,Miscellaneous Shop,Pizza Place,Salon / Barbershop,Optical Shop,Café,Bank,Pharmacy,Clothing Store,Shoe Store,Bakery
4,"Amsterdam, Netherlands",52.37454,4.897976,2,0.69,Bar,Coffee Shop,Burger Joint,Office,Adult Boutique,Marijuana Dispensary,Bridge,Café,Hostel,Hotel
5,"New York, USA",40.712728,-74.006015,2,0.615,Office,Taxi,Food Truck,Government Building,College Classroom,Building,Lawyer,Bar,Bus Station,General College & University
6,"Sao Paulo, Brazil",-23.550651,-46.633382,1,0.725,Office,Pharmacy,College Academic Building,Historic Site,Cosmetics Shop,Building,Courthouse,Church,Music Venue,Outdoor Sculpture


We can note from the results above that a small city (Jülich) may be clustered together with the big cities in spite of other big cities such as São Paulo being in a separate cluster.  
The distance matrix provides a quantification for those differences:

In [105]:
dist_matrix

Unnamed: 0,"Dublin, Ireland","London, England","Cologne, Germany","Juelich, Germany","Amsterdam, Netherlands","New York, USA","Sao Paulo, Brazil"
"Dublin, Ireland",0.0,24.246767,24.564217,25.478818,24.393808,24.775783,25.935727
"London, England",24.246767,0.0,24.303802,24.877851,24.405267,24.845097,25.162867
"Cologne, Germany",24.564217,24.303802,0.0,22.601027,22.62778,25.166785,25.695264
"Juelich, Germany",25.478818,24.877851,22.601027,0.0,23.770524,24.673967,26.207041
"Amsterdam, Netherlands",24.393808,24.405267,22.62778,23.770524,0.0,22.914864,24.833543
"New York, USA",24.775783,24.845097,25.166785,24.673967,22.914864,0.0,25.365967
"Sao Paulo, Brazil",25.935727,25.162867,25.695264,26.207041,24.833543,25.365967,0.0


Note that the distance from São Paulo to all the other cities tend to be larger - even though other cities are also disparate (as Dublin from Jülich, or New York from Cologne).  
In this sense, the city that is more similar to Dublin is London, followed closely by Amsterdam and then Cologne.

The clusters can be plotted on a map, which can provide further insights to the results:

In [146]:
map_clusters

It seems that geographical distance have a large influence on the similarity or difference between the cities.
For example, Jülich and Cologne, nearby cities, have the closest Euclidian distance.
Another correlation can originate from historical reasons: New York and London are relatively similar to each other.

### Large differences

We now try to find the difference between very disparate big cities - a question that a traveller or an explorer may have.
We compare the cities of Shanghai, Tokyo, Johannesburg and São Paulo. 
As a mean of comparison, we also include Cologne and Jülich, the cities with shortest distance from the previous example.

In this case, the density will not be used, since Foursquare provide very different number of venues in each of the cities.

In [139]:
addresses = ['Shanghai, China', 'Tokyo, Japan', 'Johannesburg, South Africa', 'Sao Paulo, Brazil', 'Cologne, Germany', 'Juelich, Germany']
df_final_2, dist_matrix_2, map_clusters_2 = group_cluster_map(addresses, kclusters = 3, use_density=False)

240 uniques categories were obtained, which will be transformed into features.


We start by looking at the clusters and the most common venues:

In [140]:
df_final_2

Unnamed: 0,Address,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Shanghai, China",31.232274,121.469175,1,Chinese Restaurant,Coffee Shop,Office,Theater,Art Museum,Japanese Restaurant,Shopping Mall,Tea Room,Convenience Store,Italian Restaurant
1,"Tokyo, Japan",35.682839,139.759455,1,Historic Site,Japanese Restaurant,Office,Convenience Store,Italian Restaurant,Coworking Space,Chinese Restaurant,Hotel Bar,Police Station,Lounge
2,"Johannesburg, South Africa",-26.205,28.049722,1,Office,Building,Café,Automotive Shop,Clothing Store,Portuguese Restaurant,Fast Food Restaurant,Bank,Coworking Space,Breakfast Spot
3,"Sao Paulo, Brazil",-23.550651,-46.633382,2,Office,Pharmacy,College Academic Building,Historic Site,Building,Cosmetics Shop,Courthouse,Church,Monument / Landmark,Spa
4,"Cologne, Germany",50.938361,6.959974,1,Hotel,Office,Bar,Plaza,German Restaurant,Italian Restaurant,Gay Bar,Residential Building (Apartment / Condo),Road,Brewery
5,"Juelich, Germany",50.922093,6.361102,0,Miscellaneous Shop,Pizza Place,Salon / Barbershop,Café,Optical Shop,Pharmacy,Bank,Clothing Store,Doner Restaurant,Bakery


In this case, the city of São Paulo stands in a separate cluster, indicating that it is still the most different from the other ones.
This is confirmed by the distance matrix:

In [141]:
dist_matrix_2

Unnamed: 0,"Shanghai, China","Tokyo, Japan","Johannesburg, South Africa","Sao Paulo, Brazil","Cologne, Germany","Juelich, Germany"
"Shanghai, China",0.0,23.11047,24.129704,24.264237,23.550396,23.752523
"Tokyo, Japan",23.11047,0.0,23.546244,24.225114,23.327369,23.288283
"Johannesburg, South Africa",24.129704,23.546244,0.0,25.396546,24.318102,23.828016
"Sao Paulo, Brazil",24.264237,24.225114,25.396546,0.0,24.630309,25.496425
"Cologne, Germany",23.550396,23.327369,24.318102,24.630309,0.0,22.96663
"Juelich, Germany",23.752523,23.288283,23.828016,25.496425,22.96663,0.0


Note that, even though the distance between Cologne and Jülich is still the smallest one, K-Means clustering placed them in different clusters.

We can extend now the clustering results between the big cities above by decreasing the list size:

In [142]:
addresses = ['Shanghai, China', 'Tokyo, Japan', 'Johannesburg, South Africa', 'Sao Paulo, Brazil']
df_final_3, dist_matrix_3, map_clusters_3 = group_cluster_map(addresses, kclusters = 3, use_density=False)

193 uniques categories were obtained, which will be transformed into features.


In [143]:
df_final_3

Unnamed: 0,Address,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Shanghai, China",31.232274,121.469175,2,Chinese Restaurant,Coffee Shop,Office,Convenience Store,Noodle House,Italian Restaurant,Shopping Mall,Japanese Restaurant,Theater,Art Museum
1,"Tokyo, Japan",35.682839,139.759455,1,Historic Site,Japanese Restaurant,Convenience Store,Office,Police Station,Italian Restaurant,Lounge,Coworking Space,Hotel Bar,Chinese Restaurant
2,"Johannesburg, South Africa",-26.205,28.049722,1,Office,Building,Café,Automotive Shop,Clothing Store,Bank,Fast Food Restaurant,Portuguese Restaurant,Coworking Space,Doctor's Office
3,"Sao Paulo, Brazil",-23.550651,-46.633382,0,Office,Pharmacy,College Academic Building,Cosmetics Shop,Building,Historic Site,Church,Courthouse,Music Venue,Road


In [144]:
dist_matrix_3

Unnamed: 0,"Shanghai, China","Tokyo, Japan","Johannesburg, South Africa","Sao Paulo, Brazil"
"Shanghai, China",0.0,22.822803,22.040519,21.559651
"Tokyo, Japan",22.822803,0.0,23.566859,24.281156
"Johannesburg, South Africa",22.040519,23.566859,0.0,21.715092
"Sao Paulo, Brazil",21.559651,24.281156,21.715092,0.0


We see now that São Paulo seems closer to Shanghai and Johannesburg, and Tokyo is the most different one.
Although this seems to be an unintuitive change compared to the previous result, this may be caused by the feature scaling that now does not include Cologne and Jülich.

## 6 - Conclusion <a name="Conclusion"></a>

<!-- section where you conclude the report. -->
In this work, we have compared cities using K-Means clustering and Euclidian distances.
This was done using the categories of the venues in a city.
We have also defined the number of venues inside a given radius as a measure of the density of a city.
This allowed us to answer the question: can a large city be more like a small city that has similar venues than another big city with different venues?
Our results showed that it can, and geographical and historical correlations may play a bigger role than the size of the city.

We have also compared cities in different corners of the world, to see how so contrasting cities would be compared by our algorithm.
The distances we obtained were, in fact, comparable to the ones for cities that are alike.
These results, however, are certainly not exact.
Our list of features only took into account the list of venues in a city.
A realistic result should also include the cost of living (such as average rental and price of food), language, density of streets, number of cars, and so on.
Not only that, but our results may also be affected by the amount of venues retrieved from locations where Foursquare is not common.
Even though the comparisons made by our algorithm can already provide important insights regarding the cities, the extension of the features and improvement of the data will certainly make them more realistic.
Another possible extension for this project is to implement other functionalities.
The Decision Tree method, for example, can refine the results by displaying where and how one city diverge from the other.

Finally, apart from the numerical results obtained in this work, the functions defined here may also be used as examples and templates for future projects.