# Project: Clustering European Capitals

## Table of Contents
* 1.Introduction
* 2.Data Section
* 3.Methodology
* 4.Results and Discussion
* 5.Conclusion

## 1. Introduction

### 1.1 Background

Europe is a local with outstanding landscapes, home of many different cultures, rich in history, literature, architecture and music. It's a place that exposes its inhabitants and travelers to multiple languages, cultures, cuisines and many others experiences.
For those people who wants to take a trip or to move to some city in Europe it would be nice to know which city of Europe is more similar with the city that this person is currently living in. Besides that, for sake of curiosity, it would be nice to know which venues are the most common in each city.


### 1.2 The Problem

The problem consists in using the data for clustering the capitals of all countries in europe based in its most common venues. This project has two main objectives:
 * __First__: Cluster the European Capitals based in its similarity.
 * __Second__ : Answer which European capital is more similar to a given city. In this project, the city of Toronto it will be the 'given city' for this part of the project.


## 2. The Data

### 2.1 Collecting the Data

For collecting the data to solve this problem, the first step was to scrape a [Wikipedia](https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Europe) page to get all the capitals from Europe. In total, there are internationally recognised sovereign states in Europe. However, since The Vatican City is actually located inside the city of Rome, which is other European capital, the Vatican city was excluded of this analysis.
After collecting all the relevant information from the wikipedia page, the Nominatim library was used to get the Latitude and Longitude for each European Capital.

In the next step the [Foursquare API](https://developer.foursquare.com/) was used to collect a large number of venues for each capital. It was collected almost 700 venues for each capital with a radius of 30000 meters from their respected Latitude and Longitude. The following data was collected for each venue: the name of the venue, its category, its latitude and its longitude. After these steps, all these information was store in a Dataframe with all the possible duplicates eliminated. In the end, 25332 venues of 489 categories was collected. The resulting Data Frame looks the image below, where each row represents a venue.

[Table 1](Images/Table1.png)
![Table 1](Images/Table1.png)



### 2.2 Cleaning and Preparing the Data

After collecting the data, it was necessary to clean it and preparing it for the clustering step. For that, it was made one column per Venue Category and the data frame was grouped by the name of the capital in a way that each row would represent a capital, and each cell would represented the frequency of a venue category in that city. After this step the resulting Data Frame looked like the table below.

[Table 2](Images/Table2.png)
![Table 2](Images/Table2.png)


## 3. Methodology

### 3.1 Criteria and Analysis

After preparing the data, it was made some exploratory analysis of the data. First, it was noted that there were different categories that mean almost the same things. For instance, the categories *Coffee Shop* and *Café* and the categories *Gym* and *Gym / Fitness Center*. As these categories are some of the most frequent categories, it was decided to rename the first two categories for the label *Coffee* and the last two for *Gym*.

After do that, a data frame was made to show the 10 most frequent categories for each capital, as shown in the table below.

 [Table 3](Images/Table_3.png)
![Table 3](Images/Table_3.png)

The objective of this project is try to identify the similarity of the cities based in its venues categories, more specifically based in its peculiarities. As long that there are venues categories that are in the most frequent in almost all the cities, it was decided that venues categories that were in the Top 5 most frequent venues for 40 or more capitals would be disregarded. So, the next step it was to identify these categories. For that, it was made the following bar plot that shows the frequency of the Top 5 categories that appears in the **Table 3**.

![Bar 1](Images/Bar_1.png)

As shown in the Figure above, the venues categories *Coffee*, *Supermarket*, *Gym* and *Park* appears in the list of more frequent in at least 40 of our 49 capitals. As a criteria for this work, these categories were disregarded.


### 3.2 Clustering

Now the the data is prepared, it's time to cluster our capitals. Since our data is not labeled and we are trying to identify patterns in our capitals the unsupervised algorithm K-Means was chosen. To realize this task the [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) library was used.

The capitals were divided in 10 different clusters. The Figure below shows the Europe with our clusters, where which color represents a different cluster. In the Results section each cluster will be deeply analysed.

![Europe](Images/Europe.png)

For the second part of this project, which is try to identify which capita from europe the city of Toronto is more similar based in its venues, all the steps described in this reported were applied again to obtain the data of the city of Toronto, except the part of clustering. Instead of that, we used the model above, already classified, to try to predict the label of the new city. The results can be seen in the Results section.


## 4. Results and Discussion
### 4.1 Clusters of European Capitals

In this section it will be presented all the clusters that were obtained using the K-Means algorithm:

### Cluster 1
The Table below shows all the capitals that ended in the first cluster along with the it's 10 Most Common Venues.

[Table 4](Images/Cluster_1.png)
![Table_Cluster 1](Images/Cluster_1.png)

In the Bar Plot Below we can see the 5 Most Frequent Venues Categories that are presents in the Table 4.
<img src="Images/Bar_Cluster1.png" alt="Drawing" style="width: 700px;"/>

It's seems like in the cluster all the capitals has the Plaza venue as one of its most frequent. Following by the venue bar and Theater

### Cluster 2
The Table below shows all the capitals that ended in the second cluster along with the it's 10 Most Common Venues.

[Table 5](Images/Cluster_2.png)

![Table_Cluster 2](Images/Cluster_2.png)

In the Bar Plot Below we can see the 5 Most Frequent Venues Categories that are presents in the Table 5.
<img src="Images/Bar_Cluster2.png" alt="Drawing" style="width: 700px;"/>

Looking at the table and the bar plot, we can see that all three capitals have the venues Bar and Restaurant in common. However, the only capital that have all these Category Venues in its list of most frequent is __Aldorra la Vella__, while the only two categories that __Vaduz__ and __Podgorica__ share in common is *Bar* and *Restaurant*. This means the the last two cities are more similiar to __Aldorra la Vella__ than with each other.

### Cluster 3

The Table below shows all the capitals that ended in the third cluster along with the it's 10 Most Common Venues.

[Table 6](Images/Cluster_3.png)

![Table_Cluster 3](Images/Cluster_3.png)

In the Bar Plot Below we can see the 5 Most Frequent Venues Categories that are presents in the Table 6.
<img src="Images/Bar_Cluster3.png" alt="Drawing" style="width: 700px;"/>

All the capitals in this clusters appears to have the categories Restaurants, Pub and Theater in common. They apparently diffears from Cluster 1 for not having the category Plaza between its most frequent Venues category.

### Cluster 4
The Table below shows all the capitals that ended in the 4th cluster along with the it's 10 Most Common Venues.

[Table 7](Images/Cluster_4.png)

![Table_Cluster 4](Images/Cluster_4.png)

In the Bar Plot Below we can see the 5 Most Frequent Venues Categories that are presents in the Table 7.

![Bar_4](Images/Bar_Cluster4.png)

This cluster has only two capitals. With the venues *Italian restaurant* ans *Ice Cream Shop* in common. One thing nice to be observed, it is that San Marino territory is actually located inside Italy territory, what makes sense this two capitals ended in the same cluster.

### Cluster 5

The Table below shows the only one capital ended in the 5th cluster along with the it's 10 Most Common Venues. Since there is just one capital in this cluster the Bar plot will not be shown. Maybe this capital can be considered as an outilier, been dissimilar from others European Capitals.

[Table 8](Images/Cluster_5.png)

![Table_Cluster 5](Images/Cluster_5.png)


### Cluster 6
The Table below shows all the capitals that ended in the 6th cluster along with the it's 10 Most Common Venues.

[Table 9](Images/Cluster_6.png)

![Table_Cluster 6](Images/Cluster_6.png)

In the Bar Plot Below we can see the 5 Most Frequent Venues Categories that are presents in the Table 9.

![Bar_6](Images/Bar_Cluster6.png)

As we can see, this cluster ended with all the scandinavian countries capitals plus the the capital of Switzerland. Besides that, we can note that all these capitals have the Grocery Store Venue as one of the most frequent venues categories in common. Pehaps, they ended in the same cluster because all these capitals are famous for their chocolate.

### Cluster 7

The Table below shows the only one capital ended in the 5th cluster along with the it's 10 Most Common Venues. Since there is just one capital in this cluster the Bar plot will not be shown. Maybe this capital can be considered as an outilier, been dissimilar from others European Capitals.

[Table 10](Images/Cluster_7.png)

![Table_Cluster 7](Images/Cluster_7.png)


### Cluster 8
The Table below shows all the capitals that ended in the 8th cluster along with the it's 10 Most Common Venues.

[Table 11](Images/Cluster_8.png)

![Table_Cluster 8](Images/Cluster_8.png)


In the Bar Plot Below we can see the 5 Most Frequent Venues Categories that are presents in the Table 11.
![Bar_8](Images/Bar_Cluster8.png)

One thing to notice here is that all the capitals has not only five but actually 6 venues categories of the most frequent venues categories in common. Maybe this cluster has the capitals that are more similars with each other.

### Cluster 9
The Table below shows the only one capital ended in the 5th cluster along with the it's 10 Most Common Venues. Since there is just one capital in this cluster the Bar plot will not be shown. Maybe this capital can be considered as an outilier, been dissimilar from others European Capitals.

[Table 12](Images/Cluster_9.png)

![Table_Cluster 9](Images/Cluster_9.png)


### Cluster 10
The Table below shows all the capitals that ended in the 10th cluster along with the it's 10 Most Common Venues.

[Table 13](Images/Cluster_10.png)

![Table_Cluster 10](Images/Cluster_10.png)

In the Bar Plot Below we can see the 5 Most Frequent Venues Categories that are presents in the Table 13.
![Bar_10](Images/Bar_Cluster10.png)

In this Cluster we can see that all capitals has the category *Bar* in common, actually the *Bar* category is the __Most__ frequent venue category in all of these capitals following by *Restaurant*, *Palza*, *Theather* ans *Museum*.

### 4.2 Testing For Toronto

After clustering all European Capitals, is time to answer the following question "In which of these clusters the city of Toronto would fit?". For that, Data of the Venues of Toronto were collected, cleaning and prepared for our model classificate the city. After obtained the data the following table with the 10 most frequent categories of Toronto was created.


[Table 14](Images/Toronto_Table.png)

![Table_Toronto](Images/Toronto_Table.png)

The Result given by our model is that Toronto would fit in the cluster Number 6. The Map below shows the European capitals that are more simillars with Toronto based in its Venues categories according with our model. 

![Map2](Images/Toronto_Map.png)

Analysing the results, it seems like the city of Toronto ended in the cluster 6 because in its list of more frequent venues,Toronto have 3 Venues categories in common with almost every capitals presents in the 6th cluster.  

## 5. Conclusion

Analysing the results, it seems like the model got to separate reasonably well the European capitals based in its Venues. The model classified that the city of Toronto it is more similar to the scandinavian capitals than the others European capitals, probably because in all these countries one of the most frequent venues collected was the Grocery store category.  

It noticed that there were different categories that mean pretty the same thing like pub and bar, or Coffee Shop and Café. So one recommendation would be to verify these kind of categories and try to merge them in only one category.

For future improvements in the model , it would be nice try to cluster these capitals with a different number of cluster or using a different type of algorithm. Besides that, it would be great also take in consideration other parameters like quality living or cost living of the city.
