#  Capstone Project: The Battle of the Neighborhoods

##  Introduction

My friends living in Cologne, Germany wish to relocate for professional reasons to Berlin, Germany. They reached out to me for recommendations on the Berlin suburbs. Rental prices are not their priority. They mentioned metrics like:

* number of shops
* number of restaurants
* number of cafes
* number of services (banks, ATMs, offices, etc)

They know such a decision is very subjective but since I have also lived in Cologne and I know the Cologne suburbs quite a bit, they asked me if I could make a mapping of Berlin-to-Colone suburbs. That way, since they are familiar with the Cologne suburbs, they can get a first impression idea of the Berlin suburbs and reach easier a decision more tailored to their needs.

##  Data needed for this project

In order to tackle such a problem, I collected the following data:

1. Official names of the [Berlin](https://en.wikipedia.org/wiki/Boroughs_and_neighborhoods_of_Berlin#Localities) and [Cologne](https://de.wikipedia.org/wiki/Liste_der_Stadtbezirke_und_Stadtteile_K%C3%B6lns) suburbs were web-scraped. Representative nodes of the suburbs were queried from the **Open Street Maps API**. Data were corrected to reflect the officially recognized suburbs and their names. At the end of this data extraction and transformation, I had all officially recognized suburbs of Berlin and Cologne represented as nodes in WGS (latitude/longitute) coordinates. 

2. The **Foursquare API** was queried around each representative suburb node using a radius that was determined separately for Berlin and Cologne. I wanted to capture enough of each suburb *character* allowing for query overlaps from neighboring suburbs in order to blend the suburb boundaries. This is because the reality on the ground is not influenced in any way by the administrative boundaries. A neighborhood can evolve across suburb boundaries and maintain its character. In such case, if the majority of the neighborhood lies only on one side of the boundary the similarity with the neighboring suburb will be missed if we do not allow for query blendings. 

3. For the coordinate system transformation (CRS) from the World Geodesic System (WGS) latitude/longitude to Universal Transverse Mercador (UTM) cartesian I used [EPSG:5243](https://epsg.io/5243) which is appropriate for Germany.

## Methodology

Once the correct list of suburbs was acquired and geolocalized, we visualized the results with **folium** for final inspection. The next task was to determine the **Foursquare** query radius to use for each city. For that, I identified the nearest-neighbor of each suburb and computed the corresponding Euclidean distance. I then computed percentile statistics and visualized the distribution of distances for each city separately for inspection. In the end, I decided for radii in the range of the lower 10th percentile of nearest-neighbor suburb distances in each city.

I queried **Foursquare** for food, shop and services venues separately in each case and combined the results for each suburb. I kept all venues that were commonly found in Berlin and Cologne. For each city I removed outlier suburbs in the lower quartile of venue numbers. I then added the venue frequencies across the two cities and identified and removed from the analysis outlier venues that were in the lower quartile range.  

Our goal was to make of map of Cologne suburbs to Berlin suburbs by minimizing their dissimilarity based on their number and type of venues. I quantified suburb dissimilarity by the **normalized Euclidean distance in feature space**. The normalization was just the number of features entering the analysis in order to make the dissimilarity measure invariant to changes in this number.

In order to make it easy for my friends, I reported the top two similar suburbs of Berlin for each of the suburbs of Cologne. Also, for each pair I reported the top similar and top dissimilar features these two suburbs had.

##  Results

I identified all [96 of the Berlin suburbs](https://en.wikipedia.org/wiki/Boroughs_and_neighborhoods_of_Berlin#Localities) and all [86 of the Cologne suburbs](https://de.wikipedia.org/wiki/Liste_der_Stadtbezirke_und_Stadtteile_K%C3%B6lns) using the **Open Street Maps API**. Below I show snapshots of the city maps indicating the representative node for each suburb.

<img src='images/Berlin_suburbs.png' width='100%' title='Berlin'/>
<img src='images/Cologne_suburbs.png' width='100%' title='Cologne'/>

The next task was to decide on the **Foursquare** query radius (in meters) around the representative points of the suburbs for each city separately. For that, I converted the latitude/longitude coordinates to UTM and computed the Euclidean distance of each suburb to its nearest-neighbor. I show below a visualization of these nearest-neighboring distances for the suburbs of the two cities, where the zoom level might be different in each case.  

<img src='images/Berlin_nearest-neighbors.png' width='100%' title='Berlin'/>
<img src='images/Cologne_nearest-neighbors.png' width='100%' title='Cologne'/>

The distributions of the nearest-neighbor distances for the two cities are visualized below. In the end, I decided the query radius to be close to the first quartile of nearest-neighbor distances for each city, namely for Berlin I used 1400 meters and for Cologne 1000 meters.

<table><tr>
    <td><img src='images/Berlin_histogram_distances.png' width='100%' title='Berlin'/></td>
    <td><img src='images/Cologne_histogram_distances.png' width='100%' title='Cologne'/></td>
<tr></table>

Food, shops and services queries to **Foursquare** were done separately. I consulted the [general categories provided by Foursquare](https://developer.foursquare.com/docs/build-with-foursquare/categories/) and decided to use the following broad categories:

| category | categoryId |
| :--: | :--: |
| food | 4d4b7105d754a06374d81259 |
| shop & service | 4d4b7105d754a06378d81259 |

I identified and kept for the analysis only the common venues between the two cities. As it was expected, some remote suburbs of Berlin and Cologne did not turn up many venues. I show below the distribution of the number of venues per suburb for each city. In the end, I decided to remove suburbs in the bottom 25%. I was sure my friends would not be interested in those anyway.  

<table><tr>
    <td><img src='images/Berlin_histogram_number_of_venues.png' width='100%' title='Berlin'/></td>
    <td><img src='images/Cologne_histogram_number_of_venues.png' width='100%' title='Cologne'/></td>
<tr></table>

Finally, I pooled the venue frequencies across all suburbs of one city with those of the other city to get a final estimate of freature frequency in my dataset and I removed for further analysis features in the bottom 25% range of frequencies.

At this point I was ready for the main analysis. I standardized the features across suburbs in order to avoid the most populous, like Supermarkets dominating the analysis. Then, for each suburb of Cologne present in the analysis I computed its **normalized Euclidean distance in feature space** to all the Berlin suburbs present in the analysis. This resulted in a **dissimilarity metric** that was invariant to the number of features entering the analysis. Its range could vary. Values **very close to zero** indicated **strongly similar suburbs**. Values in the range of **[0.1, 0.2]** indicated **somewhat similar suburbs** and the rest indicated **dissimilar suburbs**.

For each suburb of Cologne I picked the two most similar suburbs of Berlin, i.e. those that had the smallest dissimilarity metric. I also identified the top features where the suburbs were most similar and most dissimilar and included them in the final table map. In the table below, I show the final map of Cologne to Berlin suburbs that this analysis produced.

In [82]:
#  Behold the map of Cologne to Berlin suburbs
pd.set_option('display.max_rows', None)
final

Unnamed: 0_level_0,Unnamed: 1_level_0,dissimilarity,top similar features,top dissimilar features
Cologne suburb,Berlin suburb,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Altstadt-Nord,Steglitz,0.17342,"Boutique, Lebanese Restaurant, Gastropub, Gift Shop, Fast Food Restaurant, French Restaurant","Locksmith, Thai Restaurant, Korean Restaurant, Smoke Shop, Seafood Restaurant, Lottery Retailer"
Altstadt-Nord,Tiergarten,0.185759,"Women's Store, Lebanese Restaurant, Sushi Restaurant, Bookstore, Paper / Office Supplies Store, Japanese Restaurant","Adult Boutique, Seafood Restaurant, Music Store, Food Truck, Kebab Restaurant, Persian Restaurant"
Altstadt-Süd,Charlottenburg,0.153988,"Mexican Restaurant, Men's Store, Bagel Shop, Jewelry Store, Indian Restaurant, Middle Eastern Restaurant","Halal Restaurant, Laundromat, Doner Restaurant, Cosmetics Shop, Kebab Restaurant, Taverna"
Altstadt-Süd,Steglitz,0.167122,"Men's Store, Persian Restaurant, Mexican Restaurant, Vietnamese Restaurant, Bagel Shop, Farmers Market","Irish Pub, Currywurst Joint, Locksmith, Smoke Shop, Sushi Restaurant, Lottery Retailer"
Bayenthal,Heinersdorf,0.108275,"Astrologer, Deli / Bodega, Organic Grocery, Drugstore, Market, Cosmetics Shop","Currywurst Joint, Discount Store, Motorcycle Shop, Fried Chicken Joint, Fast Food Restaurant, Health & Beauty Service"
Bayenthal,Waidmannslust,0.108364,"Fish Market, Astrologer, Deli / Bodega, Organic Grocery, Market, Cosmetics Shop","Business Service, Medical Supply Store, Health & Beauty Service, Gastropub, Liquor Store, Big Box Store"
Bickendorf,Weißensee,0.090279,"Entertainment Service, Market, Rental Service, Wine Shop, Photography Studio, Massage Studio","Film Studio, Diner, Lottery Retailer, Print Shop, Mobile Phone Shop, Bakery"
Bickendorf,Britz,0.098458,"Entertainment Service, Astrologer, Home Service, Flower Shop, Rental Service, Wine Shop","IT Services, Garden Center, Pet Store, Chinese Restaurant, Furniture / Home Store, ATM"
Bilderstöckchen,Biesdorf,0.048144,"Market, Arts & Crafts Store, Business Service, Pharmacy, Electronics Store, Bakery","Discount Store, Home Service, Construction & Landscaping, Big Box Store, Pet Store, Supermarket"
Bilderstöckchen,Dahlem,0.052537,"Arts & Crafts Store, Business Service, Pharmacy, Shipping Store, Shopping Mall, Drugstore","Garden Center, Lawyer, German Restaurant, Asian Restaurant, IT Services, Café"


On the one hand, there were Cologne suburbs very close in *character* to the Berlin ones. On the other hand, there were also suburbs which seemed to be unique in their Cologne *character* and their closest counterpart in Berlin was quite dissimilar. This can also be seen by the distribution of the **dissimilarity** metric shown below.

<img src='images/Final_histogram_distance.png' width='100%' title='Berlin'/>

## Discussion

I found the Cologne to Berlin suburb map very interesting. It turned out, the Cologne city center suburbs were very distant from all choices in Berlin and vice versa, no Berlin city center suburb was found similar with any of the Cologne suburbs. Citi centers seem to have a unique non-transferable *character*. Althought it should be said that a *character* of a suburb cannot be captured only by looking at food venues, shops and services. Nevertheless, this simple first approach to the problem gave my friends some broad-stroke ideas about the Berlin suburbs in order to get them going and help them narrowing down their choices.

On the Data Science side of things, I can see many areas where this analysis could be expanded and improved. Firstly, I will move away from the **Foursquare API** limitations and switch completely to the **Open Street Maps API** where it is possible to query multipolygon areas for anything with no limitations. That way the whole suburb area can be queried for features instead of a fixed radius around a node. Also, there is a wealth of information that one can add to the suburb *character* features that is not even available at **Foursquare**, like number of trees, area of the suburb covered by greenery, public transporation density, etc. Although such an analysis will take me beyond the scope of this project, I still consider it an interesting project to work on in the future.

The question asked by my friends was a very specific one. I was tempted to use machine-learning methods like K-means clustering in order to identify groups of suburbs across the cities. However, such an analysis would not have answered the question that my friends asked. If anything, it would have left them more confused having to choose by themselves among the suburbs that co-clustered with a Cologne suburb. I chose a way of analysis that was tailored to the exact question asked: *Can you make a map of the Cologne to Berlin suburbs for us?*

## Conclusions

A map of the Cologne to Berlin suburbs was made by looking at very basic information like food venues, shops and various services. Suburbs of Cologne were mapped to the two closest suburbs in Berlin, listing also the distance in Euclidean feature space as well as information about the top similar and top dissimilar features involved. The results were quite satisfactory. This analysis can serve as a blueprint for future work that can expand it and improve on the type of features used in order to capture more accurately the *character* of suburbs. 