[< Back to the main notebook](./index.md)


# Detour no.3: Getting the OpenStreetMap data

> This is a rendered version of a Jupyter notebook. The source notebook can be found [in my GitHub repository](https://github.com/barjin/ndbi023-project), along with the data used in this analysis.

In this notebook, I am downloading the OpenStreetMap data different points of interest in the city of Prague. This will help us with the analysis of the city's rental prices - perhaps the proximity to certain points of interest can explain the differences in prices?

OpenStreetMaps is a great source of data for this kind of analysis. It is a collaborative project that creates a free editable map of the world. The data is available under the Open Database License (ODbL), which allows us to use it for our analysis.

For querying the map data, we can use the Overpass API. It is a read-only API that uses a custom query language ("pattern matching"-like) to retrieve data from the OpenStreetMap database. The API is free to use, but it has some limitations on the number of requests you can make in a certain time period.

For interfacing our Python code in this notebook with the Overpass API, we can use the `overpy` library. It is a Python wrapper for the Overpass API, which makes it easy to query the OpenStreetMap data from Python code.

Let's start by fetching the data about schools in Prague.

### Schools

The importance of a school in the vicinity of a rental property is quite obvious. Families with children will be more interested in properties that are close to schools. This is especially true for families with younger children, who might not be able to travel long distances to school.

For fetching such institutions, we can query the Overpass API for nodes, ways, and relations with the `amenity=school` tag ([OSM definition here](https://wiki.openstreetmap.org/wiki/Tag:amenity%3Dschool)).


In [118]:
import overpy

## Approximate bounding box of Prague
bounding_box = (49.96, 14.28, 50.17, 14.68)

api = overpy.Overpass()

result = api.query(f"""
[timeout:25];(
  nwr["amenity"="school"]{bounding_box};
);
out body;
>;
out skel qt;
""")

As a result of the collaborative nature of OpenStreetMap, the data might not be complete or up-to-date. On top of this, different schools are tagged differently in the database - while some of them are only `node`s, others are marked as `way`s (a polygon around the perimeter of the school) or `relation`s (a group of nodes and ways that form a school). We will have to handle these different types of data in our analysis.

Since schools are (relatively) small in size, we can approximate all of the schools as points (nodes) in our analysis. This will simplify our analysis and make it easier to compare the schools with other points of interest.

| ![Schools in Prague](./img/osm/overpass_school_types.png) |
|:--:|
| *Different types of schools in the OpenStreetMap data - blue outlines are `way`s, red outlines are `relation`s, and the points are `node`s.* |

In [109]:
import pandas as pd

def get_way_center(way: overpy.Way):
    nodes = way.get_nodes(resolve_missing=True)
    lats = [node.lat for node in nodes]
    lons = [node.lon for node in nodes]
    return float(sum(lats) / len(lats)), float(sum(lons) / len(lons))

def process_ways(ways):
    l = []

    for way in ways:
        if way.tags == {} or way.tags == None:
            continue

        [lat, lon] = get_way_center(way)

        l.append({
            "name": way.tags.get("name") or way.id,
            "lat": lat,
            "lon": lon,
        })

    return pd.DataFrame(l)

def process_relations(relations):
    l = []

    for rel in relations:
        if rel.tags.get("name"):
            continue
        for m in rel.members:
            if m.role == "outer":
                [lat, lon] = get_way_center(m.resolve(resolve_missing=True))

                l.append({
                    "name": rel.tags.get("name") or rel.id,
                    "lat": lat,
                    "lon": lon,
                })
                break

    return pd.DataFrame(l)

def process_nodes(nodes):
    l = []

    for node in nodes:
        if node.tags.get("name") == None:
            continue

        l.append({
            "name": node.tags.get("name"),
            "lat": node.lat,
            "lon": node.lon,
        })

    return pd.DataFrame(l)

ways_df = process_ways(result.ways)
print("processed ways")
relations_df = process_relations(result.relations)
print("processed relations")
nodes_df = process_nodes(result.nodes)
print("processed nodes")

schools_df = pd.concat([ways_df, relations_df, nodes_df], ignore_index=True)


processed ways
processed relations
processed nodes


In [110]:
schools_df.to_csv("./data/osm/osm_schools.csv", index=False)

### Parks

Just like with schools, we would assume that parks are a desirable feature in the vicinity of a rental property. They provide a green space for recreation and relaxation, and they can also improve the air quality in the area.

For fetching parks, we can query the Overpass API for nodes, ways, and relations with the `leisure=park` tag ([OSM definition here](https://wiki.openstreetmap.org/wiki/Tag:leisure%3Dpark)). We add a few more tags to our query to make sure we get all the parks in the city of Prague - some prominent parks (like Obora Hvězda or Prokopské údolí) are tagged with different tags (`nature_reserve` or `forest`).

|![Parks in Prague](./img/osm/overpass_parks.png)|
|:--:|
| Result for the park query from Overpass API. |

Unfortunately, unlike schools, parks are usually quite large in size - and approximating them with just one point might cause us to lose some information. 

Because of this, we retain all the nodes representing parks in our analysis. This will allow us to calculate the distance from each rental property to the nearest park, which can be a useful feature in our analysis.

In [111]:
result = api.query(f"""
[timeout:25];
(
  relation["leisure"="park"]{bounding_box};
  relation["leisure"="nature_reserve"]{bounding_box};
  relation["landuse"="forest"]{bounding_box};
);
out body;
>;
out skel qt;
""")

In [112]:
parks = []
nodes = []

counter = 0
for rel in result.relations:
    counter += 1
    if(counter % 50 == 0):
        print("relations:", counter, "/", len(result.relations))
    parks.append({
        "name": rel.tags.get("name") or rel.id,
        "id": rel.id,
    })

    for m in rel.members:
        if m.role == "outer":
            for n in m.resolve().get_nodes():
                nodes.append({
                    "lat": n.lat,
                    "lon": n.lon,
                    "in": rel.id,
                })



relations: 50 / 490
relations: 100 / 490
relations: 150 / 490
relations: 200 / 490
relations: 250 / 490
relations: 300 / 490
relations: 350 / 490
relations: 400 / 490
relations: 450 / 490


In [117]:
parks_df = pd.DataFrame(parks)
nodes_df = pd.DataFrame(nodes)

parks_df.to_csv("./data/osm/osm_parks.csv", index=False)
nodes_df.to_csv("./data/osm/osm_parks_nodes.csv", index=False)

### Euronet ATMs

| ![Euronet ATM](./img/osm/euronet_atm.png) |
|:--:|
| An Euronet ATM product photo. |

Euronet ATMs are a common sight in the city of Prague. They are usually located in high-traffic areas, like shopping malls or tourist attractions - and they are known for their predatory exchange rates and high service fees. While I don't expect their presence itself to have a significant impact on the rental prices in the locality, it could serve as a marker for high-traffic areas in the city - which could be correlated with higher rental prices.

Inspired by [Peter Fabor's article](https://blog.apify.com/google-maps-data-tourism/) on mass tourism analysis in Mallorca, Spain, we can query the OpenStreetMap data for Euronet ATMs in Prague. We can use the `operator=Euronet` tag to find these ATMs in the city.

| ![Euronet ATMs in Mallorca, Peter Fabor](./img/osm/euronet_fabor_mallorca.png) |
|:--:|
| A heatmap of Euronet ATMs in Mallorca, Spain, by Peter Fabor. See the [original tweet here](https://www.linkedin.com/posts/fabor_how-to-identify-ugly-mass-tourism-places-activity-7131587606370230272-O5DS/) |

Retrieving this data is fairly easy, as all of the ATMs are marked as `node`s in the OpenStreetMap database. We can skip the `way` and `relation` processing - at least for this analysis.

In [119]:
result = api.query(f"""
[timeout:25];
(
  node["amenity"="atm"]["brand"="Euronet"]{bounding_box};
  node["amenity"="atm"]["operator"="Euronet"]{bounding_box};
);
out body;
>;
out skel qt;
""")

In [124]:
import pandas as pd

atms = pd.DataFrame([{ "lat": node.lat, "lon": node.lon } for node in result.nodes])

In [126]:
atms.to_csv("./data/osm/osm_atms.csv", index=False)

| ![Euronet ATMs in Prague](./img/osm/overpass_atms.png) |
|:--:|
| Result for the Euronet ATM query from Overpass API. Note the high density of ATMs in the city center, along the Nerudova street leading up to the Prague Castle, or the Prague Zoo entrance. |

### Highways

As the last feature retrieved from the OpenStreetMap data, we can query the highways in the city of Prague. Highways are a common feature in the city, and they can be a good indicator of the traffic density in the area.
High traffic density can be a nuisance for residents, as it can cause noise pollution and air pollution. This can be a factor in the rental prices in the city - especially in Prague, where there is a magistrála (a highway) running straight through the city center.

For fetching the highways, we can query the Overpass API for ways with the `highway` tag ([OSM definition here](https://wiki.openstreetmap.org/wiki/Key:highway)). We can filter out the highways by type - we pick the three largest types of highways in the city of Prague: `motorway`, `trunk`, and `primary`.

In [127]:
result = api.query(f"""
    [out:json][timeout:25];
    (
        way["highway"="trunk"]{bounding_box};
        way["highway"="motorway"]{bounding_box};
        way["highway"="primary"]{bounding_box};
    );
    out body;
    >;
    out skel qt;
""");

In [134]:
nodes = []

for way in result.ways:
    for node in way.get_nodes(resolve_missing=True):
        nodes.append({
            "lat": node.lat,
            "lon": node.lon,
            "in": way.id,
            "type": way.tags.get("highway"),
        })

In [135]:
pd.DataFrame(nodes).to_csv("./data/osm/osm_highways_nodes.csv", index=False)

| ![Highways in Prague](./img/osm/overpass_roads.png) |
|:--:|
| Result for the highway query from Overpass API. Note the magistrála running through the city center, and the D1 highway leading out of the city. |

This concludes the data fetching from the OpenStreetMap database. Now there is nothing left but to use these to create new features for our analysis.

---

[< Back to the main notebook](./index.md)