<h1>Explore additional info present in tags on openstreetmap toilets</h1>

On top of the amenity=toilet tag identifying a public toilet, users can add additional tags to give more info on the toilet (https://wiki.openstreetmap.org/wiki/Tag:amenity%3Dtoilets).

First I'm going to load the data. I'm using nested dict: tag > value > value count.

In [1]:
from collections import defaultdict
import json

possible_tags = defaultdict(lambda: defaultdict(lambda: 0))
toilet_count = 0

with open("eu_toilets.json", "r", encoding="utf-16-le") as source:
    for line in source:
        entry = json.loads(line.replace("\ufeff", ""))
        toilet_count += 1

        for tag in entry["tags"].keys():
            possible_tags[tag][entry["tags"][tag]] += 1

# Sanity check on extraction quality, should be true since that's the target tag value
print(toilet_count == possible_tags["amenity"]["toilets"])

True


<h2>Let's take a look at the most present tags in our data</h2>
Collapsing the nested dict into a dict tag > tag count

In [2]:
from dataclasses import dataclass

@dataclass
class Tag:
    name: str
    count: int


tag_count = {tag: sum(c for c in possible_tags[tag].values()) for tag in possible_tags.keys()}
tag_count_list = [Tag(name=name, count=tag_count[name]) for name in tag_count.keys()]
tag_count_list.sort(key=lambda x: x.count, reverse=True)
tag_count_list[:10]

[Tag(name='amenity', count=153378),
 Tag(name='fee', count=59543),
 Tag(name='wheelchair', count=52399),
 Tag(name='access', count=49521),
 Tag(name='toilets:disposal', count=30904),
 Tag(name='building', count=30338),
 Tag(name='unisex', count=23937),
 Tag(name='changing_table', count=19310),
 Tag(name='male', count=14016),
 Tag(name='female', count=13704)]

The most used tags seem related to toilets but unfortunately it looks like less than 50% of toilets have another tag. Since the data is crowdsourced it's possible than users do not add them often. Another explanation is that for more complex toilets represented as way and nodes the info is present but was not extracted. This info may be difficult to recover since the node containing it may not be directly linked to the toilet way/relation.

In [3]:
# useful tags according to openstreetmap wiki

wiki_tags = [
    "fee",
    "opening_hours",
    "wheelchair",
    "changing_table",
    "toilets:disposal",
    "toilets:position",
    "access",
    "description",
    "drinking_water",
    "indoor",
    "name",
    "operator",
    "supervised",
    "female",
    "male",
    "unisex",
    "child",
    "gender_segregated",
]

wik_tag_counts = [Tag(name=name, count=tag_count[name]) for name in wiki_tags]
wik_tag_counts.sort(key=lambda x: x.count, reverse=True)
wik_tag_counts[:10]

[Tag(name='fee', count=59543),
 Tag(name='wheelchair', count=52399),
 Tag(name='access', count=49521),
 Tag(name='toilets:disposal', count=30904),
 Tag(name='unisex', count=23937),
 Tag(name='changing_table', count=19310),
 Tag(name='male', count=14016),
 Tag(name='female', count=13704),
 Tag(name='operator', count=11764),
 Tag(name='opening_hours', count=9913)]

No surprise here, the tag keys match the wiki description perfectly.

<h2>Exploring tag values</h2>

Let's see what's actually in the tag values. I'm going to focus on tag used on at least 25% on toilets so the top 3 most used tags.

<h3>Fee: is the toilet free?</h3>

In [4]:
len(possible_tags["fee"].keys())

168

That's a lot of value. There is no constraint so user can enter what they want. There are likely invalid/non stard values in there. I'll filter by value with at least 5 use first.

In [5]:
[(x, possible_tags["fee"][x]) for x in possible_tags["fee"].keys() if possible_tags["fee"][x] > 5]

[('no', 48639),
 ('20p', 7),
 ('yes', 10544),
 ('0.50Ôé¼', 21),
 ('0.50 EUR', 8),
 ('Ôé¼0.50', 9),
 ('0.50', 6),
 ('0.5Ôé¼', 7),
 ('0,50Ôé¼', 6),
 ('2 PLN', 8),
 ('0.5', 14),
 ('donation', 30)]

Looks like users used this field to specify the fee value.

In [6]:
possible_tags["fee"]["donation"]

30

The tag is interesting but I'm not keeping it for just 30 values out of 156 000 toilets

In [7]:
[(x, possible_tags["fee"][x]) for x in possible_tags["fee"].keys() if possible_tags["fee"][x] > 50]

[('no', 48639), ('yes', 10544)]

Only yes and no have significant usage so I'm keeping them as valid value, it's the recommanded tagging policy on the wiki.

<h3>Wheelchair: is it accessible in wheelchair?</h3>

In [8]:
len(possible_tags["wheelchair"].keys())

19

In [9]:
[(x, possible_tags["wheelchair"][x]) for x in possible_tags["wheelchair"].keys() if possible_tags["wheelchair"][x] > 50]

[('yes', 33163), ('no', 14845), ('designated', 704), ('limited', 3633)]

The wiki recommands only yes/no/limited but designated probably refers to toilets explicitly designed for wheelchair and has some usage so I'm keeping it.

<h3>Access: who can access it?</h3>

In [10]:
len(possible_tags["access"].keys())

44

In [11]:
[(x, possible_tags["access"][x]) for x in possible_tags["access"].keys() if possible_tags["access"][x] > 50]

[('yes', 31563),
 ('customers', 12458),
 ('key', 51),
 ('no', 435),
 ('permissive', 2106),
 ('private', 2082),
 ('permit', 286),
 ('destination', 56),
 ('public', 337)]

Now this one is problematic. I'm targeting public toilets but this tag reveals some are actually not. I'm going to have to have to filter out toilets where access value is not yes or public.

In [12]:
n_not_public = sum(possible_tags["access"][tag_value] for tag_value in possible_tags["access"].keys() if tag_value not in ("yes", "public"))
n_not_public

17621

In [13]:
n_not_public/toilet_count

0.11488609839742336

It looks like roughly 10% of the toilets in the data are not public. I need to filter them out before creating the elasticsearch index.