In [1]:
import requests
import pandas as pd
import json
import urllib
from dotenv import load_dotenv
load_dotenv()
import os

# The Data
To be able to classify if an image contains or doesn't contain tactile paving in it, we need labelled images. Luckily, there are crowd-sourced options that contain exactly this. On OpenStreetMap, people can place "nodes" (points) and indicate if there is tactile paving there or not. They can also link to a platform called Mapillary where images are hosted and combined onto a map. Using these two parts, we can get images labeled as "yes" or "no" for tactile paving.

First, I queried an API called [overpass turbo](https://overpass-turbo.eu/#) where I filtered the nodes where it contains a mapillary image and tactile paving. These nodes are exported as a `.json` file where I extracted the mapillary link. These links contain the original unique image ID. When Meta bought Mapillary, the image ID's were changed. Because of this, the OSM mapillary link actually redirect to a new unique image ID. I used `urllib` to get the redirect links that contain the image ID's that are in the Mapillary ID. With these image ID's, I queried the Mapillary API and extracted the url to download the 256px images. I then used `requests` to download the images and split them into train, validation, and test directories. This was repeated for an equal number of images that did not contain tactile paving. In total, 12,000 images were downloaded with half labeled as no tactile paving and half labeled as tactile paving.

OSM query -> redirect url -> Mapillary API query -> image download

All images were downloaded from [Mapillary](https://www.mapillary.com/). Mapillary does not endorse me or the use of the images in this project.

Let's get to it.

## Tactile Paving Images

First, export the query (see query.txt) from overpass turbo. This includes all points with `tactile_paving=yes` and has a `mapillary` image.

In [2]:
with open("export-tp.geojson") as f:
    osm_data = json.load(f)

In [3]:
mapillary_pkeys = []

for node in osm_data["features"]:
    mapillary_pkeys.append(node["properties"]["mapillary"])

Now, we convert them to a mapillary url. This redirects to a url that has a different key (probably from when Facebook bought the company). We need *that* image id to query the mapillary API.

In [5]:
for i in range(len(mapillary_pkeys)):
    mapillary_pkeys[i] = "https://www.mapillary.com/map/im/" + mapillary_pkeys[i]

In [13]:
def resolve(url):
    return urllib.request.urlopen(url).geturl() # gets redirect url

In [14]:
resolve(mapillary_pkeys[0])

'https://www.mapillary.com/app?pKey=164164019045134'

In [16]:
new_urls = []
i = 0

for url in mapillary_pkeys:

    new_urls.append(resolve(url))

    i += 1
    if i % 100 == 0: # save progress every 100 urls
        print(i)
        with open("urls.txt", "w") as f:
            f.writelines("%s\n" % url for url in new_urls)

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200


In [17]:
len(new_urls)

6203

In [23]:
with open("urls.txt", "w") as f:
    f.writelines("%s\n" % url for url in new_urls)

The actual urls have the structure: `https://www.mapillary.com/app/?pKey=3890424561071633` and we need the portion after the equals sign. This is the image ID that the API expects.

In [18]:
image_ids = []

for url in new_urls:
    image_ids.append(url.split("=", 1)[-1])

Let's push these ids through the API.

In [19]:
d = {
    "thumb_256_url": [],
    "id": []
}
i = 0

TOKEN = os.getenv("TOKEN")

for id in image_ids:
    url = f"https://graph.mapillary.com/{id}?access_token={TOKEN}&fields=thumb_256_url" # can also use 1024 or 2048 px

    response = requests.get(url)

    if response.status_code != 200:
        print(id)
    else:
        rJson = response.json()

        d["thumb_256_url"].append(rJson["thumb_256_url"])
        d["id"].append(rJson["id"])

        i += 1
        if i % 100 == 0: # save every 100
            print(i)
            with open("api-urls.txt", "w") as f:
                f.writelines("%s\n" % url for url in d["thumb_256_url"])


100
O8LGVyzT9a76e4QaRBTlbQ
200
300
R-I7E4DMevSj6P0fZD2ExQ
400
500
600
700
KKmc1rB3-ZNWmW7uEd0wwg
Ydr6ZIgBZrcRtN3-6jYdwA
sbOETBVsFYDO39srP76uFg
MFg6dBGJeZxXJV8pIXKxBg
350mVLzDAYpurfujJuziJA
800
W6a8rLLooGT9Xa4ffDuK3g
bmzhKkOYnMSyMPpwn4q7KA
900
1000
ImRTdJ7tn4Y55sa1IWQBVQ
1100
1200
Q_Uv1qxO6TrnIPuKPtVD2w
1300
zl1fyilM66NmOc77uHR8XA
1400
prRfLARvAdQ57uQxuvz35Q
dUzXVQn7QKWfQI9t4sh2Wg
4FWceGJJaA51kTR8mZS8NQ
h5ay5jzmdRmYD7kK-jAw0g
R-I7E4DMevSj6P0fZD2ExQ
RAgrKCZsKwNLDJkmCA3EuQ
uoLKRwU8B95QGLqamLki9A
1500
HSRnlIgOSWLX3Bz2vanNzQ
ADoYEmF4dfyxnXcHfq-Paw
P6b8g3HZUNORJY7dd7w99g
1600
C8ejC9u-5ydVDinzttpfBA
KxTimNnMQYMMGvQMT5tTlw
1700
kJlQkmH4GQxwblcaIo94Hg
WsuFGt31zfLltN548v1V0A
lh4rvo6_15M9d-Avk95DTg
1800
1900
tX19XtBQITTEy-8L-4W1bA
2000
2100
yNn8VsAJEMdIAjHWx-X9gA
t59cOniN6Gev7zoA74PmPC
2200
7ZVWQk3X4uGq2QCvRFPo3A
wN2IMdUow4EOoY05ZOz7oA
dnzLOO4ErueLA0Y0sS1gRA
2300
6ba_vmD2_MV5j1Y_hEKz1A
2400
1HpiiXvYZUDZ6YbqDhVQOA
Y8MU8LP_--VOMCJNf5-vqg
5lrftrqZgqSG3W2tVe8uWg
2500
Hm9S2hs0XMW-0kaYt0CaVQ
1XkKYU2iDH

In [20]:
print(f"{i} images available")

6110 images available


Nice! Now we have the download url for each image. Let's download them.

In [24]:
for i in range(len(d["id"])):
    
    img = requests.get(d["thumb_256_url"][i]).content


    if i <= 4000:
        with open(f"split-images/train/tp/tp.{i}.jpg", "wb") as f:
            f.write(img)
    elif i < 5400 and i > 4000:
        with open(f"split-images/validation/tp/tp.{i}.jpg","wb") as f:
            f.write(img)
    else:
        with open(f"split-images/test/tp/tp.{i}.jpg", "wb") as f:
            f.write(img)

This results in a roughly 65/23/12 train/validation/test split. Let's do the same with images without tactile paving.

## No Tactile Paving

In [25]:
with open("export-notp.geojson") as f:
    osm_data = json.load(f)

In [26]:
mapillary_pkeys = []

for node in osm_data["features"]:
    mapillary_pkeys.append(node["properties"]["mapillary"])

In [27]:
for i in range(len(mapillary_pkeys)):
    mapillary_pkeys[i] = "https://www.mapillary.com/map/im/" + mapillary_pkeys[i]

In [28]:
new_urls = []
i = 0

for url in mapillary_pkeys:

    new_urls.append(resolve(url))

    i += 1
    if i % 100 == 0: # save progress every 100 urls
        print(i)
        with open("urls-notp.txt", "w") as f:
            f.writelines("%s\n" % url for url in new_urls)

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200


In [29]:
image_ids = []

for url in new_urls:
    image_ids.append(url.split("=", 1)[-1])

In [30]:
d = {
    "thumb_256_url": [],
    "id": []
}
i = 0

for id in image_ids:
    url = f"https://graph.mapillary.com/{id}?access_token={TOKEN}&fields=thumb_256_url" # can also use 1024 or 2048 px

    response = requests.get(url)

    if response.status_code != 200:
        print(id)
    else:
        rJson = response.json()

        d["thumb_256_url"].append(rJson["thumb_256_url"])
        d["id"].append(rJson["id"])

        i += 1
        if i % 100 == 0: # save every 100
            print(i)
            with open("api-urls-notp.txt", "w") as f:
                f.writelines("%s\n" % url for url in d["thumb_256_url"])

https%3A
100
200
Ik-9uCYHxhH_kFNxH3QAeQ
http%3A
http%3A
http%3A
http%3A
q6e7OdZioE1aYEJlP5e2fQ
7a15XZaXaoya_w8mXaPyUQ
http%3A
http%3A
http%3A
http%3A
http%3A
300
400
http%3A
500
600
700
ZSUKnVB3R7GUgYCyUmKKvA
800
AVLVfDxVoqWoXfXsqZ44VQ
082xOQ3C5RIkZoiwSYhG3g
z3fDvSHGYXVVEwKXOrClAg
gd7fOpDcfZriW4MZd_0OZw
-jsVrdipzPIuixwj5rYs8Q
900
1000
BKALaNoH-5j8Oi4ivPJ8nQ
3n75O4Hfk84OcsuNEkSf0Q
1100
1200
1300
Mqu9-l3Lzmt0mImWL1tnsw
Izy0-lkkbxZ4IZnyuM7oRA
1400
dvx6LXLe5JDBGPBzHcTkYg
1500
SxuxFi12PP7ryXcVTwJyWw
twqj9ayAOqn5JXG-3dRd8g
UCsQ_LM1jP5tsqrHr9OEaQ
1W8R3VTdhuZNgSliJuoyow
1600
roOITnfJZPGcp4ds4lm7HA
1700
TbwMjgED_uygmjmhmUIX4A%3B-XP1sfVYDICN8nsbZ6YMIw
1800
Ge4YkkghNaVzPf1MzBiWdA%3Bh4Bs4aimMbNASEjeIIbVYg
zriQ4E5wCxZrE3sBD62hwA
1900
cDeAPJtpWfG-uK-p13kfaw
2000
0oqfzUtod1ctG6ZYdwSL_A
ig6in9phf5m7wy4q35lyxr
2100
2200
53fbckCzece6n9meXUA6ZQ
UzmyM11q4K0M92IWgkTEeg
jZsqoCOjVxETJ0pW3cqHAA
Vx5x8GXVuefxpq3xwgnFAg
RjpHDmExkjc3R73RnuBSfQ
AIw2vu5QUFzM0YLjXm8o8w
2300
HR9RpuiaquaorP6NGJ1PhA
iQKI0iYf95sq09us-4g

In [31]:
print(f"{i} available images")

6096 available images


In [32]:
for i in range(len(d["id"])):
    
    img = requests.get(d["thumb_256_url"][i]).content


    if i <= 4000:
        with open(f"split-images/train/notp/notp.{i}.jpg", "wb") as f:
            f.write(img)
    elif i < 5400 and i > 4000:
        with open(f"split-images/validation/notp/notp.{i}.jpg","wb") as f:
            f.write(img)
    else:
        with open(f"split-images/test/notp/notp.{i}.jpg", "wb") as f:
            f.write(img)

And there it is! After hours of scraping, calling, and downloading, all the photos are ready.