# Create class labels for PlacesAudio from Places205 image paths

The audio-visual embeddings models of Harwath et al.

- [NIPS 2016 model](https://papers.nips.cc/paper/6186-unsupervised-learning-of-spoken-language-with-visual-context.pdf)
- [ACL 2017 model](https://arxiv.org/pdf/1701.07481.pdf)
- [DAVEnet model](https://github.com/dharwath/DAVEnet-pytorch)

are difficult to train from scratch. Initial warm-up could help, but the audio data has no labels. However, the images paired with audio captions have been organized to classes that are visible in their path (e.g. `c/cottage_garden/gsun_c43911d6f8ff4efb5e99dc6ac7e47a8e.jpg`). In this notebook the aim is to extract the classes from the image paths to get classification labels for the audio. Each audio caption has one corresponding image and thus one label.

First we process the data used in the NIPS 2016 paper. The image/caption pairs and paths are listed in `nips_train.json`.

In [1]:
import json
import re           # Regexps
import warnings     # Place warnings if any anomalies encountered

data_path = "/teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro"
train_json = "metadata/nips_train.json"
classes_file = "metadata/Places205_classes.txt"

with open(data_path + "/" + train_json) as f:
    data_train = json.load(f)
    
with open(data_path + "/" + classes_file) as f:
    classes = f.read().splitlines()
    
for i in range(len(data_train["data"])):
    # Replace key "image" with "label"
    data_train["data"][i]["label"] = data_train["data"][i].pop("image")
    # Find the class from the image path using regexp
    match = re.search('[a-z]?\/(.+)\/gsun', data_train["data"][i]["label"])
    if match:
        if match.group(1) in classes:
            # Use an index number instead of the word label
            data_train["data"][i]["label"] = classes.index(match.group(1))
        else:
            warnings.warn("Did not find label '%s' among Places205 classes.".format(match.group(1)))
    else:
        warnings.warn("Matching regexp to '%s' failed".format(data_train["data"][i]["label"]))
        
output_train = "metadata/nips_classification_train.json"
with open(data_path + "/" + output_train, 'w') as f:
    json.dump(data_train, f, indent=4)

Do the same for the validation dataset.

In [2]:
val_json = "metadata/val.json"

with open(data_path + "/" + val_json) as f:
    data_val = json.load(f)

for i in range(len(data_val["data"])):
    # Replace key "image" with "label"
    data_val["data"][i]["label"] = data_val["data"][i].pop("image")
    # Find the class from the image path using regexp
    match = re.search('[a-z]?\/(.+)\/gsun', data_val["data"][i]["label"])
    if match:
        if match.group(1) in classes:
            # Use an index number instead of the word label
            data_val["data"][i]["label"] = classes.index(match.group(1))
        else:
            warnings.warn("Did not find label '%s' among Places205 classes.".format(match.group(1)))
    else:
        warnings.warn("Matching regexp to '%s' failed".format(data_val["data"][i]["label"]))
        
output_val = "metadata/classification_val.json"
with open(data_path + "/" + output_val, 'w') as f:
    json.dump(data_val, f, indent=4)