# Metadata file handling for PlacesAudio and Places205

The audio-visual embeddings models of Harwath et al. that use PlacesAudio:

- [NIPS 2016 model](https://papers.nips.cc/paper/6186-unsupervised-learning-of-spoken-language-with-visual-context.pdf)
- [ACL 2017 model](https://arxiv.org/pdf/1701.07481.pdf)
- [DAVEnet model](https://github.com/dharwath/DAVEnet-pytorch)

In this notebook, we process the metadata files of PlacesAudio in different ways. The processes are:
 1. Create small samples of our own for making local test runs on desktop.
 2. Create .json files from the lists provided with PlacesAudio (under `metadata/lists`) . The lists define subsets of the full PlacesAudio dataset that were used in the NIPS and ACL papers 

First we make necessary imports and define some utility functions. The utility functions assume the following json file structure:

    {
        "image_base_path": "/path/to/images/",
        "audio_base_path": "/path/to/audio/",
        "data": [
            {
                "uttid": "A1A6D2RDPGVX5F-GSUN_C4E9B966E3F4AF2A83AF01C8ACFB47BB",
                "speaker": "A1A6D2RDPGVX5F",
                "asr_text": "a wooden table with a lobster in the center and plates are around the tape",
                "wav": "wavs/13/utterance_374111.wav",
                "image": "r/restaurant/gsun_c4e9b966e3f4af2a83af01c8acfb47bb.jpg"
            },
            ...
            {
                "uttid": "A13G469LJFEIYZ-GSUN_48C633C668C194469102D3B8E0BDE81C",
                "speaker": "A13G469LJFEIYZ",
                "asr_text": "a woman sitting in a small cluttered office",
                "wav": "wavs/375/utterance_286274.wav",
                "image": "h/home_office/gsun_48c633c668c194469102d3b8e0bde81c.jpg"
            }
        ]
    }

In [1]:
import json

# Copy base paths from a given json
def copy_base_paths(input_dict):
    
    all_keys = input_dict.keys()
    keys = [key for key in (k for k in all_keys if k not in 'data')]
    return { key: input_dict[key] for key in keys }

# Copies a sample of data from given json file to an output json file
def copy_sample(input_path, output_path, sample_size=1000):
    
    print("Copying a sample of size {:d} from {}".format(sample_size, input_path))
    with open(input_path) as f:
        inputs = json.load(f)

    # Copy the image and audio paths at the top of the json first...
    outputs = copy_base_paths(inputs)
    # ... and then copy a sample of the data.
    outputs['data'] = inputs['data'][0:sample_size]
    
    print("Writing output to {}".format(output_path))
    with open(output_path, 'w') as f:
        json.dump(outputs, f, indent=4)
    print("Finished.\n")

# Copies image paths from input and writes them to the output one path per line
def copy_image_paths(input_path, output_path):
    
    with open(input_path) as f:
        inputs = json.load(f)
    
    print("Writing image paths from {}\n to {}".format(input_path, output_path))
    base_path = inputs['image_base_path']
    with open(output_path, 'w') as f:
        for item in inputs['data']:
            f.write(base_path + item['image'] + "\n")
    print("Finished.\n")
            
# Create an output json file from an input json file using a list of utterance ids
def json_from_uttid_list(uttid_path, input_path, output_path):
    
    print("Using utterance ids from {}".format(uttid_path))
    with open(uttid_path) as f:
        uttids = [line.rstrip() for line in f]
    print("Number of uttids: {:d}".format(len(uttids)))
    
    print("Input json is {}".format(input_path))
    with open(input_path) as f:
        inputs = json.load(f)
    
    print("Copying data that matches utterance ids...")
    outputs = copy_base_paths(inputs)
    outputs['data'] = [x for x in inputs['data'] if x['uttid'] in uttids]
    
    print("Writing output to {}".format(output_path))
    with open(output_path, 'w') as f:
        json.dump(outputs, f, indent=4)
    print("Done, number of elements in output data is: {}\n".format(len(outputs['data'])))


### Small sample
First, take a subset of image/caption pairs from `train.json` and write it to a separate file. Then, write the image paths to a text file. This is for collecting the images from Triton, where the images are stored in the computer vision groups database folders. Audio is stored in `teamwork` corpus folder, so the same is not necessary for the wavs. After that, we do the same for the validation dataset.

In [2]:
data_path = "/teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/"
train_json = "metadata/train.json"
val_json = "metadata/val.json"
val_image_path = "metadata/val_image_paths.txt"
sample_size = 1000

train_1k = "metadata/train_1k.json"
train_1k_image_path = "metadata/train_image_paths.txt"

copy_sample(data_path + train_json, data_path + train_1k, sample_size=1000)
copy_image_paths(data_path + train_1k, data_path + train_1k_image_path)
copy_image_paths(data_path + val_json, data_path + val_image_path)

Copying a sample of size 1000 from /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/train.json
Writing output to /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/train_1k.json
Finished.
Writing image paths from /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/train_1k.json
 to /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/train_image_paths.txt
Finished.
Writing image paths from /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/val.json
 to /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/val_image_paths.txt
Finished.


### JSON from an utterance ID list

We recreate the NIPS and ACL datasets using the utterance id lists provided and the full data json file.

In [3]:
train_json = "metadata/train.json"
uttid_lists_and_output = [("metadata/lists/nips_2016_train_uttids", "metadata/nips_train.json"), 
                          ("metadata/lists/nips_2016_val_uttids", "metadata/nips_val.json"),
                          ("metadata/lists/acl_2017_train_uttids", "metadata/acl_train.json"),
                          ("metadata/lists/acl_2017_val_uttids", "metadata/acl_val.json")]

for (uttid_list, output_file) in uttid_lists_and_output:
    json_from_uttid_list(data_path + uttid_list, 
                         data_path + train_json,
                         data_path + output_file)

Using utterance ids from /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/lists/nips_2016_train_uttids
Number of uttids: 116111
Input json is /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/train.json
Copying data that matches utterance ids...
Writing output to /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/nips_train.json
Done, number of elements in output data is: 115162
Using utterance ids from /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/lists/nips_2016_val_uttids
Number of uttids: 1000
Input json is /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/train.json
Copying data that matches utterance ids...
Writing output to /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/nips_val.json
Done, number of elements in output data is: 993
Using utterance ids from /teamwork/t40511_asr/c/PlacesAudio400k/PlacesAudio_400k_distro/metadata/lists/acl_2017_t