## Purpose

The purpose of this code is to find all the wav files which contain known, annotated sounds if possible. The main steps we will follow are:

1. Read in all the annotated JSON from the 4 different files of annotated JSONs
2. Get the labels of the sounds, where in the image they are, and what image they correspond to
3. Attempt to match the spectrograms with the wav files that generated them
4. Create a file with the label, location in image, image name, wav file name for recordkeeping purposes

We'll start by importing some Python libraries

In [14]:
#Imports
import numpy as np
import pandas as pd
import json
import os

Now we will pull the information from the annotated JSON stored in raw_data so we can get data labels, sound coordinates, and image file names.

In [42]:
#Creating a df to save the data in
annotation_df = pd.DataFrame(columns=['sound','points','image_name'])
annotation_df.head()

#Getting the path to the JSON files & listing files
#path_to_json = '../raw_data/MLFigs_Labeled_Oct_26_Chris/20181204T100004-File-0.json' #test file
path_to_json = '../raw_data/MLFigs_Labeled_Oct_26_Chris/'
json_file_list = os.listdir(path_to_json)

#Iterate through json files to get annotated image info
for file in json_file_list:
    path_to_file = path_to_json+file
    
    #Loading the JSON data & turning into dict
    annotated_file = open(path_to_file)
    annotated_dict = json.load(annotated_file)
    #annotated_dict.keys()

    #Pulling out the labels, points, and image path
    image_name = annotated_dict['imagePath']
    if len(annotated_dict['shapes']) > 0:
        for shape in annotated_dict['shapes']:
            sound = shape['label']
            points = shape['points']
            annotation_df.loc[len(annotation_df.index)] = [sound, points, image_name]
    else:
        sound = None
        points = None
        annotation_df.loc[len(annotation_df.index)] = [sound, points, image_name]

#Saving output for future use
annotation_df.to_csv('../intermediate_data/annotated_info.csv')
    
#Examining head of file
print(annotation_df.head())

     sound                                             points  \
0  mooring  [[322.18518518518516, 592.1111111111111], [388...   
1  mooring  [[979.8328328328328, 549.1481481481482], [1054...   
2  mooring  [[1865.8888888888887, 611.3703703703703], [191...   
3  mooring  [[737.9879518072288, 739.7831325301204], [821....   
4  mooring  [[1975.3373493975903, 919.301204819277], [2034...   

                   image_name  
0  20181204T100004-File-0.png  
1  20181204T100004-File-0.png  
2  20181204T100004-File-0.png  
3  20181204T100004-File-1.png  
4  20181204T100004-File-1.png  


Now we need to match the annotated spectrograms with the wav files that generated them. We have the "image_name" info, everything before the ".png" (call it XX) will reference a file called XXMetadata. Inside that XXMetadata file there is a "FileName" which contains the raw title information which matches to the hydrophone recording title. As an example:

    image_name: 20181204T100004-File-0.png
    XX: 20181204T100004-File-0
    Metadata file: 20181204T100004-File-0Metadata
    FileName contained in the Metadata file: 181204-100002-437599-806141979_Spectrograms_20Hz.mat
    Hydrophone recording name: 181204-100002-437599-806141979.wav

We need those hydrophone recording names so we can make a list of ones we want to copy to our raw_data file for POC testing. 

You'll see from the .head() printout above that images are repeated. We will start by getting a unique list of images in our annotation files.

In [51]:
#Getting unique images
unique_images = annotation_df['image_name'].unique()

#Checking that got all of the images
if len(unique_images) == len(json_file_list):
    print("Got all {0} images - this will include images with no annotations".format(len(unique_images)))
    print("Dropping {0} images which don't have any sound annotations".format
          (len(annotation_df.loc[annotation_df['sound'].isna()])))
    annotated_info_df = annotation_df.dropna()
    if len(annotated_info_df) == len(annotation_df)-len(annotation_df.loc[annotation_df['sound'].isna()]):
        print("Images with no annotations successfully dropped")
    else:
        print("ERROR - did not drop correct number of images with no sounds")
        print("PROGRAMMER NEEDS TO CHECK WHY")
else:
    print("ERROR - pulled {0} unique images when was expecting {1}".format(
        len(unique_images), len(json_file_list)))
    print()
    #Stripping the png off of the image names
    image_names = list(annotation_df['image_name'])
    image_start = [name.replace('.png', '') for name in image_names]
    set_image_names = set(image_start)

    #Stripping the end off of the json files
    json_start = [json_name.replace('.json', '') for json_name in json_file_list]
    set_json_names = set(json_start)

    #Compare the sets
    if len(set_image_names) > len(set_json_names):
        missing_json = list(set_image_names.difference(set_json_names))
        if len(missing_json) > 0:
            print("The files which have images but not annotation JSONs are:")
            print(missing_json)
    if len(set_image_names) < len(set_json_names):
        missing_image = list(set_json_names.difference(set_image_names))
        if len(missing_image) > 0:
            print("The files which have annotation JSONs but not images are:")
            print(missing_image)
    print()
    print("CHECK ORIGINAL DATA TO SEE WHY FILES ARE MISSING")

Got all 346 images - this will include images with no annotations
Dropping 10 images which don't have any sound annotations
Images with no annotations successfully dropped


Now we will examine what kinds of sounds and what numbers of sounds we have captured. Ideally each of these sounds will be identified in our vggish algorithm + sounds which occurr between these sounds in the audio file.

In [57]:
#Counting numbers of sounds
annotated_info_df.groupby('sound').count()

Unnamed: 0_level_0,points,image_name
sound,Unnamed: 1_level_1,Unnamed: 2_level_1
airplane,31,31
fish,57,57
flow noise,670,670
helicopter,23,23
humpback,1208,1208
mooring,332,332
