## Purpose

The purpose of this code is to map all .wav files to their spectrograms
and most recent annotations. 

This code will allow the team to join datapoints output from VGGish
to the original spectrogram annotations from Chris and team. This will
allow the VGGish team to ground their clusters in known annotations.

Input(s):
1. Path to unique_images_annotations.csv file. This file contains the
    image_file_name, image_file_path, and json_file_path for all the
    annotated Fred Olsen images from Dec 2018 to Feb 2019.

Output(s):
1. CSV file with each wav file name, metadata file name, image file name,
    json file name, annotation names (e.g., mooring, whale, airplane), and
    the annotation coordinates

Usage:
1. This code will generate a csv which will be joined to the output datapoints
    from the VGGish model.

We'll start by importing some Python libraries

In [29]:
#Imports
import numpy as np
import pandas as pd
import json
import os
import matplotlib.pyplot as plt
from os import listdir
from os.path import isfile, join

Now we will pull the information from the annotated JSON files stored on the external hard drive so we can get data labels, sound coordinates, and image file names. This is tricky because the same image may have been annotated multiple times. To ensure that we have pulled the most recent annotated file we will leverage the "unique_images_annotations.csv" generated by the "unique_images_annotations.ipynb". If the csv file doesn't exist, please stop now and run it to generate the csv file.

In [3]:
#Creating a df to save the data in
annotation_df = pd.DataFrame(columns=['sound','points','image_name', 'json_file_path'])
annotation_df.head()

#Getting the path to the JSON files & listing files
uniq_annotation_df = pd.read_csv('unique_images_annotations.csv')
json_file_list = list(uniq_annotation_df['json_file_path'])

#Iterate through json files to get annotated image info
for file in json_file_list:
    #Loading the JSON data & turning into dict
    annotated_file = open(file)
    annotated_dict = json.load(annotated_file)
    #annotated_dict.keys()

    #Pulling out the labels, points, and image path
    image_name = annotated_dict['imagePath']
    if len(annotated_dict['shapes']) > 0:
        for shape in annotated_dict['shapes']:
            sound = shape['label']
            points = shape['points']
            annotation_df.loc[len(annotation_df.index)] = [sound, points, image_name, file]
    else:
        sound = None
        points = None
        annotation_df.loc[len(annotation_df.index)] = [sound, points, image_name, file]
    
#Validating that images paired with json from unique annotations file are the same as the images referenced in the JSON files themselves
comb_df = pd.merge(left = uniq_annotation_df, right = annotation_df, how = 'left', left_on='json_file_path', right_on='json_file_path')
comb_df['image_file_name_tester'] = comb_df['image_file_name']+'.png'
comb_df['name_compare'] = np.where((comb_df['image_file_name_tester'] == comb_df['image_name']), 0, 1)
if comb_df['name_compare'].sum() != 0:
    print("The following json files contain different image filenames than those they were aligned to in the unique_image_annotations code")
    print(comb_df.loc[comb_df['name_compare']!=0])
else:
    pass

#Validating we got all the json files from the original unique annotations dataframe in the new annotation df
if len(uniq_annotation_df) != len(annotation_df['json_file_path'].unique()):
    print("ERROR - not all JSON contained in the annotation_df")
else:
    pass

#Examining head of file
print(annotation_df.head())

        sound                                             points  \
0     mooring  [[382.56626506024094, 614.4819277108433], [458...   
1  helicopter  [[1920.952380952381, 1048.1904761904761], [216...   
2     mooring  [[424.7349397590361, 748.2168674698794], [490....   
3     mooring  [[1209.0722891566263, 938.5783132530119], [127...   
4     mooring  [[1645.2168674698794, 744.6024096385541], [173...   

                    image_name  \
0   20181204T100004-File-8.png   
1   20181204T100004-File-8.png   
2  20181204T113004-File-16.png   
3  20181204T113004-File-16.png   
4  20181204T113004-File-16.png   

                                      json_file_path  
0  D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...  
1  D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...  
2  D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...  
3  D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...  
4  D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...  


Now we need to match the annotated spectrograms with the wav files that generated them. We have the "image_name" info, everything before the ".png" (call it X) will reference a file called XMetadata. Inside that XMetadata file there is a "FileName" which contains the raw title information which matches to the hydrophone recording title. As an example:

    image_name: 20181204T100004-File-0.png
    X: 20181204T100004-File-0
    Metadata file: 20181204T100004-File-0Metadata
    FileName: 181204-100002-437599-806141979_Spectrograms_20Hz.mat
    Hydrophone recording name: 181204-100002-437599-806141979.wav

By stripping off the "_Spectrograms_20Hz.mat" from the FileName we can get the prefix for the hydrophone recording name. We start by pulling each Metadata file, retrieving the contained FileName field, and getting the title prior to "_Spectrogram...". We add the Metadata file names to the annotated_info_df for easy reference.

In [6]:
#Getting the image prefixes
image_names = list(annotation_df['image_name'])
image_start = [name.replace('.png', '') for name in image_names]

#Creating all the metadata file names
metadata_names = []
for image_name in image_start:
    metadata_names.append('D:/1Dec2018_28Feb2019/MLFigsMeta/'+image_name+'Metadata.txt')
    
#Adding the metadata names to the table with annotation info
annotation_df['metadata_file_path'] = metadata_names
annotation_df.head()

Unnamed: 0,sound,points,image_name,json_file_path,metadata_file_path
0,mooring,"[[382.56626506024094, 614.4819277108433], [458...",20181204T100004-File-8.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1000...
1,helicopter,"[[1920.952380952381, 1048.1904761904761], [216...",20181204T100004-File-8.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1000...
2,mooring,"[[424.7349397590361, 748.2168674698794], [490....",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...
3,mooring,"[[1209.0722891566263, 938.5783132530119], [127...",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...
4,mooring,"[[1645.2168674698794, 744.6024096385541], [173...",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...


Now we need to open each of those metadata files to extract the FileName, cut off the "_Spectrogram..." and append '.wav' to get the hydrophone recording name.

In [7]:
#List of metadata file names
metadata_file_names = annotation_df['metadata_file_path'].unique()

#Creating a metadata df to store info taken from each file
metadata_df = pd.DataFrame(columns=['metadata_file_path','filename','wav_filename', 'windows_plotted','starttime'])

for file in metadata_file_names:
    #Read all the lines in the filepath & save in array if it exists
    #Otherwise add it to a list for printing later
    try:
        meta_file = open(file, 'r')
        text = meta_file.readlines()
        filename = text[1][10:-2]
        starttime = text[2][11:-2]
        windows_plotted = text[3][16:-2]
        wav_filename = 'D:/1Dec2018_28Feb2019/Hydrophone/'+filename[:-24]+'.wav'
        metadata_df.loc[len(metadata_df.index)] = [file, filename, wav_filename, windows_plotted, starttime]
    except FileNotFoundError:
        metadata_df.loc[len(metadata_df.index)] = [file, None, None, None, None]

#Matching the metadata_df information back to annotation_df
meta_annotation_df = pd.merge(left = annotation_df, right = metadata_df, how = 'left', on='metadata_file_path')
#meta_annotation_df.head()

#Seeing how many images don't have associated metadata files
missing_metadata = meta_annotation_df[meta_annotation_df['filename'].isna()].groupby('image_name').count().sort_values('sound', ascending=False).reset_index()
print("{0} images don't have an associated metadata file.".format(len(missing_metadata)))
print("This is {0}% of all images".format(round(len(missing_metadata)*100/len(meta_annotation_df['image_name'].unique()),2)))
print("IDENTIFY ISSUE AND SOLVE WITH CHRIS")

187 images don't have an associated metadata file.
This is 45.72% of all images
SOLVE WITH CHRIS DURING MODEL ITERATION


Let's address the issue of the missing metadata files. In an earlier conversation with Chris he mentioned there being an error on the digit representing time in the filename. In the following image name we have 20181204T100004 which stands for 2018, December, 4th, time 10:00:04am. Let's see what happens if we try to match the missing files to metadata files using only the date and first 4 digits of time + the file number (e.g., File-0 in the below example).

    image_name: 20181204T100004-File-0.png
    X: 20181204T100004-File-0
    Metadata file: 20181204T100004-File-0Metadata
    FileName: 181204-100002-437599-806141979_Spectrograms_20Hz.mat
    Hydrophone recording name: 181204-100002-437599-806141979.wav

In [95]:
#Checking the last time digit - all the images with missing metadata files end with "04" in the time
missing_metadata['image_name'].str[13:16].unique()

#Repeating the steps from above to flexibly match the metadata files based on the date and time out to 4 digits + the file number
'''
match the two dataframes on first n digits and file number
check that each image has a unique metadata file
is there something common about the image seconds and metadata seconds
Create list to send to chris'''

#Creating the empty unmatched image and metadata files 
unmatched_image_df = pd.DataFrame(columns=['image_name','firstn','seconds','file_number'])
metadata_df = pd.DataFrame(columns=['metadata_name','firstn','seconds','file_number'])
unmatched_images = list(missing_metadata['image_name'].unique())
#print(unmatched_images)

#Populating the unmatched_image_df
for unmatched_image in unmatched_images:
    unmatched_image_df.loc[len(unmatched_image_df.index)] = [unmatched_image, unmatched_image[:13], unmatched_image[13:15],
                                                             unmatched_image[16:-4]]
#Populating the unmatched metadata_df
hd_path = 'D:/1Dec2018_28Feb2019/MLFigsMeta'
all_metadata_files = [file for file in listdir(hd_path) if isfile(join(hd_path, file))]
for file in all_metadata_files:
    metadata_df.loc[len(metadata_df.index)] = [file, file[:13], file[13:15], file[16:-12]]

#Match the unmatched files on firstn and file_number
fuzzy_match_df = pd.merge(left = unmatched_image_df, right = metadata_df, how = 'inner',
                          on=['firstn', 'file_number'])
if len(fuzzy_match_df) == 0:
    print("No files match based on the file number and YYYYMMDDTHHMM")
else:
    print("YAY we found some files - dig into df to discover which files match")
#NO FILES MATCH BASED ON THE FILE NUMBER AND THE YYYYMMDDTHHMM

No files match based on the file number and YYYYMMDDTHHMM


Unfortunately it appears that there are no metadata files which match with the previously unmatched images when we attempt to keep the year, month, day, hours, and minutes the same. Perhaps the issue is with the minutes and the seconds, leading to the lack of matches. We will try the same matching procedure as the above, but matching on only the YYYYMMDDTHH in the image and metadata filenames.

In [97]:
#TRYING SAME LOGIC MATCHING ON YYYY, MM, DD, AND HH ONLY

#Creating the empty unmatched image and metadata files 
unmatched_image_11 = pd.DataFrame(columns=['image_name','firstn','min_sec','file_number'])
unmatched_metadata_11 = pd.DataFrame(columns=['metadata_name','firstn','min_sec','file_number'])
unmatched_images_11 = list(missing_metadata['image_name'].unique())
#print(unmatched_images)

#Populating the unmatched_image_11
for unmatched_image in unmatched_images_11:
    unmatched_image_11.loc[len(unmatched_image_11.index)] = [unmatched_image, unmatched_image[:11], unmatched_image[11:15],
                                                             unmatched_image[16:-4]]

#Populating unmatched_metadata_11
for file in all_metadata_files:
    unmatched_metadata_11.loc[len(unmatched_metadata_11.index)] = [
        file, file[:11], file[11:15], file[16:-12]
    ]

#Trying to match on YYYY, MM, DD, T, HH
matching_11 = pd.merge(left = unmatched_image_11, right = unmatched_metadata_11, how='inner',
                       on=['firstn', 'file_number'])

#Testing for matches
if len(matching_11) == 0:
    print("Unfortunately there are still no matches based on YYYYMMDDTHH and file number elements.")
else:
    print("YAY! Some files matched. Check df for more information.")
#THERE ARE NO METADATA FILES THAT MATCH THE UNMATCHED IMAGES ON THE SAME DAY & FILE NUMBER

Unfortunately there are still no matches based on YYYYMMDDTHH and file number elements.


We are still failing to find any matches between the unmatched images and metadata files. Perhaps the image files were taken from a time not covered by the metadata. Comparing the start and end dates of metadta to check.

In [101]:
#Sorting the df for easier comparison
unmatched_image_df = unmatched_image_df.sort_values('firstn', axis= 0, ascending=True)
sorted_metadata_df = metadata_df.sort_values('firstn',axis=0, ascending=True)

#looking at dates for unmatched images
unmatched_image_df['date'] = unmatched_image_df['firstn'].str[:-5]
unmatched_image_date_set= set(unmatched_image_df['date'])
print("There are {0} days which have images with no accompanying metadata.".format(len(unmatched_image_date_set)))

#looking at dates for metadata files
sorted_metadata_df['date'] = sorted_metadata_df['firstn'].str[:-5]
metadata_date_set = set(sorted_metadata_df['date'])
print("There are {0} days with at least one metadata file.".format(len(metadata_date_set)))

#Checking the overlap of days
both_dates = metadata_date_set.intersection(unmatched_image_date_set)
print("There are {0} days which overlap between metadata file coverage and unmatched images.".format(len(both_dates)))


#Understanding the range of days
print("The unmatched image file dates range from {0} to {1}.".format(min(unmatched_image_date_set), max(unmatched_image_date_set)))
print("The metadata file dates range from {0} to {1}.".format(min(metadata_date_set), max(metadata_date_set)))
#print(sorted(metadata_date_set))
#print(sorted(unmatched_image_date_set))

There are 32 days which have images with no accompanying metadata.
There are 42 days with at least one metadata file.
There are 0 days which overlap between metadata file coverage and unmatched images.
The unmatched image file dates range from 20190115 to 20190223.
The metadata file dates range from 20181204 to 20190114.


187

It would be nice to ask Chris for the missing metadata files so we can tie them back to the wav files processed by VGGish. To quantify what we are missing, how many annotations do we have vs. not have in each category?

In [141]:
#Define function with matching
def label_matches(row):
    if row['filename'] == None:
        return 0
    else:
        return 1
    
#Apply that function to the df
meta_annotation_df['matched_metadata'] = meta_annotation_df.apply(label_matches, axis = 1)
meta_annotation_df

#Grouping annotations by metadata matched vs. not matched
print("The distribution of annotated sounds by matched metadata status is as follows:")
print(meta_annotation_df.groupby(['matched_metadata','sound']).size())

The distribution of annotated sounds by matched metadata status is as follows:
matched_metadata  sound        
0                 airplane           18
                  boat                8
                  fish               21
                  flow noise        471
                  helicopter          8
                  humpback         1130
                  mooring            56
1                 airplane           14
                  fish               67
                  flow noise        345
                  helicopter         15
                  humpback          391
                  mooring           405
                  mooring noise       3
dtype: int64


In [143]:
#meta_annotation_df['image_name'].groupby(['matched_metadata', 'image_name']).size()
images_matching_status = meta_annotation_df[['image_name', 'matched_metadata']].drop_duplicates()

print("The distribution of unique images by matched metadata status is as follows:")
print(images_matching_status.groupby('matched_metadata').size())

The distribution of unique images by matched metadata status is as follows:
matched_metadata
0    187
1    222
dtype: int64


# AFTER CHRIS PROVIDES NEW METADATA, CHECK WAV FILE MAPPING

In [12]:
#Testing to see if all wav files have been accounted for in the meta_annotation_df

#Getting list of wav files from hard drive
wav_list = [file for file in os.listdir('D:/1Dec2018_28Feb2019/Hydrophone/') if file.endswith('.wav')]

#List of wav files from meta_annotation_df
matched_wav_list = list(meta_annotation_df['wav_filename'].unique())

if len(wav_list) != len(matched_wav_list):
    print("ALERT - {0} files in the hard drive, only {1} in the meta_annotation_df.".format(len(wav_list), len(matched_wav_list)))
else:
    print("All wav files accounted for in meta_annotation_df")

ALERT - 1325 files in the hard drive, only 140 in the meta_annotation_df.


Now that we have wav files mapped back to their annotations, we need to understand what time each annotation occurs. We do this to ultimately map back to our VGGish embeddings which occurr every 0.96 seconds. We start by translating the "windows_plotted" back into audio seconds. From Chris' README we know that:

- Windows plotted: Tells which files from the audio file are plotted. If 20 Hz spectrograms and the first 
           minute of the file this would be 1-1201. If the second minute they would be 1202-1401 etc. 
           These were saved because it would be possible to closely map the time of a second from the
           window number and file start time.

Based on that information we can surmise that 1 minute of audio = 1200 windows. We need to get the start and end number of windows, subtract them, and divide by 1200 to understand how much time is covered by each spectrogram. In the next block of code we will look at the annotation coordinates to find the more granular location in time from the audio file.

In [161]:
#Breaking apart the windows_plotted into a start_window and stop_window
meta_annotation_df['start_window'] = meta_annotation_df['windows_plotted'].str.split('-').str[0]
meta_annotation_df['stop_window'] = meta_annotation_df['windows_plotted'].str.split('-').str[1]

#Subtracting the two windows to get the total time represented by the spectrogram
meta_annotation_df['time_in_spectrogram_window'] = pd.to_numeric(meta_annotation_df['stop_window']) - pd.to_numeric(meta_annotation_df['start_window'])

#Time represented by spectrogram in minutes
meta_annotation_df['time_in_spectrogram_secs'] = (meta_annotation_df['time_in_spectrogram_window']/1200)*60

#Dropping the extra calculation columns
meta_annotation_df = meta_annotation_df.drop(['start_window','stop_window', 'time_in_spectrogram_window'], axis = 1)
meta_annotation_df.head()

Unnamed: 0,sound,points,image_name,json_file_path,metadata_file_path,filename,wav_filename,windows_plotted,starttime,matched_metadata,time_in_spectrogram_secs
0,mooring,"[[382.56626506024094, 614.4819277108433], [458...",20181204T100004-File-8.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1000...,181204-100002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-100002...,9602-10801,20181204T100004,1,59.95
1,helicopter,"[[1920.952380952381, 1048.1904761904761], [216...",20181204T100004-File-8.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1000...,181204-100002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-100002...,9602-10801,20181204T100004,1,59.95
2,mooring,"[[424.7349397590361, 748.2168674698794], [490....",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...,181204-113002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-113002...,19202-20401,20181204T113004,1,59.95
3,mooring,"[[1209.0722891566263, 938.5783132530119], [127...",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...,181204-113002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-113002...,19202-20401,20181204T113004,1,59.95
4,mooring,"[[1645.2168674698794, 744.6024096385541], [173...",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...,181204-113002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-113002...,19202-20401,20181204T113004,1,59.95


# VERIFY THE BELOW 2/6

In the next block of code we will create a column detailing the start and end time the annotated noise occurred. When added to the "time_in_spectrogram_secs" measurement, it will tell us when in the audio file the annotation approximately occurred. Per Debbie, we know that the "points" are in pixels, and there are X pixels across the horizontal (time) axis. Knowing each file is ~1 minute in length, each pixel is X seconds.

Now that we know when each annotation occurred in the wav file, we can match back to the ~1800 slices created by VGGish.

Finally, we save the final output as wav_to_annotations.csv

In [14]:
#Saving the file
print('Saving wav to annotation information for future use')
meta_annotation_df.to_csv('wav_to_annotation.csv')
meta_annotation_df.head()

Saving wav to annotation information for future use


Unnamed: 0,sound,points,image_name,json_file_path,metadata_file_path,filename,wav_filename,windows_plotted,starttime
0,mooring,"[[382.56626506024094, 614.4819277108433], [458...",20181204T100004-File-8.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1000...,181204-100002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-100002...,9602-10801,20181204T100004
1,helicopter,"[[1920.952380952381, 1048.1904761904761], [216...",20181204T100004-File-8.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1000...,181204-100002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-100002...,9602-10801,20181204T100004
2,mooring,"[[424.7349397590361, 748.2168674698794], [490....",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...,181204-113002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-113002...,19202-20401,20181204T113004
3,mooring,"[[1209.0722891566263, 938.5783132530119], [127...",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...,181204-113002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-113002...,19202-20401,20181204T113004
4,mooring,"[[1645.2168674698794, 744.6024096385541], [173...",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...,181204-113002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-113002...,19202-20401,20181204T113004
