# Extract Sample Data
This notebook will walk through the steps necessary to download the one percent data sample, extract the emebddings and metadata from each record and store them locally, as numpy arrays and json files respectively. 

### Pre-requisites: 
You will need access to the google storage account for the A20 organization (`gs://a20_dropbox`). Contact Tom (`tomdenton@google.com`) for access. 

The data sample accessed in this notebook is ~73Gb, users should make sure to have adequate network connectivity and disk memory before running the code in this notebook

**Author:** Leo Thomas - leo@developmentseed.org\
**Last updated:** 2023/06/15

In [1]:
import os
import re
import datetime
import json
import numpy as np

# Set this environment variable before loading the tensorflow 
# library to squash warning
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf

# Usefull package for displaying progress bars
from tqdm.notebook import tqdm

In [2]:
DATA_DIR = os.path.abspath("./one_percent_data_sample")

### 1.0. Retrieve data from Google Storage and store locally

In [3]:
ONE_PERCENT_SAMPLE_URI = "gs://a20_dropbox/one_percent_sep_embeddings_filter"

if not os.path.exists(DATA_DIR): 
    print("Data not found locally, downloading from Google Storage...")
    os.mkdir(LOCAL_DATA_DIR)
    os.system(f"gsutil -m cp -r {ONE_PERCENT_SMAPLE_URI} {DATA_DIR}")

sample_data_filenames = [
    os.path.join(DATA_DIR, file) 
    for file in os.listdir(DATA_DIR) 
    if os.path.isfile(os.path.join(DATA_DIR, file))
]
print(f"{len(sample_data_filenames)} files to process found")

1 files to process found


### 2.0. Open dataset using tensorflow.TFRecordDataset

In [4]:
raw_dataset = tf.data.TFRecordDataset(sample_data_filenames[0])

### 2.1. Print single record
We will notice that each record is stored as a byte string

In [5]:
for raw_record in raw_dataset.take(1):
    print(raw_record)

tf.Tensor(b'\n\xcc\xe7\x12\n\xf7\x05\n\tnuisances\x12\xe9\x05\n\xe6\x05\n\xe3\x05\x08\x01\x12\x0c\x12\x02\x08\x0c\x12\x02\x08\x05\x12\x02\x08\x03"\xd0\x05e\xc2U?\xdfzG\xc1\xc4\x17\xdd\xbd\x9f\xd5\'@\x94NY\xc1Kp\xc8\xbf]\xa4\x13@\xc8+\xf3\xc0\x11\x9e\x92\xbf\xb6\x04\xc7@\xd5H!\xc1w\x80\x8d\xc0p\x1b\x01@\xe32\xd2\xc0\x90\xc6\xfd\xbf\x16\xa5\x9d?\x08\x9bM\xc1\xa8\xd7S\xbe\x96\xf5\x0e@\xe5\x16W\xc1\x9e\x04i\xbf\xf5\xa71?L~\xf9\xc0h\xf3b>\xb0\x19\xbc@%\x9a"\xc1C(W\xc0\xfc8\xd9=OI\xc8\xc0\x00j\xc1\xbcP&%>=\x9c?\xc1XSM>\xf7\x1e\x0b@\xd3VR\xc1|T\x8f\xbf\n\xde\x00@ym\x0e\xc1\xf08k\xbf\x9a\xe1\xe9@=\x80%\xc1\xb6\xcc\xa8\xc0\x99\x140\xbf\xa7\x07\xc0\xc0L\x0f\xfd=\xc8\x00\xcd?\x1f<F\xc1\xa4\x18W\xbfR\x9c\xff?RIS\xc1t\x1f]\xbf\xc4v\x0f@\x16\xe1\xe8\xc00\x8d\xab\xbf\x1e\xdb\xfb@\x96\xa5*\xc1\xef\x8d\xbc\xc0\x02\x00\x91?\x95\x8e\xd2\xc0l\xde$\xbf^\x98\x99?Y O\xc1\x8f2\x83\xbe\xf4Y%@\xea\xd3a\xc1V\xba\x91\xbf\xd5_\x00@\xf3(\xeb\xc0\n\x7f\x84\xbf\xcdP\xc0@G\xbe\x1f\xc1c\x19]\xc0\xac\xe9\xc1>+"\x9d\xc0\

### 2.2. Open Record using tensorflow.train.Example 
This will give us a clearer representation of the underlying data in order to write out the extraction code. We will find that the records each have the following features: 

- **timestamp_s (integer):** the number of seconds between the start of the file and the start of the 1 min segment represented by this record
- **filename (string):** the file name
- **embedding (byte string):** the embeddings for each of the five second clips within the 1 minute segment represented by the record
- **embedding_shape (List[3] == [12, 1, 1280]):** Each record represents 12 five second clips, each having a single channel (ie: the source audio has not been processed through the separation model) and each 5 second clip corresponds to an embedding with dimension 1280


Reference for loading `tfrecords` using the `tf.train.Example` module: https://www.tensorflow.org/tutorials/load_data/tfrecord


In [6]:
for raw_record in raw_dataset.take(1): 
    example = tf.train.Example()
    example.ParseFromString(raw_record.numpy())
    print(example)

features {
  feature {
    key: "timestamp_s"
    value {
      float_list {
        value: 0
      }
    }
  }
  feature {
    key: "nuisances"
    value {
      bytes_list {
        value: "\010\001\022\014\022\002\010\014\022\002\010\005\022\002\010\003\"\320\005e\302U?\337zG\301\304\027\335\275\237\325\'@\224NY\301Kp\310\277]\244\023@\310+\363\300\021\236\222\277\266\004\307@\325H!\301w\200\215\300p\033\001@\3432\322\300\220\306\375\277\026\245\235?\010\233M\301\250\327S\276\226\365\016@\345\026W\301\236\004i\277\365\2471?L~\371\300h\363b>\260\031\274@%\232\"\301C(W\300\3748\331=OI\310\300\000j\301\274P&%>=\234?\301XSM>\367\036\013@\323VR\301|T\217\277\n\336\000@ym\016\301\3608k\277\232\341\351@=\200%\301\266\314\250\300\231\0240\277\247\007\300\300L\017\375=\310\000\315?\037<F\301\244\030W\277R\234\377?RIS\301t\037]\277\304v\017@\026\341\350\3000\215\253\277\036\333\373@\226\245*\301\357\215\274\300\002\000\221?\225\216\322\300l\336$\277^\230\231?Y O\301\2172\203\276\364Y%@\352\32

### 2.3. Build a parser function
This is a function that will be mapped over each record in the dataset to produce native python data types. See above link for further explanation

In [7]:
feature_description = {
    'timestamp_s': tf.io.FixedLenFeature([], tf.float32),
    'filename': tf.io.FixedLenFeature([], tf.string),
    'nuisances': tf.io.FixedLenFeature([], tf.string),
    'nuisances_shape': tf.io.FixedLenFeature([3], tf.int64),
    'embedding': tf.io.FixedLenFeature([], tf.string),
    'embedding_shape': tf.io.FixedLenFeature([3], tf.int64),
    
}

# Define a function to parse the TFRecordDataset
def parse_tfrecord(example_proto):
    # Parse the features from the serialized example
    features = tf.io.parse_single_example(example_proto, feature_description)
    
    # extract embedding as 3D array of float32, from byte string 
    embedding = tf.io.parse_tensor(features["embedding"], out_type=tf.float32)
    nuisances = tf.io.parse_tensor(features["nuisances"], out_type=tf.float32)
    
    return features['timestamp_s'], features["filename"], nuisances, features["nuisances_shape"], embedding, features["embedding_shape"] 


In [11]:
for timestamp_s, filename, nuisances, nuisances_shape, embedding, embedding_shape in raw_dataset.take(1).map(parse_tfrecord).as_numpy_iterator():
    print("Filename: ", filename)
    print("Timestamp: ", timestamp_s)
    print("Embedding_shape: ", embedding_shape)
    print("Nuisances_shape: ", nuisances_shape)
    print("Embedding[0]: ", embedding[0])
    print("Nuisances[0]: ", nuisances[0])

Filename:  b'20201126T200000+0800_Great-Western-Woodlands-Wet-A_433266.flac'
Timestamp:  0.0
Embedding_shape:  [  12    5 1280]
Nuisances_shape:  [12  5  3]
Embedding[0]:  [[ 0.06962189  0.20236634  0.05608569 ...  0.01882326 -0.0416308
  -0.00506338]
 [-0.03542615  0.32632345  0.07448269 ...  0.01448132 -0.03817595
   0.00068017]
 [ 0.06230072 -0.14698084  0.00424265 ...  0.05574412  0.05995841
  -0.02540757]
 [ 0.22989355 -0.04468994  0.05393755 ...  0.02126291  0.10322674
   0.01751855]
 [-0.00296113 -0.04795054  0.06619864 ...  0.06652277  0.14661746
  -0.02096196]]
Nuisances[0]:  [[  0.8349975  -12.467498    -0.10795549]
 [  2.6224134  -13.581684    -1.5659269 ]
 [  2.306907    -7.5990944   -1.1454488 ]
 [  6.219325   -10.080281    -4.4219317 ]
 [  2.0172997   -6.5687118   -1.9826221 ]]


Note: the three values in the `nuisances` array correspond to: empty, speech and unknown
They should be filtered as follows: 
- if speech > - 0.25
- if empty > 1.5 (only for the raw audio, or 0th channel)
- if empty > 0.0 (only for the separated audio, or channels 1-4)

### 3.0. Setup target directories for extracted data

In [9]:
# target directories
METADATA_DIR = os.path.join(DATA_DIR, "metadata")
EMBEDDINGS_DIR = os.path.join(DATA_DIR, "embeddings")

if not os.path.exists(METADATA_DIR):
    os.mkdir(METADATA_DIR)

if not os.path.exists(EMBEDDINGS_DIR):
    os.mkdir(EMBEDDINGS_DIR)

### 3.1. Extract metadata and embedding for each record in each file of the sample dataset
We will use a reg exp to extract the following values from the filename: 
- **site_id (int):** Id number of the side
- **file_datetime (timestamp):** timestamp of the file's start
- **timezone (int):** Number of hours offset from UTC at the time and place of recording. Note: in some cases the timezone was empty, which we are handling by mapping to GMT (UTC+0). This is very likely incorrect since all recording devices are located in Australia.
- **site_name (str):** Full text name of the site
- **subsite_name (str == [Wet-A|Wet-B|Dry-A|Dry-B]):** Full text subsite name
- **file_sequence_id (int)**: A unique Id for the file

In [14]:
count = 0

# The file below is generated by santizing the site and subsite names from the A20 API
# data = requests.get("https://api.acousticobservatory.org/sites?direction=asc&items=500").json()["data"]
# site_mapping = {
#    d["name"].lower().replace("  ", "-").replace(" ", "-").replace("/", "-").replace("(", "").replace(")", "").replace(":", ""):json.loads(d["notes"])["Point ID"] 
#    for d in data 
#    if d["notes"] is not None
#}

with open("site_name_to_id_mapping.json", "r") as f: 
    site_name_to_id = json.loads(f.read())

for dataset_filename in tqdm(sample_data_filenames): 
    embeddings =[]
    metadata = []
    raw_dataset = tf.data.TFRecordDataset(dataset_filename)
    for timestamp_s, filename, nuisances, nuisances_shape, embedding, embedding_shape in raw_dataset.map(parse_tfrecord).as_numpy_iterator():
        [(
            #site_id, 
            file_datetime, 
            timezone, 
            site_name, 
            subsite_name, 
            file_seq_id
        )] = re.findall(
            # I'm quite proud of myself for this regex, but if anyone can see 
            # a way to simplify it, please let me know!
            #r"site_(?P<site_id>\d{4})\/(?P<datetime>\d{8}T\d{6})(?P<timezone>(?:\+\d{4})|Z)_(?P<site_name>(?:\w*|-)*)-(?P<subsite_name>(?:Wet|Dry)-(?:A|B))_(?P<file_seq_id>\d*).flac",
            r"(?P<datetime>\d{8}T\d{6})(?P<timezone>(?:\+\d{4})|Z)_(?P<site_name>(?:\w*|-)*)-(?P<subsite_name>(?:Wet|Dry)-(?:A|B))_(?P<file_seq_id>\d*).flac",
            filename.decode("utf-8")
        )
        
        # Some files have just "Z" as timezone, assume UTC in this case
        timezone = "+0000" if timezone == "Z" else timezone
        file_datetime = datetime.datetime.strptime(f"{file_datetime}{timezone}", "%Y%m%dT%H%M%S%z")
        midnight = file_datetime.replace(hour=0, minute=0, second=0)
        file_offset_since_midnight = (file_datetime - midnight).seconds
        
        # `embedding` is a 3D array with Dims [12,5,1280]
        # The first dimension, [0:11] is the distinct 5 second windows
        # within a 60 second period.
        # The second dimension [0:4] is the 5 different audio channels
        # (4 separated + 1 combined audio channel)
        # `nuisances` is a 3D array with Dims [12, 5, 3]
        # The first and second dimensions are identical to those of 
        # the embeddings, and the 3 represents 3 different categories
        # of "issues" that may be present in the embedding: empty (no 
        # animal calls), speech (human voices present in recording) and 
        # unknown. The embeddings should be filtered out as follows: 
        # if speech > - 0.25
        # if empty > 1.5 (only for the raw audio, or 0th channel)
        # if empty > 0.0 (only for the separated audio, or channels 1-4)
                
        for temporal_index, embedding_channels in enumerate(embedding):
            for channel_index, _embedding in enumerate(embedding_channels): 

                # check if "speech" is greater than -0.25
                if nuisances[temporal_index][channel_index][1] > -0.25: 
                    continue
                if channel_index == 0 and nuisances[temporal_index][channel_index][0] > 1.5: 
                    continue
                if channel_index != 0 and nuisances[temporal_index][channel_index][0] > 0.0: 
                    continue
                
                
                site_id = site_name_to_id.get(f"{site_name.lower().replace(' ', '-')}-{subsite_name.lower().replace(' ', '-')}")
                if not site_id: 
                    raise Exception(f"No site id found for site: {site_name.lower().replace(' ', '-')}-{subsite_name.lower().replace(' ', '-')}")
                  
                count +=1

                embeddings.append(_embedding)
                metadata.append({
                    "file_timestamp": int(file_datetime.timestamp()),
                    "file_seconds_since_midnight": file_offset_since_midnight,
                    "recording_offset_in_file": int(timestamp_s + (5*temporal_index)), 
                    "channel_index": channel_index,
                    "site_id": site_id,
                    "site_name": site_name, 
                    "subsite_name": subsite_name, 
                    "file_seq_id": int(file_seq_id),
                    "filename": filename.decode("utf-8")
                })
    
    # extract filename, removes extension
    stripped_filename = dataset_filename.split('/')[-1].split('.')[0]
    
    # prep filepaths for generated files
    metadata_filepath = os.path.join(METADATA_DIR, f"{stripped_filename}.json")
    numpy_filepath = os.path.join(EMBEDDINGS_DIR, f"{stripped_filename}.npy")
    
    # write metadata to disk
    with open(metadata_filepath, "w") as f: 
        f.write(json.dumps(metadata))
    
    # write embeddings to disk
    np.save(numpy_filepath, embeddings)
            
print(f"Total number of data records: {count}")

  0%|          | 0/1 [00:00<?, ?it/s]

Total number of data records: 27390
