# Extract Sample Data
This notebook will walk through the steps necessary to download the one percent data sample, extract the emebddings and metadata from each record and store them locally, as numpy arrays and json files respectively. 

### Pre-requisites: 
You will need access to the google storage account for the A20 organization (`gs://a20_dropbox`). Contact Tom (`tomdenton@google.com`) for access. 

The data sample accessed in this notebook is ~73Gb, users should make sure to have adequate network connectivity and disk memory before running the code in this notebook

**Author:** Leo Thomas - leo@developmentseed.org\
**Last updated:** 2023/06/15

In [1]:
import os
import re
import datetime
import json
import numpy as np

# Set this environment variable before loading the tensorflow 
# library to squash warning
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf

# Usefull package for displaying progress bars
from tqdm.notebook import tqdm

In [None]:
DATA_DIR = os.path.abspath("./one_percent_data_sample")

### 1.0. Retrieve data from Google Storage and store locally

In [2]:
ONE_PERCENT_SAMPLE_URI = "gs://a20_dropbox/one_percent_embeddings"

if not os.path.exists(DATA_DIR): 
    print("Data not found locally, downloading from Google Storage...")
    os.mkdir(LOCAL_DATA_DIR)
    os.system(f"gsutil -m cp -r {ONE_PERCENT_SMAPLE_URI} {DATA_DIR}")

sample_data_filenames = [
    os.path.join(DATA_DIR, file) 
    for file in os.listdir(DATA_DIR) 
    if os.path.isfile(os.path.join(DATA_DIR, file))
]
print(f"{len(sample_data_filenames)} files to process found")

374 files to process found


### 2.0. Open dataset using tensorflow.TFRecordDataset

In [3]:
raw_dataset = tf.data.TFRecordDataset(sample_data_filenames[0])

### 2.1. Print single record
We will notice that each record is stored as a byte string

In [4]:
for raw_record in raw_dataset.take(1):
    print(raw_record)

tf.Tensor(b'\n\xae\xe1\x03\n\x1b\n\x0fembedding_shape\x12\x08\x1a\x06\n\x04\x0c\x01\x80\n\n\xac\xe0\x03\n\tembedding\x12\x9d\xe0\x03\n\x99\xe0\x03\n\x95\xe0\x03\x08\x01\x12\r\x12\x02\x08\x0c\x12\x02\x08\x01\x12\x03\x08\x80\n"\x80\xe0\x03#\xfc >\xd5\x16T>W\x86\xcf\xbc\xa3\'\xd5\xbc\xc7\xc4\x93\xbd\x19y\x8c\xbc\xf4>"=\x9bI^=\r\xd2\x13=\xb67E=\x1b\xd0\r>#fD\xbdK\x16&\xbd\xed\'\xed\xbc\x8bt\x0e>@\xa6(<+\xde\xc5<\xd06m\xbcL_0\xbd\x05;\xff\xbc\x87\x82\xfc<\xfdF\xa4\xba\x1a\x1d\xd4=\xfdRW>WI\xc9=\x0f@$=\xc7\xffo\xbc\r\x90\xf2\xbcoF\x13=P\xef\xfd;x\xf2\xb6<XGy>#\x80}\xbc\xffk\xc8=\x97\xba[<\xd1\xcb\x1e>\xd1\xff\xb0=*\x1a\xc0\xbc\xd4\xe2\xbe=\xba\x9cy;\x89\xa7\x88\xbd\x9a\xe0\xdf;\xc1\xb3\x86\xbb\xe22s\xbc]\xab\x19\xbd\xd7\x15s\xbdo\x1a\x1b;S\xbca=L\xff\x17<b\xad2\xbd\xf0\x9b\x04\xbeS\xd6\xbf;^_\xa6\xbd\x0b\xb9\xd7=W\xf5>="J\x1e\xbe[R\xc2<Z\x99\x1b\xbd\xfa\x07d<g#%\xbd\x03\xf7\x81\xbd\x15\xcf\x86\xbcC6\x84<\xd09\x1e;\x98\x87{\xbc\n;\xee=@\xec\xce=\xf2\x00\xbb;\x7f-q=\x9eSC\xbd/\x9f\xb7;\xf7\x1f

### 2.2. Open Record using tensorflow.train.Example 
This will give us a clearer representation of the underlying data in order to write out the extraction code. We will find that the records each have the following features: 

- **timestamp_s (integer):** the number of seconds between the start of the file and the start of the 1 min segment represented by this record
- **filename (string):** the file name
- **embedding (byte string):** the embeddings for each of the five second clips within the 1 minute segment represented by the record
- **embedding_shape (List[3] == [12, 1, 1280]):** Each record represents 12 five second clips, each having a single channel (ie: the source audio has not been processed through the separation model) and each 5 second clip corresponds to an embedding with dimension 1280


Reference for loading `tfrecords` using the `tf.train.Example` module: https://www.tensorflow.org/tutorials/load_data/tfrecord


In [5]:
for raw_record in raw_dataset.take(1): 
    example = tf.train.Example()
    example.ParseFromString(raw_record.numpy())
    print(example)

features {
  feature {
    key: "timestamp_s"
    value {
      float_list {
        value: 6480
      }
    }
  }
  feature {
    key: "filename"
    value {
      bytes_list {
        value: "site_0285/20201014T010000+1030_Arkaba-Dry-A_49500.flac"
      }
    }
  }
  feature {
    key: "embedding"
    value {
      bytes_list {
        value: "\010\001\022\r\022\002\010\014\022\002\010\001\022\003\010\200\n\"\200\340\003#\374 >\325\026T>W\206\317\274\243\'\325\274\307\304\223\275\031y\214\274\364>\"=\233I^=\r\322\023=\2667E=\033\320\r>#fD\275K\026&\275\355\'\355\274\213t\016>@\246(<+\336\305<\3206m\274L_0\275\005;\377\274\207\202\374<\375F\244\272\032\035\324=\375RW>WI\311=\017@$=\307\377o\274\r\220\362\274oF\023=P\357\375;x\362\266<XGy>#\200}\274\377k\310=\227\272[<\321\313\036>\321\377\260=*\032\300\274\324\342\276=\272\234y;\211\247\210\275\232\340\337;\301\263\206\273\3422s\274]\253\031\275\327\025s\275o\032\033;S\274a=L\377\027<b\2552\275\360\233\004\276S\326\277;^_\246\275\013\

### 2.3. Build a parser function
This is a function that will be mapped over each record in the dataset to produce native python data types. See above link for further explanation

In [6]:
feature_description = {
    'timestamp_s': tf.io.FixedLenFeature([], tf.float32),
    'filename': tf.io.FixedLenFeature([], tf.string),
    'embedding': tf.io.FixedLenFeature([], tf.string),
    'embedding_shape': tf.io.FixedLenFeature([3], tf.int64)
}

# Define a function to parse the TFRecordDataset
def parse_tfrecord(example_proto):
    # Parse the features from the serialized example
    features = tf.io.parse_single_example(example_proto, feature_description)
    
    # extract embedding as 3D array of float32, from byte string 
    embedding = tf.io.parse_tensor(features["embedding"], out_type=tf.float32)
    
    return features['timestamp_s'], features["filename"], embedding, features["embedding_shape"]


In [7]:
for timestamp_s, filename, embedding, embedding_shape in raw_dataset.take(1).map(parse_tfrecord).as_numpy_iterator():
    print("Filename: ", filename)
    print("Timestamp: ", timestamp_s)
    print("Embedding_shape: ", embedding_shape)
    print("Embedding[0]: ", embedding[0])

Filename:  b'site_0285/20201014T010000+1030_Arkaba-Dry-A_49500.flac'
Timestamp:  6480.0
Embedding_shape:  [  12    1 1280]
Embedding[0]:  [[ 0.15721183  0.20711835 -0.02533261 ...  0.01862661  0.15726553
  -0.00364249]]


### 3.0. Setup target directories for extracted data

In [8]:
# target directories
METADATA_DIR = os.path.join(DATA_DIR, "metadata")
EMBEDDINGS_DIR = os.path.join(DATA_DIR, "embeddings")

if not os.path.exists(METADATA_DIR):
    os.mkdir(METADATA_DIR)

if not os.path.exists(EMBEDDINGS_DIR):
    os.mkdir(EMBEDDINGS_DIR)

### 3.1. Extract metadata and embedding for each record in each file of the sample dataset
We will use a reg exp to extract the following values from the filename: 
- **site_id (int):** Id number of the side
- **file_datetime (timestamp):** timestamp of the file's start
- **timezone (int):** Number of hours offset from UTC at the time and place of recording. Note: in some cases the timezone was empty, which we are handling by mapping to GMT (UTC+0). This is very likely incorrect since all recording devices are located in Australia.
- **site_name (str):** Full text name of the site
- **subsite_name (str == [Wet-A|Wet-B|Dry-A|Dry-B]):** Full text subsite name
- **file_sequence_id (int)**: A unique Id for the file

In [9]:
count = 0
for dataset_filename in tqdm(sample_data_filenames): 
    embeddings =[]
    metadata = []
    raw_dataset = tf.data.TFRecordDataset(dataset_filename)
    for timestamp_s, filename, embedding, embedding_shape in raw_dataset.map(parse_tfrecord).as_numpy_iterator():
        [(
            site_id, 
            file_datetime, 
            timezone, 
            site_name, 
            subsite_name, 
            file_seq_id
        )] = re.findall(
            # I'm quite proud of myself for this regex, but if anyone can see 
            # a way to simplify it, please let me know!
            r"site_(?P<site_id>\d{4})\/(?P<datetime>\d{8}T\d{6})(?P<timezone>(?:\+\d{4})|Z)_(?P<site_name>(?:\w*|-)*)-(?P<subsite_name>(?:Wet|Dry)-(?:A|B))_(?P<file_seq_id>\d*).flac",
            filename.decode("utf-8")
        )
        
        # Some files have just "Z" as timezone, assume UTC in this case
        timezone = "+0000" if timezone == "Z" else timezone
        file_datetime = datetime.datetime.strptime(f"{file_datetime}{timezone}", "%Y%m%dT%H%M%S%z")
        midnight = file_datetime.replace(hour=0, minute=0, second=0)
        file_offset_since_midnight = (file_datetime - midnight).seconds
        
        # `embedding` is a 3D array with Dims [12,1,1280]
        # We loop over the first dimension to "flatten" 
        # the 12 emebddings per minute
        # and extract the single channel (2nd dimension). 
        # We add each of the 12 embeddings as their own record
        for i, _embedding in enumerate(embedding[:,0]):
            
            count +=1
            
            embeddings.append(_embedding)
            metadata.append({
                "file_timestamp": int(file_datetime.timestamp()),
                "file_seconds_since_midnight": file_offset_since_midnight,
                "recording_offset_in_file": int(timestamp_s + (5*i)), 
                "site_id": site_id, 
                "site_name": site_name, 
                "subsite_name": subsite_name, 
                "file_seq_id": int(file_seq_id),
                "filename": filename.decode("utf-8")
            })
    
    # extract filename, removes extension
    stripped_filename = dataset_filename.split('/')[-1].split('.')[0]
    
    # prep filepaths for generated files
    metadata_filepath = os.path.join(METADATA_DIR, f"{stripped_filename}.json")
    numpy_filepath = os.path.join(EMBEDDINGS_DIR, f"{stripped_filename}.npy")
    
    # write metadata to disk
    with open(metadata_filepath, "w") as f: 
        f.write(json.dumps(metadata))
    
    # write embeddings to distk
    np.save(numpy_filepath, embeddings)
            
print(f"Total number of data records: {count}")

  0%|          | 0/374 [00:00<?, ?it/s]

Total number of data records: 14412192
