## Reformat Bird Soundscapes & Training Data to Anqa

Converts a dog's breakfast of sources, annotation programs and methodologies into a cleaned and unified tabular ML-friendly data format *Anqa* with time-frequency bounding boxes and a separate file for per-recording metadata.

### Motivation & Summary

* The format should allow quality soundscapes to be built with retention of metadata, and capability for continuous reviewing & improvement.  

* By making the format compatible with the model outputs, human reviewed field data can easily contribute to improved future datasets.

* By working with randomly selected soundscapes rather than model detections we avoid creating models that get increasingly biased by only learning what they already know.  

<img src="..//images/anqa_diagram.png" width="900">


The format combined with the methods in WildPyTools should support all the following:

* Create a dataset of soundscapes with a specified maximum length, in `.flac` audio format
* Keep track of any metadata available, with one row per soundscape
* T-F annotations for future processing, training pipelines, or human-in-loop continuous improvement
* Optional conversion to Raven `.selections.txt` files for visualisation or editing
* Methods for direct reviewing and editing in a Jupyater notebook
* Models for detection of unlabelled animal sounds, or human-voice
* Convert weakly/sparsely labelled soundscapes into densely labled soundscapes of fixed length
* Merge labels and metadata from multiple Anqa datasets into unified parent tables (with updated file paths)

### Methods

**Source Data (Labels plus corresponding .wav, .ogg or .flac files)**
* A previously compiled Anqa dataset
* Raven `.selections.txt` tab seperated tables
* A BirdCLEF - Zenodo style soundscape `.csv` tidytable with one bounding box per row, as per [here](https://zenodo.org/records/7018484)
* An existing BirdCLEF-style metadata dataframe with data such as licenses, lat, long, reviewed-by, reviewed-on, etc.
* AviaNZ (New Zealand) `.data` labels 
* Freebird (New Zealand) `.tag` xml labels

If starting a new dataset from scratch the annotation process should do the following:

* Create the metadata in `.csv`, one row per sound recording, with the column names used here.
* Annotation should use Raven with e-bird and iNaturalist codes.
* Mark bounding boxes (frequency and time bounds) on every bird-call.
* Unidentified calls should still be boxed but labelled 'unknown'.
* One minute of annotations per recording is optimal (randomly sampled from a longer recording if need be).
* A 'Sex' column, m/f or left empty.
* A 'Rating' for label decision confidence on a scale of 0 to 1.

**Intermediate formatting step (Automated)**  

Parses the source files, and creates two dataframes:

1. Metadata, from the existing one if it exists, otherwise a new (mostly empty) one is created
2. Labels.  These will only be used from their original 'source' format, for example `.selections.txt` from Raven.

**Propose missing labels (Optional)**   

* Delete any annotations previously labelled 'h_detect' or 'a_detect'
* Run through every sample with a voice detection model and label any human voice.  Label 'h_detect'
* Run through every sample with an animal sound detection model.  Label 'a_detect'

**Crop or chunk longer recordings to a fixed maximum length (Optional)**  

Restrict each file to a maximum length (default 60 seconds) chosen from the whole with the most identified labels.  If none found then a random period will be chosen.  The point of this is to maximise the value of the labeller's time.  15 different files of one minute each is much more diversity than a single 15-minute file with same birds in the same background noise.

**Crop & randomly assemble sparsly labelled soundscapes**  

Turns a recording with only a small number of labels (and many missing labels) into something closer to a strongly labelled soundscape.  Randomly sampling with replacement from the short-crops within it, and giving each crop a randomly sized buffer on either side.  Produces fixed length 'synthetic' soundscape, when none was available before.

**Save in Anqa format**  

* (Optional), re-save the recordings in `.flac` format
* Save the annotations 
* Save the metadata

**Human-in-loop editing**
* Create an updated raven dataset with `.selections.txt` files, plus `annotations` and `metadata`
* Externally review the `.selections.txt` files in Raven, then re-run the process above.
* Precedence will be given to the existing human-tagged labels over anything like automated bird or voice detection.
* Previously auto-labelled annotations will be over-written.

**Re-Running an existing Anqa Dataset**
* Change the chunk lengths, or take advantage of improved detection/cleaning.
* 'Densify' by removing long periods without labels, randomly selecting and concatenating the remainder
* Precedence given to human-tagged labels over anything like automated bird or voice detection.
* Previously auto-labelled annotations will be over-written.

### Notebook Configuration
(Adjust as needed)

In [1]:
from pathlib import Path
project_root = str(Path().resolve().parent)

In [2]:
#Note: 'anqa_destn' must differ from source directory if saving new audio files
source = 'birdclef_samples'  #raven_samples, birdclef_samples, bc-zenodo_samples, avianz_samples, 
notebook_use = {'project_root': project_root,
                'source': source,
                'anqa_destn': project_root + f'/data/sample_label_formats/small_anqa_datasets/anqa_from_{source}',
                'merger_folder': project_root +  '/data/sample_label_formats/small_anqa_datasets',
                'debug': True,
                'max_seconds': 60,  #Segment and/or crop any data type to this maximum length
                'max_buffer_seconds': 1, #If cropping around bounding boxes a buffer up to this value will be added
                'crop_method': 'bbox', #'keep_all', #any of 'bbox', 'keep_all', 'random',  in future: 'max_annotations', 'max_detections'
                'padding': None, # Any of 'pad', 'noise', 'zero', None
                'save_new_flac_files': True, # If true the original source will be re-written into .flac files of length max_seconds
                'tabular_save_suffix': 'parquet' #csv
                }

### Imports

In [3]:
import os, sys
from dotenv import load_dotenv
import random
import numpy as np
import pandas as pd
import torch
from tqdm.auto import tqdm
from IPython.display import display
import ipywidgets as widgets
import importlib
button = widgets.Button(description="Continue")
output = widgets.Output()

#%matplotlib widget  
torch.set_num_threads(1)

In [4]:
load_dotenv(dotenv_path = project_root + " /.env")

for p in os.getenv("PYTHONPATH", "").split(os.pathsep):
    if p:
        print(f' Adding Python path: {p}')
        sys.path.append(str(Path(p).resolve()))

 Adding Python path: ../../wildpytools/src
 Adding Python path: ./src


In [5]:
#Only doing like this for convenience whilst I work on WildPyTools
import wildpytools
import wildpytools.audio as wa
importlib.reload(wa)
importlib.reload(wildpytools.io)

SourceDataLoader = wa.SourceDataLoader
ToAnqa = wa.ToAnqa
anqa_to_raven_selections = wa.anqa_to_raven_selections
save_dataframe = wildpytools.io.save_dataframe
load_dataframe = wildpytools.io.load_dataframe
merge_anqua_data = wa.merge_anqa_data

### Data Paths & Values (Simulating all possible options)

In [6]:
# These lines will over-write any existing values from the csv metadata file if it exists.  
# Comment out any items you want to leave unchanged.  
update_metadata_dict = {'collection': 'crown_jewels',
                        'secondary_labels': "['sidekick_bird']",
                        'url' : 'some_web_address',
                        'latitude': 37.0,
                        'longitude': -118.5,
                        'author' : 'Mr Bigglesworth',
                        'licence': 'some_licence',
                        #'recorded_on': 025-12-23 09:15:30, 
                        'collection' : 'birdclef_22',
                        'reviewed_on' : '2025-12-23 09:15:30',
                        'reviewed_by': 'Joe Blogs',
                        'models_used' : "['model_1']"
                        }

anqa_samples = { 'source_type': 'anqa',
              'audio_files': project_root + '/data/sample_label_formats//small_anqa_datasets/anqa_from_freebird/audio',
              'metadata_path': project_root + '/data/sample_label_formats//small_anqa_datasets/anqa_from_freebird/metadata.csv',
              'labels_path': project_root + '/data/sample_label_formats//small_anqa_datasets/anqa_from_freebird/annotations.csv',
              'naming_csv': project_root + '/data/sample_label_formats/birdclef/metadata/naming.csv',
              'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
             }

raven_samples = {'source_type': 'raven',
                 'audio_files': project_root + '/data/sample_label_formats/raven_kokako_olly',
                 'naming_csv': project_root + '/data/experiments/exp_2/exp_2_deploy/exp_2_bird_map.csv',
                 'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
                 'metadata_path' : None, #project_root + '/data/sample_label_formats/raven_kokako_olly/metadata.csv'
                 'metadata_dict': update_metadata_dict,  # replace with None if not updating
                }

avianz_samples = {'source_type': 'avianz',
               'audio_files': project_root + '/data/sample_label_formats/avianz_janzoon_brian',
               'naming_csv': project_root + '/data/bird_names/avianz_to_ebird.csv',
               'naming_cols' : {'source_name': 'AviaNZ','save_name': 'eBird'},
              }

freebird_samples = {'source_type': 'freebird',
                 'audio_files': project_root + '/data/sample_label_formats/freebird_tier_1',
                 'naming_csv': project_root + '/data/sample_label_formats/freebird_species_list_05112013.csv',
                 'naming_cols' : {'source_name': 'Code', 'save_name': 'eBird'},
                 'metadata_dict': update_metadata_dict
}

birdclef_samples = {
                'source_type': 'birdclef',
                'audio_files': project_root + '/data/sample_label_formats/bc25_audio_files',
                'metadata_path': project_root + '/data/sample_label_formats/birdclef/metadata/train.csv',
                'naming_csv': project_root + '/data/sample_label_formats/birdclef/metadata/naming.csv',
                'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
                'metadata_dict': update_metadata_dict,  # replace with None if not updating
                }

#Post BirdCLEF annotated soundscapes on zenodo.org:   https://zenodo.org/search?q=birdclef&l=list&p=1&s=10&sort=bestmatch
bc_zenodo_samples = {'source_type': 'bc-zenodo',
                  'audio_files': '/home/olly/Desktop/Kaytoo/data/sample_label_formats/zenodo_birdclef/Sierra_Nevada/audio',
                  'metadata_dict': update_metadata_dict,
                  'labels_path': project_root + '/data/sample_label_formats/zenodo_birdclef/Sierra_Nevada/annotations.csv',
                  'naming_csv': project_root + '/data/sample_label_formats/birdclef/metadata/naming.csv',
                  'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
}

data_sources = {'anqa_samples' : anqa_samples,
                'raven_samples' : raven_samples,
                'avianz_samples' : avianz_samples,
                'birdclef_samples' : birdclef_samples,
                'bc_zenodo_samples' : bc_zenodo_samples,
                'freebird_samples' : freebird_samples
                }

data_source = data_sources[notebook_use['source']]
allowed_loader_kwargs = ['metadata_path', 'metadata_dict', 'labels_path']
loader_kwargs = {
    key: val
    for key, val in data_source.items()
    if key in allowed_loader_kwargs and val is not None
}

In [7]:
class Config():
    def __init__(self, options: dict, data_sources: dict):
        self.source_data_type = data_sources[options['source']].get('source_type')
        self.debug = options['debug']
        self.max_seconds = options['max_seconds']
        self.crop_method = options['crop_method']
        self.padding = options['padding']
        self.save_new_flac_files = options['save_new_flac_files']
        self.save_raven_selections = False  # User must manually change at end of the notebook if wanted
        self.source_name = self.SAVE_NAME = None
        _naming_cols = data_sources[options['source']].get('naming_cols')
        if _naming_cols is not None:
            self.source_name = _naming_cols.get('source_name')
            self.save_name = _naming_cols.get('save_name')


class FilePaths():
    def __init__(self, notebook_use: dict, data_sources: dict):
        _source = notebook_use['source']
        _tabular_save_suffix = notebook_use['tabular_save_suffix']
        _source_cfg = data_sources[_source]
        _metadata = _source_cfg.get('metadata_path')
        _naming = _source_cfg.get('naming_csv')
        _annotations = _source_cfg.get('labels_path')
        
        self.audio_files = Path(data_sources[_source]['audio_files'])
        self.metadata = Path(_metadata) if _metadata is not None else None
        self.naming_csv = Path(_naming) if _naming is not None else None
        self.annotation_src = Path(_annotations) if _annotations is not None else None

        # Output directory (ensure it exists)
        self.anqa_destn = Path(notebook_use.get('anqa_destn'))
        self.anqa_destn.mkdir(parents=True, exist_ok=True)
        self.annotations_destn = self.anqa_destn / f'annotations.{_tabular_save_suffix}'
        self.metadata_destn = self.anqa_destn / f'metadata.{_tabular_save_suffix}'
        self.multi_anqa_fldr = notebook_use['merger_folder']

config = Config(notebook_use, data_sources)
paths = FilePaths(notebook_use, data_sources)

In [8]:
paths.metadata

PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/birdclef/metadata/train.csv')

In [9]:
if paths.metadata:
    testing = pd.read_csv(paths.metadata)
    display(testing.head(3))

Unnamed: 0,primary_label,secondary_labels,type,filename,collection,rating,url,latitude,longitude,scientific_name,common_name,author,license
0,12345,[''],['silly antics'],fake_folder/not_here.ogg,santa,0.0,,,,,,Olly,
1,1139490,[''],[''],1139490/CSA36385.ogg,CSA,0.0,http://colecciones.humboldt.org.co/rec/sonidos...,7.3206,-73.7128,Ragoniella pulchella,Ragoniella pulchella,Fabio A. Sarria-S,cc-by-nc-sa 4.0
2,1139490,[''],[''],1139490/CSA36389.ogg,CSA,0.0,http://colecciones.humboldt.org.co/rec/sonidos...,7.3206,-73.7128,Ragoniella pulchella,Ragoniella pulchella,Fabio A. Sarria-S,cc-by-nc-sa 4.0


In [10]:
if paths.annotation_src:
    testing = pd.read_csv(paths.annotation_src)
    display(testing.head(3))

In [11]:
class VoiceDetector():
    def __init__(self, chunk_len, threshold=0.1, no_voice=0, voice=20):
        model, (get_speech_timestamps, _, read_audio, _, _) = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                                                                 model='silero_vad', verbose=False)
        self.model=model
        self.chunk_len=chunk_len
        self.threshold=threshold
        self.get_stamps = get_speech_timestamps

    def detect(self, np_wav):
        speech_timestamps = self.get_stamps(torch.Tensor(np_wav), self.model, threshold=self.threshold)
        voice_detect = np.zeros_like(np_wav)
        for st in speech_timestamps:
            voice_detect[st['start']: st['end']] = 20
    
        #downsample to match the power plot axis
        voice_detect = np.pad(voice_detect, 
                            (0, int(np.ceil(len(voice_detect) / self.chunk_len) * self.chunk_len - len(voice_detect))))
        voice_detect = voice_detect.reshape((-1, self.chunk_len)).max(axis=1)  # Use max to preserve speech detection
        return voice_detect
    
def calc_signal_pwr(wav, chunk_len, sr=32000):
    power = wav ** 2 
    power = np.pad(power, (0, int(np.ceil(len(power) / chunk_len) * chunk_len - len(power))))
    power = power.reshape((-1, chunk_len)).sum(axis=1)
    return power


def detect_voice(labels: pd.DataFrame,
                 metadata: pd.DataFrame,
                 threshold: float = 0.5,
                 ):
    """Run a voice detection model over the spectrogram to try and locate any human-voice

    Args:
        df_labels (_type_): _description_
        df_meta (_type_): _description_
    """
    return labels, metadata


def detect_animal_sound(labels: pd.DataFrame,
                        metadata: pd.DataFrame,
                        ):

    """Run an object detection model over spectrograms to propose any missing labels for bird or animal sounds

    Args:
        df_labels (_type_): _description_
        df_meta (_type_): _description_
    """
    return labels, metadata

## Load source data

In [12]:
if paths.metadata is not None:
    meta_df = pd.read_csv(paths.metadata)
    display(meta_df.head(3))
    display(f'The metadata dataframe has length {len(meta_df)}')

Unnamed: 0,primary_label,secondary_labels,type,filename,collection,rating,url,latitude,longitude,scientific_name,common_name,author,license
0,12345,[''],['silly antics'],fake_folder/not_here.ogg,santa,0.0,,,,,,Olly,
1,1139490,[''],[''],1139490/CSA36385.ogg,CSA,0.0,http://colecciones.humboldt.org.co/rec/sonidos...,7.3206,-73.7128,Ragoniella pulchella,Ragoniella pulchella,Fabio A. Sarria-S,cc-by-nc-sa 4.0
2,1139490,[''],[''],1139490/CSA36389.ogg,CSA,0.0,http://colecciones.humboldt.org.co/rec/sonidos...,7.3206,-73.7128,Ragoniella pulchella,Ragoniella pulchella,Fabio A. Sarria-S,cc-by-nc-sa 4.0


'The metadata dataframe has length 28565'

Setup mapping between the source data and the new dataset

In [13]:
if paths.naming_csv is not None:
    source_names = pd.read_csv(paths.naming_csv)
    source_col = config.source_name
    save_col = config.save_name
    save_name_map = dict(zip(source_names.loc[:, source_col], source_names.loc[:, save_col]))
    display(source_names.head(3))

Unnamed: 0,eBird,CommonName,ScientificName,Class
0,1139490,Ragoniella pulchella,Ragoniella pulchella,insecta
1,1192948,Oxyprora surinamensis,Oxyprora surinamensis,insecta
2,1194042,Copiphora colombiae,Copiphora colombiae,insecta


In [14]:
paths.audio_files

PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/bc25_audio_files')

In [15]:
dataset_loader = SourceDataLoader(source_type = config.source_data_type,
                                  rename_map = save_name_map,
                                  )
print(f'Loading source data of the type {config.source_data_type}')
print(loader_kwargs)

df_label, df_meta = dataset_loader.load_data(paths.audio_files, **loader_kwargs)



Loading source data of the type birdclef
{'metadata_path': '/home/olly/Desktop/Kaytoo/data/sample_label_formats/birdclef/metadata/train.csv', 'metadata_dict': {'collection': 'birdclef_22', 'secondary_labels': "['sidekick_bird']", 'url': 'some_web_address', 'latitude': 37.0, 'longitude': -118.5, 'author': 'Mr Bigglesworth', 'licence': 'some_licence', 'reviewed_on': '2025-12-23 09:15:30', 'reviewed_by': 'Joe Blogs', 'models_used': "['model_1']"}}
These are the label cols: ['Filename', 'Start Time (s)', 'End Time (s)', 'Low Freq (Hz)', 'High Freq (Hz)', 'Label', 'Type', 'Sex', 'Score', 'Delta Time (s)', 'Delta Freq (Hz)', 'Avg Power Density (dB FS/Hz)', 'Filepath']
These are the metadata cols: ['filename', 'collection', 'secondary_labels', 'url', 'latitude', 'longitude', 'author', 'license', 'recorded_on', 'reviewed_by', 'reviewed_on', 'source_filename', 'source_start_s', 'source_end_s', 'models_used']
This is the rename map: {'1139490': '1139490', '1192948': '1192948', '1194042': '119404

Note that we're ignoring any 'secondary' labels at this point, but if there is a primary/secondary labelling it is retained in the metadata.  In the case of AviaNZ-sourced data any multi-bird lists have been labelled 'unknown'

In [16]:
df_label.head(3)

Unnamed: 0,Filename,Start Time (s),End Time (s),Low Freq (Hz),High Freq (Hz),Label,Type,Sex,Score,Delta Time (s),Delta Freq (Hz),Avg Power Density (dB FS/Hz),Filepath
1,1139490/CSA36385.ogg,,,,,1139490,[''],,0.0,,,,/home/olly/Desktop/Kaytoo/data/sample_label_fo...
2,1139490/CSA36389.ogg,,,,,1139490,[''],,0.0,,,,/home/olly/Desktop/Kaytoo/data/sample_label_fo...
3,1192948/CSA36358.ogg,,,,,1192948,[''],,0.0,,,,/home/olly/Desktop/Kaytoo/data/sample_label_fo...


In [17]:
if config.debug:
    df_label = df_label.head(40)
    df_meta = df_meta.head(40)

In [18]:
df_meta.head(3)

Unnamed: 0,filename,collection,secondary_labels,url,latitude,longitude,author,license,recorded_on,reviewed_by,reviewed_on,source_filename,source_start_s,source_end_s,models_used
1,1139490/CSA36385.ogg,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos...,7.3206,-73.7128,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1139490/CSA36385.ogg,,,[]
2,1139490/CSA36389.ogg,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos...,7.3206,-73.7128,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1139490/CSA36389.ogg,,,[]
3,1192948/CSA36358.ogg,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos...,7.3791,-73.7313,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1192948/CSA36358.ogg,,,[]


In [19]:
df_meta.shape

(40, 15)

## Pre-filter the audio files

* Voice detection
* Bird-call detection

In [20]:
df_label, df_meta = detect_voice(df_label, df_meta, threshold=False)
df_label, df_meta = detect_animal_sound(df_label, df_meta)

## Convert the source data to Anqa Format

In [21]:
converter = ToAnqa(source_dir=paths.audio_files,
                   destn_dir=paths.anqa_destn,
                   name_map=save_name_map,
                   save_audio=config.save_new_flac_files,
                   max_seconds=config.max_seconds,
                   end_padding=config.padding,
                   crop_method=config.crop_method,
                   n_jobs=4,
                   )

df_labels, df_meta = converter.convert_all(df_label, df_meta)

Converting recordings: 100%|██████████| 40/40 [00:05<00:00,  7.61it/s]


In [22]:
pd.set_option("display.max_colwidth", None)

In [23]:
df_labels.head(3)

Unnamed: 0,Filename,Start Time (s),End Time (s),Low Freq (Hz),High Freq (Hz),Label,Type,Sex,Score,Delta Time (s),Delta Freq (Hz),Avg Power Density (dB FS/Hz)
0,audio/CSA363850_from_0.flac,0.0,98.0,0.0,16000.0,1139490,[''],,0.0,98.0,16000.0,-43.0
1,audio/CSA363890_from_0.flac,0.0,96.0,0.0,16000.0,1139490,[''],,0.0,96.0,16000.0,-43.3
2,audio/CSA363580_from_0.flac,0.0,116.0,0.0,16000.0,1192948,[''],,0.0,116.0,16000.0,-45.4


In [24]:
df_meta.head(3)

Unnamed: 0,filename,collection,secondary_labels,url,latitude,longitude,author,license,recorded_on,reviewed_by,reviewed_on,source_filename,source_start_s,source_end_s,models_used
0,audio/CSA363850_from_0.flac,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos/IAvH-CSA-36385/IAvH-CSA-36385.mp3,7.3,-73.7,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1139490/CSA36385.ogg,0.0,98.5,[]
1,audio/CSA363890_from_0.flac,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos/IAvH-CSA-36389/IAvH-CSA-36389.mp3,7.3,-73.7,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1139490/CSA36389.ogg,0.0,96.5,[]
2,audio/CSA363580_from_0.flac,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos/IAvH-CSA-36358/IAvH-CSA-36358.mp3,7.4,-73.7,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1192948/CSA36358.ogg,0.0,116.5,[]


In [25]:
dupe_filenames = df_meta[df_meta['filename'].duplicated(keep=False)]
dupe_filenames

Unnamed: 0,filename,collection,secondary_labels,url,latitude,longitude,author,license,recorded_on,reviewed_by,reviewed_on,source_filename,source_start_s,source_end_s,models_used


In [26]:
save_dataframe(df_meta, paths.metadata_destn, index=False)
save_dataframe(df_labels, paths.annotations_destn, index=False)

## Merge Multiple Anqa Datasets

In [27]:
paths.multi_anqa_fldr

'/home/olly/Desktop/Kaytoo/data/sample_label_formats/small_anqa_datasets'

In [28]:
df_labels, df_meta = merge_anqua_data(paths.multi_anqa_fldr)

Loading /home/olly/Desktop/Kaytoo/data/sample_label_formats/small_anqa_datasets/anqa_from_bc-zenodo/annotations.parquet
Loading /home/olly/Desktop/Kaytoo/data/sample_label_formats/small_anqa_datasets/anqa_from_bc-zenodo/metadata.parquet
Loading /home/olly/Desktop/Kaytoo/data/sample_label_formats/small_anqa_datasets/anqa_from_anqa/annotations.parquet
Loading /home/olly/Desktop/Kaytoo/data/sample_label_formats/small_anqa_datasets/anqa_from_anqa/metadata.parquet
Loading /home/olly/Desktop/Kaytoo/data/sample_label_formats/small_anqa_datasets/anqa_from_anqa_samples/annotations.parquet
Loading /home/olly/Desktop/Kaytoo/data/sample_label_formats/small_anqa_datasets/anqa_from_anqa_samples/metadata.parquet
Loading /home/olly/Desktop/Kaytoo/data/sample_label_formats/small_anqa_datasets/anqa_from_raven/annotations.parquet
Loading /home/olly/Desktop/Kaytoo/data/sample_label_formats/small_anqa_datasets/anqa_from_raven/metadata.parquet
Loading /home/olly/Desktop/Kaytoo/data/sample_label_formats/smal

In [29]:
df_labels.head(3)

Unnamed: 0,Filename,Start Time (s),End Time (s),Low Freq (Hz),High Freq (Hz),Label,Type,Sex,Score,Delta Time (s),Delta Freq (Hz),Avg Power Density (dB FS/Hz)
0,anqa_from_bc-zenodo/HSN_001_20150708_061805_from_0.flac,3.8,4.0,2272.0,6515.0,gcrfin,,,,0.2,4243.0,-77.2
1,anqa_from_bc-zenodo/HSN_001_20150708_061805_from_0.flac,4.6,4.8,2939.0,6303.0,gcrfin,,,,0.2,3364.0,-78.8
2,anqa_from_bc-zenodo/HSN_001_20150708_061805_from_0.flac,5.0,5.2,3424.0,4333.0,gcrfin,,,,0.2,909.0,-75.0


In [30]:
df_meta.head(3)

Unnamed: 0,filename,collection,secondary_labels,url,latitude,longitude,author,license,recorded_on,reviewed_by,reviewed_on,source_filename,source_start_s,source_end_s,models_used
0,anqa_from_bc-zenodo/HSN_001_20150708_061805_from_0.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2015-07-08 06:18:05,Joe Blogs,2025-12-23 09:15:30,HSN_001_20150708_061805.flac,0.0,60.0,['model_1']
1,anqa_from_bc-zenodo/HSN_001_20150708_061805_from_60.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2015-07-08 06:18:05,Joe Blogs,2025-12-23 09:15:30,HSN_001_20150708_061805.flac,60.0,120.0,['model_1']
2,anqa_from_bc-zenodo/HSN_001_20150708_061805_from_120.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2015-07-08 06:18:05,Joe Blogs,2025-12-23 09:15:30,HSN_001_20150708_061805.flac,120.0,180.0,['model_1']


In [31]:
dupe_filenames = df_meta[df_meta['filename'].duplicated(keep=False)]
dupe_filenames

Unnamed: 0,filename,collection,secondary_labels,url,latitude,longitude,author,license,recorded_on,reviewed_by,reviewed_on,source_filename,source_start_s,source_end_s,models_used


## (Optional) Create `.selections.txt` for Raven
Re-run these cells manually to produce `.selections.txt` files for viewing in [Raven](https://www.ravensoundsoftware.com/)

In [32]:
if config.save_raven_selections:
    anqa_to_raven_selections(df_labels, paths.anqa_destn)  #paths.new_raven_dataset, df_labels
    selections = list(paths.anqa_destn.rglob('*.selections.txt'))
    display(selections[:3])
    random_file = random.choice(selections)
    selection = pd.read_csv(random_file, sep="\t")
    display(selection.head())

In [33]:
config.save_raven_selections = True