## Reformat Bird Soundscapes & Training Data to Anqa

Converts a dog's breakfast of sources, annotation programs and methodologies into a cleaned and unified tabular ML-friendly data format *Anqa* with time-frequency bounding boxes and a separate file for per-recording metadata.

### Motivation & Summary

* Create a dataset of soundscapes with a specified maximum length, in `.flac` audio format.
* Keep track of any metadata available, with one row per soundscape.
* T-F annotations for future processing, training pipelines, or human-in-loop continuous improvement.
* Optional conversion to Raven-visualisable `.selections.txt` files for visualisation or editing.
* Merge labels and metadata from multiple Anqa datasets into unified parent tables (with updated file paths)

The next step in a model training pipeline would likely be to chunk to shorter crops (20 seconds?) for efficient data-loading, whilst keeping track of the original source file to avoid data leakage during splitting & sampling operations.

For validation and/or test loaders, the data from those splits could potentially be left in their 1-minute form to more accurately reflect their real-life use-case.

The formatting schema is trying to separate metadata from annotations. Arbitrary extra columns such as 'reviewed by' can be added to the metadata file, whilst keeping the annotations in a simple standardised annotation file.

### TODO (Phase 1):
* A merger method, to take a list of multiple Anqa dataset paths and create a parent one
* Run on all local datasets
* Merge the result
* Publish merged dataset on Zenodo

### TODO (Phase 2):
* Build detection model from all published open-source data available globally
* Integrate Voice Detection
* Integrate Animal Sound Detection
* Build the Anqa dataset reviewing and editing notebook

### Methods

**Source Data (Labels plus corresponding .wav, .ogg or .flac files)**
* A previously compiled Anqa dataset
* Raven `.selections.txt` tab seperated tables
* A BirdCLEF - Zenodo style soundscape `.csv` tidytable with one bounding box per row, as per [here](https://zenodo.org/records/7018484)
* An existing BirdCLEF-style metadata dataframe with data such as licenses, lat, long, reviewed-by, reviewed-on, etc.
* AviaNZ `.data` labels
* Freebird `.tag` xml labels

If starting a new dataset from scratch:

* Create the metadata in `.csv`, one row per sound recording, with the column names used here.
* Annotation should use Raven with e-bird and iNaturalist codes.
* Mark bounding boxes (frequency and time bounds) on every bird-call.
* Unidentified calls should still be boxed but labelled 'unknown'.
* One minute of annotations per recording is optimal (randomly sampled from a longer recording if need be).
* A 'Sex' column, m/f or left empty.
* A 'Rating' for label decision confidence on a scale of 1 to 5.

**Intermediate formatting step (Automated)**  

Parses the source files, and creates two dataframes:

1. Metadata, from the existing one if it exists, otherwise a new (mostly empty) one is created
2. Labels.  These will only be used from their original 'source' format, for example `.selections.txt` from Raven.

**Propose missing labels (Optional)**   

* Delete any annotations previously labelled 'h_detect' or 'a_detect'
* Run through every sample with a voice detection model and label any human voice.  Label 'h_detect'
* Run through every sample with an animal sound detection model.  Label 'a_detect'

**Crop or chunk longer recordings to a fixed maximum length (Optional)**  

Restrict each file to a maximum length (default 60 seconds) chosen from the whole with the most identified labels.  If none found then a random period will be chosen.  The point of this is to maximise the value of the labeller's time.  15 different files of one minute each is much more diversity than a single 15-minute file with same birds in the same background noise.

**Save in Anqa format**  

* (Optional), re-save the recordings in `.flac` format
* Save the annotations 
* Save the metadata

**Human-in-loop editing**
* Create an updated raven dataset with `.selections.txt` files, plus `annotations` and `metadata`
* Externally review the `.selections.txt` files in Raven, then re-run the process above.
* Precedence will be given to the existing human-tagged labels over anything like automated bird or voice detection.
* Previously auto-labelled annotations will be over-written.

**Re-Running an existing Anqa Dataset**
* Change the chunk lengths, or take advantage of improved detection/cleaning.
* Precedence given to human-tagged labels over anything like automated bird or voice detection.
* Previously auto-labelled annotations will be over-written.

### Notebook Configuration
(Adjust as needed)

In [1156]:
from pathlib import Path
project_root = str(Path().resolve().parent)

In [1157]:
#Note: 'anqa_destn' must differ from source directory if saving new audio files

notebook_use = {'project_root': project_root,
                'anqa_destn': project_root + '/data/sample_label_formats/anqa_from_anqa', 
                'source_type': 'anqa', #'bc-zenodo', #any of: 'anqa', 'raven', 'birdclef', 'bc-zenodo', 'avianz', 'freebird'
                'debug': True,
                'max_seconds': 60,  #Segment and/or crop any data type to this maximum length
                'crop_method': 'keep_all', #any of 'keep_all', 'random',  in future: 'max_annotations', 'max_detections'
                'padding': None, # Any of 'pad', 'noise', 'zero', None
                'save_new_flac_files': True, # If true the original source will be re-written into .flac files of length max_seconds
                'tabular_save_suffix': 'csv' #csv
                }

### Imports

In [1158]:
import os, sys
from dotenv import load_dotenv
import random
import numpy as np
import pandas as pd
import torch
from tqdm.auto import tqdm
from IPython.display import display
import ipywidgets as widgets
import importlib
button = widgets.Button(description="Continue")
output = widgets.Output()

#%matplotlib widget  
torch.set_num_threads(1)

In [1159]:
load_dotenv(dotenv_path = project_root + " /.env")

for p in os.getenv("PYTHONPATH", "").split(os.pathsep):
    if p:
        print(f' Adding Python path: {p}')
        sys.path.append(str(Path(p).resolve()))

 Adding Python path: ../../wildpytools/src
 Adding Python path: ./src


In [1160]:
#Only doing like this for convenience whilst I work on WildPyTools
import wildpytools
import wildpytools.audio as wa
importlib.reload(wa)
importlib.reload(wildpytools.io)

SourceDataLoader = wa.SourceDataLoader
ToAnqa = wa.ToAnqa
anqa_to_raven_selections = wa.anqa_to_raven_selections
save_dataframe = wildpytools.io.save_dataframe

### Data Paths & Values (Simulating all possible options)

In [1161]:
# These lines will over-write any existing values from the csv metadata file if it exists.  
# Comment out any items you want to leave unchanged.  
update_metadata_dict = {'collection': 'crown_jewels',
                        'secondary_labels': "['sidekick_bird']",
                        'url' : 'some_web_address',
                        'latitude': 37.0,
                        'longitude': -118.5,
                        'author' : 'Mr Bigglesworth',
                        'licence': 'some_licence',
                        #'recorded_on': None,            #2025-12-23 09:15:30        
                        'collection' : 'birdclef_22',
                        'reviewed_on' : '2025-12-23 09:15:30',
                        'reviewed_by': 'Joe Blogs',
                        'models_used' : "['model_1']"
                        }

anqa_data = {'audio_files': project_root + '/data/sample_label_formats/converted_to_anqa/audio',
              'metadata_path': project_root + '/data/sample_label_formats/converted_to_anqa/metadata.csv',
              'labels_path': project_root + '/data/sample_label_formats/converted_to_anqa/annotations.csv',
              'naming_csv': project_root + '/data/sample_label_formats/birdclef/metadata/naming.csv',
              'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
             }

raven_data = {'audio_files': project_root + '/data/sample_label_formats/raven_kokako_olly',
              'naming_csv': project_root + '/data/experiments/exp_2/exp_2_deploy/exp_2_bird_map.csv',
              'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
              'metadata_path' : None, #project_root + '/data/sample_label_formats/raven_kokako_olly/metadata.csv'
              'metadata_dict': update_metadata_dict,  # replace with None if not updating
              }

avianz_data = {'audio_files': project_root + '/data/sample_label_formats/avianz_janzoon_brian',
               'naming_csv': project_root + '/data/bird_names/avianz_to_ebird.csv',
               'naming_cols' : {'source_name': 'AviaNZ','save_name': 'eBird'},
              }

birdclef_data = {
                'audio_files': project_root + '/data/sample_label_formats/bc25_audio_files',
                'metadata_path': project_root + '/data/sample_label_formats/birdclef/metadata/train.csv',
                'naming_csv': project_root + '/data/sample_label_formats/birdclef/metadata/naming.csv',
                'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
                'metadata_dict': update_metadata_dict,  # replace with None if not updating
                }

#Post BirdCLEF annotated soundscapes on zenodo.org:   https://zenodo.org/search?q=birdclef&l=list&p=1&s=10&sort=bestmatch
bc_zenodo_data = {'audio_files': '/home/olly/Desktop/Kaytoo/data/sample_label_formats/zenodo_birdclef/Sierra_Nevada/audio',
                  'metadata_dict': update_metadata_dict,
                  'labels_path': project_root + '/data/sample_label_formats/zenodo_birdclef/Sierra_Nevada/annotations.csv',
                  'naming_csv': project_root + '/data/sample_label_formats/birdclef/metadata/naming.csv',
                  'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
                  }

freebird_data = {'audio_files': project_root + '/data/sample_label_formats/freebird_tier_1',
                 'naming_csv': project_root + '/data/sample_label_formats/freebird_species_list_05112013.csv',
                 'naming_cols' : {'source_name': 'Code', 'save_name': 'eBird'},
                 'metadata_dict': update_metadata_dict
                }

data_sources = {'anqa' : anqa_data,
                'raven' : raven_data,
                'avianz' : avianz_data,
                'birdclef' : birdclef_data,
                'bc-zenodo' : bc_zenodo_data,
                'freebird' : freebird_data
                }

data_source = data_sources[notebook_use['source_type']]
allowed_loader_kwargs = ['metadata_path', 'metadata_dict', 'labels_path']
loader_kwargs = {
    key: val
    for key, val in data_source.items()
    if key in allowed_loader_kwargs and val is not None
}

In [1162]:
class Config():
    def __init__(self, options: dict, data_sources: dict):
        self.SOURCE_DATA_TYPE = options['source_type']
        self.DEBUG = options['debug']
        self.MAX_SECONDS = options['max_seconds']
        self.CROP_METHOD = options['crop_method']
        self.PADDING = options['padding']
        self.SAVE_NEW_FLAC_FILES = options['save_new_flac_files']
        self.SAVE_RAVEN_SELECTIONS = False # User must manually change at end of the notebook if wanted
        self.SOURCE_NAME = self.SAVE_NAME = None
        _naming_cols = data_sources[self.SOURCE_DATA_TYPE].get('naming_cols')
        if _naming_cols is not None:
            self.SOURCE_NAME = _naming_cols.get('source_name')
            self.SAVE_NAME = _naming_cols.get('save_name')
        self.DESTN_DEPTH = options.get('destn_depth') or 2


class FilePaths():
    def __init__(self, notebook_use: dict, data_sources: dict):
        _data_type = notebook_use['source_type']
        _tabular_save_suffix = notebook_use['tabular_save_suffix']
        _source_cfg = data_sources[_data_type]
        _metadata = _source_cfg.get('metadata_path')
        _naming = _source_cfg.get('naming_csv')
        _annotations = _source_cfg.get('labels_path')
        
        self.PROJECT_ROOT = Path(notebook_use['project_root'])
        self.AUDIO_FILES = Path(data_sources[_data_type]['audio_files'])
        self.METADATA = Path(_metadata) if _metadata is not None else None
        self.NAMING_CSV = Path(_naming) if _naming is not None else None
        self.ANNOTATION_SRC = Path(_annotations) if _annotations is not None else None

        # Output directory (ensure it exists)
        self.ANQA_DESTN = Path(notebook_use.get('anqa_destn'))
        self.ANQA_DESTN.mkdir(parents=True, exist_ok=True)
        self.ANNOTATIONS_DESTN = self.ANQA_DESTN / f'annotations.{_tabular_save_suffix}'
        self.METADATA_DESTN = self.ANQA_DESTN / f'metadata.{_tabular_save_suffix}'

config = Config(notebook_use, data_sources)
paths = FilePaths(notebook_use, data_sources)

In [1163]:
paths.METADATA

PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/converted_to_anqa/metadata.csv')

In [1164]:
if paths.METADATA:
    testing = pd.read_csv(paths.METADATA)
    display(testing.head(3))

Unnamed: 0,filename,collection,secondary_labels,url,latitude,longitude,author,license,recorded_on,reviewed_by,reviewed_on,source_filename,source_start_s,source_end_s,models_used
0,120401_07_from_0.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,120401_07.WAV,0.0,49.4,['model_1']
1,120401_05_from_0.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,120401_05.WAV,0.0,60.0,['model_1']
2,120401_05_from_60.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,120401_05.WAV,60.0,120.0,['model_1']


In [1165]:
if paths.ANNOTATION_SRC:
    testing = pd.read_csv(paths.ANNOTATION_SRC)
    display(testing.head())

Unnamed: 0,Filename,Start Time (s),End Time (s),Low Freq (Hz),High Freq (Hz),Label,Type,Sex,Rating,Delta Time (s),Delta Freq (Hz),Avg Power Density (dB FS/Hz)
0,120401_07_from_0.flac,15.3,16.8,227.0,12427.0,kokako3,,,,1.5,12199.0,-33.4
1,120401_05_from_0.flac,7.4,11.0,303.0,12654.0,kokako3,,,,3.6,12351.0,-31.6
2,120401_05_from_0.flac,23.9,28.5,303.0,14927.0,kokako3,,,,4.6,14624.0,-39.4
3,120401_05_from_0.flac,39.3,46.5,0.0,13563.0,kokako3,,,,7.2,13563.0,-34.6
4,090201_01_from_0.flac,9.1,11.5,443.0,9697.0,kokako3,,,,2.5,9255.0,-53.3


In [1166]:
class VoiceDetector():
    def __init__(self, chunk_len, threshold=0.1, no_voice=0, voice=20):
        model, (get_speech_timestamps, _, read_audio, _, _) = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                                                                 model='silero_vad', verbose=False)
        self.model=model
        self.chunk_len=chunk_len
        self.threshold=threshold
        self.get_stamps = get_speech_timestamps

    def detect(self, np_wav):
        speech_timestamps = self.get_stamps(torch.Tensor(np_wav), self.model, threshold=self.threshold)
        voice_detect = np.zeros_like(np_wav)
        for st in speech_timestamps:
            voice_detect[st['start']: st['end']] = 20
    
        #downsample to match the power plot axis
        voice_detect = np.pad(voice_detect, 
                            (0, int(np.ceil(len(voice_detect) / self.chunk_len) * self.chunk_len - len(voice_detect))))
        voice_detect = voice_detect.reshape((-1, self.chunk_len)).max(axis=1)  # Use max to preserve speech detection
        return voice_detect
    
def calc_signal_pwr(wav, chunk_len, sr=32000):
    power = wav ** 2 
    power = np.pad(power, (0, int(np.ceil(len(power) / chunk_len) * chunk_len - len(power))))
    power = power.reshape((-1, chunk_len)).sum(axis=1)
    return power


def detect_voice(labels: pd.DataFrame,
                 metadata: pd.DataFrame,
                 threshold: float = 0.5,
                 ):
    """Run a voice detection model over the spectrogram to try and locate any human-voice

    Args:
        df_labels (_type_): _description_
        df_meta (_type_): _description_
    """
    return labels, metadata


def detect_animal_sound(labels: pd.DataFrame,
                        metadata: pd.DataFrame,
                        ):

    """Run an object detection model over spectrograms to propose any missing labels for bird or animal sounds

    Args:
        df_labels (_type_): _description_
        df_meta (_type_): _description_
    """
    return labels, metadata

## Load source data

In [1167]:
if paths.METADATA is not None:
    meta_df = pd.read_csv(paths.METADATA)
    display(meta_df.head(3))
    display(f'The metadata dataframe has length {len(meta_df)}')

Unnamed: 0,filename,collection,secondary_labels,url,latitude,longitude,author,license,recorded_on,reviewed_by,reviewed_on,source_filename,source_start_s,source_end_s,models_used
0,120401_07_from_0.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,120401_07.WAV,0.0,49.4,['model_1']
1,120401_05_from_0.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,120401_05.WAV,0.0,60.0,['model_1']
2,120401_05_from_60.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,120401_05.WAV,60.0,120.0,['model_1']


'The metadata dataframe has length 14'

Setup mapping between the source data and the new dataset

In [1168]:
if paths.NAMING_CSV is not None:
    source_names = pd.read_csv(paths.NAMING_CSV)
    source_col = config.SOURCE_NAME
    save_col = config.SAVE_NAME
    save_name_map = dict(zip(source_names.loc[:, source_col], source_names.loc[:, save_col]))
    display(source_names.head(3))

Unnamed: 0,eBird,CommonName,ScientificName,Class
0,1139490,Ragoniella pulchella,Ragoniella pulchella,insecta
1,1192948,Oxyprora surinamensis,Oxyprora surinamensis,insecta
2,1194042,Copiphora colombiae,Copiphora colombiae,insecta


In [1169]:
paths.AUDIO_FILES

PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/converted_to_anqa/audio')

In [1170]:
dataset_loader = SourceDataLoader(source_type = config.SOURCE_DATA_TYPE,
                                  rename_map = save_name_map,
                                  )
print(f'Loading source data of the type {config.SOURCE_DATA_TYPE}')

df_label, df_meta = dataset_loader.load_data(paths.AUDIO_FILES, **loader_kwargs)



Loading source data of the type anqa
[INFO] 141 file(s) exist under /home/olly/Desktop/Kaytoo/data/sample_label_formats/converted_to_anqa/audio but are not referenced by the metadata dataframe. Examples:
[PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/converted_to_anqa/audio/AA144_BIRP_130226_121325_from_420.flac'), PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/converted_to_anqa/audio/20190831_181504_from_420.flac'), PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/converted_to_anqa/audio/iNat1208572_from_180.flac'), PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/converted_to_anqa/audio/20190831_083004_from_180.flac'), PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/converted_to_anqa/audio/HSN_001_20150708_061805_from_240.flac'), PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/converted_to_anqa/audio/20190831_181504_from_0.flac'), PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/c

Note that we're ignoring any 'secondary' labels at this point, but if there is a primary/secondary labelling it is retained in the metadata.  In the case of AviaNZ-sourced data any multi-bird lists have been labelled 'unknown'

In [1171]:
df_label.head(3)

Unnamed: 0,Filename,Start Time (s),End Time (s),Low Freq (Hz),High Freq (Hz),Label,Type,Sex,Score,Delta Time (s),Delta Freq (Hz),Avg Power Density (dB FS/Hz),Filepath
0,120401_07_from_0.flac,15.3,16.8,227.0,12427.0,kokako3,,,,1.5,12199.0,-33.4,/home/olly/Desktop/Kaytoo/data/sample_label_fo...
1,120401_05_from_0.flac,7.4,11.0,303.0,12654.0,kokako3,,,,3.6,12351.0,-31.6,/home/olly/Desktop/Kaytoo/data/sample_label_fo...
2,120401_05_from_0.flac,23.9,28.5,303.0,14927.0,kokako3,,,,4.6,14624.0,-39.4,/home/olly/Desktop/Kaytoo/data/sample_label_fo...


In [1173]:
if config.DEBUG:
    df_label = df_label.head(40)
    df_meta = df_meta.head(40)

In [1174]:
df_meta.head(3)

Unnamed: 0,filename,collection,secondary_labels,url,latitude,longitude,author,license,recorded_on,reviewed_by,reviewed_on,source_filename,source_start_s,source_end_s,models_used
0,120401_07_from_0.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,120401_07.WAV,0.0,49.4,['model_1']
1,120401_05_from_0.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,120401_05.WAV,0.0,60.0,['model_1']
2,090201_01_from_0.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,090201_01.WAV,0.0,60.0,['model_1']


In [1175]:
df_meta.shape

(9, 15)

## Pre-filter the audio files

* Voice detection
* Bird-call detection

In [1176]:
df_label, df_meta = detect_voice(df_label, df_meta, threshold=False)
df_label, df_meta = detect_animal_sound(df_label, df_meta)

## Convert the source data to Anqa Format

In [1177]:
converter = ToAnqa(
                    source_dir=paths.AUDIO_FILES,
                    destn_dir=paths.ANQA_DESTN,
                    name_map=save_name_map,
                    save_audio=config.SAVE_NEW_FLAC_FILES,
                    max_seconds=config.MAX_SECONDS,
                    end_padding=config.PADDING,
                    crop_method=config.CROP_METHOD,
                    n_jobs=4,
                    destn_depth=config.DESTN_DEPTH,
                    )

df_labels, df_meta = converter.convert_all(df_label, df_meta)

Converting recordings: 100%|██████████| 8/8 [00:00<00:00, 23.36it/s]


In [1178]:
df_labels.head(3)

Unnamed: 0,Filename,Start Time (s),End Time (s),Low Freq (Hz),High Freq (Hz),Label,Type,Sex,Score,Delta Time (s),Delta Freq (Hz),Avg Power Density (dB FS/Hz)
0,120401_07_from_0.flac,15.3,16.8,227.0,12427.0,kokako3,,,,1.5,12200.0,-34.8
1,120401_05_from_0.flac,7.4,11.0,303.0,12654.0,kokako3,,,,3.6,12351.0,-33.0
2,120401_05_from_0.flac,23.9,28.5,303.0,14927.0,kokako3,,,,4.6,14624.0,-40.8


In [1179]:
df_labels.tail(3)

Unnamed: 0,Filename,Start Time (s),End Time (s),Low Freq (Hz),High Freq (Hz),Label,Type,Sex,Score,Delta Time (s),Delta Freq (Hz),Avg Power Density (dB FS/Hz)
37,090201_02_from_120.flac,26.4,28.5,509.0,5492.0,kokako3,,,,2.1,4983.0,-46.1
38,090201_02_from_120.flac,33.0,37.4,429.0,4554.0,kokako3,,,,4.4,4125.0,-44.9
39,090201_02_from_120.flac,38.9,42.1,429.0,4313.0,kokako3,,,,3.2,3884.0,-41.3


In [1180]:
df_meta.head()

Unnamed: 0,filename,collection,secondary_labels,url,latitude,longitude,author,license,recorded_on,reviewed_by,reviewed_on,source_filename,source_start_s,source_end_s,models_used
0,120401_07_from_0.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,120401_07.WAV,0.0,49.4,['model_1']
1,120401_05_from_0.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,120401_05.WAV,0.0,60.0,['model_1']
2,090201_01_from_0.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,090201_01.WAV,0.0,60.0,['model_1']
3,090201_01_from_60.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,090201_01.WAV,60.0,120.0,['model_1']
4,090201_01_from_120.flac,birdclef_22,['sidekick_bird'],some_web_address,37.0,-118.5,Mr Bigglesworth,,2025-12-23 09:15:30,Joe Blogs,2025-12-23 09:15:30,090201_01.WAV,120.0,180.0,['model_1']


In [1181]:
save_dataframe(df_meta, paths.METADATA_DESTN, index=False)
save_dataframe(df_labels, paths.ANNOTATIONS_DESTN, index=False)

### (Optional) Create `.selections.txt` for viewing or editing in Raven
Re-run these cells manually to produce `.selections.txt` files for viewing in [Raven](https://www.ravensoundsoftware.com/)

In [1182]:
if config.SAVE_RAVEN_SELECTIONS:
    anqa_to_raven_selections(df_labels, paths.ANQA_DESTN)  #paths.new_raven_dataset, df_labels
    selections = list(paths.ANQA_DESTN.rglob('*.selections.txt'))
    display(selections[:3])
    random_file = random.choice(selections)
    selection = pd.read_csv(random_file, sep="\t")
    display(selection.head())

In [1183]:
config.SAVE_RAVEN_SELECTIONS = True