## TODO:
Various argument patterns have changed.  Re-do the following: 
 
* Load BC-Zenodo  
* Load Freebird   
* Load Avianz   

Then work on:

* Integrate Voice Detection
* Try out on available datasets
* A merger method, to take a list of multiple anqa dataset paths and create a parent one

## Reformat Bird Soundscapes & Training Data to Anqa

Converts a dog's breakfast of sources, annotation programs and methodologies into a cleaned and unified tabular data format *Anqa* with time-frequency bounding boxes and a separate file for per-recording metadata.

### Motivation & Summary

* Creates a dataset of soundscapes with a specified maximum length, in `.flac` audio format.
* `metadata.parquet` (or csv) contains with any metadata available, with one row per soundscape.
* `annotations.parquet`, for future processing 
* Optional conversion to Raven-visualisable `.selections.txt` files

This notebook is intended to demonstrate one step in a ML model training pipeline, to unify data from various sources.  The immediate next step could be human-in-loop review using Raven in which case the updated Raven Selections files would need to be re-run through it to update the underlying Anqa labels csv.  Alternatively another method or a custom notebook could be used to directly work with the Anqa labels.  The step after that would likely be to use the annotated bounding boxes to crop the soundscapes into ML-training short-crop format.

For validation and testing steps, the data from those splits could be left in their 1-minute soundscapes to more accurately reflect their real-life use-case.

The formating schema is trying to separate metadata from annotations. Arbitrary extra columns such as 'reviewed by' can be added to the metadata `.csv`, whilst keeping the annotations in a simple standardised `.csv`, or in `.selections` files.

### Methods

**Source Data (Labels plus corresponding .wav, .ogg or .flac files)**
* A previously compiled Anqa dataset (To change or standardise the recording lengths, or update detection proposals)
* Raven `.selections` tables
* A BirdCLEF - Zenodo style soundscape `.csv` tidytable with one bounding box per row, as per [here](https://zenodo.org/records/7018484)
* An existing BirdCLEF-style metadata dataframe with data such as licenses, lat, long, reviewed-by, reviewed-on, etc.
* AviaNZ `.data` labels
* Freebird `.tag` xml labels

If starting from scratch, people supplying the audio should be encouraged supply the metadata in `.csv`, one row per recording, and for annotation to use Raven with e-bird and iNaturalist codes, and mark bounding boxes (frequency and time bounds) on every bird-call, including 'unknown' on any calls they can not identify.  One minute of annotations per recording is optimal (randomly sampled from a longer recording if need be).  A 'Sex' and 'Rating' column should ideally be included if practical.

**Intermediate formatting step (Automated)**  

Parses the source files, and creates two dataframes:

1. Metadata, from the existing one if it exists, otherwise a new (mostly empty) one is created
2. Labels.  These will only be used from their original 'source' format, for example `.selections` from Raven.

**Propose missing labels (Optional)**  

* Delete any annotations previously labelled 'h_detect' or 'a_detect'
* Run through every sample, and detect and label any human voice.  Label 'h_detect'
* Run through every sample and detect other animal or bird sounds.  Label 'a_detect'

**Crop longer recordings to a fixed maximum length (Optional)**
Restrict each file to a maximum length (default 60 seconds) chosen from the whole with the most identified labels.  If none found then a random period will be chosen.  The point of doing this is to maximise the value of the labelling person's time.  15 different files of one minute each is much more diversity than a single 15-minute file with same birds in the same background noise.

**Save in Anqa format**  

* (Optional), re-save the recordings in `.flac` format
* Save the Annotations 
* Save the metadata

**Human-in-loop editing**
* Create an updated raven dataset with `.selections.txt` files, plus `annotations` and `metadata`
* Externally review the `.selections.txt` files in raven, then re-run the process above.
* Precedence will be given to the existing human-tagged labels over anything like automated bird or voice detection.
* Previously auto-labelled annotations will be over-written.

**Re-Running an existing Anqa Dataset**
* The main reasons to do this would be to change the length of the soundscapes, or take advantage of improved detection/cleaning models.
* Precedence will be given to the existing human-tagged labels over anything like automated bird or voice detection.
* Previously auto-labelled annotations will be over-written.

### Notebook Configuration
(Adjust as needed)

In [1]:
from pathlib import Path
project_root = str(Path().resolve().parent)

In [2]:
#Note: 'anqa_destn' must differ from source directory if saving new audio files

notebook_use = {'project_root': project_root,
                'anqa_destn': project_root + '/data/sample_label_formats/converted_to_anqa', 
                'source_type': 'birdclef', #'bc-zenodo', #any of: 'anqa', 'raven', 'birdclef', 'bc-zenodo', 'avianz', 'freebird'
                'debug': True,
                'max_seconds': 60,  #Segment and/or crop any data type to this maximum length
                'crop_method': 'keep_all', #any of 'keep_all', 'random',  in future: 'max_annotations', 'max_detections'
                'padding': None, # Any of 'pad', 'noise', 'zero', None
                'save_new_flac_files': True, # If true the original source will be re-written into .flac files of length max_seconds
                'tabular_save_suffix': 'parquet' #csv
                }

### Imports

In [3]:
import os, sys
from dotenv import load_dotenv
import random
import numpy as np
import pandas as pd
import torch
from tqdm.auto import tqdm
from IPython.display import display
import ipywidgets as widgets
import importlib
button = widgets.Button(description="Continue")
output = widgets.Output()

#%matplotlib widget  
torch.set_num_threads(1)

In [4]:
load_dotenv(dotenv_path = project_root + " /.env")

for p in os.getenv("PYTHONPATH", "").split(os.pathsep):
    if p:
        print(f' Adding Python path: {p}')
        sys.path.append(str(Path(p).resolve()))

 Adding Python path: ../../wildpytools/src
 Adding Python path: ./src


In [5]:
#Only doing like this for convenience whilst I work on WildPyTools
import wildpytools
import wildpytools.audio as wa
importlib.reload(wa)
importlib.reload(wildpytools.io)

SourceDataLoader = wa.SourceDataLoader
ToAnqa = wa.ToAnqa
anqa_to_raven_selections = wa.anqa_to_raven_selections
save_dataframe = wildpytools.io.save_dataframe

### Data Paths & Values (Simulating all possible options)

In [6]:
# These lines will over-write any existing values from the csv metadata file if it exists.  
# Comment out any items you want to leave unchanged.  
update_metadata_dict = {'collection': 'crown_jewels',
                        'secondary_labels': "['sidekick_bird']",
                        'url' : 'some_web_address',
                        'latitude': 37.0,
                        'longitude': -118.5,
                        'author' : 'Mr Bigglesworth',
                        'licence': 'some_licence',
                        'recorded_on': '2025-12-23 09:15:30',                    
                        'collection' : 'birdclef_22',
                        'reviewed_on' : '2025-12-23 09:15:30',
                        'reviewed_by': 'Joe Blogs',
                        'models_used' : "['model_1']"
                        }

anqa_data = {'audio_files': project_root + '/data/sample_label_formats/converted_to_anqa/audio',
              'metadata_path': project_root + '/data/sample_label_formats/converted_to_anqa/metadata.csv',
              'labels_path': project_root + '/data/sample_label_formats/converted_to_anqa/annotations.csv',
              'naming_csv': project_root + '/data/sample_label_formats/birdclef/metadata/naming.csv',
              'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
             }

raven_data = {'audio_files': project_root + '/data/sample_label_formats/raven_kokako_olly',
              'naming_csv': project_root + '/data/experiments/exp_2/exp_2_deploy/exp_2_bird_map.csv',
              'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
              'metadata_path' : None, #project_root + '/data/sample_label_formats/raven_kokako_olly/metadata.csv'
              'metadata_dict': update_metadata_dict,  # replace with None if not updating
              }

avianz_data = {'audio_files': project_root + '/data/sample_label_formats/avianz_janzoon_brian',
               'naming_csv': "filepath_needed",
               'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
              }

birdclef_data = {
                'audio_files': project_root + '/data/sample_label_formats/bc25_audio_files',
                'metadata_path': project_root + '/data/sample_label_formats/birdclef/metadata/train.csv',
                'naming_csv': project_root + '/data/sample_label_formats/birdclef/metadata/naming.csv',
                'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
                }

#Post BirdCLEF annotated soundscapes on zenodo.org:   https://zenodo.org/search?q=birdclef&l=list&p=1&s=10&sort=bestmatch
bc_zenodo_data = {'audio_files': '/home/olly/Desktop/Kaytoo/data/sample_label_formats/zenodo_birdclef/Sierra_Nevada/',
                  'metadata_dict': update_metadata_dict,
                  'labels_path': project_root + '/data/sample_label_formats/zenodo_birdclef/Sierra_Nevada/annotations.csv',
                  'naming_csv': project_root + '/data/sample_label_formats/birdclef/metadata/naming.csv',
                  'naming_cols' : {'source_name': 'eBird','save_name': 'eBird'},
                  }

freebird_data = {'audio_files': project_root + '/data/sample_label_formats/freebird_tier_1',
                 'naming_csv': project_root + '/data/sample_label_formats/freebird_species_list_05112013.csv',
                 'naming_cols' : {'source_name': 'Code', 'save_name': 'eBird'}
                }

data_sources = {'anqa' : anqa_data,
                'raven' : raven_data,
                'avianz' : avianz_data,
                'birdclef' : birdclef_data,
                'bc-zenodo' : bc_zenodo_data,
                'freebird' : freebird_data
                }

data_source = data_sources[notebook_use['source_type']]
allowed_loader_kwargs = ['metadata_path', 'metadata_dict', 'labels_path']
loader_kwargs = {
    key: val
    for key, val in data_source.items()
    if key in allowed_loader_kwargs and val is not None
}

In [7]:
class Config():
    def __init__(self, options: dict, data_sources: dict):
        self.SOURCE_DATA_TYPE = options['source_type']
        self.DEBUG = options['debug']
        self.MAX_SECONDS = options['max_seconds']
        self.CROP_METHOD = options['crop_method']
        self.PADDING = options['padding']
        self.SAVE_NEW_FLAC_FILES = options['save_new_flac_files']
        self.SAVE_RAVEN_SELECTIONS = False # User must manually change at end of the notebook if wanted
        self.SOURCE_NAME = self.SAVE_NAME = None
        _naming_cols = data_sources[self.SOURCE_DATA_TYPE].get('naming_cols')
        if _naming_cols is not None:
            self.SOURCE_NAME = _naming_cols.get('source_name')
            self.SAVE_NAME = _naming_cols.get('save_name')
        self.DESTN_DEPTH = options.get('destn_depth') or 2


class FilePaths():
    def __init__(self, notebook_use: dict, data_sources: dict):
        _data_type = notebook_use['source_type']
        _tabular_save_suffix = notebook_use['tabular_save_suffix']
        _source_cfg = data_sources[_data_type]
        _metadata = _source_cfg.get('metadata_path')
        _naming = _source_cfg.get('naming_csv')
        _annotations = _source_cfg.get('labels_path')
        
        self.PROJECT_ROOT = Path(notebook_use['project_root'])
        self.AUDIO_FILES = Path(data_sources[_data_type]['audio_files'])
        self.METADATA = Path(_metadata) if _metadata is not None else None
        self.NAMING_CSV = Path(_naming) if _naming is not None else None
        self.ANNOTATION_SRC = Path(_annotations) if _annotations is not None else None

        # Output directory (ensure it exists)
        self.ANQA_DESTN = Path(notebook_use.get('anqa_destn'))
        self.ANQA_DESTN.mkdir(parents=True, exist_ok=True)
        self.ANNOTATIONS_DESTN = self.ANQA_DESTN / f'annotations.{_tabular_save_suffix}'
        self.METADATA_DESTN = self.ANQA_DESTN / f'metadata.{_tabular_save_suffix}'

config = Config(notebook_use, data_sources)
paths = FilePaths(notebook_use, data_sources)

In [8]:
paths.METADATA

PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/birdclef/metadata/train.csv')

In [9]:
if paths.METADATA:
    testing = pd.read_csv(paths.METADATA)
    display(testing.head(3))

Unnamed: 0,primary_label,secondary_labels,type,filename,collection,rating,url,latitude,longitude,scientific_name,common_name,author,license
0,12345,[''],['silly antics'],fake_folder/not_here.ogg,santa,0.0,,,,,,Olly,
1,1139490,[''],[''],1139490/CSA36385.ogg,CSA,0.0,http://colecciones.humboldt.org.co/rec/sonidos...,7.3206,-73.7128,Ragoniella pulchella,Ragoniella pulchella,Fabio A. Sarria-S,cc-by-nc-sa 4.0
2,1139490,[''],[''],1139490/CSA36389.ogg,CSA,0.0,http://colecciones.humboldt.org.co/rec/sonidos...,7.3206,-73.7128,Ragoniella pulchella,Ragoniella pulchella,Fabio A. Sarria-S,cc-by-nc-sa 4.0


In [10]:
if paths.ANNOTATION_SRC:
    testing = pd.read_csv(paths.ANNOTATION_SRC)
    display(testing.head())

In [11]:
class VoiceDetector():
    def __init__(self, chunk_len, threshold=0.1, no_voice=0, voice=20):
        model, (get_speech_timestamps, _, read_audio, _, _) = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                                                                 model='silero_vad', verbose=False)
        self.model=model
        self.chunk_len=chunk_len
        self.threshold=threshold
        self.get_stamps = get_speech_timestamps

    def detect(self, np_wav):
        speech_timestamps = self.get_stamps(torch.Tensor(np_wav), self.model, threshold=self.threshold)
        voice_detect = np.zeros_like(np_wav)
        for st in speech_timestamps:
            voice_detect[st['start']: st['end']] = 20
    
        #downsample to match the power plot axis
        voice_detect = np.pad(voice_detect, 
                            (0, int(np.ceil(len(voice_detect) / self.chunk_len) * self.chunk_len - len(voice_detect))))
        voice_detect = voice_detect.reshape((-1, self.chunk_len)).max(axis=1)  # Use max to preserve speech detection
        return voice_detect
    
def calc_signal_pwr(wav, chunk_len, sr=32000):
    power = wav ** 2 
    power = np.pad(power, (0, int(np.ceil(len(power) / chunk_len) * chunk_len - len(power))))
    power = power.reshape((-1, chunk_len)).sum(axis=1)
    return power


def detect_voice(labels: pd.DataFrame,
                 metadata: pd.DataFrame,
                 threshold: float = 0.5,
                 ):
    """Run a voice detection model over the spectrogram to try and locate any human-voice

    Args:
        df_labels (_type_): _description_
        df_meta (_type_): _description_
    """
    return labels, metadata


def detect_animal_sound(labels: pd.DataFrame,
                        metadata: pd.DataFrame,
                        ):

    """Run an object detection model over spectrograms to propose any missing labels for bird or animal sounds

    Args:
        df_labels (_type_): _description_
        df_meta (_type_): _description_
    """
    return labels, metadata

## Load source data

In [12]:
if paths.METADATA is not None:
    meta_df = pd.read_csv(paths.METADATA)
    display(meta_df.head(3))
    display(f'The metadata dataframe has length {len(meta_df)}')

Unnamed: 0,primary_label,secondary_labels,type,filename,collection,rating,url,latitude,longitude,scientific_name,common_name,author,license
0,12345,[''],['silly antics'],fake_folder/not_here.ogg,santa,0.0,,,,,,Olly,
1,1139490,[''],[''],1139490/CSA36385.ogg,CSA,0.0,http://colecciones.humboldt.org.co/rec/sonidos...,7.3206,-73.7128,Ragoniella pulchella,Ragoniella pulchella,Fabio A. Sarria-S,cc-by-nc-sa 4.0
2,1139490,[''],[''],1139490/CSA36389.ogg,CSA,0.0,http://colecciones.humboldt.org.co/rec/sonidos...,7.3206,-73.7128,Ragoniella pulchella,Ragoniella pulchella,Fabio A. Sarria-S,cc-by-nc-sa 4.0


'The metadata dataframe has length 28565'

Setup mapping between the source data and the new dataset

In [13]:
if paths.NAMING_CSV is not None:
    source_names = pd.read_csv(paths.NAMING_CSV)
    source_col = config.SOURCE_NAME
    save_col = config.SAVE_NAME
    save_name_map = dict(zip(source_names.loc[:, source_col], source_names.loc[:, save_col]))
    display(source_names.head(3))

Unnamed: 0,eBird,CommonName,ScientificName,Class
0,1139490,Ragoniella pulchella,Ragoniella pulchella,insecta
1,1192948,Oxyprora surinamensis,Oxyprora surinamensis,insecta
2,1194042,Copiphora colombiae,Copiphora colombiae,insecta


In [14]:
paths.AUDIO_FILES

PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/bc25_audio_files')

In [15]:
dataset_loader = SourceDataLoader(source_type = config.SOURCE_DATA_TYPE,
                                  rename_map = save_name_map,
                                  )
print(f'Loading source data of the type {config.SOURCE_DATA_TYPE}')

df_label, df_meta = dataset_loader.load_data(paths.AUDIO_FILES, **loader_kwargs)

if config.DEBUG:
    df_label = df_label.head(40)
    df_meta = df_meta.head(40)

Loading source data of the type birdclef
[PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/bc25_audio_files/saffin/XC528006.ogg'), PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/bc25_audio_files/tropar/XC541300.ogg'), PosixPath('/home/olly/Desktop/Kaytoo/data/sample_label_formats/bc25_audio_files/blhpar1/XC415538.ogg')]



Note that we're ignoring any 'secondary' labels at this point, but if there is a primary/secondary labelling it is retained in the metadata

In [16]:
df_label.head(3)

Unnamed: 0,Filename,Start Time (s),End Time (s),Low Freq (Hz),High Freq (Hz),Label,Type,Sex,Rating,Delta Time (s),Delta Freq (Hz),Avg Power Density (dB FS/Hz),Filepath
1,1139490/CSA36385.ogg,,,,,1139490,[''],,0.0,,,,/home/olly/Desktop/Kaytoo/data/sample_label_fo...
2,1139490/CSA36389.ogg,,,,,1139490,[''],,0.0,,,,/home/olly/Desktop/Kaytoo/data/sample_label_fo...
3,1192948/CSA36358.ogg,,,,,1192948,[''],,0.0,,,,/home/olly/Desktop/Kaytoo/data/sample_label_fo...


In [17]:
df_meta.head(3)

Unnamed: 0,filename,collection,secondary_labels,url,latitude,longitude,author,license,recorded_on,reviewed_by,reviewed_on,source_filename,source_start_s,source_end_s,models_used
1,1139490/CSA36385.ogg,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos...,7.3206,-73.7128,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1139490/CSA36385.ogg,,,[]
2,1139490/CSA36389.ogg,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos...,7.3206,-73.7128,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1139490/CSA36389.ogg,,,[]
3,1192948/CSA36358.ogg,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos...,7.3791,-73.7313,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1192948/CSA36358.ogg,,,[]


In [18]:
df_meta.shape

(40, 15)

## Pre-filter the audio files

* Voice detection
* Bird-call detection

In [19]:
df_label, df_meta = detect_voice(df_label, df_meta, threshold=False)
df_label, df_meta = detect_animal_sound(df_label, df_meta)

## Convert the source data to Anqa Format

In [20]:
converter = ToAnqa(
            source_dir=paths.AUDIO_FILES,
            destn_dir=paths.ANQA_DESTN,
            name_map=save_name_map,
            save_audio=config.SAVE_NEW_FLAC_FILES,
            max_seconds=config.MAX_SECONDS,
            end_padding=config.PADDING,
            crop_method=config.CROP_METHOD,
            n_jobs=4,
            destn_depth=config.DESTN_DEPTH,
            )

df_labels, df_meta = converter.convert_all(df_label, df_meta)

Converting recordings: 100%|██████████| 40/40 [00:05<00:00,  6.70it/s]


In [21]:
df_labels.head()

Unnamed: 0,Filename,Start Time (s),End Time (s),Low Freq (Hz),High Freq (Hz),Label,Type,Sex,Rating,Delta Time (s),Delta Freq (Hz),Avg Power Density (dB FS/Hz)
0,CSA36385_from_0.flac,0.0,60.0,0.0,16000.0,1139490,[''],,0.0,60.0,16000.0,-43.0
1,CSA36385_from_60.flac,0.0,38.0,0.0,16000.0,1139490,[''],,0.0,38.0,16000.0,-42.9
2,CSA36389_from_0.flac,0.0,60.0,0.0,16000.0,1139490,[''],,0.0,60.0,16000.0,-43.9
3,CSA36389_from_60.flac,0.0,36.0,0.0,16000.0,1139490,[''],,0.0,36.0,16000.0,-42.5
4,CSA36358_from_0.flac,0.0,60.0,0.0,16000.0,1192948,[''],,0.0,60.0,16000.0,-45.8


In [22]:
df_meta.head()

Unnamed: 0,filename,collection,secondary_labels,url,latitude,longitude,author,license,recorded_on,reviewed_by,reviewed_on,source_filename,source_start_s,source_end_s,models_used
0,CSA36385_from_0.flac,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos...,7.3,-73.7,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1139490/CSA36385.ogg,0.0,60.0,[]
1,CSA36385_from_60.flac,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos...,7.3,-73.7,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1139490/CSA36385.ogg,60.0,98.9,[]
2,CSA36389_from_0.flac,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos...,7.3,-73.7,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1139490/CSA36389.ogg,0.0,60.0,[]
3,CSA36389_from_60.flac,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos...,7.3,-73.7,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1139490/CSA36389.ogg,60.0,96.5,[]
4,CSA36358_from_0.flac,CSA,[],http://colecciones.humboldt.org.co/rec/sonidos...,7.4,-73.7,Fabio A. Sarria-S,cc-by-nc-sa 4.0,NaT,,NaT,1192948/CSA36358.ogg,0.0,60.0,[]


In [23]:
save_dataframe(df_meta, paths.METADATA_DESTN, index=False)
save_dataframe(df_labels, paths.ANNOTATIONS_DESTN, index=False)

### (Optional) Create `.selections` for viewing or editing in Raven
Re-run these cells manually to produce `.selections.txt` files for viewing in [Raven](https://www.ravensoundsoftware.com/)

In [24]:
if config.SAVE_RAVEN_SELECTIONS:
    anqa_to_raven_selections(df_labels, paths.ANQA_DESTN)  #paths.new_raven_dataset, df_labels
    selections = list(paths.ANQA_DESTN.rglob('*.selections.txt'))
    display(selections[:3])
    random_file = random.choice(selections)
    selection = pd.read_csv(random_file, sep="\t")
    display(selection.head())

In [25]:
config.SAVE_RAVEN_SELECTIONS = True