# Segments Extraction  

This notebook extracts annotated audio segments from the official recordings of Tovanella and WABAD using the `Bird_tags_Train.mat` file. Since BirdNET analyzes 3-second clips, all extracted segments follow this duration.  

Segments are generated with a 50% overlap, shifting by 1.5 seconds between consecutive clips.  

## Extraction Process:
1. **`species_dict`**: maps common names to scientific names for all species.  
2. **`category_annots.json`** & **`audio_annots.json`**: store segment annotations for each species in every audio file.  
3. **`audio_info.json`**: provides total duration and sampling rate for each recording.  
4. **`true_segments.json`**: lists the species present in each extracted segment.  

Unannotated segments can be included (labeled as `"None"`) by enabling `generate_None`, treating them as a non-species class.  

For WABAD, a different approach was used due to multiple recording sites—only relevant sites containing the study species were processed.


In [14]:
import os
import json
import pandas as pd
import csv
import utils
import scipy.io
from birdlib import utils

In [15]:
DATASET_NAME = 'dataset'
DATASET_PATH = f'/home/giacomoschiavo/segments/{DATASET_NAME}'
AUDIO_SOURCE = '/home/giacomoschiavo/Tovanella'

# Update! Birds_tags_Train_2.mat
In this new dataset some file are written in this form

"< code > - < common species > - < scientific name >.mp3" -> "XC636429 - Merlo dal collare - Turdus torquatus.wav"

To simpify the process, they will be renamed as "< code >_ 0 _.mp3" -> "XC636429_0_.wav"

First, we need to convert .mp3 to .wav

In [9]:
from pydub import AudioSegment
tovanella_path = '/home/giacomoschiavo/Tovanella'

def convert_mp3_to_wav(path):
    for filename in os.listdir(path):
        if filename.endswith(".mp3"):
            mp3_path = os.path.join(path, filename)
            wav_filename = os.path.splitext(filename)[0] + ".WAV"
            wav_path = os.path.join(path, wav_filename)

            try:
                audio = AudioSegment.from_mp3(mp3_path)
                audio.export(wav_path, format="wav")
                print(f"✅ Converted: {filename} → {wav_filename}")
            except Exception as e:
                print(f"Error with {filename}: {e}")

# convert_mp3_to_wav(tovanella_path)


In [10]:
for audio in os.listdir(tovanella_path):
    if audio.upper().endswith('.WAV') and '-' in audio:
        code = audio.split(' - ')[0]
        print(audio)
        os.rename(
            os.path.join(tovanella_path, audio),
            os.path.join(tovanella_path, f'{code}_0.WAV')
        )

# Species Dict
Create a dictionary to map the scientific name -> common name

In [16]:
species_dict = utils.get_species_dict("utils/BirdNET_GLOBAL_6K_V2.4_Labels_en_uk.txt")
# export species_dict to json
# with open('utils/species_dict_map.json', 'w') as f:
#     json.dump(species_dict, f)

# Category and Audio Annotation Files: A Deep Dive

These files hold invaluable annotation data, offering distinct perspectives on our dataset.

**`category_annots.json`**: This file provides a species-centric view. For each species identified, it lists *all* corresponding annotations found across *every* audio recording within the Tovanella folder.

**`audio_annots.json`**: In contrast, this file takes an audio-centric approach. For each individual audio file in our collection, it details *all* the annotations present within that specific recording.

In [17]:
# extract annotations from the given file
bird_tags = scipy.io.loadmat('Birds_tags_Train_2.mat')["Bird_tags"] 
# visualize an example, showing all the properties
for i, prop in enumerate(bird_tags[12][0][0][0]):
    print(i, prop)

0 ['Turdus_philomelos']
1 ['20190607_030000.WAV']
2 [[ 5.05964467  1.72461929  1.72461929  5.05964467 24.41782537 26.39896524]]
3 [[24.41782537  5.05964467]
 [24.41782537  1.72461929]
 [26.39896524  1.72461929]
 [26.39896524  5.05964467]
 [24.41782537  5.05964467]]
4 [[2]]


In [18]:
category_annots, audio_annots, miss = utils.get_audio_category_annots("Bird_tags_Train.mat", AUDIO_SOURCE, species_dict)
category_annots_2, audio_annots_2, miss_2 = utils.get_audio_category_annots("Birds_tags_Train_2.mat", AUDIO_SOURCE, species_dict)

# category_annots.update(category_annots_2)
# audio_annots.update(audio_annots_2)
# with open("utils/category_annots.json", "w") as f:
#     json.dump(category_annots, f)
# with open("utils/audio_annots.json", "w") as f:
#     json.dump(audio_annots, f)
# with open("utils/category_annots_2.json", "w") as f:
#     json.dump(category_annots, f)
# with open("utils/audio_annots_2.json", "w") as f:
#     json.dump(audio_annots, f)


In [19]:
category_annots_test, audio_annots_test, missing = utils.get_audio_category_annots("Bird_tags_Test.mat", AUDIO_SOURCE, species_dict)

# with open("utils/category_annots_test.json", "w") as f:
#     json.dump(category_annots_test, f)
# with open("utils/audio_annots_test.json", "w") as f:
#     json.dump(audio_annots_test, f)

In [20]:
missing, miss, miss_2

(set(),
 set(),
 {'20190607_030000.WAV', '20190607_040000.WAV', '20190607_050000.WAV'})

In [21]:
# creates species list
species_list = list(category_annots.keys())

# Segments Creation
Creates all the segments listed in Category Info

In [None]:
def load_or_generate_info(filename, annots, audio_source, save_path):
    full_path = os.path.join(save_path, filename)
    info = utils.generate_audio_info(audio_source, annots)
    with open(full_path, 'w') as f:
        json.dump(info, f)
    return info

In [None]:
audio_info = load_or_generate_info('audio_info.json', audio_annots, AUDIO_SOURCE, 'utils')
audio_info_test = load_or_generate_info('audio_info_test.json', audio_annots_test, AUDIO_SOURCE, 'utils')

In [None]:
true_segments_train = utils.generate_true_segments(audio_annots, audio_info)
true_segments_test = utils.generate_true_segments(audio_annots_test, audio_info_test)

In [None]:
import copy
true_segments = copy.deepcopy(true_segments_train)
true_segments.update(true_segments_test)

In [None]:
# SAVE
os.makedirs(f'utils/{DATASET_NAME}', exist_ok=True)
with open(f'utils/{DATASET_NAME}/true_segments_train.json', 'w') as f:
    json.dump(true_segments_train, f)
with open(f'utils/{DATASET_NAME}/true_segments_test.json', 'w') as f:
    json.dump(true_segments_test, f)
with open(f'utils/{DATASET_NAME}/true_segments.json', 'w') as f:
    json.dump(true_segments, f)


In [None]:
utils.generate_segments(audio_source_path=AUDIO_SOURCE,
                  target_path=f"{DATASET_PATH}/train",
                  true_segments=true_segments_train,
                  audio_info=audio_info,
                  generate_None=True)

Processing segments for 20190621_010000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 2489.38it/s]
Processing segments for 20190621_020000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 2537.55it/s]
Processing segments for 20190621_030000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 1848.82it/s]
Processing segments for 20190621_040000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 1396.85it/s]
Processing segments for 20190621_050000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 1677.69it/s]
Processing segments for 20190621_060000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 1547.49it/s]
Processing segments for 20190621_070000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 1502.15it/s]
Processing segments for 20190621_080000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 1903.92it/s]
Processing segments for 20190621_090000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 2147.36it/s]
Processing segments

In [None]:
utils.generate_segments(audio_source_path=AUDIO_SOURCE,
                  target_path=f"{DATASET_PATH}/test",
                  true_segments=true_segments_test,
                  audio_info=audio_info_test,
                  generate_None=True)

Processing segments for 20190601_000000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 735.08it/s]
Processing segments for 20190601_030000.WAV...: 100%|[31m██████████[0m| 399/399 [00:01<00:00, 390.55it/s]
Processing segments for 20190601_040000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 614.74it/s] 
Processing segments for 20190601_050000.WAV...: 100%|[31m██████████[0m| 399/399 [00:01<00:00, 352.68it/s]
Processing segments for 20190601_060000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 748.63it/s] 
Processing segments for 20190601_070000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 591.56it/s]
Processing segments for 20190601_080000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 660.65it/s] 
Processing segments for 20190601_090000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 555.30it/s]
Processing segments for 20190601_100000.WAV...: 100%|[31m██████████[0m| 399/399 [00:00<00:00, 644.58it/s]
Processing segments for 2

In [24]:
# count segments by species
target_path = f"{DATASET_PATH}/train"
species_count = {species: len(os.listdir(os.path.join(target_path, species))) for species in os.listdir(target_path)}
species_count_df = pd.DataFrame(list(species_count.items()), columns=["Species", "Count"])
species_count_df.sort_values(by="Count", ascending=False).reset_index(drop=True)

Unnamed: 0,Species,Count
0,,13374
1,Fringilla coelebs_Common Chaffinch,7901
2,Parus major_Great Tit,3859
3,Sylvia atricapilla_Eurasian Blackcap,3742
4,Regulus ignicapilla_Common Firecrest,2777
5,Phylloscopus collybita_Common Chiffchaff,2014
6,Cuculus canorus_Common Cuckoo,1684
7,Erithacus rubecula_European Robin,1554
8,Troglodytes troglodytes_Eurasian Wren,1159
9,Periparus ater_Coal Tit,883


# WABAD Segments Extraction

For the WABAD dataset, a segment extraction strategy similar to the previous one is employed, with key adaptations to address its unique characteristics.

Initially, the focus is specifically on the **less represented species**. In this particular analysis, species with **fewer than 750 occurrences** were targeted, while the "non-species" class was excluded.

Next, the process involves extracting annotations directly from WABAD. This requires referencing a list of specific site datasets to download, located in the `wabad_datasets.txt` file. Once the annotations are gathered, audio segments are created. This follows the **same robust pipeline** used previously: transforming raw category and audio annotations (`category_annots`, `audio_annots`), saving audio details (`audio_info`), and then feeding into the generation of labeled segments (`true_segments`).

Finally, given the abundance of unannotated segments (exceeding 10,000 samples of "None"), they have been excluded from this analysis.

In [23]:
# get species from category_info file, filter non-species name
species_common_name_list = [species.split("_")[0] for species in list(category_annots.keys()) if len(species.split("_")) > 1]

In [25]:
minority_threshold = 1000
train_folder = '/home/giacomoschiavo/segments/dataset/train'
train_species = os.listdir(train_folder)
species_count = {species : len(os.listdir(os.path.join(train_folder, species))) for species in train_species}
species_to_augment = [species.split("_")[0] for species in train_species if species_count[species] <= minority_threshold and len(species.split("_")) != 1]
species_to_augment

['Muscicapa striata',
 'Periparus ater',
 'Regulus regulus',
 'Anthus trivialis',
 'Certhia familiaris',
 'Coccothraustes coccothraustes',
 'Turdus merula',
 'Loxia curvirostra',
 'Lophophanes cristatus']

In [26]:
# 1. Locate site dataset list: `wabad_datasets.txt` in the `utils` folder.
# 2. Manually download and extract ALL listed datasets.
# 3. Place the extracted datasets into the designated `WABAD` folder.

# this is an example of the outcome
# E:\Giacomo\Tovanella\WABAD\BAM\BAM\Raven Pro annotations\BAM_20151116_060801.txt
# WABAD_PATH = "E:/Giacomo/Tovanella/WABAD"
WABAD_PATH = "/home/giacomoschiavo/WABAD/audio"

In [27]:
def extract_wabad_info(folder_path):
    # folder_path = ".../WABAD/BIAL/BIAL/Raven Pro annotations"
    audio_info_wabad = {}
    category_info_wabad = {}
    for txt_file in os.listdir(folder_path):
        complete_path = os.path.join(folder_path, txt_file)
        with open(complete_path, newline='', encoding='utf-8') as csvfile:
            reader = csv.DictReader(csvfile, delimiter='\t') 
            for row in reader:
                if row["Species"] not in species_to_augment or "End Time (s)" not in row:
                    continue
                file_name = txt_file.replace(".txt", ".wav")
                start_time = float(row["Begin Time (s)"])
                end_time = float(row["End Time (s)"])
                duration = end_time - start_time
                common_name = species_dict[row["Species"]]
                label = f"{row['Species']}_{common_name}"

                audio_info_entry = {
                    "scientific_name": row["Species"],
                    "common_name": common_name, 
                    "start_time": start_time,
                    "duration": duration,
                    "label": label
                }
                category_info_entry = {
                    "file_name": file_name,
                    "start_time": start_time,
                    "duration": duration,
                    "label": label
                }
                
                if file_name not in audio_info_wabad:
                    audio_info_wabad[file_name] = []
                if label not in category_info_wabad:
                    category_info_wabad[label] = []
                audio_info_wabad[file_name].append(audio_info_entry)
                category_info_wabad[label].append(category_info_entry)
    return audio_info_wabad, category_info_wabad

In [28]:
audio_annots_wabad = {}
category_annots_wabad = {}

for folder in os.listdir(WABAD_PATH):
    if not os.path.isdir(os.path.join(WABAD_PATH, folder)):
        continue
    annotations = os.path.join(WABAD_PATH, folder, folder, "Raven Pro annotations")
    audio_info_update, category_info_update = extract_wabad_info(annotations)
    for audio in audio_info_update.keys():
        if audio not in audio_annots_wabad:
            audio_annots_wabad[audio] = []
        audio_annots_wabad[audio].extend(audio_info_update[audio])
    for category in category_info_update.keys():
        if category not in category_annots_wabad:
            category_annots_wabad[category] = []
        category_annots_wabad[category].extend(category_info_update[category])    

with open("utils/WABAD/audio_annots_wabad.json", 'w', encoding='utf-8') as jsonfile:
    json.dump(audio_annots_wabad, jsonfile)

with open("utils/WABAD/category_annots_wabad.json", 'w', encoding='utf-8') as jsonfile:
    json.dump(category_annots_wabad, jsonfile)

In [29]:
# show contribution of WABAD for every species
species_count_wabad = {species_name: len(segms) for species_name, segms in category_annots_wabad.items()}
species_count_wabad_df = pd.DataFrame(list(species_count_wabad.items()), columns=["Species", "Count WABAD"])
merged_df = pd.merge(species_count_df, species_count_wabad_df, on="Species", how="inner")
merged_df.sort_values(by=["Count"], ascending=False)

Unnamed: 0,Species,Count,Count WABAD
1,Periparus ater_Coal Tit,883,768
6,Turdus merula_Eurasian Blackbird,658,2308
3,Anthus trivialis_Tree Pipit,658,176
2,Regulus regulus_Goldcrest,582,437
8,Lophophanes cristatus_Crested Tit,490,132
4,Certhia familiaris_Eurasian Treecreeper,372,96
5,Coccothraustes coccothraustes_Hawfinch,341,59
7,Loxia curvirostra_Common Crossbill,280,25
0,Muscicapa striata_Spotted Flycatcher,256,117


In [33]:
folder_path, '/home/giacomoschiavo/WABAD/audio/HAR/HAR/Recordings'

('/home/giacomoschiavo/WABAD/audio/HAR/HAR/Recordings',
 '/home/giacomoschiavo/WABAD/audio/HAR/HAR/Recordings')

In [36]:
# move all WABAD audio in a unique folder -> run "move_files.py" in the VM
import shutil

WABAD_PATH = "/home/giacomoschiavo/WABAD/audio"
WABAD_AUDIO_SOURCE = "/home/giacomoschiavo/WABAD/all_wabad_audio"
for folder in os.listdir(WABAD_PATH):
    if not os.path.isdir(os.path.join(WABAD_PATH, folder)):
        continue
    # ...\BAM\BAM\Recordings
    folder_path = os.path.join(WABAD_PATH, folder, folder, "Recordings")
    all_audio = os.listdir(folder_path)
    for audio in all_audio:
        if audio in audio_annots_wabad.keys():
            if os.path.exists(os.path.join(WABAD_AUDIO_SOURCE, audio)):
                continue
            # os.rename(
            #     os.path.join(folder_path, audio),
            #     os.path.join(WABAD_AUDIO_SOURCE, audio)
            # )
            shutil.copy(
                os.path.join(folder_path, audio), 
                os.path.join(WABAD_AUDIO_SOURCE, audio)
            )
    

In [37]:
# with open("utils/audio_info_wabad.json") as f:
#     audio_info_wabad = json.load(f)

audio_info_wabad = utils.generate_audio_info(WABAD_AUDIO_SOURCE, audio_annots_wabad)
with open("utils/WABAD/audio_info_wabad.json", "w") as f:
    json.dump(audio_info_wabad, f)

In [38]:
true_segments_wabad = utils.generate_true_segments(audio_annots_wabad, audio_info_wabad)
with open("utils/WABAD/true_segments_wabad.json", "w") as f:
    json.dump(true_segments_wabad, f)

In [39]:
# here we generate the segments for WABAD in WABAD_SEGMENTS_PATH folder
WABAD_SEGMENTS_PATH = "/home/giacomoschiavo/WABAD/segments"
os.makedirs(WABAD_SEGMENTS_PATH, exist_ok=True)

In [40]:
utils.generate_segments(WABAD_AUDIO_SOURCE, WABAD_SEGMENTS_PATH, true_segments_wabad, audio_info_wabad, generate_None=False)

Processing segments for PITI_20220313_070800.wav...: 100%|[31m██████████[0m| 39/39 [00:00<00:00, 2720.67it/s]
Processing segments for OESF_20230518_060601.wav...: 100%|[31m██████████[0m| 39/39 [00:00<00:00, 3824.24it/s]
Processing segments for OESF_20230611_144932.wav...: 100%|[31m██████████[0m| 39/39 [00:00<00:00, 6197.30it/s]
Processing segments for PINA_20220603_082300.wav...: 100%|[31m██████████[0m| 39/39 [00:00<00:00, 3068.43it/s]
Processing segments for PINA_20220506_073700.wav...: 100%|[31m██████████[0m| 39/39 [00:00<00:00, 2572.34it/s]
Processing segments for PINA_20220502_083700.wav...: 100%|[31m██████████[0m| 39/39 [00:00<00:00, 3175.71it/s]
Processing segments for PINA_20220504_070800.wav...: 100%|[31m██████████[0m| 39/39 [00:00<00:00, 4982.72it/s]
Processing segments for PINA_20220502_083200.wav...: 100%|[31m██████████[0m| 39/39 [00:00<00:00, 2849.14it/s]
Processing segments for PINA_20220601_083600.wav...: 100%|[31m██████████[0m| 39/39 [00:00<00:00, 6081.

In [41]:
species_count_wabad_fr = {}
for species in os.listdir(WABAD_SEGMENTS_PATH):
    species_count_wabad_fr[species] = len(os.listdir(os.path.join(WABAD_SEGMENTS_PATH, species)))

species_count_wabad_fr_df = pd.DataFrame(list(species_count_wabad_fr.items()), columns=["Species", "Count WABAD FR"])
merged_df = pd.merge(species_count_df, species_count_wabad_fr_df, on="Species", how="inner")
merged_df.sort_values(by=["Count"], ascending=False)

Unnamed: 0,Species,Count,Count WABAD FR
5,,13374,18110
6,Parus major_Great Tit,3859,0
2,Cuculus canorus_Common Cuckoo,1684,0
1,Periparus ater_Coal Tit,883,0
4,Anthus trivialis_Tree Pipit,658,0
9,Turdus merula_Eurasian Blackbird,658,4301
3,Regulus regulus_Goldcrest,582,0
11,Lophophanes cristatus_Crested Tit,490,0
7,Certhia familiaris_Eurasian Treecreeper,372,0
8,Coccothraustes coccothraustes_Hawfinch,341,0
