### Uncertainty sampling with filtering & class-specific dynamic adjustments 

We do not need to concatenate the .txt files as all predictions are now condensed into one single big .txt file. 
We need to select 32 top scoring clips for each class.
On iteration 0 it is easy, there is no need of using a filter because there are no clips from field data already incorporated. There is no need to check class precision on previous evaluation as there are no previous calculations yet. So iteration 0 (base) it is easy and probably it is better to use the base code. **Just select 32 top scoring clips per class.**

It will be necessary to include a df (training set tracker) to filter out the files already included as part of the training classes. Moreover, there will be a need to include a df with the precisions of each class (computed before), this will only be included after iteration 1. So on iteration 1 there will be no need to use this part but it would be necessary on subsequent iterations. If the precision of a certain class is below 0.5 (<= 50%), then select only the 32 top-scoring clips (filtering out the already included clips). If the precision of a certain class was above 0.5 (>= 50%), then select the 64 top scoring clips for that class. 


Goals:
Select top-scoring 32 (or 64) clips per class for validation.
Adjust the number of clips based on the previous iteration's precision:
Classes with precision > 50%: Select 64 clips, split (manually?) into:
Top 32 for computing precision.
Next 32 for potential inclusion in the training set.
Classes with precision ≤ 50%: Select 32 clips.
Filter out clips already in the training set.

#### Load Required Data:

* Combined predictions dataframe (big_df).
* training_set_tracker _to exclude already training set clips_.
* Class precision dataframe _from the previous iteration_.

#### Sampling Logic:

* For each class:
    - Exclude filenames present in the tracker for that class.
    - Sort predictions by score and select:
        Top 32 if precision ≤ 50%.
        Top 64 if precision > 50% (split into two groups).


In [1]:
# Load required libraries
import pandas as pd
from pathlib import Path
from opensoundscape import Audio

  from tqdm.autonotebook import tqdm


In [2]:
# Directories and file paths
annotations_file = '/mnt/d/retraining_BirdNET/iterative_training_2025/output_results_2025/model_0/BirdNET_SelectionTable.txt'
audio_files_dir = '/mnt/d/retraining_BirdNET/iterative_training_2025/input_20%/'
output_dir = '/mnt/d/retraining_BirdNET/iterative_training_2025/segments_validation_2025/model_1/'
tracker_path = '/mnt/d/retraining_BirdNET/model_train_2025/training_set_tracker.csv'  # Path to training set tracker
precision_file = '/mnt/d/retraining_BirdNET/iterative_training_2025/segments_validation_2025/model_0/precision_per_class.csv'  # Path to precision data (after iteration 1)

In [3]:
# Read annotations
df = pd.read_csv(annotations_file, sep='\t', usecols=['Begin Path', 'File Offset (s)', 'Common Name', 'Confidence'])
df.rename(columns={
    'Begin Path': 'file',
    'File Offset (s)': 'offset',
    'Common Name': 'class',
    'Confidence': 'score'
}, inplace=True)


In [4]:
# Extract filename without directory for comparison
df['filename'] = df['file'].apply(lambda x: Path(x).name)

In [5]:
# Load the training set tracker (filter already included clips)
if Path(tracker_path).exists():
    tracker_df = pd.read_csv(tracker_path)
    tracker_filenames = tracker_df['file'].apply(lambda x: Path(x).name).tolist()  # Ignore score and offset
else:
    tracker_filenames = []


In [6]:
# Load precision data (only needed after iteration 1)
if Path(precision_file).exists():
    precision_df = pd.read_csv(precision_file)
    class_precision = dict(zip(precision_df['class'], precision_df['precision']))
else:
    class_precision = {}

In [7]:
# Filter out already-included clips
df = df[~df['filename'].isin(tracker_filenames)]

In [8]:
# Container for top-scoring clips per class
top_entries_per_class = []

# Group by class and select top clips based on precision
for class_name, class_df in df.groupby(by='class'):
    # Skip non-relevant classes (optional)
    if class_name == ['nocall']:
        continue

    # Determine number of clips to select based on precision
    num_clips = 32  # Default
    if class_name in class_precision:
        if class_precision[class_name] > 0.5:
            num_clips = 64

    # Select top-scoring clips
    top_clips = class_df.sort_values(by='score', ascending=False).head(num_clips)
    top_entries_per_class.append(top_clips)

# Combine all top clips into a single DataFrame
all_top_entries = pd.concat(top_entries_per_class, ignore_index=True)

In [9]:
# Save selected clips into class-specific directories
for _, row in all_top_entries.iterrows():
    # Extract class and output directory
    class_name = row['class']
    class_dir = Path(output_dir) / class_name
    class_dir.mkdir(parents=True, exist_ok=True)

    # Construct the output filename
    score = row['score']
    offset = int(row['offset'])
    filename = row['filename']
    output_filename = f"{score}_{offset}_{filename}"

    # Load and save the audio segment
    audio_path = Path(audio_files_dir) / filename
    audio = Audio.from_file(audio_path, offset=offset, duration=3)  # Assuming 3-second segments
    audio.save(class_dir / output_filename)

print(f"Selected clips have been saved to: {output_dir}")



Selected clips have been saved to: /mnt/d/retraining_BirdNET/iterative_training_2025/segments_validation_2025/model_1/
