# [BirdCLEF23](https://www.kaggle.com/competitions/birdclef-2023) Generating Uniform No-Call Chunks

[Original dataset collected from 3 other comps for BirdCLEF21](https://www.kaggle.com/datasets/christofhenkel/birdclef2021-background-noise)  
[The resulting dataset from this notebook](https://www.kaggle.com/datasets/ollypowell/birdclef-8-sec-ogg)

### Motivation

The motivation for this is to have unform chunks to mix in with my existing datasets, along with a compatible csv file for easy concatination, in the same format:

I am working with 8 second chunks, but have based all my datasets on the same csv format as [this one](/kaggle/input/cleaned-training-labels-21-23-for-birdclef2023) for easy concantenation.

I have saved to ogg this time, because I'm sharing my data around with colab, and my own machine.  In theory .wav will load a bit faster, but the file sizes were less practical for me.

### Data sources
- The one labeled *Train soundscapes* was provided for the 2021 competition, but I've assumed has had everything except the 'no-call' parts removed.  There are more files there, so the dataset could be increased.  [link here](https://www.kaggle.com/competitions/birdclef-2021/data?select=train_soundscapes)
- ff1010bird_nocall has come from [DCASE 2018](https://dcase.community/challenge2018/task-bird-audio-detection) competition, and originally sourced from the [freesound project](https://freesound.org/)
- AICrowd collection has come from [birdCLEF2020](https://www.aicrowd.com/clef_tasks/22/task_dataset_files?challenge_id=211)  (requires a login)

### Output
- A new dataset, with files in a single folder, labelled original_folder_nm_xxx.yyy, where xxx is a unique integer, and yyy is either .wav or .ogg, depending on the choice in config. 
- A csv file with filename, file path in the kaggle system, primary class 'no-call', secondary class empty

### Usage
You could either modify the csv format to match your own plans, and add it to an existing dataset with a no-call class label.  Or just use the chunks as background noise for augmentation.

In [1]:
import os
import numpy as np
import pandas as pd
import soundfile as sf
from joblib import Parallel, delayed
from pathlib import Path
import random 
import librosa
import multiprocessing as mp
from IPython.display import Audio
import shutil 

Configuration for this notebook

In [2]:
class Config:
    SR = 32000 # Sampling rate of all the source files
    CHUNK_DURATION = 8  # Clips the files to this number of seconds.
    CLASS_NAME = 'no-call'
    FILE_TYPE = 'ogg'  # Save to .wav will potentially mean faster loading, but larger files
    NUM_OUT = 1000  # By comparison the mean for all birds in the three comp years is 215 samples each of varying length
    MAX_IMBALANCE = 3 # The maximum ratio of sampling from any dataset compared to the smallest one
    NUM_WORKERS = 4 # For parallel processing


Setup filepaths and create output folders

- Write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
- You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [3]:
in_fldrs = ['/kaggle/input/birdclef2021-background-noise/aicrowd2020_noise_30sec/noise_30sec',  #225 mminutes
            '/kaggle/input/birdclef2021-background-noise/ff1010bird_nocall/nocall',  #960 minutes
           '/kaggle/input/birdclef2021-background-noise/train_soundscapes/nocall', ]  #30 minutes (Could add more, but would need to filter for no-call first)

data_folder = Path('/kaggle/input')  # modify to suit
in_csv = data_folder / 'cleaned-training-labels-21-23-for-birdclef2023/train_21_22_23.csv' # for header format only

#out_folder = Path('/kaggle/temp') # will be lost outside current session
out_folder = Path('/kaggle/working/') # to save and make into a dataset
temp_folder = Path('/kaggle/temp') # to store intermediate files

out_dataset_name =  f'birdclef-{str(Config.CHUNK_DURATION)}-sec-{Config.FILE_TYPE}' 
out_csv = out_folder / 'birdclef-nocall.csv'
out_sf_folder = out_folder / 'train_audio'

os.makedirs(out_sf_folder, exist_ok=True)

In [4]:
format_df = pd.read_csv(in_csv)
df = pd.DataFrame(columns = format_df.columns)
df

Unnamed: 0,primary_label,secondary_labels,type,filename,filepath


The goal here is to have roughly the same number of minutes from each of the sound source directories.  Going for simplicity over efficiency, I'm just breaking up all the sources into uniform chunks, and saving them all temporarily. The max is actually determined by the shortest source times `Config.MAX_IMBALANCE`.  eg 30 minutes x 3 = 90 so in this case we have (30 + 90 + 90)*60/8 = 1575 8-second files, minus a bit of truncating.  For a total number less than that, set a lower value for `Config.NUM_OUT`.

In [5]:
def load_ogg(path):
    y, sr = sf.read(path, always_2d=True)
    y = np.mean(y, 1) # For any sterio (X, 2) arrays
    if not np.isfinite(y).all():
        y[np.isnan(y)] = np.zeros_like(y)
        y[np.isinf(y)] = np.max(y)
    return y, len(y)


def play_audio(file_path):
    audio_abe, sr_abe = librosa.load(file_path)
    return Audio(data=audio_abe, rate=sr_abe)


def file_to_chunks(item_tuple):
    file, save_pth = item_tuple
    parent_name = str(file.parent.name) + '_' + str(file.stem)
    chunk_len = Config.CHUNK_DURATION * Config.SR
    y, length = load_ogg(file)
    if length < chunk_len + 1:
        print(f'{file.name} is too short ({length//Config.SR} seconds)')
    else:
        chunks = [y[i:i+chunk_len] for i in range(0, length, chunk_len)]
        if len(chunks[-1]) != chunk_len:
            chunks = chunks[:-1]
        for idx2, chunk in enumerate(chunks):
            fn = f'{parent_name}_{str(idx2)}.{Config.FILE_TYPE}' # could add {str(save_pth).partition("_")[-1]}  for more uniqueness
            pth = save_pth / fn
            sf.write(pth, chunk, Config.SR)
    return

In [6]:
all_temps = {}
for idx, folder in enumerate(in_fldrs):
    in_path_list = [p for p in Path(folder).rglob('*.ogg')]
    print(f'Processing {len(in_path_list)} ogg files in folder {folder}')
    temp_subfolder_nm = f'no_calls_{idx}'
    temp_save = temp_folder / temp_subfolder_nm
    os.makedirs(temp_save, exist_ok=True)
    items = [(p, temp_save) for p in in_path_list]
    
    if __name__ == '__main__':
        pool = mp.Pool(processes=4)
        results = pool.map(file_to_chunks, items)
        
    new_files = [p for p in Path(temp_save).rglob(f'*.{Config.FILE_TYPE}')]
    all_temps[temp_subfolder_nm] = new_files 
    print(f'Made {len(new_files)} {Config.CHUNK_DURATION}-second files from folder {idx+1}')

Processing 450 ogg files in folder /kaggle/input/birdclef2021-background-noise/aicrowd2020_noise_30sec/noise_30sec
1439.ogg is too short (5 seconds)
Made 1339 8-second files from folder 1
Processing 5755 ogg files in folder /kaggle/input/birdclef2021-background-noise/ff1010bird_nocall/nocall
Made 5755 8-second files from folder 2
Processing 3 ogg files in folder /kaggle/input/birdclef2021-background-noise/train_soundscapes/nocall
Made 225 8-second files from folder 3


In [7]:
os.listdir(temp_folder)

['no_calls_2', 'no_calls_1', 'no_calls_0']

In [8]:
#Randomly sample
random.seed(2023)
min_length = min([len(val) for val in all_temps.values()])

sampled_temps = []
for path_list in all_temps.values():
    if len(path_list) > min_length * Config.MAX_IMBALANCE:
        random_sample = random.sample(path_list, min_length * Config.MAX_IMBALANCE)
    else:
        random_sample = path_list
    sampled_temps = sampled_temps + random_sample
#Now the list contains similar numbers from the different source folders, to a max ratio of MAX_IMBALANCE

random.shuffle(sampled_temps)

if len(sampled_temps) > Config.NUM_OUT:
    sampled_temps = sampled_temps[:Config.NUM_OUT]

print(f'There are {len(sampled_temps)} filepaths in the list to be moved')

There are 1000 filepaths in the list to be moved


In [9]:
new_rows= []
for path in sampled_temps:
    fn = str(Path(path).name)
    destination = out_sf_folder / fn
    shutil.move(path, destination)
    
    csv_file_path = f'/kaggle/input/{out_dataset_name}/train_audio/{fn}'
    row_data = {'primary_label': Config.CLASS_NAME, 
                'secondary_labels': [], 
                'type': [], 
                'filename':fn,
                'filepath': csv_file_path }
    new_rows.append(row_data)
    
df = df.append(new_rows, ignore_index=True)
df.to_csv(out_csv)

In [10]:
print(f'There are a total of {len(os.listdir(out_sf_folder))} sound files')
print(f'There are {df.shape[0]} rows in the new labels dataframe')

There are a total of 1000 sound files
There are 1000 rows in the new labels dataframe


Below I'm just checking a random sample of the file paths written to the csv

In [11]:
rand_list = random.sample(range(1, 1000), 200)
for num in rand_list[:4]:
    print(df.iloc[num]['filepath'])

/kaggle/input/birdclef-8-sec-ogg/train_audio/nocall_50677_0.ogg
/kaggle/input/birdclef-8-sec-ogg/train_audio/noise_30sec_1433_1.ogg
/kaggle/input/birdclef-8-sec-ogg/train_audio/nocall_31454_0.ogg
/kaggle/input/birdclef-8-sec-ogg/train_audio/noise_30sec_385_2.ogg


And to varify a few files have saved correctly and don't contain birds:

In [12]:
def path_from_csv(idx):
    kaggle_parents = '/kaggle/input/' + out_dataset_name
    #new_path = '../working/' + df.iloc[rand_list[idx]]['filepath'].lstrip(kaggle_parents) #I can't see the problem!
    new_path = '../working' + df.iloc[rand_list[idx]]['filepath'].replace(kaggle_parents, '')
    return new_path

path_from_csv(5)

'../working/train_audio/nocall_41300_0.ogg'

In [13]:
play_audio(path_from_csv(0))

In [14]:
play_audio(path_from_csv(1))

In [15]:
play_audio(path_from_csv(20))

In [16]:
play_audio(path_from_csv(45))

In [17]:
play_audio(path_from_csv(70))

In [18]:
play_audio(path_from_csv(150))

In [19]:
play_audio(path_from_csv(190))