# [BirdCLEF23](https://www.kaggle.com/competitions/birdclef-2023) 

## Cropping the ogg files first and last 8 seconds

[Notebook adapted from here](https://www.kaggle.com/code/kaerunantoka/birdclef2022-audio-to-numpy-1-4), then [here](https://www.kaggle.com/code/ollypowell/birdclef23-audio-to-numpy), plus some inspiration from Hiroki's helpful comments from [here](https://www.kaggle.com/competitions/birdclef-2023/discussion/397086#2197884).    

The goal is to crop and or split the whole 2023 dataset into maximum 8 second chunks and save to .wav files, for convenient batching.  With the chunks taken from each end of the recording for recordings longer than some threshold. Or, if wanted, just save the origininal lengths to wav.  The motivation for this is:

- smaller memory issues with ogg than wav

- The first and last part of the recording are assumed to be more likely to have the primary bird calls, as the recordings were likely cut from longer files.  So this reduces the amount of data being loaded with false positives or incorrect labels.

- Future manual editing of sound files is easier with .wav files.  Should anyone want to go through the recordings and remove the nocall parts.  The files could be played on any media player.

- I'm, anticipating that I'll go through the less common classes and manualy split up the bird calls into shorter segments, to reduce class imbalance.  And/or do some automated localisation process on the longer files. Max of 8 second chunks sounds like a good compromise length, as I'll still be able to randomly crop or pad everything to 5 seconds in the dataloader, and expect the bird call to be present most of time.

### Output
A new dataset, with the same file structure as the original dataset, and a new labels csv, updated with any split files.

### Usage
You should just need to modify the two filepaths `data_folder` and `out_folder`, and un-comment the last line.

In [1]:
import os
import numpy as np
import pandas as pd
import soundfile as sf
from joblib import Parallel, delayed
from tqdm import tqdm
from pathlib import Path
from os import sep
import glob
import gc
import random 
import librosa
from IPython.display import Audio

In [2]:
SR = 32000 # Sampling rate of all the source files
MAX_LEN = 8  # Clips the files to this number of seconds.
SPLIT_THRESHOLD = 10 # Clips longer than this will be split into two, each of length MAX_LEN

Setup filepaths and create output folders

- Write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
- You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [3]:
data_folder = Path('/kaggle/input')  # modify to suit
#out_folder = Path('/kaggle/temp') # will be lost outside current session
out_folder = Path('/kaggle/working/') # modify to suit
in_csv = data_folder / 'cleaned-training-labels-21-23-for-birdclef2023/train_23.csv'
in_sfiles = data_folder / 'birdclef-2023' / 'train_audio'
out_csv = out_folder / 'train_23_cropped.csv'
out_train_folder = out_folder / 'train_audio'
dataset_dir = data_folder / f'birdclef23-first-last-{str(MAX_LEN)}-sec-ogg' / 'train_audio'

paths = {p:str(p.name) for p in Path(in_sfiles).rglob('*.ogg')}
classes = os.listdir(in_sfiles)
df = pd.read_csv(in_csv, index_col=None)

os.makedirs(out_train_folder, exist_ok=True)
for bird_type in classes:
    os.makedirs(out_train_folder / bird_type, exist_ok=True)

print(f'There are a total of {len(classes)} bird class folders')
print(f'There are a total of {len(paths)} bird sound files')

There are a total of 264 bird class folders
There are a total of 16941 bird sound files


Helper functions

In [4]:
def load_ogg(path):
    y, sr = sf.read(path, always_2d=True)
    y = np.mean(y, 1) # For any sterio (X, 2) arrays
    return y, len(y)


def save_only(array, path):
    out_path = out_train_folder / f'{str(path.parent.name)}/{str(path.stem)}.ogg'
    sf.write(out_path, array, SR) 
    return
    

def crop_only(array, path):
    out_path = out_train_folder / f'{str(path.parent.name)}/{str(path.stem)}_0.ogg'
    array = array[:SR * MAX_LEN]
    sf.write(out_path, array, SR) 
    return


def split_crop(array, path):
    y1 = array[:SR * MAX_LEN]
    y2 = array[-SR * MAX_LEN:]
    out_path_1 = out_train_folder / f'{str(path.parent.name)}/{str(path.stem)}_1.ogg'
    out_path_2 = out_train_folder / f'{str(path.parent.name)}/{str(path.stem)}_2.ogg'
    sf.write(out_path_1, y1, SR) 
    sf.write(out_path_2, y2, SR)
    return


def play_audio(file_path):
    audio_abe, sr_abe = librosa.load(file_path)
    return Audio(data=audio_abe, rate=sr_abe)

In [5]:
data = df.to_dict('records')
new_labels = []

for row_1 in tqdm(data):   # for code testing just add[0:200]
    fp = Path(row_1['filepath'])
    np_arr, length = load_ogg(fp)
    
    if length < MAX_LEN*SR:
        save_only(np_arr, fp)
        row_1['filename'] = f'{str(fp.stem)}.ogg'
        row_1['filepath'] = str(dataset_dir / f'{str(fp.parent.name)}/{str(fp.stem)}.ogg')
        new_labels.append(row_1)
        
    elif length < SPLIT_THRESHOLD*SR:
        crop_only(np_arr, fp)
        row_1['filename'] = f'{str(fp.stem)}_0.ogg'
        row_1['filepath'] = str(dataset_dir / f'{str(fp.parent.name)}/{str(fp.stem)}_0.ogg')
        new_labels.append(row_1)
        
    else:
        split_crop(np_arr, fp)
        row_2 = row_1.copy()
        row_1['filename'] = f'{str(fp.stem)}_1.ogg'
        row_2['filename'] = f'{str(fp.stem)}_2.ogg'
        row_1['filepath'] =  str(dataset_dir / f'{str(fp.parent.name)}/{str(fp.stem)}_1.ogg')
        row_2['filepath'] =  str(dataset_dir / f'{str(fp.parent.name)}/{str(fp.stem)}_2.ogg')
        new_labels = new_labels + [row_1, row_2]        
        
df2 = pd.DataFrame.from_dict(new_labels) 
df2.to_csv(out_csv, index=False)
df2.head()

100%|██████████| 16941/16941 [36:11<00:00,  7.80it/s]


Unnamed: 0,primary_label,secondary_labels,type,filename,filepath
0,abethr1,[],['song'],XC128013_1.ogg,/kaggle/input/birdclef23-first-last-8-sec-ogg/...
1,abethr1,[],['song'],XC128013_2.ogg,/kaggle/input/birdclef23-first-last-8-sec-ogg/...
2,abethr1,[],['call'],XC363501_1.ogg,/kaggle/input/birdclef23-first-last-8-sec-ogg/...
3,abethr1,[],['call'],XC363501_2.ogg,/kaggle/input/birdclef23-first-last-8-sec-ogg/...
4,abethr1,[],['song'],XC363502_1.ogg,/kaggle/input/birdclef23-first-last-8-sec-ogg/...


In [6]:
classes = os.listdir(out_train_folder)
paths = [p for p in Path(out_train_folder).rglob('*.ogg')]

print(f'There are a total of {len(classes)} bird class folders')
print(f'There are a total of {len(paths)} bird sound .wav files')
print(f'There are {df2.shape[0]} rows in the new labels dataframe')

There are a total of 264 bird class folders
There are a total of 30832 bird sound .wav files
There are 30832 rows in the new labels dataframe


Below I'm just checking a random sample of the file paths written to the csv

In [7]:
rand_list = random.sample(range(1, df2.shape[0]), 200)
for num in rand_list[:6]:
    print(df2.iloc[num]['filepath'])

/kaggle/input/birdclef23-first-last-8-sec-ogg/train_audio/subbus1/XC511755_2.ogg
/kaggle/input/birdclef23-first-last-8-sec-ogg/train_audio/barswa/XC333650_2.ogg
/kaggle/input/birdclef23-first-last-8-sec-ogg/train_audio/whbcro2/XC166357.ogg
/kaggle/input/birdclef23-first-last-8-sec-ogg/train_audio/ratcis1/XC503255_1.ogg
/kaggle/input/birdclef23-first-last-8-sec-ogg/train_audio/blfbus1/XC325824_2.ogg
/kaggle/input/birdclef23-first-last-8-sec-ogg/train_audio/cohmar1/XC581664_1.ogg


And to varify a few wav files have saved correctly and contain birds

In [8]:
play_audio(paths[rand_list[0]])

In [9]:
play_audio(paths[rand_list[1]])

In [10]:
play_audio(paths[rand_list[2]])

In [11]:
play_audio(paths[rand_list[3]])