
# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%"> 1 | Installing Libraries </p>

In [None]:

    !pip install /kaggle/input/onnxruntime/humanfriendly-10.0-py2.py3-none-any.whl --no-index --find-links /kaggle/input/onnxruntime
    !pip install /kaggle/input/onnxruntime/coloredlogs-15.0.1-py2.py3-none-any.whl --no-index --find-links /kaggle/input/onnxruntime
    !pip install /kaggle/input/onnxruntime/onnxruntime-1.17.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl --no-index --find-links /kaggle/input/onnxruntime

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">2 | Importing Libraries </p>

In [None]:
import os
import gc
import sys
import glob
import time
import shutil
import random
import ast

import warnings
warnings.simplefilter("ignore")
import onnx
import onnxruntime as ort
import wandb

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, GroupKFold, StratifiedGroupKFold
from sklearn.model_selection import KFold, StratifiedKFold, GroupKFold
from sklearn import metrics
from sklearn.metrics import mean_squared_error, roc_auc_score
from tqdm.notebook import tqdm

import seaborn as sns
import matplotlib.pyplot as plt

from torch.cuda import amp
import torch
print(f"pytorch version is {torch.__version__}")
import torch.nn as nn
from torch.cuda import amp

isTrain = False
name = 'bird2024exp1057'

import torchvision
from torchvision.transforms import v2 as transforms

import librosa
import torchaudio
import torchaudio.transforms as audioT

import timm

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">3 | Configuration Class </p>

The `config` class is going to be use  to centralize and manage all the configuration settings needed for the training and evaluation of the model. This approach makes it easier to adjust parameters and ensures consistency across different parts of the code.

```python
class config:
    dir = "/kaggle/input/birdclef-2024/"
```

- **dir**: Specifies the directory where the dataset is located. This allows the code to locate and load the dataset efficiently.

```python
    wave_path = "original_waves/second_30/"
```

- **wave_path**: The sub-directory within the main directory where the preprocessed audio wave files are stored. This helps in organizing the data and facilitates easy access during data loading.

```python
    model_name = 'tf_efficientnet_b0'
```

- **model_name**: The name of the model architecture to be used. Here, `tf_efficientnet_b0` is chosen, which is a variant of EfficientNet. This setting allows for easy switching between different model architectures.

```python
    pool_type = 'avg'
```

- **pool_type**: Specifies the type of pooling layer to use, in this case, average pooling. Pooling layers are used to reduce the spatial dimensions of the feature maps.

```python
    train_duration = 30
```

- **train_duration**: The duration (in seconds) of the audio samples used for training. This sets a fixed length for all training samples, ensuring uniformity.

```python
    slice_duration = 5
```

- **slice_duration**: The duration (in seconds) of the slices used for Short-Time Fourier Transform (STFT) during training. This defines the length of audio segments input to the model.

```python
    test_duration = 5
```

- **test_duration**: The duration (in seconds) of the audio samples used for testing. This ensures that the test data is processed in the same way as the training data.

```python
    train_drop_duration = 1
```

- **train_drop_duration**: Duration (in seconds) of audio to be randomly dropped out during training for augmentation purposes. This helps in making the model robust to missing data.

### Spectrogram Parameters

```python
    sr = 32000
    fmin = 20
    fmax = 15000
    n_mels = 128
    n_fft = n_mels*8
    size_x = 512
```

- **sr**: Sample rate, the number of samples per second in the audio files.
- **fmin**: Minimum frequency for the Mel spectrogram.
- **fmax**: Maximum frequency for the Mel spectrogram.
- **n_mels**: Number of Mel bands to generate.
- **n_fft**: Number of FFT components, calculated as eight times the number of Mel bands.
- **size_x**: The size of the resulting spectrogram along the time axis.

```python
    hop_length = int(sr*slice_duration / size_x)
    test_hop_length = int(sr*test_duration / size_x)
```

- **hop_length**: Number of audio samples between adjacent frames for the training spectrogram.
- **test_hop_length**: Number of audio samples between adjacent frames for the testing spectrogram.

### Miscellaneous Configuration

```python
    bins_per_octave = 12
```

- **bins_per_octave**: Number of frequency bins per octave in the spectrogram.

```python
    nfolds = 5
    inference_folds = [4]
```

- **nfolds**: Number of folds for cross-validation.
- **inference_folds**: Specific folds to use for inference.

```python
    enable_amp = True
    train_batchsize = 32
    valid_batchsize = 1
```

- **enable_amp**: Enables automatic mixed precision to speed up training.
- **train_batchsize**: Batch size for training.
- **valid_batchsize**: Batch size for validation.

```python
    loss_type = "BCEFocalLoss"
```

- **loss_type**: Type of loss function used during training. Here, Binary Cross-Entropy with Focal Loss is used to handle class imbalance.

```python
    lr = 1.0e-03
    optimizer = 'adan'
    weight_decay = 1.0e-02
    es_patience = 5
    deterministic = True
    enable_amp = True
```

- **lr**: Learning rate for the optimizer.
- **optimizer**: Optimizer type, here 'adan' is chosen.
- **weight_decay**: Weight decay for regularization.
- **es_patience**: Early stopping patience, the number of epochs with no improvement after which training will be stopped.
- **deterministic**: Ensures deterministic behavior for reproducibility.
- **enable_amp**: (Repeated) Enables automatic mixed precision to speed up training.

```python
    max_epoch = 9
    aug_epoch = 6
```

- **max_epoch**: Maximum number of epochs for training.
- **aug_epoch**: Number of epochs for data augmentation.

### Secondary Label Handling

```python
    useSecondary = True
    secondary_label_value = 0.5
```

- **useSecondary**: Whether to use secondary labels in training.
- **secondary_label_value**: Value assigned to secondary labels during training.

### Oversampling

```python
    oversample = False
    oversample_threthold = 60
```

- **oversample**: Whether to oversample minority classes.
- **oversample_threthold**: Threshold for oversampling.

### Random Seed and Logging

```python
    seed = 42
    wandb = True
```

- **seed**: Seed for random number generators to ensure reproducibility.
- **wandb**: Whether to use Weights and Biases for experiment tracking.

### Data Augmentation Flags

```python
    aug_noise = 0.
    aug_gain = 0.0
    aug_wave_pitchshift = 0.0
    aug_wave_shift = 0.
    aug_spec_xymasking = 0.
    aug_spec_coarsedrop = 0.
    aug_spec_hflip = 0.
```

- **aug_noise, aug_gain, aug_wave_pitchshift, aug_wave_shift**: Flags for different types of wave augmentations.
- **aug_spec_xymasking, aug_spec_coarsedrop, aug_spec_hflip**: Flags for different types of spectrogram augmentations.

### Mixup Parameters

```python
    aug_wave_mixup = 1.0
    aug_spec_mixup = 0.0
    aug_spec_mixup_prob = 0.5
    alpha = 0.95
```

- **aug_wave_mixup**: Flag for wave mixup augmentation.
- **aug_spec_mixup**: Flag for spectrogram mixup augmentation.
- **aug_spec_mixup_prob**: Probability of applying spectrogram mixup.
- **alpha**: Parameter for the Beta distribution used in mixup.

```python
    smoothing_value = 0.0
```

- **smoothing_value**: Label smoothing value.

### Initializing Configuration and Device

```python
cfg = config()
```

- **cfg**: Instantiates the configuration class, making all settings accessible through `cfg`.

```python
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')
print(device)
```

- **device**: Sets the device for computation to GPU if available, otherwise CPU. This ensures that the code can leverage GPU acceleration if available.
- **print(device)**: Prints the selected device to verify the computation environment.



---

### **Key Concepts and Terms of CFG class**

#### Audio Processing

1. **Sample Rate (sr)**:
   - **Definition**: The number of samples of audio carried per second, measured in Hz (samples per second).
   - **In Context**: `sr = 32000` means the audio is sampled at 32,000 samples per second.

2. **Waveform**:
   - **Definition**: A graphical representation of the audio signal's amplitude over time.
   - **In Context**: The raw audio data before any processing like creating spectrograms.

3. **Spectrogram**:
   - **Definition**: A visual representation of the spectrum of frequencies in a signal as it varies with time.
   - **In Context**: Used to convert audio signals into a format that can be fed into convolutional neural networks (CNNs).

4. **Short-Time Fourier Transform (STFT)**:
   - **Definition**: A method to analyze the frequency content of a signal by dividing it into short segments and performing Fourier transform on each segment.
   - **In Context**: The `slice_duration` and `hop_length` parameters are used in STFT to create segments of audio and overlap them to analyze the frequency content.

5. **Mel Spectrogram**:
   - **Definition**: A spectrogram where the frequency axis is converted to the Mel scale, which is more closely aligned with human hearing.
   - **In Context**: Parameters like `n_mels`, `fmin`, and `fmax` are used to create Mel spectrograms.

#### Spectrogram Parameters

1. **fmin and fmax**:
   - **Definition**: The minimum and maximum frequencies for the Mel spectrogram.
   - **In Context**: `fmin = 20` and `fmax = 15000` define the frequency range of interest for the Mel spectrogram.

2. **n_mels**:
   - **Definition**: The number of Mel bands to generate in the spectrogram.
   - **In Context**: `n_mels = 128` means the spectrogram will have 128 Mel frequency bands.

3. **n_fft**:
   - **Definition**: The number of data points used in each block for the FFT (Fast Fourier Transform).
   - **In Context**: `n_fft = n_mels * 8` means each block will use 1024 points.

4. **hop_length**:
   - **Definition**: The number of samples between successive frames.
   - **In Context**: Determines the overlap between adjacent STFT segments, calculated based on `slice_duration` and `size_x`.

5. **bins_per_octave**:
   - **Definition**: Number of frequency bins per octave in a spectrogram.
   - **In Context**: `bins_per_octave = 12` is often used in music and audio analysis.

#### Training Parameters

1. **train_duration and test_duration**:
   - **Definition**: Duration of the audio clips used for training and testing, respectively.
   - **In Context**: Sets a fixed length for audio samples to ensure uniformity during training and testing.

2. **train_drop_duration**:
   - **Definition**: Duration of audio to be randomly dropped during training for data augmentation.
   - **In Context**: `train_drop_duration = 1` indicates dropping 1 second of audio to make the model robust to missing data.

3. **nfolds and inference_folds**:
   - **Definition**: Number of folds for cross-validation and specific folds used for inference.
   - **In Context**: `nfolds = 5` and `inference_folds = [4]` specify a 5-fold cross-validation with the fourth fold used for inference.

4. **train_batchsize and valid_batchsize**:
   - **Definition**: Number of samples per batch during training and validation.
   - **In Context**: `train_batchsize = 32` and `valid_batchsize = 1` set the batch sizes for these phases.

5. **loss_type**:
   - **Definition**: The type of loss function used during training.
   - **In Context**: `loss_type = "BCEFocalLoss"` is chosen to handle class imbalance.

6. **lr (learning rate)**:
   - **Definition**: The step size at each iteration while moving toward a minimum of the loss function.
   - **In Context**: `lr = 1.0e-03` is the learning rate for the optimizer.

7. **optimizer**:
   - **Definition**: The algorithm used to adjust the weights of the network.
   - **In Context**: `optimizer = 'adan'` specifies the type of optimizer used.

8. **weight_decay**:
   - **Definition**: A regularization technique to reduce overfitting by penalizing large weights.
   - **In Context**: `weight_decay = 1.0e-02` sets the strength of the penalty.

9. **es_patience**:
   - **Definition**: Number of epochs with no improvement after which training will be stopped early.
   - **In Context**: `es_patience = 5` means the training will stop if there is no improvement for 5 consecutive epochs.

10. **max_epoch and aug_epoch**:
    - **Definition**: Maximum number of epochs for training and number of epochs with augmentation.
    - **In Context**: `max_epoch = 9` and `aug_epoch = 6` define the training schedule.

#### **Data Augmentation**

1. **Augmentation Flags**:
   - **Definition**: Various flags that control different types of data augmentation techniques.
   - **In Context**: Flags like `aug_noise`, `aug_gain`, `aug_wave_pitchshift`, etc., control the types and extents of augmentations applied to the audio data.

2. **Mixup Parameters**:
   - **Definition**: Parameters for mixup data augmentation technique.
   - **In Context**: `aug_wave_mixup = 1.0` and `aug_spec_mixup = 0.0` control the application of mixup on waveforms and spectrograms.

#### **Label Handling and Logging**

1. **useSecondary and secondary_label_value**:
   - **Definition**: Whether to use secondary labels and their assigned value.
   - **In Context**: `useSecondary = True` and `secondary_label_value = 0.5` handle secondary labels during training.

2. **wandb**:
   - **Definition**: Flag to enable or disable Weights and Biases for experiment tracking.
   - **In Context**: `wandb = True` enables logging of training runs and results.

3. **seed**:
   - **Definition**: Random seed to ensure reproducibility.
   - **In Context**: `seed = 42` ensures consistent results across different runs.



In [None]:
class config:
    dir = "/kaggle/input/birdclef-2024/"


    wave_path = "original_waves/second_30/"

    model_name = 'tf_efficientnet_b0'

    pool_type = 'avg'

    
    train_duration = 30 
    slice_duration = 5 

    test_duration = 5

    train_drop_duration = 1
    
    # spectrogram parameters
    sr = 32000
    fmin = 20
    fmax = 15000

    n_mels = 128
    n_fft = n_mels*8
    size_x = 512
    
    hop_length = int(sr*slice_duration / size_x)
    test_hop_length = int(sr*test_duration / size_x)
    
    bins_per_octave = 12

    nfolds = 5
    inference_folds = [4]
    
    enable_amp = True
    train_batchsize = 32
    valid_batchsize = 1

    # loss_type = "BCEWithLogitsLoss"
    loss_type = "BCEFocalLoss"

    lr = 1.0e-04 


    optimizer='adan'
    weight_decay = 1.0e-02
    es_patience =  5
    deterministic = True
    enable_amp = True

    max_epoch = 9
    aug_epoch = 6
    

    useSecondary =True
    secondary_label_value = 0.5
    oversample =False
    oversample_threthold = 60
    
    seed = 42

    wandb = True

    ###augmentation flags
    aug_noise            = 0.
    aug_gain             = 0.0
    aug_wave_pitchshift  = 0.0
    aug_wave_shift       = 0.

    aug_spec_xymasking   = 0.
    aug_spec_coarsedrop  = 0.
    aug_spec_hflip       = 0.

    ##mixup param
    aug_wave_mixup       = 1.0
    aug_spec_mixup       = 0.0
    aug_spec_mixup_prob  = 0.5 
    alpha=0.95

    smoothing_value      = 0.0
    # spec_mix_mask_percent = 20
    
cfg = config()

device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')

print(device)

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">4 | Data Augmentation Pipeline For Training </p>



### Normal Augmentation

Augmentations applied directly to the audio waveform.

#### `Compose`
- Combines multiple augmentation transformations into a single pipeline. Each transformation in the list is applied in sequence.

#### `OneOf`
- Randomly selects one of the provided augmentations to apply. This is useful for introducing variety without overloading the data with too many augmentations at once.

#### `Gain` and `GainTransition`
- `Gain`: Changes the volume of the audio by a random amount between -15 dB and 15 dB.
- `GainTransition`: Changes the volume gradually over a duration, making the audio louder or quieter over time.
- `p=cfg.aug_gain`: Probability of applying the gain augmentation, controlled by `cfg.aug_gain`.

#### `AddGaussianNoise` and `AddColorNoise`
- `AddGaussianNoise`: Adds random Gaussian noise to the audio.
- `AddColorNoise`: Adds colored noise with specific signal-to-noise ratio (SNR) and decay properties.
- `p=cfg.aug_noise`: Probability of applying noise augmentation, controlled by `cfg.aug_noise`.

#### `PitchShift`
- Changes the pitch of the audio by a random amount between -1 and 1 semitones.
- `p=cfg.aug_wave_pitchshift`: Probability of applying pitch shift, controlled by `cfg.aug_wave_pitchshift`.

#### `Shift`
- Shifts the audio in time, effectively moving it forward or backward.
- `p=cfg.aug_wave_shift`: Probability of applying time shift, controlled by `cfg.aug_wave_shift`.

### Albumentations Transform
Augmentations that are applied to the spectrograms, which are visual representations of the audio.

#### `albumentations.XYMasking`
- Randomly masks parts of the spectrogram, blocking out sections in both time and frequency dimensions.
- `num_masks_x=2`, `num_masks_y=1`: Number of masks in x (time) and y (frequency) directions.
- `mask_x_length`, `mask_y_length`: Size of the masks.
- `p=cfg.aug_spec_xymasking`: Probability of applying XY masking, controlled by `cfg.aug_spec_xymasking`.

#### `albumentations.CoarseDropout`
- Randomly drops larger sections of the spectrogram, simulating occlusions or missing data.
- `min_holes=20`, `max_holes=50`: Number of holes to drop in the spectrogram.
- `p=cfg.aug_spec_coarsedrop`: Probability of applying coarse dropout, controlled by `cfg.aug_spec_coarsedrop`.

#### `albumentations.HorizontalFlip`
- Flips the spectrogram horizontally. This can help the model learn features invariant to time reversal.
- `p=cfg.aug_spec_hflip`: Probability of applying horizontal flip, controlled by `cfg.aug_spec_hflip`.

### Compose Albumentations
```python
albumentations_augment = albumentations.Compose(alb_transform)
```
Combines the defined spectrogram augmentations into a single pipeline, so they can be applied in sequence.

### Summary
- **Purpose**: These augmentations are applied to the training data to increase its diversity and help the model generalize better to unseen data.
- **Waveform Augmentations**: Applied directly to the raw audio signal to introduce variations in volume, noise, pitch, and timing.
- **Spectrogram Augmentations**: Applied to the visual representation of the audio to simulate occlusions and learn invariant features.



In [None]:
if isTrain== True:

    normal_augment = Compose([
        OneOf([
            Gain(min_gain_in_db=-15, max_gain_in_db=15, p=1.0),
            GainTransition(min_gain_in_db=-24.0, max_gain_in_db=6.0,
                           min_duration=0.2, max_duration=6.0,  p=1.0)
        ], p=cfg.aug_gain),
        
        OneOf([
            AddGaussianNoise(p=1),
            AddColorNoise(p=1, min_snr_db=5, max_snr_db=20, min_f_decay=-3.01, max_f_decay=-3.01)
        ],p=cfg.aug_noise),

    
        PitchShift(min_semitones=-1, max_semitones=1, p=cfg.aug_wave_pitchshift),
        Shift(p=cfg.aug_wave_shift)
    ])
    alb_transform = [
        albumentations.XYMasking(num_masks_x=2, num_masks_y=1, 
                                 mask_x_length=cfg.size_x//30, mask_y_length=cfg.n_mels//30,
                                 fill_value=0, mask_fill_value=0, p=cfg.aug_spec_xymasking),
        albumentations.CoarseDropout(fill_value=0, min_holes=20, max_holes=50, p=cfg.aug_spec_coarsedrop),
        albumentations.HorizontalFlip(p=cfg.aug_spec_hflip)    
    ]
    albumentations_augment = albumentations.Compose(alb_transform)

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">5 |  Mixup Data Augmentation Function </p>


### Function: `mixup`

**Purpose**:  
The `mixup` function is used for data augmentation. It combines two examples in the dataset to create a new example. This technique helps the model generalize better by providing it with more varied training examples.

#### Parameters:
- `data`: The input data (usually a batch of samples).
- `targets`: The corresponding labels for the input data.
- `alpha`: A parameter for the Beta distribution, controlling the degree of interpolation between examples.
- `mode`: Specifies how to select the data for mixing. Options are "same_wave" or "other_wave".
---
#### Concepts:

1. **Mixup**: 
   - Mixup is a data augmentation technique where two samples and their labels are linearly interpolated to create a new sample and label.
   - This helps improve the robustness of the model by preventing it from overfitting to the original training data.

2. **Beta Distribution**:
   - `np.random.beta(alpha, alpha)`: Generates a lambda (λ) value from the Beta distribution. This λ determines the mix ratio between the two samples.
   - A higher `alpha` means samples will be more similar to the original, while a lower `alpha` will result in more variation.

3. **Interpolation**:
   - The new data is created by interpolating between the original and shuffled data using the lambda (λ) value.

---

### Code Explanation:

```python
def mixup(data, targets, alpha, mode="same_wave"):
```
- Defines the function with parameters `data`, `targets`, `alpha`, and `mode`.

```python
    if mode == "same_wave":
        data = torch.tensor(data)
        indices = torch.randperm(data.size(0))
        shuffled_data = data[indices]

        lam = np.random.beta(alpha, alpha)
        new_data = data * lam + shuffled_data * (1 - lam)
        return new_data.numpy()
```
- **Mode: "same_wave"**:
  - Converts `data` to a tensor.
  - `torch.randperm(data.size(0))` generates a random permutation of indices.
  - `shuffled_data = data[indices]` shuffles the data based on the random indices.
  - `lam = np.random.beta(alpha, alpha)` generates a lambda value.
  - `new_data = data * lam + shuffled_data * (1 - lam)` creates the new mixed data by combining the original and shuffled data.
  - Returns the new mixed data as a numpy array.

```python
    elif mode == "other_wave":
        indices = torch.randperm(data.size(0))
        shuffled_data = data[indices]
        shuffled_targets = targets[indices]
    
        lam = np.random.beta(alpha, alpha)
        new_data = data * lam + shuffled_data * (1 - lam)
        new_targets = targets * lam + shuffled_targets * (1 - lam)
    
        return new_data, new_targets
```
- **Mode: "other_wave"**:
  - `indices = torch.randperm(data.size(0))` generates a random permutation of indices.
  - `shuffled_data = data[indices]` and `shuffled_targets = targets[indices]` shuffle the data and targets.
  - `lam = np.random.beta(alpha, alpha)` generates a lambda value.
  - `new_data = data * lam + shuffled_data * (1 - lam)` and `new_targets = targets * lam + shuffled_targets * (1 - lam)` create new mixed data and labels.
  - Returns the new mixed data and labels.



In [None]:
def mixup(data, targets, alpha, mode="same_wave"):
    
    if mode == "same_wave":
        data = torch.tensor(data)
        indices = torch.randperm(data.size(0))
        shuffled_data = data[indices]

        lam = np.random.beta(alpha, alpha)
        new_data = data * lam + shuffled_data * (1 - lam)
        return new_data.numpy()
        
    elif mode == "other_wave":
        indices = torch.randperm(data.size(0))
        shuffled_data = data[indices]
        shuffled_targets = targets[indices]
    
        lam = np.random.beta(alpha, alpha)
        new_data = data * lam + shuffled_data * (1 - lam)
        new_targets = targets * lam + shuffled_targets * (1 - lam)
    
        return new_data, new_targets

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">6 | Spectral Mixup Function</p>


1. **SpecXYMasking**:
   - `spec_xymasking` is an instance of the `XYMasking` transformation from the Albumentations library, which applies masking to an image.
   - The parameters specify the number of masks along the x and y axes, the lengths of these masks, and the fill values for the mask and the image.
   - This transformation is applied with a probability of 1 (i.e., always) during data augmentation.

2. **Function `spec_mixup`**:
   - This function takes `data` (input spectrogram data) and `targets` (corresponding labels).
   - `type = data.dtype`: Stores the data type of the input data.
   - `indices = torch.randperm(data.size(0))`: Generates random permutations of indices.
   - `shuffled_data` and `shuffled_targets` are created by shuffling `data` and `targets` based on the generated indices.
   - `data_transposed` is created by transposing the input data to a different shape suitable for applying `spec_xymasking`.
   - `diff` calculates the difference between the original data and the transposed data.
   - `mask` is created to identify the differences between the original and transposed data.
   - `shuffled_data_masked` applies the mask to the shuffled data.
   - `new_data` combines the transposed data and the masked shuffled data.
   - `lam` calculates the mix ratio based on the number of non-zero elements in the mask.
   - `new_targets` are calculated by mixing the original targets with the shuffled targets based on the mix ratio.

3. **Return**:
   - The function returns `new_data` and `new_targets`.



---

### **Spectral mixup**

A data augmentation technique commonly used in the field of audio signal processing, especially in tasks such as audio classification or sound event detection. This technique is an extension of the original "mixup" technique, which was initially proposed for image classification tasks.

In the context of audio data, spectral mixup involves mixing two or more audio signals' spectrograms to create new synthetic examples. Here's how it works:

1. **Spectrogram Generation**:
   - Audio signals are converted into spectrograms, which are visual representations of the audio signal's frequency content over time. Spectrograms are created using techniques such as Short-Time Fourier Transform (STFT).

2. **Mixing Spectrograms**:
   - Spectrograms from different audio samples are linearly combined. This combination involves taking a weighted sum of the spectrograms, where the weights are randomly sampled from a beta distribution.
   - The purpose of mixing spectrograms is to create new training examples that lie on the line segment connecting two original spectrograms in the feature space.

3. **Label Mixing**:
   - Along with mixing spectrograms, the corresponding labels (or target outputs) are also mixed based on the same weights used for mixing the spectrograms. This ensures that the labels for the synthetic examples are also a linear combination of the original labels.

4. **Training**:
   - The synthetic spectrograms and their mixed labels are then used for training a neural network model.
   - During training, the model learns from both the original and synthetic examples, effectively expanding the diversity of the training data and improving the model's ability to generalize to unseen data.

Spectral mixup helps in regularizing the model and reducing overfitting by providing a smoother and more continuous distribution of training examples in the feature space. It encourages the model to learn more robust and generalized representations of the input data, leading to improved performance on unseen data.

In [None]:
if isTrain== True:
    spec_xymasking = albumentations.XYMasking(num_masks_x=2, num_masks_y=1, 
                                              mask_x_length=cfg.size_x // 10, mask_y_length=cfg.n_mels // 10,
                                              fill_value=0, mask_fill_value=0, p=1)

def spec_mixup(data, targets):
    type = data.dtype

    indices = torch.randperm(data.size(0))
    shuffled_data = data[indices]
    shuffled_targets = targets[indices]

    data = np.array(data)
    data_transposed = np.transpose(data, (2, 3, 1, 0))
    data_transposed = spec_xymasking(image=data_transposed)["image"]
    data_transposed = np.transpose(data_transposed, (3, 2, 0, 1))  

    diff = data - data_transposed
    mask = (diff != 0).astype(int)

    shuffled_data_masked = (shuffled_data * mask)

    new_data = torch.tensor(data_transposed, dtype=type) + torch.tensor(shuffled_data_masked, dtype=type)

    lam = mask.sum() / len(data) / (cfg.n_mels*cfg.size_x)
    new_targets = targets * (1-lam) + shuffled_targets *lam

    return new_data, new_targets


# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">7 | Mel Spectrogram Generation </p>

#### Purpose:
This cell is all about creating Mel spectrograms from audio data. Mel spectrograms are like pictures of sound, and they're crucial for training and testing models to deal with audio, like recognizing bird sounds.

#### Explanation:
1. **spec_layer**: Think of this as the tool we use to make Mel spectrograms during training. It's like having a special camera that takes pictures of sound waves. We set it up with details like how fast it should take pictures (sample rate), how often it should take them (hop length), and how clear the pictures should be (number of FFT points and Mel bins). We also tell it the lowest and highest sounds it should pay attention to. Then, we put this camera on the device we want to use, like a computer's brain (GPU) if it's strong or just the regular computer (CPU) if it's not.

2. **valid_spec_layer**: This is similar to the first one, but we use it when we're checking how good our model is during validation. It's like having another camera, but this one takes pictures a bit differently to save time. It's still useful for making sure our model is learning well.

3. **test_spec_layer**: This is yet another camera, but we use it when we're done training and want to see how well our model works on new sounds. We put this camera on the regular computer (CPU) because we don't need it to be super fast for this part.

#### Concepts:
- **Mel Spectrogram**: It's like a photo album of sound waves, showing how loud different pitches are over time.
- **torchaudio.transforms.MelSpectrogram**: It's like a magic tool that turns sound into Mel spectrograms.
- **Sample Rate**: How quickly the magic tool takes pictures of sound.
- **Hop Length**: How often the magic tool takes pictures, kind of like frames in a movie.
- **FFT (Fast Fourier Transform) Points**: The clearer the pictures, the better we can see details in the sound waves.
- **Mel Bins**: Imagine the magic tool dividing sound into buckets to see how much sound there is at different pitches.
- **Minimum and Maximum Frequency**: The lowest and highest pitches the magic tool pays attention to.
- **Mel Scale**: It's like adjusting the colors in the photo album to match how humans hear sound.
- **Centering and Padding**: Tricks to make sure the pictures look nice, even at the edges.


In [None]:
spec_layer = torchaudio.transforms.MelSpectrogram(
    sample_rate=cfg.sr, hop_length=cfg.hop_length, n_fft=cfg.n_fft,
    n_mels=cfg.n_mels,f_min=cfg.fmin,f_max=cfg.fmax,mel_scale='slaney',center=True, pad_mode='reflect'
).to(device)

valid_spec_layer = torchaudio.transforms.MelSpectrogram(
    sample_rate=cfg.sr, hop_length=cfg.test_hop_length, n_fft=cfg.n_fft,
    n_mels=cfg.n_mels,f_min=cfg.fmin,f_max=cfg.fmax,mel_scale='slaney',center=True, pad_mode='reflect'
).to(device)

test_spec_layer = torchaudio.transforms.MelSpectrogram(
    sample_rate=cfg.sr, hop_length=cfg.test_hop_length, n_fft=cfg.n_fft,
    n_mels=cfg.n_mels,f_min=cfg.fmin,f_max=cfg.fmax,mel_scale='slaney',center=True, pad_mode='reflect'
).cpu()

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">8 | Data Preparation Steps</p>

#### 1. Reading Sample Submission Data:
   - **Purpose**: Understanding submission file format.
   - **Action**: Reads the sample submission file.
  

#### 2. Defining Labels:
   - **Purpose**: Storing label information.
   - **Action**: Extracts column names from the sample submission file.
  

#### 3. Reading Training Metadata:
   - **Purpose**: Understanding training data details.
   - **Action**: Reads the training metadata file.
 

#### 4. Creating New Target Column:
   - **Purpose**: Preparing target variable for training.
   - **Action**: Combines primary and secondary labels into a single column.


#### 5. Calculating Length of New Target:
   - **Purpose**: Analyzing label distribution.
   - **Action**: Counts the number of labels per sample.
 

#### 6. Visualizing Label Distribution:
   - **Purpose**: Visualizing label distribution for analysis.
   - **Action**: Generates a bar plot of label distribution.
  

#### 7. Cleaning Duplicated Filenames:
   - **Purpose**: Ensuring data integrity.
   - **Action**: Identifies and removes rows with duplicated filenames.
   - **Steps**:
     - Extracts filenames from paths and removes file extensions.
     - Identifies duplicated filenames.
     - Removes rows with duplicated filenames.
  

#### 8. Resetting DataFrame Index:
   - **Purpose**: Ensuring sequential index after data manipulation.
   - **Action**: Resets the index of the DataFrame.
 


In [None]:
sample_submission = pd.read_csv(cfg.dir+"sample_submission.csv")
LABELS = list(sample_submission.set_index("row_id").columns)
LABELS[:5]
train_csv = pd.read_csv(cfg.dir+"train_metadata.csv")
train_csv['new_target'] = train_csv['primary_label'] + ' ' + train_csv['secondary_labels'].map(lambda x: ' '.join(ast.literal_eval(x)))
train_csv['len_new_target'] =train_csv['new_target'].map(lambda x: len(x.split()))
train_csv["len_new_target"].value_counts().plot(kind="bar", figsize=(4,2))
train_csv["filename_tmp"] = train_csv["filename"].map(lambda x:x.split("/")[1][:-4])
duplicated_filenames = train_csv["filename_tmp"].value_counts()[train_csv["filename_tmp"].value_counts() > 1].index
train_csv = train_csv[~train_csv["filename_tmp"].isin(duplicated_filenames)]
train_csv = train_csv.reset_index(drop=True)

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%"> 9 | BirdCLEF Dataset Preparation </p>

#### Purpose:
Prepare the BirdCLEF dataset for training, validation, and testing purposes. It involves defining a custom dataset class, performing necessary data preprocessing steps, and generating Mel spectrograms from audio recordings.

#### Explanation:


1. **`__init__` method**:
    - This method initializes the dataset instance. It takes in a DataFrame (`df`), a boolean flag for augmentation (`augmentation`), and a mode (`mode`). The DataFrame likely contains information about the dataset, such as file paths and labels. The mode determines whether the dataset is used for training, validation, testing, or cleaning.

2. **`__len__` method**:
    - This method returns the total number of samples in the dataset.

3. **`normalize` method**:
    - This method normalizes the input data. It replaces `-inf` values with the mean of valid values, then scales the data between 0 and 1.

4. **`wave_tile_and_cutoff` method**:
    - This method preprocesses the audio data. It adjusts the length of the audio data to match the desired duration for training by either truncating or tiling the data.

5. **`label_smoothing` method**:
    - This method applies label smoothing to the target labels. It generates smoothed labels to reduce overfitting and improve generalization performance.

6. **`__getitem__` method**:
    - This method retrieves a sample from the dataset based on the provided index (`idx`). It loads audio data, preprocesses it, applies augmentation (if enabled), generates spectrograms, and returns the spectrogram along with its corresponding target label.

    - For training mode:
        - It reads the audio file, applies augmentation techniques (if enabled), generates spectrograms, and applies label smoothing.

    - For validation mode:
        - It reads the audio file, generates spectrograms, and returns them along with the target labels.

    - For test mode:
        - It reads the audio file, generates spectrograms, and returns them.

    - For clean mode:
        - It reads the audio file, generates spectrograms, and returns them along with the file path.



#### Concepts:
- **Custom Dataset Class**: A specialized class tailored for specific dataset handling.
- **Mel Spectrogram**: A visual representation of sound frequency spectrum used for audio analysis.
- **Label Smoothing**: Technique to prevent model overconfidence and improve generalization.
- **Data Preprocessing**: Manipulating data to make it suitable for analysis or model training.
- **Data Augmentation**: Techniques to increase dataset diversity for better model training.
- **Indexing**: Accessing specific elements from the dataset based on their index.

In [None]:
class BirdCLEF_Dataset(torch.utils.data.Dataset):
    def __init__(self, df, augmentation=False, mode='train'):
        if mode == 'train':
            self.df = df.reset_index(drop=True)
        elif mode == 'valid':
            self.df = df.reset_index(drop=True)
        else:
            self.df = df
        self.mode = mode
        self.augmentation = augmentation
    
    def __len__(self):
        return len(self.df)

    def normalize(self, x):
        valid_values = x[x != float('-inf')]
        mean_value = np.mean(valid_values)
        x[x == float('-inf')] = mean_value
        

        x = x - x.min()
        x = x / x.max()
        return x

    def wave_tile_and_cutoff(self, data):
      
        drop_duration = cfg.sr*cfg.train_drop_duration
        use_duration  = cfg.sr*cfg.train_duration
        
        if len(data[0]) > drop_duration: 
            data = data[:,drop_duration:]

        if len(data[0]) < use_duration:
            iter = 1 + (use_duration) // len(data[0])
            data = np.tile(data, (1, iter))

        data = data[:,:use_duration]
        return data

    def label_smoothing(self, idx, target):
    
        secondary_target = target * cfg.secondary_label_value
    
        out_of_target_noise_intensity = cfg.smoothing_value/(len(LABELS)-1) 
        out_of_target_noise_array = torch.ones(target.shape) * out_of_target_noise_intensity
        
        secondary_target_with_noise = secondary_target + out_of_target_noise_array
        secondary_target_with_noise = torch.clip(secondary_target_with_noise, min=0, max=cfg.secondary_label_value)
    
        primary_target = np.isin(LABELS, self.df.loc[idx, "primary_label"]).astype(int)
        primary_target = torch.tensor(primary_target, dtype=torch.float32)

        primary_and_secondary_target_with_noise = primary_target + secondary_target_with_noise
        new_target = torch.clip(primary_and_secondary_target_with_noise, min=0, max=1)
    
        new_target = new_target - primary_target * cfg.smoothing_value
    
        return new_target

    
    def __getitem__(self, idx):

        if self.mode == 'train':

          
            if cfg.useSecondary == True:
                target = np.isin(LABELS, self.df.loc[idx, "new_target"].split()).astype(int)
            else:
                target = np.isin(LABELS, self.df.loc[idx, "primary_label"].split()).astype(int)
            target = torch.tensor(target, dtype=torch.float32)
          
            target = self.label_smoothing(idx, target)
            
            fileID = self.df.loc[idx, 'fileID'] 
            
            path = f"{cfg.wave_path}{fileID}.npy"
            wave = np.load(path)
            

      
            wave = self.wave_tile_and_cutoff(data=wave)

            
            input_duration = cfg.sr * cfg.slice_duration
            
            
            if self.augmentation == True:
               
                if cfg.aug_wave_mixup > np.random.random():
                    #train_duration -> slice_duration
                    wave_reshape = wave.reshape(-1, input_duration)
                    wave = mixup(data=wave_reshape, targets=target, alpha=cfg.alpha, mode="same_wave")
                    wave = wave[:1,:]
                else:
                    wave = wave[:, :input_duration]
                
     
                wave = normal_augment(samples=wave, sample_rate=cfg.sr)

    
                wave = torch.tensor(wave).to(device)
                mel_spec = spec_layer(wave)
                mel_spec = np.array(mel_spec.cpu())

                mel_spec = np.log(mel_spec)
                for i in range(len(mel_spec)):
                    mel_spec[i] = self.normalize(mel_spec[i])
                mel_spec = torch.tensor(mel_spec)
                mel_spec = mel_spec[:,:,:cfg.size_x]

     
                mel_spec = np.array(mel_spec.cpu())
                mel_spec = np.transpose(mel_spec, (1, 2, 0))                
                mel_spec = albumentations_augment(image=mel_spec)["image"]
                mel_spec = np.transpose(mel_spec, (2, 0, 1))


                
            else:
                wave = wave[:, :input_duration]
                
                wave = torch.tensor(wave).to(device)
                mel_spec = spec_layer(wave)
                mel_spec = np.array(mel_spec.cpu())

                mel_spec = np.log(mel_spec)

                for i in range(len(mel_spec)):
                    mel_spec[i] = self.normalize(mel_spec[i])
                    

                mel_spec = torch.tensor(mel_spec)
                mel_spec = mel_spec[:,:,:cfg.size_x]

            
            mel_spec = torch.tensor(mel_spec)

            
            return mel_spec, target

        elif self.mode == 'valid':
            

            if cfg.useSecondary == True:
                target = np.isin(LABELS, self.df.loc[idx, "new_target"].split()).astype(int)
            else:
                target = np.isin(LABELS, self.df.loc[idx, "primary_target"].split()).astype(int)
            target = torch.tensor(target, dtype=torch.float32)
            
            fileID = self.df.loc[idx, 'fileID'] 
            
            path = f"{cfg.wave_path}{fileID}.npy"
            wave = np.load(path)

            wave = self.wave_tile_and_cutoff(data=wave)

            input_duration = cfg.sr*cfg.test_duration
            wave_reshape = wave.reshape(-1, input_duration)

            wave_reshape = torch.tensor(wave_reshape).to(device)
            mel_specs = valid_spec_layer(wave_reshape)
            mel_specs = mel_specs.cpu().numpy()

            mel_specs = np.log(mel_specs)
            for i in range(len(mel_specs)):
                mel_specs[i] = self.normalize(mel_specs[i])
            mel_specs = torch.tensor(mel_specs)
            
            mel_specs = mel_specs[:,:,:cfg.size_x]

            targets = torch.tile(target, dims=(mel_specs.shape[0],1))
            return mel_specs, targets

        elif self.mode == 'test':

            filepath = self.df[idx]
            wave, _  = torchaudio.load(filepath)
            wave = wave[:,:60*4*32000]

            wave_reshaped = wave.reshape(-1, 1, cfg.test_duration*cfg.sr)
            
            mel_spec = test_spec_layer(wave_reshaped)
            mel_spec = np.log(mel_spec)

            mel_spec = np.array(mel_spec)
            for i in range(len(mel_spec)):
                mel_spec[i] = self.normalize(mel_spec[i])
            mel_spec = torch.tensor(mel_spec)

            mel_spec = mel_spec[:,:,:cfg.size_x]
            return mel_spec

        elif self.mode == 'clean':

            filepath = self.df[idx]
            wave, _  = torchaudio.load(filepath)

            wave = wave[:, :6*cfg.test_duration*cfg.sr]

            chunk_length = len(wave[0]) // (cfg.test_duration*cfg.sr)
            
            wave = wave[:,:chunk_length*cfg.test_duration*cfg.sr]

            wave_reshaped = wave.reshape(-1, 1, cfg.test_duration*cfg.sr)
            
            mel_spec = test_spec_layer(wave_reshaped)
            mel_spec = np.log(mel_spec)

            mel_spec = np.array(mel_spec)
            for i in range(len(mel_spec)):
                mel_spec[i] = self.normalize(mel_spec[i])
            mel_spec = torch.tensor(mel_spec)

            return mel_spec, filepath

**Explaination**

1. **`if isTrain:`**: 
    - This conditional block ensures that the following code is executed only when the variable `isTrain` is `True`.

2. **`print("train data")`**:
    - This line simply prints "train data" to indicate that the following visualization is related to training data.

3. **`dataset = BirdCLEF_Dataset(df=train_csv, augmentation=True,  mode="train")`**:
    - This line creates an instance of the `BirdCLEF_Dataset` class for training data. It uses the training DataFrame `train_csv`, enables augmentation, and sets the mode to "train".

4. **`data, target = dataset[270]`**:
    - This line retrieves a specific sample (index 270 in this case) from the dataset. It returns the data (likely a spectrogram) and its corresponding target label.

5. **`fig, ax = plt.subplots(figsize=(6,4))`**:
    - This line creates a figure and axes object for plotting.

6. **`plt.imshow(data[0], cmap="jet", origin="lower")`**:
    - This line plots the spectrogram of the first channel of the data using `imshow` function from matplotlib. It sets the colormap to "jet" and origin to "lower".

7. **`plt.show()`**:
    - This line displays the plotted spectrogram.

8. **`print("validation data")`**:
    - This line prints "validation data" to indicate that the following visualization is related to validation data.

9. **`dataset = BirdCLEF_Dataset(df=train_csv, augmentation=True,  mode="valid")`**:
    - This line creates an instance of the `BirdCLEF_Dataset` class for validation data. It uses the same training DataFrame `train_csv`, enables augmentation, and sets the mode to "valid".

10. **`data, target = dataset[270]`**:
    - Similar to before, this line retrieves a specific sample (index 270) from the dataset. It returns the data (likely a spectrogram) and its corresponding target label.

11. **`fig, axes = plt.subplots(figsize=(12,8), nrows=len(data), tight_layout=True)`**:
    - This line creates a figure and axes object for plotting multiple spectrograms. It sets the figure size, number of rows (equal to the length of `data`), and enables tight layout to prevent overlap.

12. **`for idx, ax in enumerate(axes.ravel()):`**:
    - This line iterates over the flattened axes array.

13. **`ax.imshow(data[idx], cmap="jet", origin="lower")`**:
    - Inside the loop, it plots each spectrogram from the `data` array onto the corresponding axis.



In [None]:
if isTrain:
    print("train data")
    dataset = BirdCLEF_Dataset(df=train_csv, augmentation=True,  mode="train")
    data, target = dataset[270]
    fig, ax = plt.subplots(figsize=(6,4))
    plt.imshow(data[0], cmap="jet", origin="lower")
    plt.show()
    
    print("validation data")
    dataset = BirdCLEF_Dataset(df=train_csv, augmentation=True,  mode="valid")
    data, target = dataset[270]
    fig, axes = plt.subplots(figsize=(12,8), nrows=len(data), tight_layout=True)
    for idx, ax in enumerate(axes.ravel()):
        ax.imshow(data[idx], cmap="jet", origin="lower")

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">10 | BirdModel : Flexible Pooling Architecture For Bird Sound Classification </p>

Define a PyTorch neural network model called `BirdModel`, which is tailored for bird sound classification tasks. 

1. **Initialization**:
    - The `BirdModel` is initialized with parameters like `model_name`, `pretrained`, `in_channels`, `num_classes`, and `pool`, enabling users to customize the architecture according to their needs.
    - `model_name`: Specifies the backbone model architecture.
    - `pretrained`: Boolean flag indicating whether to use pre-trained weights for the backbone.
    - `in_channels`: Number of input channels.
    - `num_classes`: Number of output classes.
    - `pool`: Specifies the pooling strategy ("default", "max", "avg", or "both").

2. **Backbone**:
    - The backbone is selected using `timm.create_model`, allowing the flexibility to choose from various pre-trained models.
    - Depending on the `pool` parameter, the global pooling behavior of the backbone is customized. If `pool` is set to "default", the default global pooling behavior of the backbone is retained; otherwise, the global pooling is overridden with an empty string.

3. **Pooling Layers**:
    - Three types of pooling layers are employed based on the chosen strategy:
        - `max_pooling`: Adaptive max pooling followed by flattening, useful for capturing the most salient features.
        - `avg_pooling`: Adaptive average pooling followed by flattening, useful for capturing overall patterns.
        - `both_pooling_neck`: A combination of max and average pooling followed by a linear layer, allowing the model to learn complementary representations from both strategies.

4. **Head**:
    - The head of the model consists of fully connected layers, responsible for mapping the features extracted by the backbone and pooling layers to the output space.
    - It includes batch normalization, activation functions (e.g., Hardswish), and dropout for regularization, ensuring robustness and preventing overfitting.
    - The final linear layer outputs logits for each class, with the number of units equal to the number of output classes.

5. **Forward Pass**:
    - The `forward` method defines the forward pass of the model.
    - It applies input normalization, passes the input through the backbone model, applies the specified pooling operation (if applicable), and feeds the features to the head.
    - If pooling is set to "both", it combines features extracted by both max and average pooling layers before passing them to the head.
    - The output of the model is the predicted probabilities for each class, obtained by applying the sigmoid activation function.



### Some Related concepts

1. **Normalization (`transforms.Normalize`)**:
   - Normalization is a preprocessing technique used to scale numerical data to a standard range. In this case, it standardizes the input data to have a mean of `[0.485, 0.456, 0.406]` and a standard deviation of `[0.229, 0.224, 0.225]`.
   - `transforms.Normalize` is a transformation provided by PyTorch's `torchvision.transforms` module.

2. **Backbone Model**:
   - The backbone model is the core architecture of the neural network responsible for feature extraction.
   - `timm.create_model` is a function from the `timm` (pytorch-image-models) library used to create various neural network architectures.
   - `model_name` specifies the name of the backbone model architecture, and `pretrained` indicates whether to use pre-trained weights.

3. **Pooling Layers**:
   - Pooling layers reduce the spatial dimensions of feature maps while retaining important information.
   - `torch.nn.AdaptiveMaxPool2d` and `torch.nn.AdaptiveAvgPool2d` are adaptive pooling layers that dynamically adjust their output size based on the input size.
   - `torch.nn.Flatten` converts multi-dimensional data into a one-dimensional tensor.
   - `torch.nn.BatchNorm1d` is a batch normalization layer applied to the features before feeding them to the fully connected layers.

4. **Head**:
   - The head of the model typically consists of fully connected layers responsible for classification.
   - `torch.nn.Linear` defines a linear transformation from input to output features.
   - `torch.nn.Hardswish` is an activation function that is a smooth approximation of the ReLU function.
   - `torch.nn.Dropout` applies dropout regularization to prevent overfitting by randomly dropping input units during training.

5. **Forward Pass (`forward` method)**:
   - The `forward` method defines how input data flows through the neural network layers during inference.
   - It applies input normalization, passes data through the backbone model, applies pooling if specified, and feeds the features to the head for classification.
   - The output of the model is the predicted probabilities for each class.


In [None]:
class BirdModel(torch.nn.Module):
    def __init__(self, model_name, pretrained, in_channels, num_classes, pool="default"):
        super().__init__()

        self.pool = pool
        self.normalize = transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        
        if pool == "default":
            self.backbone = timm.create_model(
                model_name=model_name, pretrained=pretrained,
                num_classes=0, in_chans=3)
        else:
            self.backbone = timm.create_model(
                model_name=model_name, pretrained=pretrained,
                num_classes=0, in_chans=3, global_pool="")

        in_features = self.backbone.num_features



        self.max_pooling = torch.nn.Sequential(torch.nn.AdaptiveMaxPool2d(1),
                                               torch.nn.Flatten(start_dim=1, end_dim=-1))
        self.avg_pooling = torch.nn.Sequential(torch.nn.AdaptiveAvgPool2d(1),
                                               torch.nn.Flatten(start_dim=1, end_dim=-1))
        self.both_pooling_neck = torch.nn.Sequential(torch.nn.BatchNorm1d(2*in_features),
                                                     torch.nn.Linear(in_features=2*in_features, out_features=in_features))
        
        self.head = torch.nn.Sequential(
            torch.nn.BatchNorm1d(in_features),
            torch.nn.Linear(in_features=in_features, out_features=256),
            torch.nn.Hardswish(inplace=True),torch.nn.Dropout(0.1),
            torch.nn.Linear(in_features=256, out_features=len(LABELS))  
        )



        self.active = torch.nn.Sigmoid()
    def forward(self, x):
        x = x.expand(-1, 3, -1, -1)
        x = self.normalize(x)
        x = self.backbone(x)

        if self.pool == "max":
            x = self.max_pooling(x)
        elif self.pool == "avg":
            x = self.avg_pooling(x)
        elif self.pool == "both":
            x_max = self.max_pooling(x)
            x_avg = self.avg_pooling(x)
            x = x_max + x_avg
            # x = torch.cat([x_max, x_avg], dim=1)
            # x = self.both_pooling_neck(x)
            
        x = self.head(x)
        # x = self.active(x)
        return x

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">11 | Stratified k-Fold Cross-Validation and Random Seed Setting 🐦</p>


1. **Stratified K-Fold Cross-Validation**:
   - Stratified K-Fold cross-validation is a technique used to evaluate the performance of a machine learning model. It ensures that each fold of the dataset has approximately the same proportion of samples from each class, which is particularly useful for imbalanced datasets.
   - `StratifiedKFold` is a class from the scikit-learn library that splits a dataset into K folds while preserving the percentage of samples for each class.
   - In the `for` loop:
       - `skf.split(train_csv, train_csv['primary_label'])` splits the dataset (`train_csv`) into train and validation sets for each fold, ensuring that each fold maintains the same distribution of classes as the original dataset.
       - `train_index` and `valid_index` contain the indices of samples for the training and validation sets for the current fold.
       - `enumerate(skf.split(...))` iterates over each fold, providing the fold index (`fold`) and the corresponding train/validation indices.
       - `train_csv.loc[valid_index, 'fold'] = int(fold)` assigns the fold index to the validation samples in the `fold` column of the DataFrame `train_csv`, indicating which fold each sample belongs to.

2. **Random Seed Setting**:
   - Setting random seeds ensures reproducibility of results in machine learning experiments. It initializes the random number generators with a fixed seed, so the same sequence of random numbers is generated every time the code is run.
   - `set_random_seed` is a function defined to set the random seed across different random number generators used in the experiment.
   - Inside the function:
       - `random.seed(seed)`, `np.random.seed(seed)`, and `os.environ["PYTHONHASHSEED"] = str(seed)` set the random seed for the Python built-in random number generator, NumPy, and hash randomization, respectively.
       - `torch.manual_seed(seed)` sets the random seed for the PyTorch library for CPU operations.
       - `torch.cuda.manual_seed(seed)` sets the random seed for GPU operations in PyTorch.
       - `torch.backends.cudnn.deterministic = deterministic` ensures deterministic behavior of CuDNN (CUDA Deep Neural Network library) for GPU operations in PyTorch, which can affect the performance but ensures reproducibility.



In [None]:

skf = StratifiedKFold(n_splits=cfg.nfolds, shuffle=True, random_state=cfg.seed)
for fold, (train_index, valid_index) in enumerate(skf.split(train_csv, train_csv['primary_label'])):
    train_csv.loc[valid_index, 'fold'] = int(fold)
    
    
if isTrain:
    train_csv.groupby("fold", as_index=False)["primary_label"].value_counts()   
    
    
def set_random_seed(seed: int = 42, deterministic: bool = False):
    """Set seeds"""
    random.seed(seed)
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)  # type: ignore
    torch.backends.cudnn.deterministic = deterministic  # type: ignore    

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%"> 12 | BCEFocalLoss: Binary Cross-Entropy Focal Loss 🐦</p>


1. **BCEFocalLoss Class**:
    - This class defines a custom loss function called `BCEFocalLoss` for binary classification tasks.
    - It inherits from `nn.Module`, indicating that it's a PyTorch module.

2. **Initialization**:
    - The `__init__` method initializes the loss function with two parameters: `alpha` and `gamma`.
    - `alpha` (default value: 0.25) controls the balance between positive and negative class samples in the loss calculation.
    - `gamma` (default value: 2.0) controls the degree of focus on hard-to-classify examples.

3. **Forward Method**:
    - The `forward` method computes the loss given model predictions (`preds`) and ground truth labels (`targets`).
    - It first calculates the binary cross-entropy (BCE) loss using `nn.BCEWithLogitsLoss` with the option `reduction='none'` to compute the loss per sample without averaging.
    - `probas = torch.sigmoid(preds)` computes the sigmoid activation of the model predictions to obtain probabilities.

4. **Focal Loss Components**:
    - Focal loss introduces two additional components: focal term (`tmp`) and smooth term (`smp`).
    - `tmp` calculates the focal loss for positive class samples, where the focus is increased for misclassified samples (`(1. - probas)**self.gamma` increases the loss for hard-to-classify examples).
    - `smp` calculates the focal loss for negative class samples, focusing on correctly classified samples (`probas**self.gamma` increases the loss for hard-to-classify examples).
    - Both `tmp` and `smp` are multiplied by the BCE loss to incorporate the original loss calculation.

5. **Final Loss Calculation**:
    - The final loss is calculated as the sum of `tmp` and `smp`, followed by taking the mean over all samples.
    - This mean loss value is returned as the output of the `forward` method.



In [None]:
class BCEFocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, preds, targets):
        bce_loss = nn.BCEWithLogitsLoss(reduction='none')(preds, targets)
        probas = torch.sigmoid(preds)

        

        tmp = targets * self.alpha * (1. - probas)**self.gamma * bce_loss
        smp = (1. - targets) * probas**self.gamma * bce_loss
        
        loss = tmp + smp
        loss = loss.mean()
        return loss

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">13 | Initialization Function For Training </p>

1. **Model Initialization**:
   - The function initializes the neural network model using the `BirdModel` class, which is customized for bird sound classification.
   - It sets parameters such as the model architecture (`model_name`), whether to use pre-trained weights (`pretrained`), number of input channels (`in_channels`), number of output classes (`num_classes`), and pooling type (`pool`).

2. **Optimizer Selection**:
   - Depending on the configuration (`cfg.optimizer`), the function selects the optimizer for training.
   - If `cfg.optimizer` is set to `'adan'`, it uses the custom optimizer `Adan` with specific parameters like learning rate (`lr`), betas, and weight decay.
   - Otherwise, it uses the standard AdamW optimizer from PyTorch with parameters such as learning rate (`lr`) and weight decay.

3. **Learning Rate Scheduler**:
   - The function initializes a learning rate scheduler using `torch.optim.lr_scheduler.OneCycleLR`.
   - This scheduler adjusts the learning rate during training, starting from an initial value (`cfg.lr`), and following a one-cycle policy with specified parameters such as the maximum number of epochs (`cfg.max_epoch`), percentage of epochs to increase/decrease learning rate (`pct_start`), and step size (`steps_per_epoch`).

4. **Gradient Scaler (Automatic Mixed Precision)**:
   - Automatic Mixed Precision (AMP) is a technique that combines single and half-precision floating-point arithmetic to speed up training while maintaining numerical stability.
   - The function initializes a gradient scaler using `amp.GradScaler` with the option to enable or disable AMP based on the configuration (`cfg.enable_amp`).

5. **Loss Function Initialization**:
   - Depending on the loss type specified in the configuration (`cfg.loss_type`), the function initializes the loss function.
   - If `cfg.loss_type` is set to `"BCEWithLogitsLoss"`, it uses the binary cross-entropy loss with logits (`torch.nn.BCEWithLogitsLoss`).
   - If `cfg.loss_type` is set to `"BCEFocalLoss"`, it uses the custom focal loss function `BCEFocalLoss` with a specified `alpha` value.

6. **Returning Initialized Components**:
   - The function returns the initialized model, optimizer, scheduler, scaler, and loss function, all moved to the appropriate device (`device`), typically GPU.



In [None]:
def initialization():
    model = BirdModel(model_name=cfg.model_name, pretrained=True, in_channels=3, num_classes=len(LABELS), pool=cfg.pool_type)
    
    if cfg.optimizer=='adan':
        optimizer = Adan(model.parameters(), lr=cfg.lr, betas=(0.02, 0.08, 0.01), weight_decay=cfg.weight_decay)
    else:
        optimizer = torch.optim.AdamW(params=model.parameters(), lr=cfg.lr, weight_decay=cfg.weight_decay)
    
    scheduler = torch.optim.lr_scheduler.OneCycleLR(
        optimizer=optimizer, epochs=cfg.max_epoch,
        pct_start=0.0, steps_per_epoch=len(train_dataloader),
        max_lr=cfg.lr, div_factor=25, final_div_factor=4.0e-01
    )
    
    scaler = amp.GradScaler(enabled=cfg.enable_amp)
    if cfg.loss_type == "BCEWithLogitsLoss":
        loss_func = torch.nn.BCEWithLogitsLoss()
    elif cfg.loss_type == "BCEFocalLoss":
        loss_func = BCEFocalLoss(alpha=1)
    
    


    return model.to(device), optimizer, scheduler, scaler, loss_func.to(device)

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">14 | Training And Evaluation Functions 🐦</p>

1. **`train_one_loop` Function**:
   - This function performs one epoch of training.
   - It iterates through the training data (`dataloader`) and updates the model parameters based on the calculated loss.
   - Within each iteration:
     - The data and labels are moved to the appropriate device (`device`).
     - The gradients are zeroed using `optimizer.zero_grad()` to clear the previous gradients.
     - Inside the training loop, gradient scaling is applied using AMP (Automatic Mixed Precision) to improve numerical stability and speed up training.
     - The loss is computed using the specified loss function (`loss_fn`) and backpropagated through the network.
     - The optimizer's learning rate is adjusted using the scheduler (`scheduler.step()`).
     - The loss value is accumulated for monitoring training progress.
   - After processing all batches, the average training loss is calculated and logged (if using Weights & Biases for logging).

2. **`mixup_one_loop` Function**:
   - This function performs one epoch of training with mixup augmentation.
   - Mixup is a data augmentation technique that blends pairs of examples and their corresponding labels.
   - It follows a similar structure to `train_one_loop`, but before feeding the data to the model, it applies mixup augmentation based on a probability threshold (`cfg.aug_spec_mixup_prob`).
   - Mixup can be applied either on different waveforms (`"other_wave"`) or on spectrograms (`"spec_mixup"`).
   - The rest of the process, including loss computation and optimization, remains the same.

3. **`evaluate_validation` Function**:
   - This function evaluates the model on the validation dataset (`dataloader`) after each epoch of training.
   - It calculates validation loss and various evaluation metrics such as AUC (Area Under the ROC Curve), F1-score, precision, and more.
   - Inside the evaluation loop:
     - The model makes predictions on the validation data.
     - Predictions are compared with the ground truth labels to compute the evaluation metrics.
     - The validation loss is computed using the specified loss function.
   - The function returns the validation loss and evaluation metrics, which can be used for monitoring the model's performance during training.

These functions collectively handle the training and evaluation process of the bird sound classification model, including data processing, model training, and performance evaluation. Additionally, they provide flexibility in choosing different training strategies such as mixup augmentation and support for monitoring training progress using Weights & Biases.

In [None]:
def train_one_loop(model, optimizer, scaler, scheduler, dataloader, loss_fn):
    trainloss = 0; model.train()

    count = 0
    for idx, (data, label) in enumerate(tqdm(dataloader,leave=False ,desc="[train]")):
        # label = label.reshape(-1, len(LABELS))
        
        data, label = data.to(device), label.to(device)
        
        optimizer.zero_grad()
        with amp.autocast(cfg.enable_amp, dtype=torch.bfloat16):
        # with amp.autocast(cfg.enable_amp):
            pred = model.forward(data)
            loss = loss_fn(pred, label)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        scheduler.step()
        
        trainloss += loss.item()
        # print(idx, loss.item())
        # if cfg.wandb == True:
        #     wandb.log({f"train_loss": loss.item(), f"lr":scheduler.get_lr()[0]})
        del data, label, loss
        count += 1
        # if count == 300:
        # break
    trainloss /= len(dataloader)
    if cfg.wandb == True:
        wandb.log({f"train_loss": trainloss, f"lr":scheduler.get_lr()[0]})
    return model, optimizer, scaler, scheduler, trainloss


def mixup_one_loop(model, optimizer, scaler, scheduler, dataloader, loss_fn):
    trainloss = 0; model.train()

    count = 0
    for idx, (data, label) in enumerate(tqdm(dataloader,leave=False ,desc="[train]")):
        if np.random.random()>cfg.aug_spec_mixup_prob:
            data, label = mixup(data=data, targets=label, alpha=cfg.alpha, mode="other_wave")
        else:
            data, label = spec_mixup(data=data, targets=label)
        data, label = data.to(device), label.to(device)
        
        optimizer.zero_grad()
        with amp.autocast(cfg.enable_amp, dtype=torch.bfloat16):
        # with amp.autocast(cfg.enable_amp):
            pred = model.forward(data)
            loss = loss_fn(pred, label)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        scheduler.step()
        
        trainloss += loss.item()
        # print(idx, loss.item())
        # if cfg.wandb == True:
        #     wandb.log({f"lr":scheduler.get_lr()[0]})
        del data, label, loss
        count += 1
        # if count == 300:
        # break
    trainloss /= len(dataloader)
    if cfg.wandb == True:
        wandb.log({f"train_loss": trainloss, f"lr":scheduler.get_lr()[0]})
    return model, optimizer, scaler, scheduler, trainloss


def evaluate_validation(model, dataloader, loss_fn):
    validloss=0
    model.eval()

    preds, trues, targets = [], [], []
    
    for idx, (data, label) in enumerate(tqdm(dataloader,leave=False ,desc="[valid]")):
        # label = label.reshape(-1, len(LABELS))

        d = data[0].unsqueeze(1)
        label = label[0]
        
        d = d.to(device)
        # with amp.autocast(cfg.enable_amp):
        pred = model.forward(d)

        preds.extend(pred.detach().cpu())
        trues.extend(label)
        targets.extend(label.argmax(axis=1))
        
    #======================== metrics ========================#
    # y_preds = torch.stack(preds)
    t = torch.stack(preds)
    t = torch.sigmoid(t)
    targets = torch.tensor(targets)
    y_trues = torch.stack(trues)


    validloss = loss_fn(torch.stack(preds), torch.stack(trues))
    #     # print(idx, loss)
    #     # wandb.log({"valid_loss": loss})

    # validloss /= len(dataloader)
    
    # sk_f1 = metrics.f1_score(np.array(y_trues), np.array(t), average="micro")
    sk_f1_30 = metrics.f1_score(np.array(y_trues), np.array(t) > 0.30, average="micro")
    sk_f1_50 = metrics.f1_score(np.array(y_trues), np.array(t) > 0.50, average="micro")
    
    auc = multiclass_auroc(input=t, target=targets, num_classes=len(LABELS),
                           average="macro").item()

    # auc_micro = multiclass_auroc(input=t, target=targets, num_classes=len(LABELS),
    #                        average="none").item()

    prec = multiclass_precision(input=t, target=targets, num_classes=len(LABELS),
                           average="macro").item()
    # rec = multiclass_recall(input=t, target=targets, num_classes=len(LABELS),
    #                        average="macro").item()
    
    # acc = multilabel_accuracy(input=t, target=targets).item()

    f1 = multiclass_f1_score(input=t, target=torch.tensor(targets), num_classes=len(LABELS),
                             average="micro").item()

    f1_macro = multiclass_f1_score(input=t, target=torch.tensor(targets), num_classes=len(LABELS),
                             average="macro").item()

    t_03 = (t>0.3).int()
    t_03 = torch.tensor(t_03, dtype=torch.int64)
    f1_03 = multiclass_f1_score(input=t_03, target=torch.tensor(targets), num_classes=len(LABELS), 
                                average="micro").item()

    t_05 = (t>0.5).int()
    t_05 = torch.tensor(t_05, dtype=torch.int64)
    f1_05 = multiclass_f1_score(input=t_05, target=torch.tensor(targets), num_classes=len(LABELS), 
                                average="micro").item()

    if cfg.wandb == True:
        wandb.log({f"valid_loss": validloss,
                   f"AUC":auc,
                   # "auc_micro":auc_micro,
                   "precision":prec, 
                   # "recall":rec, 
                   # "accuracy":acc,
                   f"F1":f1,
                   "F1_macro":f1_macro,
                   f"F1 30%":f1_03,
                   f"F1 50%":f1_05})
    return validloss, auc, f1, f1_03, f1_05, sk_f1_30, sk_f1_50

**Explanation:**



1. **Creating Temporary Parameters Dictionary (`tmp_params`):**
   - This code initializes a dictionary named `tmp_params` by copying the content of the `config` object's variables.
   - It then removes certain reserved attributes (`__module__`, `__dict__`, `__weakref__`, `__doc__`) from the dictionary.
   - This operation essentially extracts all the parameters from the `config` object and stores them in `tmp_params` for further processing or logging.

2. **`get_oversampled_df` Function:**
   - This function is designed to oversample the dataframe `df` to address class imbalance, a common issue in machine learning where certain classes have significantly fewer samples than others.
   - It takes a dataframe `df` as input.
   - It initializes an empty list `new_df` to store the oversampled dataframes.
   - It identifies classes (birds in this case) with a low number of samples based on a threshold (`cfg.oversample_threthold`) specified in the configuration.
   - For each class with fewer samples than the threshold, it replicates the data to increase the sample count.
   - The replication process involves creating multiple copies (tiles) of the original data until the sample count reaches the threshold. This is achieved by concatenating the original dataframe multiple times.
   - Finally, it concatenates all the oversampled dataframes into a single dataframe and returns it.

3. **Explanation of `get_oversampled_df` Function (continued):**
   - The function first identifies bird classes with low sample counts (`low_sample_birds`).
   - For each identified class, it replicates the data (`tile_df`) to reach the oversampling threshold (`cfg.oversample_threthold`).
   - It then concatenates the replicated dataframes (`tile_df`) along with a piece of data (`piece`) from the original dataframe that couldn't be tiled entirely.
   - After processing all classes, it concatenates all oversampled dataframes (`new_df`) into a single dataframe, which is returned as the oversampled dataset.

This function serves the purpose of oversampling the dataset to address class imbalance, ensuring better model performance, especially for classes with fewer samples. 

In [None]:
if isTrain == True:
    tmp_params = dict(vars(config))
    del tmp_params['__module__'],tmp_params['__dict__'],tmp_params['__weakref__'],tmp_params['__doc__']

def get_oversampled_df(df):
    
    new_df = [df]

    low_sample_birds = df["primary_label"].value_counts()[df["primary_label"].value_counts() < cfg.oversample_threthold].index
    for bird in low_sample_birds:
        tmp = df[df["primary_label"] == bird]
        data_num = len(tmp)
    
        tiles = 1 + cfg.oversample_threthold // data_num
    
        tile_df = []
        for i in range(tiles):
            tile_df.append(tmp)
    
        tiled_df = pd.concat(tile_df)
        piece = tiled_df[data_num:cfg.oversample_threthold]
        new_df.append(piece)
    
    return pd.concat(new_df)

**Explanation:**

1. **Condition Check (`isTrain == True`):**
   - This line checks if the `isTrain` variable is `True`, indicating that the code is running in training mode.

2. **Setting Random Seed:**
   - The `set_random_seed` function is called to set a random seed for reproducibility. It ensures that random operations in the code generate the same results across different runs when the same seed is used.

3. **Initializing Weights and Biases (W&B) Logging:**
   - If W&B logging is enabled (`cfg.wandb == True`), the `wandb.init` function is called to initialize a W&B run for logging the training process. The project name and run name are specified, along with the configuration parameters (`tmp_params`).
   
4. **Fold-wise Training:**
   - The training process is performed for each fold in the cross-validation setup (`for fold in cfg.inference_folds:`).
   - For each fold:
     - The training and validation data for the current fold are separated.
     - If oversampling is enabled (`cfg.oversample == True`), the training data is oversampled using the `get_oversampled_df` function to address class imbalance.
     - Training and validation datasets are created using the `BirdCLEF_Dataset` class, with or without augmentation based on the mode.
     - The model, optimizer, scheduler, scaler, and loss function are initialized using the `initialization` function.
     - The training loop runs for a specified number of epochs (`cfg.max_epoch`), with different behaviors for the initial epochs (`cfg.aug_epoch`).
     - During the training loop, the model is trained on the training dataset using either standard training or mixup training based on a probability condition (`cfg.aug_spec_mixup`).
     - After each epoch, the model's performance is evaluated on the validation dataset, and if the validation loss improves, the model is saved.
     - Training progress and performance metrics are logged using print statements.
     - At the end of training for each fold, the best-performing model is saved, and memory is cleaned up.

5. **W&B Logging of Best Metrics:**
   - If W&B logging is enabled, the best validation loss, F1 score, and AUC score are logged using the `wandb.log` function.

6. **Memory Cleanup:**
   - After training each fold, memory is cleaned up to release GPU memory using `del`, `gc.collect()`, and `torch.cuda.empty_cache()`.




In [None]:
if isTrain == True:
    set_random_seed(seed=42)
    
    
    if cfg.wandb == True:
        wandb.init(project='BirdCLEF_cv_ver2', name=f"{name}",
                   config=tmp_params)
        
    # for fold in range(cfg.nfolds):
    for fold in cfg.inference_folds:
        train_ = train_csv.loc[train_csv["fold"]!=fold]

        if cfg.oversample == True:
            train = get_oversampled_df(df=train_)
        else:
            train = train_
        
        augme_dataset = BirdCLEF_Dataset(df=train, augmentation=True, mode='train')
        augme_dataloader = torch.utils.data.DataLoader(dataset=augme_dataset, batch_size=cfg.train_batchsize, shuffle=True)

        train_dataset = BirdCLEF_Dataset(df=train, augmentation=False, mode='train')
        train_dataloader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=cfg.train_batchsize, shuffle=True)
        
        valid = train_csv.loc[train_csv["fold"]==fold]
        valid_dataset = BirdCLEF_Dataset(df=valid, augmentation=False, mode='valid')
        valid_dataloader = torch.utils.data.DataLoader(dataset=valid_dataset, batch_size=cfg.valid_batchsize, shuffle=False)
    
        model, optimizer, scheduler, scaler, loss_func =  initialization()
    
    
        best_f1 = 0
        best_auc = 0
        best_loss = 1.00000
        for e in range(cfg.max_epoch):
            start_time = time.time()
            if e < cfg.aug_epoch:
                if cfg.aug_spec_mixup > np.random.random():
                    model, optimizer, scaler, shcheduler, train_loss = mixup_one_loop(model=model,optimizer=optimizer,scaler=scaler, 
                                                                                          scheduler=scheduler,dataloader=augme_dataloader, loss_fn=loss_func)
                else:
                    model, optimizer, scaler, shcheduler, train_loss = train_one_loop(model=model,optimizer=optimizer,scaler=scaler, 
                                                                                          scheduler=scheduler,dataloader=augme_dataloader, loss_fn=loss_func)

            else:
                model, optimizer, scaler, shcheduler, train_loss = train_one_loop(model=model,optimizer=optimizer,scaler=scaler, 
                                                                                          scheduler=scheduler,dataloader=train_dataloader, loss_fn=loss_func)
            
            valid_loss, auc, f1, f1_03, f1_05, sk_f1_30, sk_f1_50 = evaluate_validation(model=model, dataloader=valid_dataloader, loss_fn=loss_func)
            # print(f"epoch {e} , train_loss is {train_loss}, valid_loss is {valid_loss}")
            
            if best_loss > valid_loss:
                end_time = time.time()
                print(f"[epoch {str(e).zfill(2)}] AUC{auc: .4f}, F1{f1: .4f}, F1_03{f1_03: .4f}, F1_05{f1_05: .4f}")
                print(f"[epoch {str(e).zfill(2)}] SKF1_03{sk_f1_30: .4f}, SKF1_05{sk_f1_50: .4f}")
                print(f"[epoch {str(e).zfill(2)}] valid_loss {valid_loss: .6f}")
                print(f"[epoch {str(e).zfill(2)}] update loss {best_loss: .6f} --> {valid_loss: .6f} {(end_time - start_time): .1f}[s]")
                print(f"[epoch {str(e).zfill(2)}] update auc score {best_auc: .6f} --> {auc: .6f} {(end_time - start_time): .1f}[s]")
                model_name = f'{name}/checkpoint/fold_{fold}_snapshot_epoch_{str(e).zfill(2)}.pth'
                best_model = model
                best_loss = valid_loss
                best_auc = auc
                best_f1 = f1
            else:
                end_time = time.time()
                print(f"[epoch {str(e).zfill(2)}] NOT update loss {best_loss: .6f} <-- {valid_loss: .6f} {(end_time - start_time): .1f}[s]")
                print(f"[epoch {str(e).zfill(2)}] NOT update score {best_auc: .6f} <-- {auc: .6f} {(end_time - start_time): .1f}[s]")

        if cfg.wandb == True:
            wandb.log({f"best_loss": best_loss,
                       f"best_f1": best_f1,
                       f"best_auc":best_auc})

        torch.save(best_model.state_dict(), model_name)
        
        del model, best_model
        gc.collect()
        torch.cuda.empty_cache()
        print("--")


# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">15 | Model Loading For Inference 🐦</p>

1. **Initializing Dictionaries:**
   - Two dictionaries, `models` and `models_names`, are initialized to store loaded models and their corresponding names, respectively.

2. **Iterating Over Folds:**
   - The code iterates over each fold for which inference is required (`for fold in cfg.inference_folds:`).

3. **Loading Trained Model:**
   - The path to the best-performing model checkpoint for the current fold is obtained using `glob.glob`.
   - The `BirdModel` class is instantiated to create a new model instance with the same architecture as the trained model.
   - The model's state dictionary is loaded from the saved checkpoint file using `torch.load`.
   - The loaded model is set to evaluation mode using `model.eval()`.

4. **Storing Loaded Models and Names:**
   - The loaded model is stored in the `models` dictionary with the fold index as the key.
   - The name of the ONNX file for the model is generated from the checkpoint file path and stored in the `models_names` dictionary.

5. **Printing Model Path and ONNX Name:**
   - The path of the loaded model checkpoint file and the corresponding ONNX file name are printed for verification.



In [None]:
models = dict()
models_names = dict()
# for fold in range(cfg.nfolds):
for fold in cfg.inference_folds:
    bestmodel_path = sorted(glob.glob(f"/kaggle/input/{name}/checkpoint/fold_{fold}*.pth"))[-1]

    print(bestmodel_path)
    model = BirdModel(model_name=cfg.model_name, pretrained=False, in_channels=1, num_classes=len(LABELS))
    model.load_state_dict(torch.load(bestmodel_path, map_location=torch.device('cpu')))
    model = model.eval()
    models[fold] = model

    models_names[fold] = bestmodel_path.split(".")[0]+".onnx"
    print(models_names[fold])

**Explanation:**


1. **Test Audio Directory and File List:**
   - The directory path where the test audio files are stored is defined using `cfg.dir`.
   - The list of test audio file paths is obtained using `glob.glob` by searching for files with the `.ogg` extension in the test audio directory.
   - The list of file paths is sorted alphabetically.

2. **Test Dataset and DataLoader Setup:**
   - An instance of the `BirdCLEF_Dataset` class is created for the test dataset using the list of file paths obtained in the previous step.
   - A `DataLoader` object is initialized for the test dataset with a batch size of 1 and shuffle set to False. This loader will be used to iterate over the test data during inference.

3. **Input and Output Tensor Configuration:**
   - An example input tensor is initialized with random values to specify the input shape expected by the ONNX models. The shape is defined as `(48, 1, cfg.n_mels, cfg.size_x+1)`.

4. **Model File Paths Retrieval:**
   - The file paths of the ONNX models for each fold are retrieved using `glob.glob`.
   - The paths are sorted to ensure consistency in model loading.

5. **ONNX Session Initialization:**
   - For each fold, the corresponding ONNX model is loaded using `onnx.load`, and the model's graph is accessed.
   - An ONNX inference session is created using `ort.InferenceSession` by passing the serialized model obtained from `onnx_model.SerializeToString()`.
   - The initialized ONNX sessions are stored in the `onnx_sessions` dictionary.

**Concepts Used:**
- **File Path Manipulation:** File paths are obtained using `glob.glob` to locate test audio files and ONNX model files.
- **Dataset and DataLoader:** The test dataset is prepared using `BirdCLEF_Dataset`, and a `DataLoader` is initialized for iterating over the test data.
- **Input and Output Configuration:** Example input and output tensor shapes are defined to configure the ONNX models during inference.
- **ONNX Model Loading:** ONNX models are loaded using `onnx.load`, and inference sessions are created using `ort.InferenceSession`.
- **Data Storage:** The file paths of ONNX models and the initialized ONNX sessions are stored in dictionaries for easy access during inference.



In [None]:

test_audio_dir = f"{cfg.dir}test_soundscapes/"
file_list = glob.glob(test_audio_dir+"*.ogg")
file_list = sorted(file_list)


test_dataset = BirdCLEF_Dataset(df=file_list, mode="test")
test_dataloader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                              batch_size=1, 
                                              shuffle=False)

input_tensor = torch.randn((48, 1, cfg.n_mels, cfg.size_x+1))  # input shape
output_names=['output']
input_names=["x"]


# models_names = []
models_names = dict()
# for fold in range(cfg.nfolds):
for fold in cfg.inference_folds:
    onnxmodel_path = sorted(glob.glob(f"/kaggle/input/{name}/checkpoint/fold_{fold}*.onnx"))[-1]

    print(onnxmodel_path)

    models_names[fold] = onnxmodel_path
    
    
onnx_sessions = dict()
# for fold in range(cfg.nfolds):
for fold in cfg.inference_folds:

    onnx_model = onnx.load(models_names[fold])
    onnx_model_graph = onnx_model.graph
    onnx_session = ort.InferenceSession(onnx_model.SerializeToString())

    onnx_sessions[fold] = onnx_session    

**Explanation:**


1. **Start Time Measurement:**
   - The current time is recorded using `time.time()` to measure the duration of the inference process.

2. **Inference Loop:**
   - The code iterates over the test data using the `test_dataloader`.
   - For each batch of data:
     - Predictions are collected for each fold using the ONNX models loaded in the `onnx_sessions`.
     - The predictions are obtained by running the ONNX session (`session.run`) with the input data (`data[0].numpy()`), where `data[0]` contains the audio data converted to a numpy array.
     - The output predictions are then converted to PyTorch tensors, sigmoid is applied, and appended to the `preds` list.
   - The average prediction across folds for each data batch is computed by taking the mean along the fold axis (`axis=0`) and stored in `preds_per_batch`.
   - These batch-wise predictions are collected in the `predictions` list.

3. **Prediction Processing:**
   - Once all predictions are collected, they are stacked along the batch dimension to form a single tensor using `torch.stack`. If `predictions` is empty, it remains as is.
   
4. **End Time Measurement:**
   - The current time is recorded again using `time.time()` after the inference loop completes.

5. **Time Calculation:**
   - The total time taken for the inference process is calculated by subtracting the start time from the end time.

**Concepts Used:**
- **Inference Loop:** Iterating over the test data using a `DataLoader` and performing inference for each batch.
- **ONNX Model Inference:** Utilizing the loaded ONNX models to make predictions on the test data.
- **Batch Processing:** Collecting and processing predictions in batches to optimize memory usage.
- **Time Measurement:** Recording start and end times to calculate the duration of the inference process.



In [None]:
start_time = time.time()

predictions = []
for data in tqdm(test_dataloader):
    
    preds = []
    
#     for fold, session in enumerate(onnx_sessions):
    for fold in cfg.inference_folds:
        session = onnx_sessions[fold]
        pred = session.run(output_names, {input_names[0]: data[0].numpy()})[0]
        
        pred = torch.sigmoid(torch.tensor(pred))
        preds.append(pred)
    preds_per_batch = torch.stack(preds, axis=0).mean(axis=0)
    
    predictions.extend(preds_per_batch)
    
if len(predictions)>0:
    predictions = torch.stack(predictions)
else:
    predictions = predictions
end_time = time.time()
use_time = end_time - start_time

# <p style="font-family: 'Amiri'; font-size: 3rem; color: Black; text-align: center; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); background-color: #c9b68b; padding: 20px; border-radius: 20px; border: 7px solid Black; width:95%">16 | Final Submission 📝</p>

Creates a submission file (`submission.csv`) containing the predictions for the test data in the format required for submission. 

1. **Column Selection:**
   - The column names representing bird species from the `sample_submission` DataFrame are extracted using `sample_submission.columns[1:]`. These columns represent the target labels.

2. **DataFrame Initialization:**
   - A new DataFrame `df` is initialized with columns named `'row_id'` and the bird species names.

3. **Row ID Generation:**
   - For each audio file in the `file_list` (representing test audio files), a unique `'row_id'` is generated for each 5-second segment of the audio. This is done by iterating over each file and creating row IDs in the format `<audio_file_name>_<segment_number>`, where the segment number ranges from 1 to the total number of 5-second segments in the audio file.

4. **DataFrame Population:**
   - The generated row IDs are assigned to the `'row_id'` column of the DataFrame.
   - If there are predictions available (`len(predictions) >= 1`), the predictions are assigned to the corresponding bird species columns in the DataFrame. Otherwise, this step is skipped.
     - This step involves populating the DataFrame with the predictions obtained from the previous inference step.

5. **Saving to CSV:**
   - The DataFrame `df` is saved to a CSV file named `"submission.csv"` using `to_csv()`, with the `index` set to `False` to exclude row indices from the output file.


In [None]:
bird_cols = sample_submission.columns[1:]
df = pd.DataFrame(columns=['row_id']+list(bird_cols))


row_list = []
for file in file_list:
    dataname = file.split("/")[-1][:-4]
    for i in range(int(4*60/5)):
        row = f"{dataname}_{(i+1)*5}"
        row_list.append(row)
        
        
        
df['row_id'] = row_list        

if len(predictions) < 1:
    pass
else:
    df[bird_cols] = predictions
    
    
df.to_csv("submission.csv", index=False)     