<center> <h1> Freesound Audio Tagging </h1> </center>

# 4. Preprocessing and Featurizations

Following featurizations are possible:
1. Removing leading and trailing silences (noises below 60 dB)<br><br>
2. Resampling the audio clip
 - Current sampling rate = 44.1 kHz
 - Acc. to some kaggle kernels, we can change the sampling rate to 16 kHz without losing much information resulting in faster computations<br><br>
3. Random offsetting/padding:
 - After removing leading and trailing noises, the lengths of audio clips vary from 0-30 seconds, one idea is to take all clips of same lengths, i.e. 15 seconds, so, clips longer than that, we choose a random sample of length 15 seconds and for clips shorter than that, we pad the clip with zeros on either side.
 - By choosing the random offset, we perform a kind of randomization which helps in controlling overfitting and when all the clips become of same length, then, it becomes easy to feed data to the model
 - By the above step, all clips become of same length (15 seconds)<br><br>

### Importing Libraries and Data

In [1]:
import numpy as np
import pandas as pd
import joblib
from pathlib import Path
import os
import shutil
import librosa
from sklearn.model_selection import train_test_split

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
df_combined = pd.read_csv(r'../train_curated.csv')

#### Creating a list of all class labels in sorted order

In [4]:
LABELS=set()
all_labels = list(df_combined['labels'])
for row in all_labels:
    for lab in row.split(r','):
        LABELS.add(lab)

LABELS=list(LABELS)
LABELS.sort()
joblib.dump(LABELS, 'labels.joblib');

### Train Cross Validation Split
We perform a 70-30 train cross validate split which is not random, but, stratified such that both train and validation dataset contain approximately the same distribution of number of labels per clip

In [5]:
def count_labels(label_string):
    '''
    Descrption -> Returns the total count of labels in the given comma separated label_string
    Input -> String containing all the labels corr. the clip in CSV format ("label1,label2,label3")
    Output -> Number of unique labels in the label_string (eg. 3)
    '''
    return len(label_string.split(','))

In [7]:
df_combined['label_count'] = df_combined['labels'].apply(count_labels)

In [8]:
df_combined['label_count'].value_counts()

1    4269
2     627
3      69
4       4
6       1
Name: label_count, dtype: int64

In [11]:
ALL_FILES_DIRECTORY = r'all_data/'
TRAIN_FILES_DIRECTORY = r'train_data/'
VAL_FILES_DIRECTORY = r'val_data/'

In [14]:
# There is only one point with 6 labels which can be considered as an outlier hence we delete the corresponding clip
fname_to_be_deleted = (df_combined[df_combined['label_count'] == 6]['fname']).values[0]
fname_path = ALL_FILES_DIRECTORY + fname_to_be_deleted
os.remove(fname_path)
df_combined = df_combined[df_combined['label_count'] < 6]
y_count=np.array(df_combined['label_count'])
df_train, df_val = train_test_split(df_combined, stratify=y_count, test_size=0.3, random_state=21)

In [31]:
def get_percentage_of_labels_in_a_dataframe(df):
    '''
    Get the number of labels for each clip and the percentage of clips having it
    Input  -> Pandas dataframe containing "fname" (filename), "label" (ground truth labels) and 
             "label_count" (number of ground truth labels for the fname)
    Output -> Pandas dataframe containing "Number of labels" (each unique value in the "label_count") and
              "Percentage of clips" having it
    '''
    col1=list(df['label_count'].unique())
    col2=list(round(df['label_count'].value_counts()/sum(df['label_count'].value_counts())*100, 2))
    df_labels_percentage = pd.DataFrame({'Number of labels' : col1, 'Percentage of clips having it' : col2})
    return df_labels_percentage

In [32]:
df_labels_percentage_train = get_distribution_of_labels_in_a_dataframe(df_train)
df_labels_percentage_train

Unnamed: 0,Number of labels,Percentage of clips having it
0,1,85.91
1,2,12.62
2,3,1.38
3,4,0.09


In [33]:
df_labels_percentage_val = get_distribution_of_labels_in_a_dataframe(df_val)
df_labels_percentage_val

Unnamed: 0,Number of labels,Percentage of clips having it
0,1,85.92
1,2,12.61
2,3,1.41
3,4,0.07


In [14]:
# Moving the .wav files present in df_val to the VAL_FILES_DIRECTORY
df_val_filenames = set(df_val['fname'])
all_files = os.listdir(ALL_FILES_DIRECTORY)

for file in all_files:
    if file in df_val_filenames:
        src_path = os.path.join(ALL_FILES_DIRECTORY, file)
        dst_path = os.path.join(VAL_FILES_DIRECTORY, file)
        shutil.move(src_path, dst_path)
        
os.rename(ALL_FILES_DIRECTORY, TRAIN_FILES_DIRECTORY)

In [15]:
# Saving for future use
df_train.to_csv(r'df_train.csv', index=False)
df_val.to_csv(r'df_val.csv', index=False)
df_combined.to_csv('df_combined.csv', index=False)

### Creating Configuration class

The Configuration object stores those learning parameters that are shared between data generators, models, and training functions. Basically, these are the global variables as far as training is considered. <br><br>
Various parameters used here are: <br>
1. **Sampling rate:** Number of samples picked up per second. The original sample rate of the data was 44.1 kHz, i.e., 44,100 samples were picked up per second while recording the audio. We choose a lower sampling rate of 16 kHz, i.e., 16,000 samples are picked up per second. This is a kind of downsampling which helps us to speed up model training due to computational constraints.<br><br>
2. **Audio Duration**: For feeding data to the model, we need each datapoint to be of the same dimension, this could be possible if each clip were to be of the same audio length (i.e. same total number of samples). In the EDA, we saw that, this is not the case. So, we choose an audio duration of 15 seconds for each clip, which means regardless of the actual length of the audio clip, we'll select 16,000 * 15 = 2,40,000 samples for each datapoint. <br>
 - For the clips with actual length shorter than 15 seconds, we'll pad the clips on either side with silences (zeroes) so that the duration becomes exactly 15 seconds.
 - For the clips having an audio duration > 15 seconds, we choose a random sample of 15 seconds from the clip. This also acts like data augmentation at training time which helps in controlling overfitting <br><br>
3. **n_classes:** This is the number of unique labels in the modified dataset (containing only datapoints with single labels) <br><br>
4. **Audio length:** This is the total number of samples present in the clip. If the sampling rate is S (S samples are picked up per second) and the duration of the clip is t seconds, the total number of samples present in the clip = S * t <br><br>
5. **Dimensionality:** The above audio length decides the dimensionality of each datapoint

In [9]:
# Config class is used to share the global parameters across various functions
class Config():
    def __init__(self,
                 sampling_rate=None,
                 audio_duration=None):
        
        self.sampling_rate = sampling_rate
        self.audio_duration = audio_duration
        self.n_classes = len(df_combined['labels'].value_counts())
        self.audio_length = self.sampling_rate * self.audio_duration
        self.dim = (self.audio_length, 1)
        
config = Config(sampling_rate=16000, audio_duration=15)

### Defining the preprocessing function

In [11]:
def preprocess_initial(config, data_dir, dest_folder):
    '''
    Objective ->
    This function performs the preprocessing on each audio clip in the data_dir
    
    Input ->
    config: An instance of the above Config() class which is used to determine the configuration parameters used for
    preprocessing
    
    data_dir: Path of the training/testing data folder which contains the .wav files
    
    Processing ->
    1. Load each of the .wav file present in data_dir folder in a NumPy array
    2. Trim the leading and trailing silences (sounds below 60 dB loudness)
    3. Select a fixed length random sample of 15 seconds from each clip to ensure uniform dimensions while training the model
       For clips shroter than 15 seconds, pad the NumPy array with zeros uniformly on both ends
    4. Store the modified clip in "dest_folder"
    '''
    input_length = config.audio_length
    
    # Iterate through each file in the "data_dir" which contains all the .wav files
    for file in os.listdir(data_dir):
        filepath = data_dir + "//" + file
        # Load the .wav file into a numpy array "data" using 16 khz sampling rate and "kaiser_fast" resolution which quickly
        # loads the file
        data, _ = librosa.core.load(filepath,
                                    sr=config.sampling_rate,
                                    res_type='kaiser_fast')
        
        # Trim the leading and trailing silences, i.e., sounds below 60 dB of loudness (inaudible to human ear)
        data, _ = librosa.effects.trim(data, top_db=60)
        
        # Random offset / Padding
        # Case 1: Audio longer than "input_length" seconds -> We choose a random subsample of data of "input_length" seconds
        if len(data) > input_length:
            pad_flag=0
            max_offset = len(data) - input_length
            offset = np.random.randint(max_offset)
            data = data[offset:(input_length+offset)]
            
        # Case 2: Audio shorter than "input_length" seconds -> Padding with zeroes required on either side of the clip
        elif input_length > len(data):
            pad_flag=1
            max_offset = input_length - len(data)
            offset = np.random.randint(max_offset)
                
        # Case 3: Audio is exactly "input_length" seconds long -> No change is required
        else:
            pad_flag = 0
            offset = 0
            
        if pad_flag:
            data = np.pad(data, (offset, input_length - len(data) - offset), "constant")
     
        dest_path = dest_folder + file
        
        write(dest_path, config.sampling_rate, data)

### Preprocessing for the train and validation dataset

In [12]:
TRAIN_FILES_DIRECTORY = r'train_data/'
TRAIN_DEST_DIRECTORY = r'preprocessed_files_train/'

VAL_FILES_DIRECTORY = r'val_data/'
VAL_DEST_DIRECTORY = r'preprocessed_files_val/'

preprocess_initial(config, TRAIN_FILES_DIRECTORY, TRAIN_DEST_DIRECTORY)
preprocess_initial(config, VAL_FILES_DIRECTORY, VAL_DEST_DIRECTORY)

### Preprocessing for the test dataset

In [14]:
TEST_FILES_DIRECTORY = r'test_data/'
TEST_DEST_DIRECTORY = r'preprocessed_files_test/'

preprocess_initial(config, TEST_FILES_DIRECTORY, TEST_DEST_DIRECTORY)