In [2]:
#line to render the plots under the code cell that created it
%matplotlib inline
import json  # for working with json files
import sys  # Python system library needed to load custom functions
import numpy as np  # for performing calculations on numerical arrays
import pandas as pd  # home of the DataFrame construct, _the_ most important object for Data Science
import torch  # library to work with PyTorch tensors and to figure out if we have a GPU available
import os     # for changing the directory

from datasets import load_dataset, Audio  # required tools to create, load and process our audio dataset
from transformers import ASTFeatureExtractor, ASTForAudioClassification, TrainingArguments, Trainer  # required classes to perform the model training

sys.path.append('../src')  # add the source directory to the PYTHONPATH. This allows to import local functions and modules.
from gdsc_utils import download_directory, PROJECT_DIR # function to download the needed files from the official GDSC s3 bucket and our root directory
from config import DEFAULT_BUCKET  # S3 bucket with the GDSC data
from preprocessing import calculate_stats, preprocess_audio_arrays  # functions to calculate dataset statistics and preprocess the dataset with ASTFeatureExtractor
from gdsc_eval import make_predictions, compute_metrics  # functions to create predictions and evaluate them
os.chdir('../..') # changing our directory to root

In [3]:
print(os.getcwd())

/root/data


# Loading the model and doing inference on the test set

If you look back at the *TrainingArguments* class you will see that we passed an *output_dir* argument that tells 🤗 where to put the checkpoint with training metadata and model. We set it to *models/AST*, so let's use this directory to load the feature extractor and the model from the best checkpoint (note that this is not necessary, as we put in our *TrainingArguments* object an argument called *load_best_model_at_end* and we set it to *True*. This ensures that the variable *model* contains already the best one based on the metric of choice. We just wanted to show you how to load the model from other checkpoints in case you'd like to experiment). With 🤗 library loading the checkpoint it's just a matter of two lines.

In [4]:
print(os.getcwd())

/root/data


In [5]:
feature_extractor = ASTFeatureExtractor.from_pretrained("experiments/models/sm-training-custom-2023-07-13-09-09-10-608/checkpoint-16110")
model = ASTForAudioClassification.from_pretrained("experiments/models/sm-training-custom-2023-07-13-09-09-10-608/checkpoint-16110")

Cool! Now let's get the test set data. We need to preprocess them in the same way as we did for the training. Let's start with simply loading the dataset and resample the audio arrays. 

In [6]:
MODEL_SAMPLING_RATE = 22050

In [7]:
test_path = 'data/test'
test_dataset = load_dataset("audiofolder", data_dir=test_path).get('train')
test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=MODEL_SAMPLING_RATE))

Resolving data files:   0%|          | 0/557 [00:00<?, ?it/s]

Downloading and preparing dataset audiofolder/default to /root/.cache/huggingface/datasets/audiofolder/default-0de20c61adc4af17/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc...


Downloading data files:   0%|          | 0/557 [00:00<?, ?it/s]

Downloading data files: 0it [00:00, ?it/s]

Extracting data files: 0it [00:00, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset audiofolder downloaded and prepared to /root/.cache/huggingface/datasets/audiofolder/default-0de20c61adc4af17/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
test_dataset

Dataset({
    features: ['audio'],
    num_rows: 556
})

In [9]:
test_dataset[0]

{'audio': {'path': '/root/data/data/test/0.wav',
  'array': array([-2.13162821e-13,  4.26325641e-14, -7.03437308e-13, ...,
         -3.38351354e-02,  8.03619623e-02,  2.31235102e-02]),
  'sampling_rate': 22050}}

As we need the predictions file to have two columns - file_name and predicted_class_id, let's take care of extracting the paths for each data point and make it a feature called "file_name". 

For this purpose we'll use the metadata information from the dataset object that we just created.

So let's get the paths of the audio files.

In [10]:
test_paths = list(test_dataset.info.download_checksums.keys())

Let's inspect the variable.

In [11]:
test_paths[:3]

['/root/data/data/test/0.wav',
 '/root/data/data/test/1.wav',
 '/root/data/data/test/10.wav']

Great! We obtained the paths. One thing to note is that the test_paths variable contains also the metadata.csv file with file_names and labels (check it on your own!). We don't need it, so we will use a one-liner lambda function to extract only the items related to the audio files.

Furthermore, we don't need the whole path - just the file names, so we will define another one-liner that gets the string after the last "/" character, which is exactly the file name.

We will use the built-in filter and map methods that allow for applying a function on an Python iterable. With its help we will run the below defined lambda function.

In [12]:
remove_metadata = lambda x: x.endswith(".wav")
extract_file_name = lambda x: x.split('/')[-1]

test_paths = list(filter(remove_metadata, test_paths))
test_paths = list(map(extract_file_name, test_paths))

Let's see if the test_paths variable contains the file names.

In [13]:
test_paths[:3]

['0.wav', '1.wav', '10.wav']

Yes, we indeed have just the file names. Let's create a new column with the file names.

In [14]:
test_dataset = test_dataset.add_column("file_name", test_paths)

Let's inspect the newly created "file_name" feature.

In [15]:
test_dataset

Dataset({
    features: ['audio', 'file_name'],
    num_rows: 556
})

In [16]:
test_dataset[0]

{'audio': {'path': '/root/data/data/test/0.wav',
  'array': array([-2.13162821e-13,  4.26325641e-14, -7.03437308e-13, ...,
         -3.38351354e-02,  8.03619623e-02,  2.31235102e-02]),
  'sampling_rate': 22050},
 'file_name': '0.wav'}

Amazing! We almost finished preprocessing the data. The last step is to pass the audio arrays through our feature extractor and set fromat of the "input_values" columns from numpy to torch, so that we can safely pass the spectrogram arrays through the model.

In [17]:
test_dataset_encoded = test_dataset.map(lambda x: preprocess_audio_arrays(x, 'audio', 'array', feature_extractor), remove_columns="audio", batched=True, batch_size = 2)
test_dataset_encoded.set_format(type='torch', columns=['input_values'])

Map:   0%|          | 0/556 [00:00<?, ? examples/s]

Now let's inform the 🤗 that we want to run the predicions on our GPU. To do this we need to define the *device* variable with help of the *PyTorch* library.

In [18]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Good, we are set up to perform the inference on the test set. Let's use the *make_predictions* function from our *gdsc_eval* modeule located in *src* directory. This time we will set the *batch_size* argument to 8, to avoid any out-of-memory issues. We are also dropping the "input_values" column, as we won't need it anymore.

In [19]:
test_dataset_encoded = test_dataset_encoded.map(lambda x: make_predictions(x['input_values'], model, device), batched=True, batch_size=8, remove_columns="input_values")

Map:   0%|          | 0/556 [00:00<?, ? examples/s]

Let's now create a pandas dataframe from our 🤗 dataset. We should see the columns file_name and predicted_class_id

In [20]:
test_dataset_encoded_df = test_dataset_encoded.to_pandas()
test_dataset_encoded_df.head()

Unnamed: 0,file_name,predicted_class_id
0,0.wav,14
1,1.wav,60
2,10.wav,26
3,100.wav,56
4,101.wav,57


Great! Now we need to save the dataframe in a csv file and we are ready to send the predictions. We will save it in the directory of our model, to have everything in one place.

In [21]:
test_dataset_encoded_df.to_csv("models/predictions_chunks_epoch10.csv", index=False)

And done! We have our CSV file with the predictions ready. Let's upload it via the challenge website and see our results!

The score is way better than the one from Random Forest. Remember that in this tutorial we are using a much more powerful model, that was designed to work with audio data. But taking into account that the F1 metric ranges from 0 to 1, there is still some room for improvement. In the next tutorial, we will see how the model performs on the whole dataset. Then you will see what the model is really capable of! In the mean time, you can try to complete the exercises while making a coffee before the final tutorial.

***
**It is important that you name the columns exactly: **file_name** and **predicted_class_id**, otherwise your score won't appear on the leaderboard!**
***

**Exercise time:**

The last exercise in this notebook is to 
* try to think how we could improve the model further apart from running it on the whole sample. What does your Data Science intuition tell you? Post your thoughts in the Team's channel and gain some recognition for your team! 😃
* try also to use another model from the 🤗 model hub. You will need to import other classes instead of ASTFeatureExtractor and ASTForAudioClassification. You will also need to change the string in the *from_pretrained* method and adjust the preprocessing. Sounds like a lot? Well, this is how we do Data Science! 😃

REMINDER: After finishing your work remember to shut down the instance.