In [None]:
#adapted from https://www.tensorflow.org/tutorials/audio/transfer_learning_audio
#Google Colab alternative: https://colab.research.google.com/drive/1b1Ade24iVxJHhgi32ODMsjfQfXcoA70c?usp=sharing 

# Working with a Pretrained YAMNet

## YAMNet

YAMNet is a model provided by Tensorflow trained on the [AudioSet](https://research.google.com/audioset/) environmental sounds dataset. 

You can think of it as analogous to ``ImageNet`` but for audio.  

```
AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos
```

We're going to look at 


* What are the labels it gives to new Youtube videos? 


* Using the ``embeddings`` it generates (think ``word embeddings`` from NLP) to train our own audio classifiers. This is conceptually similar to the transfer learning we did in the image domain, but implemented slightly differently. 


### Install ```tensorflow_hub```

First we need to install ``tensorflow_hub``, this is a package which allow us to access some **pretrained models** from Google.

In [None]:
!pip install tensorflow_hub

In [None]:
import os
from IPython import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import librosa

In [None]:
#Load model
yamnet_model_handle = 'https://tfhub.dev/google/yamnet/1'
yamnet_model = hub.load(yamnet_model_handle)

In [None]:
#Get class names
class_map_path = yamnet_model.class_map_path().numpy().decode('utf-8')
class_names =list(pd.read_csv(class_map_path)['display_name'])

### Ripping audio from Youtube Videos 

#### Youtube-dl

``youtube-dl`` is a great tool for extracting videos from Youtube. You can install it using ``homebrew`` (which you hopefully installed in Week 2!). 

In [None]:
#Install youtube-dl (Mac OSX only)
!brew install youtube-dl

In [None]:
#Function to get video from youtube, extract audio and trim
def get_audio(youtube_id, file_length=60):

    output_filename = youtube_id + ".wav"
    
    #Remove existing files from previous runs
    !rm youtube_audio.m4a
    !rm full.wav
    !rm $output_filename
    
    #Get youtube video
    !youtube-dl -ci -f "bestaudio[ext=m4a]" $youtube_id -o 'youtube_audio.m4a'
    
    #Extract audio
    !ffmpeg -i 'youtube_audio.m4a' -ac 2 -f wav full.wav
    
    #Trim
    !ffmpeg -y -ss $file_length -i full.wav -t $file_length -c copy $output_filename
    
    #Read into librosa and return audio data (samples)
    y,sr = librosa.load(output_filename,sr=16000)
    return y

### Try it out

Put the **id of a youtube video** into the function below. The first part will get the audio, the second will run it through the ``YamNet`` and show the classification of the audio at various timestamps. 

#### Warning 

We have to download the **whole video**, then we trim it down. So probably, pick a video that is between 1-3 mins

#### Example

For a youtube video address like this: https://www.youtube.com/watch?v=Ptxjrmqo2Xo, we only need this bit: Ptxjrmqo2Xo

In [None]:
#The first minute of the audio
file_length = 60

In [None]:
#Get audio and trim to file_length
audio_class_one = get_audio("Ptxjrmqo2Xo", file_length)

In [None]:
#Run through the yamNet model and get scores for first minute in timestamps of half a second
#Scores is the output of the final layer, the probability of the audio being one of the classes
scores, embeddings_class_one, spec = yamnet_model(audio_class_one)
interval = np.round(file_length/len(scores),1)
all_sounds = [str(i*interval) + "s -> " + class_names[tf.argmax(score)] for i,score in enumerate(scores)]
#Getting the predicted label for each timestamp
all_sounds

## Building an Audio Classifier

### Embeddings 

You'll notice that as well as a ``scores`` data that gets returned by the model (which we use to check what labels have been applied to that audio buffer), we also get something called ``embeddings``. 

Similarly to word embeddings used in NLP tasks that provide a **denser, continuous** representation of text data, we can use a similar approach to get a vector of **1024 numbers** that represent each audio frame. 

So, instead of taking the model and retraining it like we did with **image classifier**, here, we **first run each audio frame through the model** and get the **embeddings**. This then becomes the features that we use to save in the dataset. 

We can then build our own simple classifier based on this dataset, with the **embedding** being a highly optimised representation of each audio buffer when doing audio classification tasks that are similiar that the one the original YAMNet was trained on. 

In [None]:
#Get second audio class
audio_class_two = get_audio("D6Cwi-jMzdc",file_length)

In [None]:
scores_2, embeddings_class_two, spec = yamnet_model(audio_class_two)
interval = np.round(file_length/len(scores_2),1)
all_sounds = [str(i*interval) + "s -> " + class_names[tf.argmax(score)] for i,score in enumerate(scores_2)]
all_sounds

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
#Labels (output)
num_class_one = embeddings_class_one.shape[0]
num_class_two = embeddings_class_two.shape[0]
#class_one = 0, class_two = 1
labels = tf.concat([tf.zeros(num_class_one), tf.ones(num_class_two)],0)

#Features/Embeddings (input)
features = tf.concat([embeddings_class_one,embeddings_class_two],0)

#Test-Train split
#20% for validation testing, the rest for training
X_train, X_val, y_train, y_val = train_test_split(features.numpy(), 
                                                    labels.numpy(),
                                                    test_size=0.2, 
                                                    random_state=42)

#Build dataset
train_ds = tf.data.Dataset.from_tensors((X_train,y_train))
val_ds = tf.data.Dataset.from_tensors((X_val,y_val))

### Training Callbacks and Early Stopping

We build up out ``Sequential()`` model much the same as we have done in the past, making the sure the ``Input`` layer is the right size to accept our embeddings (1024 values for each audio buffer). 

We also add in an extra function called a ``callback``. This is a function that gets called **at the end of every epoch**

We can make our own custom functions, or use some of the built in ones that come with ``Keras``. Here we use ``EarlyStopping()``, which checks conditions at each epoch and decides whether we should continue with training

* ``monitor``: Tells us which metric to keep tabs on whilst training. Here, we pick ``loss`` 


* ``patience``: This tells us how many epochs to wait once the ``loss`` has **stopped decreasing** 


* ``restore_best_weights``: This means that after **early stopping**, we keep the weights that gave us the lowest loss. 

In [None]:
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import Sequential

In [None]:
#Make model
embedding_size = embeddings_class_one.shape[1]
classifier = Sequential([
    #the input layer is the size of the embeddings
    layers.Input(shape=(embedding_size), dtype=tf.float32,
                          name='input_embedding'),
    layers.Dense(32, activation='relu'),
    layers.Dense(1)
], name='my_model')

classifier.summary()
classifier.compile(loss=keras.losses.BinaryCrossentropy(),
                 optimizer="adam",
                 metrics=['accuracy'])

callback = keras.callbacks.EarlyStopping(monitor='loss',
                                            patience=3,
                                            restore_best_weights=True)

In [None]:
#Train new classifier with a callback called at the end of every epoch
history = classifier.fit(train_ds,
                         validation_data=val_ds,
                         epochs=20,
                         callbacks=callback)

## Tasks 

1. Try some different Youtube videos and check out the timestamped labels, do these seem correct?


2. Download another Youtube video and try a binary classifier, does it work well?