# SoundNet: Learning Sound
Representations from Unlabeled Video
---
---
<img src="../TP's/images/IMT_Atlantique.jpg" width="200">

Authors:
`ARIAS Camila and IBARRA Kevin`



The purpose of this notebook is explain how `SoundNet` works (the maths, the code and the experiments). Soundnet  was developed in 2016 in order to use the natural synchronization between vision and sound to learn an acoustic representation from a large amount of unlabeled videos.

[Scientific article reference](https://arxiv.org/pdf/1610.09001.pdf)  by Yusuf Aytar, Carl Vondrick, Antonio Torralba. NIPS 2016

[SoundNet in Keras](https://github.com/brain-bzh/soundnet_keras) SoundNet, built in Keras with pre-trained 8-layer model.



# Presentation of the problem

There are a lot of works related to object recognition, speech recognition and machine translation using labeled dataset as reference but it is not the same in understanding of natural sound maybe is because is expensive and ambiguos to collect labeled sound.

>How to transfer discriminative visual knowlegde into sound using labeled video as a brigde

Deep convolutional network learns directly from **raw audio** waveform but the model is training with visual supervision

Other studies are focus on features such as spectograms and MFC, on other hand,  Soundnet is focus in the natural sound. 
* 2M Train --deep fully convolutional network
* Teacher-student model **Transfer model**
* Without truth sound labels
* Videos Flick
* sound mp3 explicar en el codifgo

            
## Architecture
![sound.jpeg](../TP's/images/sound.jpeg)

SoundNet is a deep convolutional network, layers are described as follow:
    
1. `One dimensional convolutions` 

    Why does it use convolutional into sound data?
    Because they are invariables to translation making the number of parameters are reduced 
    
   stack layers: Detect high level features **

2. `Fully connected`
3. `Pooling` 
    
    

## Math in training

Now we are explain the math behind this project. It is important make a mention in the training phase they used video and sound, but the aim is learn by sound, so in the compile model there is not video as input. 

*Training phase*

For the inputs the model needs the raw audio waveform $$x_i \in \mathcal{R}^D$$ and the video $$y_i \in \mathcal{R}^{3xTxWxH}$$

## Key aspects of the learning setting

# code: model pre-trained

Finally, we are going to understand the code and explain how it is works. 

The code given by the teacher staff correspond to [SoundNet](https://github.com/brain-bzh/soundnet_keras) built in Keras with pre-trained **8-layer model**. 

So, the first thing to do is obtain a model with the weigths pre-trained. 
Let's import the libraries

In [None]:
# -*- coding: utf-8 -*-
from keras.layers import BatchNormalization, Activation, Conv1D, MaxPooling1D, ZeroPadding1D, InputLayer
from keras.models import Sequential
import numpy as np
import librosa #audio library

# The configuration of the layers 

<img src="../TP's/images/layers.png" width="600">

Now `build_model` make soundNet according to the structure defined by the authors. This model has as input the audio raw waveform and two output:scenes and objet distribution

In [None]:
def build_model():
    """
    Builds up the SoundNet model and loads the weights from a given model file (8-layer model is kept at models/sound8.npy).
    :return:
    """
    model_weights = np.load('models/sound8.npy',encoding = 'latin1').item()
    model = Sequential()
    #Input layer: audio raw waveform (1,length_audio,1)
    model.add(InputLayer(batch_input_shape=(1, None, 1)))

    filter_parameters = [{'name': 'conv1', 'num_filters': 16, 'padding': 32,
                          'kernel_size': 64, 'conv_strides': 2,
                          'pool_size': 8, 'pool_strides': 8}, #pool1

                         {'name': 'conv2', 'num_filters': 32, 'padding': 16,
                          'kernel_size': 32, 'conv_strides': 2,
                          'pool_size': 8, 'pool_strides': 8}, #pool2

                         {'name': 'conv3', 'num_filters': 64, 'padding': 8,
                          'kernel_size': 16, 'conv_strides': 2},

                         {'name': 'conv4', 'num_filters': 128, 'padding': 4,
                          'kernel_size': 8, 'conv_strides': 2},

                         {'name': 'conv5', 'num_filters': 256, 'padding': 2,
                          'kernel_size': 4, 'conv_strides': 2,
                          'pool_size': 4, 'pool_strides': 4}, #pool5

                         {'name': 'conv6', 'num_filters': 512, 'padding': 2,
                          'kernel_size': 4, 'conv_strides': 2},

                         {'name': 'conv7', 'num_filters': 1024, 'padding': 2,
                          'kernel_size': 4, 'conv_strides': 2},

                         {'name': 'conv8_2', 'num_filters': 401, 'padding': 0,
                          'kernel_size': 8, 'conv_strides': 2},#output: VGG 401 classes
                         ]

    for x in filter_parameters:
        #for each [zero_padding - conv - batchNormalization - relu]
        model.add(ZeroPadding1D(padding=x['padding']))
        model.add(Conv1D(x['num_filters'],
                         kernel_size=x['kernel_size'],
                         strides=x['conv_strides'],
                         padding='valid'))
        weights = model_weights[x['name']]['weights'].reshape(model.layers[-1].get_weights()[0].shape)
        biases = model_weights[x['name']]['biases']

        model.layers[-1].set_weights([weights, biases])  #set weights in conv

        if 'conv8' not in x['name']:
            print('inside')
            gamma = model_weights[x['name']]['gamma']
            beta = model_weights[x['name']]['beta']
            mean = model_weights[x['name']]['mean']
            var = model_weights[x['name']]['var']

            
            model.add(BatchNormalization())
            model.layers[-1].set_weights([gamma, beta, mean, var]) #set weights in batchNormalization
            model.add(Activation('relu'))
            
        if 'pool_size' in x:
            #add 3 pooling layers
            model.add(MaxPooling1D(pool_size=x['pool_size'],
                                   strides=x['pool_strides'],
                                   padding='valid'))

    return model


## Audio preprocessing

In [None]:
def preprocess(audio):
    audio *= 256.0  # SoundNet needs the range to be between -256 and 256
    # reshaping the audio data so it fits into the graph (batch_size, num_samples, num_filter_channels)
    audio = np.reshape(audio, (1, -1, 1))
    return audio


def load_audio(audio_file):
    sample_rate = 22050  # SoundNet works on mono audio files with a sample rate of 22050.
    audio, sr = librosa.load(audio_file, dtype='float32', sr=22050, mono=True) #load audio
    audio = preprocess(audio) #preprocess using soundnet parameters
    return audio

## Construction of the model

In [None]:
from keras.utils import plot_model

#Review of the model and structure
model = build_model()
model.summary()
plot_model(model, to_file='model.png')


# 1. First experiments

In [None]:
def predictions_to_scenes(prediction):
    scenes = []
    with open('../soundnet_keras-master/categories/categories_places2.txt', 'r') as f:
        categories = f.read().split('\n')
        for p in range(prediction.shape[1]):
            scenes.append(categories[np.argmax(prediction[0, p, :])])
    return scenes


In [None]:
#what is the prediction?
# let's load a audio file
audio_test = load_audio('../soundnet_keras-master/railroad_audio.wav')

import IPython.display as ipd
ipd.Audio(audio_test) # load 

#sound like a railroad non?

In [None]:
prediction = model.predict(audio_test)
print(prediction.shape)


import seaborn as sns
plt.figure(figsize=(8,4))
#sns.countplot(x='label', data=prediction);