# SoundNet: Learning Sound
Representations from Unlabeled Video
---
---
<img src="../TP's/images/IMT_Atlantique.jpg" width="200">

Authors:
`ARIAS Camila and IBARRA Kevin`



The purpose of this notebook is explain how `SoundNet` works (the maths, the code and the experiments). Soundnet  was developed in 2016 in order to use the natural synchronization between vision and sound to learn an acoustic representation from a large amount of unlabeled videos.

[Scientific article reference](https://arxiv.org/pdf/1610.09001.pdf)  by Yusuf Aytar, Carl Vondrick, Antonio Torralba. NIPS 2016

[SoundNet in Keras](https://github.com/brain-bzh/soundnet_keras) SoundNet, built in Keras with pre-trained 8-layer model.



# Presentation of the problem

There are a lot of works related to object recognition, speech recognition and machine translation using labeled dataset as reference but it is not the same in understanding of natural sound maybe is because is expensive and ambiguos to collect labeled sound.

>How to transfer discriminative visual knowlegde into sound using unlabeled video as a brigde

Deep convolutional network learns directly from **raw audio** waveform but the model is training with visual supervision

Other studies are focus on features such as spectograms and MFC, on other hand,  Soundnet is focus in the natural sound. 
* 2M Train --deep fully convolutional network
* Teacher-student model **Transfer model**
* Without truth sound labels
* Videos Flick
* sound mp3 explicar en el codifgo

DESCRIBIR DATASET           
## Architecture
![sound.jpeg](../TP's/images/sound.jpeg)

SoundNet is a deep convolutional network, layers are described as follow:
    
1. `One dimensional convolutions` 

    Why does it use convolutional into sound data?
    Because they are invariables to translation making the number of parameters are reduced 
    
   stack layers: Detect high level features **

2. `Fully connected`
3. `Pooling` 
    
    

## Math in training

Now we are explain the math behind this project. It is important make a mention in the training phase they used video and sound, but the aim is learn by sound, so in the compile model there is not video as input. 

*Training phase*

For the inputs the model needs the raw audio waveform $$x_i \in \mathcal{R}^D$$ and the video $$y_i \in \mathcal{R}^{3xTxWxH}$$

## Key aspects of the learning setting

SOUND CLASSIFICATION
IMPLEMENTATION

# code: model pre-trained

Finally, we are going to understand the code and explain how it is works. 

The code given by the teacher staff correspond to [SoundNet](https://github.com/brain-bzh/soundnet_keras) built in Keras with pre-trained **8-layer model**. 

So, the first thing to do is obtain a model with the weigths pre-trained. 
Let's import the libraries

In [18]:
# -*- coding: utf-8 -*-
from keras.layers import BatchNormalization, Activation, Conv1D, MaxPooling1D, ZeroPadding1D, InputLayer
from keras.models import Sequential
import numpy as np
import librosa 
#audio library

# The configuration of the layers 

<img src="../TP's/images/layers.png" width="600">

Now `build_model` make soundNet according to the structure defined by the authors. This model has as input the audio raw waveform and two output:scenes and objet distribution

In [19]:
def build_model():
    """
    Builds up the SoundNet model and loads the weights from a given model file (8-layer model is kept at models/sound8.npy).
    :return:
    """
    model_weights = np.load('../soundnet_keras-master/models/sound8.npy',encoding = 'latin1').item()
    model = Sequential()
    #Input layer: audio raw waveform (1,length_audio,1)
    model.add(InputLayer(batch_input_shape=(1, None, 1)))

    filter_parameters = [{'name': 'conv1', 'num_filters': 16, 'padding': 32,
                          'kernel_size': 64, 'conv_strides': 2,
                          'pool_size': 8, 'pool_strides': 8}, #pool1

                         {'name': 'conv2', 'num_filters': 32, 'padding': 16,
                          'kernel_size': 32, 'conv_strides': 2,
                          'pool_size': 8, 'pool_strides': 8}, #pool2

                         {'name': 'conv3', 'num_filters': 64, 'padding': 8,
                          'kernel_size': 16, 'conv_strides': 2},

                         {'name': 'conv4', 'num_filters': 128, 'padding': 4,
                          'kernel_size': 8, 'conv_strides': 2},

                         {'name': 'conv5', 'num_filters': 256, 'padding': 2,
                          'kernel_size': 4, 'conv_strides': 2,
                          'pool_size': 4, 'pool_strides': 4}, #pool5

                         {'name': 'conv6', 'num_filters': 512, 'padding': 2,
                          'kernel_size': 4, 'conv_strides': 2},

                         {'name': 'conv7', 'num_filters': 1024, 'padding': 2,
                          'kernel_size': 4, 'conv_strides': 2},

                         {'name': 'conv8_2', 'num_filters': 401, 'padding': 0,
                          'kernel_size': 8, 'conv_strides': 2},#output: VGG 401 classes
                         ]

    for x in filter_parameters:
        #for each [zero_padding - conv - batchNormalization - relu]
        model.add(ZeroPadding1D(padding=x['padding']))
        model.add(Conv1D(x['num_filters'],
                         kernel_size=x['kernel_size'],
                         strides=x['conv_strides'],
                         padding='valid'))
        weights = model_weights[x['name']]['weights'].reshape(model.layers[-1].get_weights()[0].shape)
        biases = model_weights[x['name']]['biases']

        model.layers[-1].set_weights([weights, biases])  #set weights in conv

        if 'conv8' not in x['name']:
            gamma = model_weights[x['name']]['gamma']
            beta = model_weights[x['name']]['beta']
            mean = model_weights[x['name']]['mean']
            var = model_weights[x['name']]['var']

            
            model.add(BatchNormalization())
            model.layers[-1].set_weights([gamma, beta, mean, var]) #set weights in batchNormalization
            model.add(Activation('relu'))
            
        if 'pool_size' in x:
            #add 3 pooling layers
            model.add(MaxPooling1D(pool_size=x['pool_size'],
                                   strides=x['pool_strides'],
                                   padding='valid'))

    return model


## Audio preprocessing

In [20]:
def preprocess(audio):
    audio *= 256.0  # SoundNet needs the range to be between -256 and 256
    # reshaping the audio data so it fits into the graph (batch_size, num_samples, num_filter_channels)
    audio = np.reshape(audio, (1, -1, 1))
    return audio


def load_audio(audio_file):
    sample_rate = 22050  # SoundNet works on mono audio files with a sample rate of 22050.
    audio, sr = librosa.load(audio_file, dtype='float32', sr=22050, mono=True) #load audio
    audio = preprocess(audio) #preprocess using soundnet parameters
    return audio

## Construction of the model

In [21]:
from keras.utils import plot_model

#Review of the model and structure
model = build_model()
model.summary()
#plot_model(model, to_file='model.png')


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
zero_padding1d_1 (ZeroPaddin (1, None, 1)              0         
_________________________________________________________________
conv1d_1 (Conv1D)            (1, None, 16)             1040      
_________________________________________________________________
batch_normalization_1 (Batch (1, None, 16)             64        
_________________________________________________________________
activation_1 (Activation)    (1, None, 16)             0         
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (1, None, 16)             0         
_________________________________________________________________
zero_padding1d_2 (ZeroPaddin (1, None, 16)             0         
_________________________________________________________________
conv1d_2 (Conv1D)            (1, None, 32)             16416     
__________

# 1. First experiments

In [22]:
def predictions_to_scenes(prediction):
    scenes = []
    with open('../soundnet_keras-master/categories/categories_places2.txt', 'r') as f:
        categories = f.read().split('\n')
        for p in range(prediction.shape[1]):
            scenes.append(categories[np.argmax(prediction[0, p, :])])
    return scenes


In [16]:
#what is the prediction?
# let's load a audio file
audiot,sr = librosa.load('../soundnet_keras-master/railroad_audio.wav', dtype='float32', sr=22050, mono=True) #load audio

#library to listen sound
#import IPython.display as ipd

#ipd.Audio(audiot) # load 

#sound like a railroad non?

In [8]:
def predict_scene_from_audio_file(audio_file):
    model = build_model()
    audio = load_audio(audio_file)
    return model.predict(audio)

prediction = predict_scene_from_audio_file('../soundnet_keras-master/railroad_audio.wav')
print(prediction.shape)
#import seaborn as sns
#plt.figure(figsize=(8,4))
#sns.countplot(x='label', data=prediction);

(1, 4, 401)


In [24]:
#histograma con ventana
plt.figure(figsize=(8,4))
plt.hist(prediction[0,0,:])

NameError: name 'plt' is not defined

In [None]:
Experimentos con otros audios uno largo (demo o propio)

In [None]:
descripcion dataset

In [None]:
import pandas as pd
test = pd.read_csv('/homes/k18ibarr/Téléchargements/Projet/ESC-50-master/meta/esc50.csv',sep=',')
test.head()
test.info()

In [None]:
data = []
list_target = []
list_category = []

for file_name, target, category,esc10 in zip(list(test['filename']), 
                                       list(test['target']), 
                                       list(test['category']),
                                       list(test['esc10'])):
        if esc10 == True: 
            #only 10 classes
            audio = load_audio('/homes/k18ibarr/Téléchargements/Projet/ESC-50-master/audio/'+file_name)
            data.append(audio)
            list_target.append(target)
            list_category.append(category)  

In [None]:
datos = np.asarray([data[155]]).reshape(1,-1,1) #falla por tamano
print(datos.shape)

p = model.predict(datos) #-dgfsdfgsdfgdfg---menores a 5 segundos no funciona en el ultimo nivel
print(p.shape)
print(predictions_to_scenes(p))
tensor = get.reshape(1,-1)
print(tensor.shape)

In [None]:
#audio 1-172649-b-40 helicopter numero 5 en data
datos = np.asarray([data[5],data[5],data[5]]).reshape(1,-1,1)
print(datos.shape)
p = model.predict(datos) #-dgfsdfgsdfgdfg---menores a 5 segundos no funciona en el ultimo nivel
print(p.shape)
print(predictions_to_scenes(p))
tensor = get.reshape(1,-1)
print(tensor.shape)

# nuestros experimentos

1. Encontrar caracteristicas en las hidden layers transfer  learning graficas tsne
2. clasicador con la extraccion de pool5 menos parametros


In [25]:
conclusiones

NameError: name 'conclusiones' is not defined