### Correlations as a function of 'receptive field'
Speech2Text (S2T) has max correlations (corr) for 2nd layer. Wav2Letter (W2L) has peak for corr for 2nd and 3rd layers. How is possible that for both models layers in the same region (in terms of counting # of layers) have peak corr scores? Lets look at receptive fields in search of answers to that. 

In [4]:
import numpy as np

def _get_layer_receptive_field(kernels, strides, layer_id):
    """Computes receptive field for the layer at index 'layer_id'.

    Args:
        kernels: list = kernels sizes of convolution layers in order.
        strides: list = strides of convolution layers in order.
        layer_id: int = index of layer to compute the receptive field for.

    Returns:
        Receptive field (number of samples of input).
    """
    samples = kernels[layer_id]
    for i in range(layer_id,0,-1):
        samples = (samples - 1)*strides[i-1] + kernels[i-1]

    return samples

def get_receptive_fields(kernels, strides, fs=16000):
    """Computes receptive fields for all the layers of the network,
    the given arrays fo kernels and strides.
    Args:
        kernels: list = kernels sizes of convolution layers in order.
        strides: list = strides of convolution layers in order.

    """
    print("Calculating receptive fields for all layers...")
    samping_rates = np.zeros(len(kernels))
    samping_rates[0] = fs/strides[0]
    for i in range(0,len(strides)):
        rf_samples = _get_layer_receptive_field(kernels, strides, i)
        rf_ms = rf_samples*1000/fs
        if i>0:
            samping_rates[i] = samping_rates[i-1]/strides[i] 
        print(f"Layer {i}, RF: {rf_samples:5d} samples, {rf_ms:4.2f} ms, sampling_rate: {samping_rates[i]:.0f}Hz, sampling_time: {(1000/samping_rates[i]):.3f}ms",)

### Whisper

#### tiny..

In [6]:
print("Whisper...!")
# spectrogram with:
#  - window length 25 ms (400 samples)
#  - window stride 10 ms (160 samples)
kernels = [400, 3, 3]
strides = [160, 1, 2]
# the last entries are kernel size and strides of the 'convolution position encoding'

get_receptive_fields(kernels, strides)

Whisper...!
Calculating receptive field using 'modified_rf'
Layer 0, RF:   400 samples, 25.00 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 1, RF:   720 samples, 45.00 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 2, RF:  1040 samples, 65.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms


### S2T 'receptive field'

In [5]:
print("Speech2Text...!")
kernels = [400, 5, 5]
strides = [160, 2, 2]
# the last entries are kernel size and strides of the 'convolution position encoding'

get_receptive_fields(kernels, strides)


# # Spectrogram:
# win_length_spect = 25
# stride_spect = 10
# rf_spect = win_length_spect


# # Layer 1:
# kernel_l1 = 5
# stride_l1 = 2

# # Layer 2:
# kernel_l2 = 5
# stride_l2 = 2

# rf_l1 = receptive_field(kernel_l1, win_length_spect, stride_spect, rf_spect)
# rf_l2 = receptive_field(kernel_l2, kernel_l1, stride_l1, rf_l1)

Speech2Text...!
Calculating receptive field using 'modified_rf'
Layer 0, RF:   400 samples, 25.00 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 1, RF:  1040 samples, 65.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 2, RF:  2320 samples, 145.00 ms, sampling_rate: 25Hz, sampling_time: 40.000ms


### wave2vec RF

In [2]:
print("Wav2vec...!")
kernels = [10,8,4,4,4,1,1, 2,3,4,5,6,7,8,9,10,11,12,13]
strides = [5,4,2,2,2,1,1, 1,1,1,1,1,1,1,1,1,1,1,1 ]
# the last entries are kernel size and strides of the 'convolution position encoding'

get_receptive_fields(kernels, strides)

Wav2vec...!
Calculating receptive field using 'modified_rf'
Layer 0, RF:    10 samples, 0.62 ms, sampling_rate: 3200Hz, sampling_time: 0.312ms
Layer 1, RF:    45 samples, 2.81 ms, sampling_rate: 800Hz, sampling_time: 1.250ms
Layer 2, RF:   105 samples, 6.56 ms, sampling_rate: 400Hz, sampling_time: 2.500ms
Layer 3, RF:   225 samples, 14.06 ms, sampling_rate: 200Hz, sampling_time: 5.000ms
Layer 4, RF:   465 samples, 29.06 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 5, RF:   465 samples, 29.06 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 6, RF:   465 samples, 29.06 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 7, RF:   625 samples, 39.06 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 8, RF:   945 samples, 59.06 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 9, RF:  1425 samples, 89.06 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 10, RF:  2065 samples, 129.06 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 11, RF:  2865 

### wave2vec2 RF
Pre-precessor has 7 conv layers, with kernels: [10,3,3,3,3,2,2] and strides: [[5,2,2,2,2,2,2]

In [32]:
print("Wav2vec...!")
kernels = [10,3,3,3,3,2,2, 128]
strides = [5,2,2,2,2,2,2, 1]
# the last entries are kernel size and strides of the 'convolution position encoding'

get_receptive_fields(kernels, strides)

Wav2vec...!
Calculating receptive field using 'modified_rf'
Layer 0, RF:    10 samples, 0.62 ms, sampling_rate: 3200Hz
Layer 1, RF:    20 samples, 1.25 ms, sampling_rate: 1600Hz
Layer 2, RF:    40 samples, 2.50 ms, sampling_rate: 800Hz
Layer 3, RF:    80 samples, 5.00 ms, sampling_rate: 400Hz
Layer 4, RF:   160 samples, 10.00 ms, sampling_rate: 200Hz
Layer 5, RF:   240 samples, 15.00 ms, sampling_rate: 100Hz
Layer 6, RF:   400 samples, 25.00 ms, sampling_rate: 50Hz
Layer 7, RF: 41040 samples, 2565.00 ms, sampling_rate: 50Hz


In [2]:
from auditory_cortex.utils import get_receptive_fields

print("Wav2vec...!")
kernels = [10,3,3,3,3,2,2, 128]
strides = [5,2,2,2,2,2,2, 1]
# the last entries are kernel size and strides of the 'convolution position encoding'

get_receptive_fields(kernels, strides, fs=16000)

Wav2vec...!
Calculating receptive fields for all layers...
Layer 0, RF:    10 samples, 0.62 ms, sampling_rate: 3200Hz, sampling_time: 0.312ms
Layer 1, RF:    20 samples, 1.25 ms, sampling_rate: 1600Hz, sampling_time: 0.625ms
Layer 2, RF:    40 samples, 2.50 ms, sampling_rate: 800Hz, sampling_time: 1.250ms
Layer 3, RF:    80 samples, 5.00 ms, sampling_rate: 400Hz, sampling_time: 2.500ms
Layer 4, RF:   160 samples, 10.00 ms, sampling_rate: 200Hz, sampling_time: 5.000ms
Layer 5, RF:   240 samples, 15.00 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 6, RF:   400 samples, 25.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 7, RF: 41040 samples, 2565.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms


In [2]:
print("Wav2vec...!")
kernels = [12, 4, 2, 2, 3, 4]
strides = [5, 3, 2, 1, 1, 1]
# the last entries are kernel size and strides of the 'convolution position encoding'

get_receptive_fields(kernels, strides)

Wav2vec...!
Calculating receptive field using 'modified_rf'
Layer 0, RF:     8 samples, 0.50 ms, sampling_rate: 3200Hz, sampling_time: 0.312ms
Layer 1, RF:    23 samples, 1.44 ms, sampling_rate: 1067Hz, sampling_time: 0.937ms
Layer 2, RF:    38 samples, 2.38 ms, sampling_rate: 533Hz, sampling_time: 1.875ms
Layer 3, RF:    68 samples, 4.25 ms, sampling_rate: 533Hz, sampling_time: 1.875ms
Layer 4, RF:   128 samples, 8.00 ms, sampling_rate: 533Hz, sampling_time: 1.875ms
Layer 5, RF:   218 samples, 13.62 ms, sampling_rate: 533Hz, sampling_time: 1.875ms


In [6]:
kernels = [8, 8, 2, 2, 3, 7]
strides = [5, 3, 2, 1, 1, 1]
# the last entries are kernel size and strides of the 'convolution position encoding'

get_receptive_fields(kernels, strides)

Calculating receptive field using 'modified_rf'
Layer 0, RF:     8 samples, 0.50 ms, sampling_rate: 3200Hz, sampling_time: 0.312ms
Layer 1, RF:    43 samples, 2.69 ms, sampling_rate: 1067Hz, sampling_time: 0.937ms
Layer 2, RF:    58 samples, 3.62 ms, sampling_rate: 533Hz, sampling_time: 1.875ms
Layer 3, RF:    88 samples, 5.50 ms, sampling_rate: 533Hz, sampling_time: 1.875ms
Layer 4, RF:   148 samples, 9.25 ms, sampling_rate: 533Hz, sampling_time: 1.875ms
Layer 5, RF:   328 samples, 20.50 ms, sampling_rate: 533Hz, sampling_time: 1.875ms


In [5]:
kernels = [5, 3, 2]
strides = [2, 2, 1]
# the last entries are kernel size and strides of the 'convolution position encoding'

get_receptive_fields(kernels, strides)

Calculating receptive field using 'modified_rf'
Layer 0, RF:     5 samples, 0.31 ms, sampling_rate: 8000Hz, sampling_time: 0.125ms
Layer 1, RF:     9 samples, 0.56 ms, sampling_rate: 4000Hz, sampling_time: 0.250ms
Layer 2, RF:    13 samples, 0.81 ms, sampling_rate: 4000Hz, sampling_time: 0.250ms


### W2L original

In [44]:
import numpy as np
print("Receptive Fields for Wav2Letter (original):")
kernels = [320, 11,11,11,11, 13,13,13, 17,17,17, 21,21,21, 25,25,25]
strides = [160, 2, 1 ,1 ,1, 1,1,1, 1,1,1, 1,1,1, 1,1,1 ]

get_receptive_fields(kernels, strides)


Receptive Fields for Wav2Letter (original):
Calculating receptive field using 'modified_rf'
Layer 0, RF:   320 samples, 20.00 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 1, RF:  1920 samples, 120.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 2, RF:  5120 samples, 320.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 3, RF:  8320 samples, 520.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 4, RF: 11520 samples, 720.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 5, RF: 15360 samples, 960.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 6, RF: 19200 samples, 1200.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 7, RF: 23040 samples, 1440.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 8, RF: 28160 samples, 1760.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 9, RF: 33280 samples, 2080.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 10, RF: 38400 samples, 2400.00 ms, sampling_rate: 50Hz, sa

### W2L modified

In [40]:
print("Receptive Fields for proposed w2l:")

kernels = [31,3,3,3,3,3,3,3,7,7,7,7,31]
strides = [20,2,2,2,2,1,1,1,1,1,1,1,1]

get_receptive_fields(kernels, strides)



Receptive Fields for proposed w2l:
Calculating receptive field using 'modified_rf'
Layer 0, RF:    31 samples, 1.94 ms, sampling_rate: 800Hz, sampling_time: 1.250ms
Layer 1, RF:    71 samples, 4.44 ms, sampling_rate: 400Hz, sampling_time: 2.500ms
Layer 2, RF:   151 samples, 9.44 ms, sampling_rate: 200Hz, sampling_time: 5.000ms
Layer 3, RF:   311 samples, 19.44 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 4, RF:   631 samples, 39.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 5, RF:  1271 samples, 79.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 6, RF:  1911 samples, 119.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 7, RF:  2551 samples, 159.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 8, RF:  4471 samples, 279.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 9, RF:  6391 samples, 399.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 10, RF:  8311 samples, 519.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms


### DeepSpeech2

In [6]:
print("Receptive Fields for deepspeech2:")

print(f"First layer is spectrogram here, conv layers start from layer index 1...!")
kernels = [320,11,11]
strides = [160,2,1]

get_receptive_fields(kernels, strides)


Receptive Fields for deepspeech2:
First layer is spectrogram here, conv layers start from layer index 1...!
Calculating receptive field using 'modified_rf'
Layer 0, RF:   320 samples, 20.00 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 1, RF:  1920 samples, 120.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 2, RF:  5120 samples, 320.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
