### Correlations as a function of 'receptive field'
Speech2Text (S2T) has max correlations (corr) for 2nd layer. Wav2Letter (W2L) has peak for corr for 2nd and 3rd layers. How is possible that for both models layers in the same region (in terms of counting # of layers) have peak corr scores? Lets look at receptive fields in search of answers to that. 

In [37]:
def modified_rf(kernels, strides, layer_id):
    samples = kernels[layer_id]
    #print(f"samples: {samples}")
    for i in range(layer_id,0,-1):
        samples = (samples - 1)*strides[i-1] + kernels[i-1]
    #    print(f"stride: {strides[i-1]}", end=': ')
    #    print(f"samples: {samples}")
    return samples

def get_receptive_fields(kernels, strides, fs=16000):
    """computes receptive fields for the given arrays fo kernels and strides.."""
    print("Calculating receptive field using 'modified_rf'")
    samping_rates = np.zeros(len(kernels))
    samping_rates[0] = fs/strides[0]
    for i in range(0,len(strides)):
        rf_samples = modified_rf(kernels, strides, i)
        rf_ms = rf_samples/16
        if i>0:
            samping_rates[i] = samping_rates[i-1]/strides[i] 
        print(f"Layer {i}, RF: {rf_samples:5d} samples, {rf_ms:4.2f} ms, sampling_rate: {samping_rates[i]:.0f}Hz, sampling_time: {(1000/samping_rates[i]):.3f}ms",)

### S2T 'receptive field'

In [42]:
print("Speech2Text...!")
kernels = [320, 5, 5]
strides = [160, 2, 2]
# the last entries are kernel size and strides of the 'convolution position encoding'

get_receptive_fields(kernels, strides)


# # Spectrogram:
# win_length_spect = 25
# stride_spect = 10
# rf_spect = win_length_spect


# # Layer 1:
# kernel_l1 = 5
# stride_l1 = 2

# # Layer 2:
# kernel_l2 = 5
# stride_l2 = 2

# rf_l1 = receptive_field(kernel_l1, win_length_spect, stride_spect, rf_spect)
# rf_l2 = receptive_field(kernel_l2, kernel_l1, stride_l1, rf_l1)

Speech2Text...!
Calculating receptive field using 'modified_rf'
Layer 0, RF:   320 samples, 20.00 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 1, RF:   960 samples, 60.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 2, RF:  2240 samples, 140.00 ms, sampling_rate: 25Hz, sampling_time: 40.000ms


### wave2vec2 RF
Pre-precessor has 7 conv layers, with kernels: [10,3,3,3,3,2,2] and strides: [[5,2,2,2,2,2,2]

In [32]:
print("Wav2vec...!")
kernels = [10,3,3,3,3,2,2, 128]
strides = [5,2,2,2,2,2,2, 1]
# the last entries are kernel size and strides of the 'convolution position encoding'

get_receptive_fields(kernels, strides)

Wav2vec...!
Calculating receptive field using 'modified_rf'
Layer 0, RF:    10 samples, 0.62 ms, sampling_rate: 3200Hz
Layer 1, RF:    20 samples, 1.25 ms, sampling_rate: 1600Hz
Layer 2, RF:    40 samples, 2.50 ms, sampling_rate: 800Hz
Layer 3, RF:    80 samples, 5.00 ms, sampling_rate: 400Hz
Layer 4, RF:   160 samples, 10.00 ms, sampling_rate: 200Hz
Layer 5, RF:   240 samples, 15.00 ms, sampling_rate: 100Hz
Layer 6, RF:   400 samples, 25.00 ms, sampling_rate: 50Hz
Layer 7, RF: 41040 samples, 2565.00 ms, sampling_rate: 50Hz


### W2L original

In [44]:
import numpy as np
print("Receptive Fields for Wav2Letter (original):")
kernels = [320, 11,11,11,11, 13,13,13, 17,17,17, 21,21,21, 25,25,25]
strides = [160, 2, 1 ,1 ,1, 1,1,1, 1,1,1, 1,1,1, 1,1,1 ]

get_receptive_fields(kernels, strides)


Receptive Fields for Wav2Letter (original):
Calculating receptive field using 'modified_rf'
Layer 0, RF:   320 samples, 20.00 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 1, RF:  1920 samples, 120.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 2, RF:  5120 samples, 320.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 3, RF:  8320 samples, 520.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 4, RF: 11520 samples, 720.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 5, RF: 15360 samples, 960.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 6, RF: 19200 samples, 1200.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 7, RF: 23040 samples, 1440.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 8, RF: 28160 samples, 1760.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 9, RF: 33280 samples, 2080.00 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 10, RF: 38400 samples, 2400.00 ms, sampling_rate: 50Hz, sa

### W2L modified

In [40]:
print("Receptive Fields for proposed w2l:")

kernels = [31,3,3,3,3,3,3,3,7,7,7,7,31]
strides = [20,2,2,2,2,1,1,1,1,1,1,1,1]

get_receptive_fields(kernels, strides)



Receptive Fields for proposed w2l:
Calculating receptive field using 'modified_rf'
Layer 0, RF:    31 samples, 1.94 ms, sampling_rate: 800Hz, sampling_time: 1.250ms
Layer 1, RF:    71 samples, 4.44 ms, sampling_rate: 400Hz, sampling_time: 2.500ms
Layer 2, RF:   151 samples, 9.44 ms, sampling_rate: 200Hz, sampling_time: 5.000ms
Layer 3, RF:   311 samples, 19.44 ms, sampling_rate: 100Hz, sampling_time: 10.000ms
Layer 4, RF:   631 samples, 39.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 5, RF:  1271 samples, 79.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 6, RF:  1911 samples, 119.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 7, RF:  2551 samples, 159.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 8, RF:  4471 samples, 279.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 9, RF:  6391 samples, 399.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
Layer 10, RF:  8311 samples, 519.44 ms, sampling_rate: 50Hz, sampling_time: 20.000ms
