-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes.txt
54 lines (24 loc) · 1.25 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
bigger than 3s
longer times for summarization
google production for music classification uses longer times
30s
30000 spectrogram
take maxpool over that or increase stride
first conv - full height (all freq) by 25 frames (250ms). let's say stride is 1. pool over time.
music - didn't do maxpool over freq, working pretty well.
-----
imagine
3 softmax
input entire utterance (3s)
1:1 input to output.
look at 26 frames, predict on frame 20.
do conv, now we have 300 outputs (300 frames for 3s), and those outputs are 100d each per timestep.
frame by frame. run an LSTM over that. blend of information about history/time.
now you have 300 outputs. outputs are each better than the output of a convnet.
train to recognize all these diff accents, it might just work on frame level.
adding in more separation and more complexity to the model works better
Input: 40x300 utterance (3 seconds long)
conv:
-----
If we're only doing a CNN, then we need to input 40x300 (one utterance) because the input would have to contain the temporal information to be able to tell accents apart.
If we're stacking an LSTM on top of the CNN and hoping that the LSTM would learn the temporal dynamics, then we should input something like 40x(26-50) (one frame with context).