Why create stack of frames from the signal?

The audio signal is changing continuously so the overall and local changes may capture by generated feature if we use the whole signal frame. This is the last thing we want because the desired feature representation should capture the vocal tract and the speaker characteristics which are not changing in the whole stream. So stack of smaller frames must be created from the signal.

The stacked frames form the new signal representation. The frames might have some overlap in the signal original domain. As an example if we use 20ms frames with overlapping of 10ms it means the first frame is formed from [0-20]ms of the signal and the second frame comes from [10-30]ms of the signal and so on. This overlapping ensures to capture the temporal information.

It is usual to take 20-40 ms frames and this due to the stationary characteristics, i.e. the statistical similarity of different frames. Frames more than 40ms considered to lose the stationary characteristics and frames less than 20ms considered to not maintain enough temporal information for computing a reliable spectral estimate. The below figure shows the overall process:

The stacking visual representation.

Introduction

Feature Extraction

MFCCs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why create stack of frames from the signal?

Clone this wiki locally