forked from kaldi-asr/kaldi
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
108 lines (77 loc) · 6.59 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
About the ICSI corpora and this particular recipe
=================================================================================
This recipe builds ASR models using ICSI data [3] (available from LDC) and where possible
follows the best practises from AMI s5b recipe (look also at: ../ami/s5b).
ICSI data comprises around 72 hours of natural, meeting-style overlapped English speech
recorded at International Computer Science Institute (ICSI), Berkley.
Speech is captured using the set of parallel microphones, including close-talk headsets,
and several distant independent microhones (i.e. mics that do not form any explicitly
known geometry, see below for an example layout). Recordings are sampled at 16kHz.
See [1] for more details on ICSI, or [2,3] to access the data.
The correpodning paper describing the ICSI corpora is [4]
[1] http://www1.icsi.berkeley.edu/Speech/mr/
[2] LDC: LDC2004S02 for audio, and LDC2004T04 for transcripts (used in this recipe)
[3] http://groups.inf.ed.ac.uk/ami/icsi/ (free access, but for now only ihm data is available for download)
[4] A Janin, D Baron, J Edwards, D Ellis, D Gelbart, N Morgan, B Peskin,
T Pfau, E Shriberg, A Stolcke, and C Wooters, The ICSI meeting corpus.
in Proc IEEE ICASSP, 2003, pp. 364-367
ICSI data did not come with any pre-defined splits for train/valid/eval sets as it was
mostly used as a training material for NIST RT evaluations. Some portions of the unrelased ICSI
data (as a part of this corpora) can be found in, for example, NIST RT04 amd RT05 evaluation sets.
This recipe, however, to be self-contained factors out training (67.5 hours), development (2.2 hours
and evaluation (2.8 hours) sets in a way to minimise the speaker-overlap between different partitions,
and to avoid known issues with available recordings during evaluation. This recipe follows [5] where
dev and eval sets are making use of {Bmr021, Bns00} and {Bmr013, Bmr018, Bro021} meetings, respectively.
[5] S Renals and P Swietojanski, Neural networks for distant speech recognition.
in Proc IEEE HSCMA 2014 pp. 172-176. DOI:10.1109/HSCMA.2014.6843274
============================================================================================
To train models, for an arbitrary mic, go to s5 and run something like (after you
set ICSI_DIR and/or ICSI_TRANS variables in the below scripts):
./run_prepare_shared.sh
and then
./run.sh --mic mic_type
where mic_type depends on whether you want to use individual headset mic (ihm),
distant (but beamformed) multiple mics (mdm) or distant single mic (D1...D4).
Mutliple distant microphones (mdm) setup is using only up to 4 PZM mics. Look below
for more details, on notations and a typical meeting layout that ICSI recordings followed.
Look at run.sh for more details on what mic_type is expected to be.
Below description is (mostly) copied from ICSI documentation for convenience.
=================================================================================
Simple diagram of the seating arrangement in the ICSI meeting room.
The ordering of seat numbers is as specified below, but their
alignment with microphones may not always be as precise as indicated
here. Also, the seat number only indicates where the participant
started the meeting. Since most of the microphones are wireless, they
were able to move around.
Door
1 2 3 4
-----------------------------------------------------------------------
| | | | S
| | | | c
| | | | r
9| D1 D2 | D3 PDA D4 | | e
| | | | e
| | | | n
| | | |
-----------------------------------------------------------------------
8 7 6 5
D1, D2, D3, D4 - Desktop PZM microphones
PDA - The mockup PDA with two cheap microphones
The following are the TYPICAL channel assignments, although a handful
of meetings (including Bmr003, Btr001, Btr002) differed in assignment.
The mapping from the above, to the actual waveform channels in the corpora,
and (this recipe for a signle distant mic case) is:
D1 - chanE - (this recipe: sdm3)
D2 - chanF - (this recipe: sdm4)
D3 - chan6 - (this recipe: sdm1)
D4 - chan7 - (this recipe: sdm2)
PDA left - chanC
PDA right - chanD
-----------
Note (Pawel): The mapping for headsets is being extracted from mrt files.
In cases where IHM channels are missing for some speakers in some meetings,
in this recipe we either back off to distant channel (typically D2, default)
or (optionally) skip this speaker's segments entirely from processing.
This is not the case for eval set, where all the channels come with the
expected recordings, and split is the same for all conditions (thus allowing
for direct comparisons between IHM, SDM and MDM settings).