Skip to content


Panagiotis Antoniadis edited this page Jul 23, 2019 · 6 revisions

Dataset Description

In order to evaluate all implemented models, an email speech dataset was created using some of the emails of my mentors. It consists of 100 emails, where the 90% was used as a train set and the rest 10% as a test set. The metric to be used is Word Accuracy.


Except for all the implemented tools in data/scripts directory, an additional tool for recording all the dictations is implemented. Its scope is the fast recording of a speech dataset and its usage can be found here.

Generic Language model adaptation

After applying general language model adaptation as described in Datasets and Adaptation, we have the following results:

Language model Acoustic model Accuracy
default default 69.07%
specific default 67.14%
merged default 79.79%
merged adapted (mllr) 80.32%
merged adapted (map) 76.98%
merged adapted (mllr + map) 76.63%

Domain-specific language model adaptation

Various methods have been implemented in the clustering of the emails as described here. In order to select which of them to apply, we should first evaluate some of them. This analysis follows:

Automatic selection of the number of clusters

Cluster id Cluster semantic No. sentence Accuracy
0 salutations 8 100.0%
1 other 55 80.47%
2 closings 1 100.0%
  • Elbow method: The definition of the maximum number of clusters to test affects the result. Above we present the figures of the sum of squared error using 5, 8 and 10 number of clusters. The selected number is the 'knee' of the figure.

8 max clusters 10 max clusters
Cluster id Cluster semantic No. sentence Accuracy
0 salutations 23 71.88%
1 other 26 78.81%
2 closings 5 83.95%
3 closings 1 100.0%
Cluster id Cluster semantic No. sentence Accuracy
0 closings 16 86.96%
1 salutation 10 88.24%
2 closings 1 100.0%
3 other 33 78.71%
4 salutations 4 60.0%

Representing sentences as vectors

  • tfidf: Low accuracy and no semantic information.
  • spacy: Good word representation, minor issue the fact that sentence embeddings are the mean value of their word embeddings.
  • word2vec: Low accuracy
  • cbow:
Cluster id Cluster semantic No. sentence Accuracy
0 closings 17 86.79%
1 other 45 80.16%
2 other 2 70.37%
  • skipgram:
Cluster id Cluster semantic No. sentence Accuracy
0 other 50 80.04%
1 closings 14 93.74%
  • doc2vec: Not ready yet.

Classifying a new vector in existing clusters

Here, cosine similarity performs better that euclidean in all cases.

Parameter Selection

Based on the above analysis, we are going to use max silhouette in order to determine the number of clusters. Also, cosine similarity is the best option in cluster classification. Finally, spacy and skipgram word vectors have the best accuracy. Since skipgram vectors are trained in user's emails, we can use them when the email corpus is pretty large (over 100 emails).

Acoustic model adaptation

After domain-specific language adaptation using the above techniques, the acoustic model should be adapted too. As we can see, mllr adaptation performs better, since our acoustic model is continuous and its map adaptation requires over 1-hour recordings.

  • max silhouette with spacy embeddings:
Cluster id No. sentence Default Map Mllr
0 1 100.0% 66.67% 100%
1 13 78.26% 78.26% 91.3%
2 50 80.85% 79.37% 81.4%
  • max silhouette with skipgram trained embeddings:
Cluster id No. sentence Default Map Mllr
0 50 80.04% 77.93% 80.61%
1 14 93.75% 91.67% 93.75%


The adaptation works! The default acoustic and language model have 69.07% accuracy with 7 insertions, 47 deletions and 122 substitutions. Finally, using acoustic and domain-specific language model adaptation (max silhouette+spacy) the asr have the above accuracy with 4 insertions, 23 deletions and 76 substitutions.

You can’t perform that action at this time.