8conclusions.tex

\chapter{Conclusions}
\label{chapter:conclusions}

Deep learning has set the trend in the previous decade for automating tasks like automatic speech recognition, and data scalability is one of the paths to improve the models. Large-scale training experiments are challenging to carry out due to lack of resources and infrastructure availability, and it becomes crucial to make use of the available assets efficiently.

We introduce Business speech dataset with around 9 Million utterances and 26,000 hours in size. The dataset is quite unique considering the diverse nature of speakers in it, and it also has two forms, conversational and prepared speech in it. Data scalability promises high performance, but also comes with its challenges. As the scale of the data increases, the chances of data having inconsistencies also go up. This is hard to keep track of, especially manually. 

We presented solutions which enable training models using data scales in the order of few Terabytes. Sequential access to the dataset through TAR archives is paramount even to be able to practically run experiments above 2000 hours of data. In our case, we use WebDataset to achieve this, and it also integrates nicely with PyTorch. It also supports accessing from multiple nodes, processes and to use with multiple GPUs which help us with the distributed training setups. 

We use an Attention based Encoder-Decoder architecture as the default for all the experiments and compare three different types of training strategies. 

In the non distributed training jobs, the best \acrshort{wer} of the models dropped consistently as we increase the scale of the data used for training, with the best \acrshort{wer} of 14.01\% on the 20,000-hour dataset. We also observed the importance of using large-scale data as the model trained with the 20,000-hour dataset reached the benchmark \acrshort{wer} in 48 hours compared to 76 hours with the 8000-hour dataset. This means that using larger datasets does not necessarily mean longer training times. It can be argued that it is better to train with large datasets, even when the time required to complete the training is a crucial factor for the experiments.

Synchronous training works best in our experiments and provide the best word error rates. We observed a speed-up of 2x when using 4-\acrshort{gpu}s in data parallel mode. The best \acrshort{wer} is 10.87\% with the 20000 hours dataset and using 4-\acrshort{gpu} \acrshort{ddp}. Hence, we can conclude that synchronous training methods are effective, especially when the hardware setup (Similar \acrshort{gpu}s, CPUs in a multinode environment) is homogeneous in nature, so that the straggler problem becomes irrelevant. Asynchronous training methods did not fare well in our experiments, but it could improve with hyperparameter optimization focussed for that strategy of training.  Even though there was a slight improvement in convergence time, the performance metrics were considerably worse than the other methodologies. The \acrshort{wer} learning curve for the asynchronous training also does not inspire confidence for the method, as it was very erratic even when the loss value was constant. 

In this particular thesis, we provide a full workflow for speech recognition with large-scale data to speed up training. We discuss methods for data storing, data loading and then to get maximum efficiency of resources available by enabling usage of parallelization techniques involving multiple GPUs and processes. The work can be accessed on GitHub\footnote{aalto-speech/BizSpeech\_SpeechBrain: Building an ASR system recipe for BizSpeech data using SpeechBrain. \href{https://github.com/aalto-speech/BizSpeech_SpeechBrain}{https://github.com/aalto-speech/BizSpeech\_SpeechBrain}}.