Skip to content

Commit

Permalink
[src] Adding GPU/CUDA lattice batched decoder + binary (kaldi-asr#3114)
Browse files Browse the repository at this point in the history
  • Loading branch information
hugovbraun authored and danpovey committed Jun 19, 2019
1 parent a0b6f3f commit c415cba
Show file tree
Hide file tree
Showing 21 changed files with 7,155 additions and 915 deletions.
13 changes: 7 additions & 6 deletions src/Makefile
Expand Up @@ -4,12 +4,13 @@

SHELL := /bin/bash


SUBDIRS = base matrix util feat tree gmm transform \
fstext hmm lm decoder lat kws cudamatrix \
bin fstbin gmmbin fgmmbin featbin \
latbin nnet3 rnnlm chain nnet3bin kwsbin \
ivector ivectorbin online2 online2bin lmbin chainbin rnnlmbin
latbin sgmm2 sgmm2bin nnet3 rnnlm chain nnet3bin kwsbin \
ivector ivectorbin online2 online2bin lmbin chainbin rnnlmbin \
cudadecoder cudadecoderbin


MEMTESTDIRS = base matrix util feat tree gmm transform \
fstext hmm lm decoder lat kws chain \
Expand Down Expand Up @@ -127,7 +128,7 @@ $(EXT_SUBDIRS) : checkversion kaldi.mk mklibdir ext_depend
### Dependency list ###
# this is necessary for correct parallel compilation
#1)The tools depend on all the libraries
bin fstbin gmmbin fgmmbin sgmm2bin featbin nnet3bin chainbin latbin ivectorbin lmbin kwsbin online2bin rnnlmbin: \
bin fstbin gmmbin fgmmbin sgmm2bin featbin nnet3bin chainbin latbin ivectorbin lmbin kwsbin online2bin rnnlmbin cudadecoderbin: \
base matrix util feat tree gmm transform sgmm2 fstext hmm \
lm decoder lat cudamatrix nnet3 ivector chain kws online2 rnnlm

Expand All @@ -146,8 +147,6 @@ lm: base util matrix fstext
decoder: base util matrix gmm hmm tree transform lat
lat: base util hmm tree matrix
cudamatrix: base util matrix
nnet: base util hmm tree matrix cudamatrix
nnet2: base util matrix lat gmm hmm tree transform cudamatrix
nnet3: base util matrix decoder lat gmm hmm tree transform cudamatrix chain fstext
rnnlm: base util matrix cudamatrix nnet3 lm hmm
chain: lat hmm tree fstext matrix cudamatrix util base
Expand All @@ -158,3 +157,5 @@ onlinebin: base matrix util feat tree gmm transform sgmm2 fstext hmm lm decoder
online: decoder gmm transform feat matrix util base lat hmm tree
online2: decoder gmm transform feat matrix util base lat hmm tree ivector cudamatrix nnet3 chain
kws: base util hmm tree matrix lat
cudadecoder: cudamatrix online2 nnet3 ivector feat fstext lat chain transform
cudadecoderbin: cudadecoder cudamatrix online2 nnet3 ivector feat fstext lat chain transform
33 changes: 33 additions & 0 deletions src/cudadecoder/Makefile
@@ -0,0 +1,33 @@
all:

EXTRA_CXXFLAGS = -Wno-sign-compare
include ../kaldi.mk

ifeq ($(CUDA), true)

# Make sure we have CUDA_ARCH from kaldi.mk,
ifndef CUDA_ARCH
$(error CUDA_ARCH is undefined, run 'src/configure')
endif

TESTFILES =

OBJFILES = batched-threaded-nnet3-cuda-pipeline.o decodable-cumatrix.o \
cuda-decoder.o cuda-decoder-kernels.o cuda-fst.o

LDFLAGS += $(CUDA_LDFLAGS)
LDLIBS += $(CUDA_LDLIBS)

LIBNAME = kaldi-cudadecoder

ADDLIBS = ../cudamatrix/kaldi-cudamatrix.a ../base/kaldi-base.a ../matrix/kaldi-matrix.a \
../lat/kaldi-lat.a ../util/kaldi-util.a ../matrix/kaldi-matrix.a ../gmm/kaldi-gmm.a \
../fstext/kaldi-fstext.a ../hmm/kaldi-hmm.a ../gmm/kaldi-gmm.a ../transform/kaldi-transform.a \
../tree/kaldi-tree.a ../online2/kaldi-online2.a ../nnet3/kaldi-nnet3.a

# Implicit rule for kernel compilation
%.o : %.cu
$(CUDATKDIR)/bin/nvcc -c $< -o $@ $(CUDA_INCLUDE) $(CUDA_FLAGS) $(CUDA_ARCH) -I../ -I$(OPENFSTINC)
endif

include ../makefiles/default_rules.mk
141 changes: 141 additions & 0 deletions src/cudadecoder/README
@@ -0,0 +1,141 @@
CUDADECODER USAGE AND TUNING GUIDE

INTRODUCTION:

The CudaDecoder was developed by NVIDIA with coordination from Johns Hopkins.
This work was intended to demonstrate efficient GPU utilization across a range
of NVIDIA hardware from SM_35 and on. The following guide describes how to
use and tune the decoder for your models.

A single speech-to-text is not enough work to fully saturate any NVIDIA GPUs.
To fully saturate GPUs we need to decode many audio files concurrently. The
solution provide does this through a combination of batching many audio files
into a single speech pipeline, running multiple pipelines in parallel on the
device, and using multiple CPU threads to perform feature extraction and
determinization. Users of the decoder will need to have a high level
understanding of the underlying implementation to know how to tune the
decoder.

The interface to the decoder is defined in "batched-threaded-cuda-decoder.h".
A binary example can be found in cudadecoderbin/batched-wav-nnet3-cuda.cc".
Below is a simple usage example.
/*
* BatchedThreadedCudaDecoderConfig batchedDecoderConfig;
* batchedDecoderConfig.Register(&po);
* po.Read(argc, argv);
* ...
* BatchedThreadedCudaDecoder CudaDecoder(batchedDecoderConfig);
* CudaDecoder.Initialize(*decode_fst, am_nnet, trans_model);
* ...
*
* for (; !wav_reader.Done(); wav_reader.Next()) {
* std::string key = wav_reader.Key();
* CudaDecoder.OpenDecodeHandle(key, wave_reader.Value());
* ...
* }
*
* while (!processed.empty()) {
* CompactLattice clat;
* CudaDecoder.GetLattice(key, &clat);
* CudaDecoder.CloseDecodeHandle(key);
* ...
* }
*
* CudaDecoder.Finalize();
*/

In the code above we first declare a BatchedThreadedCudaDecoderConfig
and register its options. This enables us to tune the configuration
options. Next we declare the CudaDecoder with that configuration.
Before we can use the CudaDecoder we need to initalize it with an
FST, AmNnetSimple, and TransitionModel.

Next we iterate through waves and enqueue them into the decoder by
calling OpenDecodeHandle. Note the key must be unique for each
decode. Once we have enqueued work we can query the results by calling
GetLattice on the same key we opened the handle on. This will automatticaly
wait for processing to complete before returning.

The key to get performance is to have many decodes active at the same time
by opening many decode handles before querying for the lattices.


PERFORMANCE TUNING:

The CudaDecoder has a lot of tuning parameters which should be used to
increase performance on various models and hardware. Note that it is
expected that the optimal parameters will vary according to both the hardware,
model, and data being decoded.

The following will briefly describe each parameter:

BatchedThreadedCudaDecoderOptions:
cuda-control-threads: Number of CPU threads simultaniously submitting work
to the device. For best performance this should be between 2-4.
cuda-worker-threads: CPU threads for worker tasks like determinization and
feature extraction. For best performance this should take up all spare
CPU threads available on the system.
max-batch-size: Maximum batch size in a single pipeline. This should be as
large as possible but is expected to be between 50-200.
batch-drain-size: How far to drain the batch before getting new work.
Draining the batch allows nnet3 to be better batched. Testing has
indicated that 10-30% of max-batch-size is ideal.
determinize-lattice: Use cuda-worker-threads to determinize the lattice. if
this is true then GetRawLattice can no longer be called.
max-outstanding-queue-length: The maximum number of decodes that can be
queued and not assigned before OpenDecodeHandle will automatically stall
the submitting thread. Raising this increases CPU resources. This should
be set to a few thousand at least.

Decoder Options:
beam: The width of the beam during decoding
lattice-beam: The width of the lattice beam
ntokens-preallocated: number of tokens allocated in host buffers. If
this size is exceeded the buffer will reallocate larger consuming more
resources
max-tokens-per-frame: maximum tokens in GPU memory per frame. If this
value is exceeded the beam will tighten and accuracy may decrease.
max-active: at the end of each frame computation, we keep only its best max-active tokens (arc instantiations)

Device Options:
use-tensor-cores: Enables tensor core (fp16 math) for gemms. This is
faster but less accurate. For inference the loss of accuracy is marginal

GPU MEMORY USAGE:

GPU memory is limited. Large GPUs have between 16-32GB of memory. Consumer
GPUs have much less. For best performance users should have as many
concurrent decodes as possible. Thus users should purchase GPUs with as
much memory as possible. GPUs with less memory may have to sacrifice either
performance or accuracy. On 16GB GPUs for example we are able to support
around 200 concurrent decodes at a time. This translates into 4
cuda-control-threads and a max-batch-size of 50 (4x50). If your model is
larger or smaller than the models our models when testing you may have to
raise or lower this.

There are a number of parameters which can be used to control GPU memory
usage. How they impact memory usage and accuracy is discussed below:

max-tokens-per-frame: Controls how many buffers can be stored on the GPU for
each frame. This buffer size cannot be exceed or reallocated. As this
buffer gets closer to being exhausted the beam is reduced possibly reducing
quality. This should be tuned according to the model and data. For
example, a highly accurate model could set this values smaller to enable
more concurrent decodes.

cuda-control-threads: Each control thread is a concurrent pipeline. Thus
the GPU memory scales linearly with this parameter. This should always be
at least 2 but should probably not be higher than 4 as more concurrent
pipelines leads to more driver contention reducing performance.

max-batch-size: The number of concurrent decodes in each pipeline. The
memory usage also scales linear with this parameter. Setting this smaller
will reduce kernel runtime while increase launch latency overhead.
Ideally this should be as large as possible while still fitting into
memory. Note that currently the maximum allowed is 200.

== Acknowledgement ==

We would like to thank Daniel Povey, Zhehuai Chen and Daniel Galvez for their help and expertise during the review process.


0 comments on commit c415cba

Please sign in to comment.