[src] Adding GPU/CUDA lattice batched decoder + binary (kaldi-asr#3114)

danpovey · Jun 19, 2019 · c415cba · c415cba
1 parent a0b6f3f
commit c415cba
Show file tree

Hide file tree

Showing 21 changed files with 7,155 additions and 915 deletions.
diff --git a/src/Makefile b/src/Makefile
@@ -4,12 +4,13 @@
 
 SHELL := /bin/bash
 
-
 SUBDIRS = base matrix util feat tree gmm transform \
           fstext hmm lm decoder lat kws cudamatrix \
           bin fstbin gmmbin fgmmbin featbin \
-          latbin nnet3 rnnlm chain nnet3bin kwsbin \
-          ivector ivectorbin online2 online2bin lmbin chainbin rnnlmbin
+          latbin sgmm2 sgmm2bin nnet3 rnnlm chain nnet3bin kwsbin \
+          ivector ivectorbin online2 online2bin lmbin chainbin rnnlmbin \
+          cudadecoder cudadecoderbin
+
 
 MEMTESTDIRS = base matrix util feat tree gmm transform \
           fstext hmm lm decoder lat kws chain \
@@ -127,7 +128,7 @@ $(EXT_SUBDIRS) : checkversion kaldi.mk mklibdir ext_depend
 ### Dependency list ###
 # this is necessary for correct parallel compilation
 #1)The tools depend on all the libraries
-bin fstbin gmmbin fgmmbin sgmm2bin featbin nnet3bin chainbin latbin ivectorbin lmbin kwsbin online2bin rnnlmbin: \
+bin fstbin gmmbin fgmmbin sgmm2bin featbin nnet3bin chainbin latbin ivectorbin lmbin kwsbin online2bin rnnlmbin cudadecoderbin: \
  base matrix util feat tree gmm transform sgmm2 fstext hmm \
  lm decoder lat cudamatrix nnet3 ivector chain kws online2 rnnlm
 
@@ -146,8 +147,6 @@ lm: base util matrix fstext
 decoder: base util matrix gmm hmm tree transform lat
 lat: base util hmm tree matrix
 cudamatrix: base util matrix
-nnet: base util hmm tree matrix cudamatrix
-nnet2: base util matrix lat gmm hmm tree transform cudamatrix
 nnet3: base util matrix decoder lat gmm hmm tree transform cudamatrix chain fstext
 rnnlm: base util matrix cudamatrix nnet3 lm hmm
 chain: lat hmm tree fstext matrix cudamatrix util base
@@ -158,3 +157,5 @@ onlinebin: base matrix util feat tree gmm transform sgmm2 fstext hmm lm decoder
 online: decoder gmm transform feat matrix util base lat hmm tree
 online2: decoder gmm transform feat matrix util base lat hmm tree ivector cudamatrix nnet3 chain
 kws: base util hmm tree matrix lat
+cudadecoder:  cudamatrix online2 nnet3 ivector feat fstext lat chain transform
+cudadecoderbin: cudadecoder cudamatrix online2 nnet3 ivector feat fstext lat chain transform
diff --git a/src/cudadecoder/Makefile b/src/cudadecoder/Makefile
@@ -0,0 +1,33 @@
+all:
+
+EXTRA_CXXFLAGS = -Wno-sign-compare
+include ../kaldi.mk
+
+ifeq ($(CUDA), true)
+
+# Make sure we have CUDA_ARCH from kaldi.mk,
+ifndef CUDA_ARCH
+  $(error CUDA_ARCH is undefined, run 'src/configure')
+endif
+
+TESTFILES =
+
+OBJFILES = batched-threaded-nnet3-cuda-pipeline.o decodable-cumatrix.o \
+           cuda-decoder.o cuda-decoder-kernels.o cuda-fst.o
+
+LDFLAGS += $(CUDA_LDFLAGS)
+LDLIBS += $(CUDA_LDLIBS)
+
+LIBNAME = kaldi-cudadecoder
+
+ADDLIBS = ../cudamatrix/kaldi-cudamatrix.a ../base/kaldi-base.a ../matrix/kaldi-matrix.a \
+          ../lat/kaldi-lat.a ../util/kaldi-util.a ../matrix/kaldi-matrix.a ../gmm/kaldi-gmm.a \
+          ../fstext/kaldi-fstext.a ../hmm/kaldi-hmm.a ../gmm/kaldi-gmm.a ../transform/kaldi-transform.a \
+          ../tree/kaldi-tree.a ../online2/kaldi-online2.a ../nnet3/kaldi-nnet3.a
+
+# Implicit rule for kernel compilation
+%.o : %.cu
+	$(CUDATKDIR)/bin/nvcc -c $< -o $@ $(CUDA_INCLUDE) $(CUDA_FLAGS) $(CUDA_ARCH) -I../ -I$(OPENFSTINC)
+endif
+
+include ../makefiles/default_rules.mk
diff --git a/src/cudadecoder/README b/src/cudadecoder/README
@@ -0,0 +1,141 @@
+CUDADECODER USAGE AND TUNING GUIDE
+
+INTRODUCTION:
+
+The CudaDecoder was developed by NVIDIA with coordination from Johns Hopkins.
+This work was intended to demonstrate efficient GPU utilization across a range 
+of NVIDIA hardware from SM_35 and on.  The following guide describes how to 
+use and tune the decoder for your models.
+
+A single speech-to-text is not enough work to fully saturate any NVIDIA GPUs.
+To fully saturate GPUs we need to decode many audio files concurrently.  The
+solution provide does this through a combination of batching many audio files
+into a single speech pipeline, running multiple pipelines in parallel on the
+device, and using multiple CPU threads to perform feature extraction and 
+determinization.  Users of the decoder will need to have a high level 
+understanding of the underlying implementation to know how to tune the 
+decoder.  
+
+The interface to the decoder is defined in "batched-threaded-cuda-decoder.h".
+A binary example can be found in cudadecoderbin/batched-wav-nnet3-cuda.cc".
+Below is a simple usage example. 
+/*
+ *  BatchedThreadedCudaDecoderConfig batchedDecoderConfig;
+ *  batchedDecoderConfig.Register(&po);
+ *  po.Read(argc, argv);
+ *  ...
+ *  BatchedThreadedCudaDecoder CudaDecoder(batchedDecoderConfig);
+ *  CudaDecoder.Initialize(*decode_fst, am_nnet, trans_model);
+ *  ...
+ *
+ *  for (; !wav_reader.Done(); wav_reader.Next()) {
+ *    std::string key = wav_reader.Key();
+ *    CudaDecoder.OpenDecodeHandle(key, wave_reader.Value());
+ *    ...
+ *  }
+ *
+ *  while (!processed.empty()) {
+ *    CompactLattice clat;
+ *    CudaDecoder.GetLattice(key, &clat);
+ *    CudaDecoder.CloseDecodeHandle(key);
+ *    ...
+ *  }
+ *
+ *  CudaDecoder.Finalize();
+ */
+
+In the code above we first declare a BatchedThreadedCudaDecoderConfig
+and register its options.  This enables us to tune the configuration 
+options.   Next we declare the CudaDecoder with that configuration.
+Before we can use the CudaDecoder we need to initalize it with an
+FST, AmNnetSimple, and TransitionModel.  
+
+Next we iterate through waves and enqueue them into the decoder by
+calling OpenDecodeHandle.  Note the key must be unique for each 
+decode. Once we have enqueued work we can query the results by calling
+GetLattice on the same key we opened the handle on.  This will automatticaly
+wait for processing to complete before returning. 
+
+The key to get performance is to have many decodes active at the same time
+by opening many decode handles before querying for the lattices.
+
+
+PERFORMANCE TUNING:
+
+The CudaDecoder has a lot of tuning parameters which should be used to
+increase performance on various models and hardware.  Note that it is 
+expected that the optimal parameters will vary according to both the hardware,
+model, and data being decoded.
+
+The following will briefly describe each parameter:
+
+BatchedThreadedCudaDecoderOptions:
+  cuda-control-threads:  Number of CPU threads simultaniously submitting work
+    to the device.  For best performance this should be between 2-4.
+  cuda-worker-threads:  CPU threads for worker tasks like determinization and
+    feature extraction.  For best performance this should take up all spare
+    CPU threads available on the system.
+  max-batch-size:  Maximum batch size in a single pipeline.  This should be as
+    large as possible but is expected to be between 50-200.  
+  batch-drain-size:  How far to drain the batch before getting new work.
+    Draining the batch allows nnet3 to be better batched.  Testing has 
+    indicated that 10-30% of max-batch-size is ideal.
+  determinize-lattice:  Use cuda-worker-threads to determinize the lattice. if
+    this is true then GetRawLattice can no longer be called.
+  max-outstanding-queue-length:  The maximum number of decodes that can be
+    queued and not assigned before OpenDecodeHandle will automatically stall 
+    the submitting thread.  Raising this increases CPU resources.  This should 
+    be set to a few thousand at least.
+
+Decoder Options:
+  beam:  The width of the beam during decoding
+  lattice-beam:  The width of the lattice beam
+  ntokens-preallocated:  number of tokens allocated in host buffers.  If
+    this size is exceeded the buffer will reallocate larger consuming more
+    resources
+  max-tokens-per-frame:  maximum tokens in GPU memory per frame.  If this
+    value is exceeded the beam will tighten and accuracy may decrease.
+  max-active: at the end of each frame computation, we keep only its best max-active tokens (arc instantiations)
+
+Device Options:
+  use-tensor-cores:  Enables tensor core (fp16 math) for gemms.  This is
+    faster but less accurate.  For inference the loss of accuracy is marginal
+
+GPU MEMORY USAGE:
+
+GPU memory is limited.  Large GPUs have between 16-32GB of memory.  Consumer
+GPUs have much less.  For best performance users should have as many
+concurrent decodes as possible.  Thus users should purchase GPUs with as
+much memory as possible.  GPUs with less memory may have to sacrifice either
+performance or accuracy.  On 16GB GPUs for example we are able to support
+around 200 concurrent decodes at a time. This translates into 4
+cuda-control-threads and a max-batch-size of 50 (4x50).  If your model is
+larger or smaller than the models our models when testing you may have to
+raise or lower this.  
+
+There are a number of parameters which can be used to control GPU memory
+usage. How they impact memory usage and accuracy is discussed below:
+
+  max-tokens-per-frame: Controls how many buffers can be stored on the GPU for
+    each frame.  This buffer size cannot be exceed or reallocated.  As this
+    buffer gets closer to being exhausted the beam is reduced possibly reducing
+    quality.  This should be tuned according to the model and data.  For
+    example, a highly accurate model could set this values smaller to enable
+    more concurrent decodes.
+
+  cuda-control-threads:  Each control thread is a concurrent pipeline.  Thus
+    the GPU memory scales linearly with this parameter.  This should always be
+    at least 2 but should probably not be higher than 4 as more concurrent
+    pipelines leads to more driver contention reducing performance.
+
+  max-batch-size:  The number of concurrent decodes in each pipeline.  The
+    memory usage also scales linear with this parameter.  Setting this smaller
+    will reduce kernel runtime while increase launch latency overhead.
+    Ideally this should be as large as possible while still fitting into
+    memory.  Note that currently the maximum allowed is 200.
+
+== Acknowledgement ==
+
+We would like to thank Daniel Povey, Zhehuai Chen and Daniel Galvez for their help and expertise during the review process.
+
+