The paper list what I read
- A Comparative Survey of Deep Active Learning
- A Survey of Deep Active Learning
- A Survey on Deep Transfer Learning
- Active Learning for Convolutional Neural Networks: A Core-Set Approach
- An overview of Multi-Task Learning in Deep Neural Networks
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
- Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
- Deep Networks with Stochastic Depth
- Distilling the Knowledge in a Neural Network
- Domain-Adversarial Training of Neural Networks
- Gaussian Error Linear Units (GELUs)
- Generative Adversarial Nets
- Gradient Episodic Memory for Continual Learning
- Group Normalization
- Layer Normalization
- Learning Loss for Active Learning
- Learning with Pseudo-Ensembles
- Learning without Forgetting
- Long-short Term Memory
- Mixed Precision Training
- Monotonic Chunkwise Attention
- Online and Linear-Time Attention by Enforcing Monotonic Alignments
- Overcoming catastrophic forgetting in neural networks
- Progressive Neural Networks
- Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks
- Representation Learning with Contrastive Predictive Coding
- Searching for Activation Functions
- SGDR: Stochastic Gradient Descent with Warm Restarts
- Unsupervised Data Augmentation for Consistency Training
- ZeRO: Memory optimizations Toward Training Trillion Parameter Models
- A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks
- A Comparative Study on Transformer vs RNN in Speech Applications
- A Comparison of Sequence-to-Sequence Models for Speech Recognition
- A deep Learning Approach to Automatic Characterisation of Rhythm in Non-native English Speech
- A pitch extraction algorithm tuned for automatic speech recognition
- A study on data augmentation of reverberant speech for robust speech recognition
- A time delay neural network architecture for efficient modeling of long temporal contexts
- Active Learning for Speech Recognition: the Power of Gradients
- Active Learning for LF-MMI Trained Neural Networks in ASR
- Adaptation Methods for Non-native Speech
- Adversarial Learning of Raw Speech Features for Domain Invariant Speech Recognition
- Adversarial Multi-task Learning of Deep Neural Networks for Robust Speech Recognition
- Adversarial Training for Multilingual Acoustic Modeling
- An exploration of dropout with LSTMs
- An overview of Automatic Speech Attribute Transcription (ASAT)
- An overview of End-to-end Automatic Speech Recognition
- Audio Augmentation for Speech Recognition
- AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
- Automatic Speech Recognition for Second Language Learning: How and Why It Actually Works
- Automatic Speech Recognition of Multiple Accented Englsih Data
- Blank Collapse: Compressing CTC emission for the faster decoding
- Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
- ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers
- Common Voice: A Massively-Multilingual Speech Corpus
- Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System
- Computer-Assisted Pronunciation Training from Pronunciation Scoring Towards Spoken Language Learning
- Conformer: Convolution-augmented Transformer for Speech Recognition
- Conmer: Streaming Conformer without self-attention for interactive voice assistants
- Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
- ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
- Continual Learning in Automatic Speech Recognition
- Continual Learning Using Lattice-Free MMI for Speech Recognition
- Coupled Training of Sequence-to-sequence Models for Accented Speech Recognition
- CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition
- Data Augmentation for Deep Neural Network Acoustic Modeling
- Data Augmentation Improves Recognition of Foreign Accented Speech
- Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
- Deep Speech2: End-to-end Speech Recognition in English and Mandarin
- Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
- DistillW2V2: A Small and Streaming Wav2vec 2.0 Based ASR Model
- Domain Adversarial Training for Accented Speech Recognition
- E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition
- E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model
- E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
- Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition
- Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition
- End-to-end Accented Speech Recognition
- End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection
- End-to-end Speech Recognition Using Lattice-free MMI
- End-to-end Speech Recognition with Word-based RNN Language Models
- English Conversational Telephone Speech Recognition by Humans and Machines
- ESPnet: End-to-end Speech Processing Toolkit
- ESPnet-ONNX: Bridging a Gap Between Research and Production
- Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
- ExKaldi-RT: A Real-Time Automatic Speech Recognition Extension Toolkit of Kaldi
- Exploring Deep Learning Architectures for Automatically Grading Non-native Spontaneous Speech
- Exploring Lexicon-Free Modeling Units for End-to-End Korean and Korean-English Code-Switching Speech Recognition
- FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ
- Far-Field Automatic Speech Recognition
- Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
- FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization
- Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers
- Full-duplex Speech-to-text System for Estonian
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
- Hyperparameter experiments on end-to-end automatic speech recognition
- Improved Accented Speech Recognition using Accent Embeddings and Multi-task Learning
- Improved end-of-query detection for streaming speech recognition
- Improved Noisy Student Training for Automatic Speech Recognition
- Improving Noise Robustness of an End-to-End Neural Model for Automatic Speech Recognition
- Intermediate Loss Regularization for CTC-based Speech Recognition
- Introducing Attribute Features to Foreign Accent Recognition
- Iterative Pseudo-Labeling for Speech Recognition
- Japanese and Korean voice search
- Jasper: An End-to-End Convolutional Neural Acoustic Model
- JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning
- Joint CTC-attention based end-to-end speech recognition using multi-task learning
- Language identification with suprasegmental cues: A study based on speech resynthesis
- Leveraging Native Language Information for Improved Accented Speech Recognition
- Lhotse: a speech data representation library for the modern deep learning ecosystem
- Libri-Light: A Benchmark for ASR with Limited or No Supervision
- Librispeech: An ASR Corpus Based on Public Domain Audio Books
- Light Gated Recurrent Units for Speech Recognition
- Listen, Attend and Spell
- MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection
- Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
- Memory-Efficient Training of RNN-Transducer with Sampled Softmax
- MixSpeech: Data Augmentation for Low-Resource Automatic Speech Recognition
- MLLR-based Accent Model Adaptation without Accented Data
- Montreal Forced Aligner: trainable text-speech alignment using Kaldi
- Multi-accent Speech Recognition with Hierarchical Grapheme Based Models
- Multi-dialect Speech Recognition with A Single Sequence-to-sequence Model
- Multi-task Learning for Speech Recognition: An overview
- MUSAN: A Music, Speech, and Noise Corpus
- NeMo: a toolkit for building AI applications using Neural Modules
- Online Continual Learning of End-to-End Speech Recognition Models
- Output-Gate Projected Gated Recurrent Unit for Speech Recognition
- Purely sequence-trained neural networks for ASR based on lattice-free MMI
- Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
- PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR
- PyKaldi: A Python Wrapper for Kaldi
- PyKaldi2: Yet Another Speech Toolkit Based on Kaldi and Pytorch
- Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
- Recent Developments on Espnet Toolkit Boosted By Conformer
- Robust Speech Recognition via Large-Scale Weak Supervision
- Run-and-Back Stitch Search: Novel Block Synchronous Decoding For Streaming Encoder-Decoder ASR
- Sequence Transduction with Recurrent Neural Networks
- Some Commonly Used Speech Feature Extraction Algorithms
- SpecAugment: A simple Data Augmentation Method for Automatic Speech Recognition
- Specaugment on Large Scale Datasets
- Speech Augmentation using Wavenet in Speech Recognition
- Speech Recognition of Multiple Accented English Data Using Acoustic Model Interpolation
- Speech recognition with weighted finite-state transducers
- Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition
- SpeechBrain: A General-Purpose Speech Toolkit
- SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
- Spell My Name: Keyword Boosted Speech Recognition
- Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
- Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition
- Streaming Automatic Speech Recognition with the Transformer Models
- Streaming End-to-end Speech Recognition for Mobile Devices
- Streaming Transformer Asr With Blockwise Synchronous Beam Search
- The 2020 ESPnet Update: New Features, Broadened Applications, Performance Improvements, and Future Plans
- The Kaldi Speech Recognition Toolkit
- The Pytorch-kaldi Speech Recognition Toolkit
- Towards Online End-to-end Transformer Automatic Speech Recognition
- Towards Fast and Accurate Streaming End-To-End ASR
- Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss
- Using Accent-Spercific Pronunciation Modelling for Robust Speech Recognition
- VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording
- Vocal Tract Length Perturbation (VTLP) improves speech recognition
- vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
- W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
- Wav2Letter: an End-to-End ConvNet-based Speech Recognition System
- Wav2Letter++: A Fast Open-source Speech Recognition System
- wav2vec: Unsupervised Pre-training for Speech Recognition
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
- Word Beam Search: A connectionist Temporal Classification Decoding Algorithm
- 저자원 환경의 음성인식을 위한 자기 주의를 활용한 음향 모델 학습
- A Survey on Neural Speech Synthesis
- AutoSpeech: Neural Architecture Search for Speaker Recognition
- Deep Learning Methods in Speaker Recognition: A Review
- Deep speaker: an end-to-end neural speaker embedding system
- ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
- End-to-End Neural Speaker Diarization with Permutation-Free Objectives
- End-to-End Neural Speaker Diarization with Self-Attention
- End-to-end speaker segmentation for overlap-aware resegmentation
- Generalized End-to-End Loss for Speaker Verifications
- Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveformss
- In defence of metric learning for speaker recognition
- Multi-scale Speaker Diarization with Dynamic Scale Weighting
- Pushing the limits of raw waveform speaker recognition
- RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification
- Speaker Recognition Based on Deep Learning: An Overview
- Speaker Recognition from Raw Waveform with SincNet
- TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context
- Unsupervised Domain Adaptation via Domain Adversarial Training for Speaker Recognition
- A Fully Convolutional Neural Network for Speech Enhancement
- Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression
- FAST-RIR: Fast neural diffuse room impulse response generator
- Improved Speech Enhancement with the Wave-U-Net
- NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing
- Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms
- Real Time Speech Enhancement in the Waveform Domain
- SEGAN: Speech Enhancement Generative Adversarial Network
- Self-Attentive VAD: Context-Aware Detection of Voice from Noise
- The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Resultss
- A review of deep learning based speech synthesis
- Adversarial Audio Synthesis
- Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
- Deep Voice: Real-time Neural Text-to-Speech
- Emotional Speech Synthesis With Rich And Granularized Control
- FastSpeech: Fast, Robust and Controllable Text to Speech
- Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
- HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
- MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
- Natural tts synthesis by conditioning wavenet on mel spectrogram predictions
- Neural Speech Synthesis with Transformer Network
- Period VITS: Variational Inference with Explicit Pitch Modeling for End-To-End Emotional Speech Synthesis
- Tacotron: Towards end-to-end speech synthesis
- YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone
- VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network
- Waveglow: A flow-based generative network for speech synthesis
- Wavenet: A Generative Model for Raw Audio
- Multimodal Cross- and Self-Attention Network for Speech Emotion Recognition
- Multimodal Emotion Recognition with High-level Speech and Text Features
- Multimodal Speech Emotion Recognition and Ambiguity Resolution
- Multimodal Speech Emotion Recognition Using Audio and Text
- Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text
- Self-Supervised Learning with Cross-Modal Transformers for Emotion Recognition
- Tensor Fusion Network for Multimodal Sentiment Analysis
- AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
- An algorithm for suffix stripping
- Attention is all you need
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Big Bird: Transformers for Longer Sequences
- BTS: Back TranScription for Speech-to-Text Post-Processor using Text-to-Speech-to-Text
- Electra: Pre-training text encoders as discriminators rather than generators
- FAIRSEQ: A Fast, Extensible Toolkit for Sequence Modeling
- Language Modeling with Deep Transformers
- Language Models are Unsupervised Multitask Learners
- Longformer: The Long-Document Transformer
- Neural Machine Translation of Rare Words with Subword Units
- Recent Trends in the Use of Deep Learning Models for Grammar Error Handling
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
- SimCSE: Simple Contrastive Learning of Sentence Embeddings
- Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
- Thutmose Tagger: Single-pass neural model for Inverse Text Normalization
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- Transformers: State-of-the-Art Natural Language Processing
- Albumentations: fast and flexible image augmentations
- An image is worth 16x16 words: Transformers for image recognition at scale
- Deep residual learning for image recognition
- EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
- Rich feature hierarchies for accurate object detection and semantic segmentation
- Squeeze-and-excitation networks
- U-Net: Convolutional Networks for Biomedical Image Segmentation
- Very deep convolutional networks for large-scale image recognition
- You Only Look Once: Unified, Real-Time Object detection
- Neural architecture search with reinforcement learning
- Playing Atari with Deep Reinforcement Learning
- Accent as a Social Symbol
- Calibrating rhythm: First language and second language studies
- Control Methods Used in a Study of the Vowels
- Correlates of linguistic rhythm in the speech signal
- Durational Variability in Speech and the Rhythm Class Hypothesis
- History of ESL Pronunciation Teaching
- Intonation
- Language Discrimination by Newborns: Toward an Understanding of the Role of Rhythm
- Measures of Native and Non-Native Rhythm in a Quantity Language
- On the distinction between 'stress-timed' and 'syllable-timed' languages
- On the Historical Phonotactic of English
- Relations between language rhythm and speech rate
- Stress-timing and Syllable-timing Reanalyzed
- Sound Change And Syllable Structure in Germanic Phonology
- Speech rhythm across turn transitions in cross-cultural talk-in-interaction
- The Environment for Open-syllable Lengthening in Middle English
- The Historical Evolution of English Pronunciation
- The Original ToBI System and the Evolution of the ToBI Framework
- The Past, Present and Future of English Rhythm
- Voice Onset Time (VOT) at 50: Theoretical and practical issues in measuring voicing distinctions
- 그림의 법칙: 연쇄 밀기 입장과 연쇄 당김 입장