Paper

The paper list what I read

Paper list

Machine Learning & Deep Learning

A Comparative Survey of Deep Active Learning
A Survey of Deep Active Learning
A Survey on Deep Transfer Learning
Active Learning for Convolutional Neural Networks: A Core-Set Approach
An overview of Multi-Task Learning in Deep Neural Networks
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
Deep Networks with Stochastic Depth
Distilling the Knowledge in a Neural Network
Domain-Adversarial Training of Neural Networks
Gaussian Error Linear Units (GELUs)
Generative Adversarial Nets
Gradient Episodic Memory for Continual Learning
Group Normalization
Layer Normalization
Learning Loss for Active Learning
Learning with Pseudo-Ensembles
Learning without Forgetting
Long-short Term Memory
Mixed Precision Training
Monotonic Chunkwise Attention
Online and Linear-Time Attention by Enforcing Monotonic Alignments
Overcoming catastrophic forgetting in neural networks
Progressive Neural Networks
Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks
Representation Learning with Contrastive Predictive Coding
Searching for Activation Functions
SGDR: Stochastic Gradient Descent with Warm Restarts
Unsupervised Data Augmentation for Consistency Training
ZeRO: Memory optimizations Toward Training Trillion Parameter Models

Speech Recognition

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks
A Comparative Study on Transformer vs RNN in Speech Applications
A Comparison of Sequence-to-Sequence Models for Speech Recognition
A deep Learning Approach to Automatic Characterisation of Rhythm in Non-native English Speech
A pitch extraction algorithm tuned for automatic speech recognition
A study on data augmentation of reverberant speech for robust speech recognition
A time delay neural network architecture for efficient modeling of long temporal contexts
Active Learning for Speech Recognition: the Power of Gradients
Active Learning for LF-MMI Trained Neural Networks in ASR
Adaptation Methods for Non-native Speech
Adversarial Learning of Raw Speech Features for Domain Invariant Speech Recognition
Adversarial Multi-task Learning of Deep Neural Networks for Robust Speech Recognition
Adversarial Training for Multilingual Acoustic Modeling
An exploration of dropout with LSTMs
An overview of Automatic Speech Attribute Transcription (ASAT)
An overview of End-to-end Automatic Speech Recognition
Audio Augmentation for Speech Recognition
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
Automatic Speech Recognition for Second Language Learning: How and Why It Actually Works
Automatic Speech Recognition of Multiple Accented Englsih Data
Blank Collapse: Compressing CTC emission for the faster decoding
Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers
Common Voice: A Massively-Multilingual Speech Corpus
Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System
Computer-Assisted Pronunciation Training from Pronunciation Scoring Towards Spoken Language Learning
Conformer: Convolution-augmented Transformer for Speech Recognition
Conmer: Streaming Conformer without self-attention for interactive voice assistants
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
Continual Learning in Automatic Speech Recognition
Continual Learning Using Lattice-Free MMI for Speech Recognition
Coupled Training of Sequence-to-sequence Models for Accented Speech Recognition
CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition
Data Augmentation for Deep Neural Network Acoustic Modeling
Data Augmentation Improves Recognition of Foreign Accented Speech
Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
Deep Speech2: End-to-end Speech Recognition in English and Mandarin
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
DistillW2V2: A Small and Streaming Wav2vec 2.0 Based ASR Model
Domain Adversarial Training for Accented Speech Recognition
E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition
E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition
Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition
End-to-end Accented Speech Recognition
End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection
End-to-end Speech Recognition Using Lattice-free MMI
End-to-end Speech Recognition with Word-based RNN Language Models
English Conversational Telephone Speech Recognition by Humans and Machines
ESPnet: End-to-end Speech Processing Toolkit
ESPnet-ONNX: Bridging a Gap Between Research and Production
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
ExKaldi-RT: A Real-Time Automatic Speech Recognition Extension Toolkit of Kaldi
Exploring Deep Learning Architectures for Automatically Grading Non-native Spontaneous Speech
Exploring Lexicon-Free Modeling Units for End-to-End Korean and Korean-English Code-Switching Speech Recognition
FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ
Far-Field Automatic Speech Recognition
Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization
Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers
Full-duplex Speech-to-text System for Estonian
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Hyperparameter experiments on end-to-end automatic speech recognition
Improved Accented Speech Recognition using Accent Embeddings and Multi-task Learning
Improved end-of-query detection for streaming speech recognition
Improved Noisy Student Training for Automatic Speech Recognition
Improving Noise Robustness of an End-to-End Neural Model for Automatic Speech Recognition
Intermediate Loss Regularization for CTC-based Speech Recognition
Introducing Attribute Features to Foreign Accent Recognition
Iterative Pseudo-Labeling for Speech Recognition
Japanese and Korean voice search
Jasper: An End-to-End Convolutional Neural Acoustic Model
JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning
Joint CTC-attention based end-to-end speech recognition using multi-task learning
Language identification with suprasegmental cues: A study based on speech resynthesis
Leveraging Native Language Information for Improved Accented Speech Recognition
Lhotse: a speech data representation library for the modern deep learning ecosystem
Libri-Light: A Benchmark for ASR with Limited or No Supervision
Librispeech: An ASR Corpus Based on Public Domain Audio Books
Light Gated Recurrent Units for Speech Recognition
Listen, Attend and Spell
MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection
Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
Memory-Efficient Training of RNN-Transducer with Sampled Softmax
MixSpeech: Data Augmentation for Low-Resource Automatic Speech Recognition
MLLR-based Accent Model Adaptation without Accented Data
Montreal Forced Aligner: trainable text-speech alignment using Kaldi
Multi-accent Speech Recognition with Hierarchical Grapheme Based Models
Multi-dialect Speech Recognition with A Single Sequence-to-sequence Model
Multi-task Learning for Speech Recognition: An overview
MUSAN: A Music, Speech, and Noise Corpus
NeMo: a toolkit for building AI applications using Neural Modules
Online Continual Learning of End-to-End Speech Recognition Models
Output-Gate Projected Gated Recurrent Unit for Speech Recognition
Purely sequence-trained neural networks for ASR based on lattice-free MMI
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR
PyKaldi: A Python Wrapper for Kaldi
PyKaldi2: Yet Another Speech Toolkit Based on Kaldi and Pytorch
Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
Recent Developments on Espnet Toolkit Boosted By Conformer
Robust Speech Recognition via Large-Scale Weak Supervision
Run-and-Back Stitch Search: Novel Block Synchronous Decoding For Streaming Encoder-Decoder ASR
Sequence Transduction with Recurrent Neural Networks
Some Commonly Used Speech Feature Extraction Algorithms
SpecAugment: A simple Data Augmentation Method for Automatic Speech Recognition
Specaugment on Large Scale Datasets
Speech Augmentation using Wavenet in Speech Recognition
Speech Recognition of Multiple Accented English Data Using Acoustic Model Interpolation
Speech recognition with weighted finite-state transducers
Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition
SpeechBrain: A General-Purpose Speech Toolkit
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
Spell My Name: Keyword Boosted Speech Recognition
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition
Streaming Automatic Speech Recognition with the Transformer Models
Streaming End-to-end Speech Recognition for Mobile Devices
Streaming Transformer Asr With Blockwise Synchronous Beam Search
The 2020 ESPnet Update: New Features, Broadened Applications, Performance Improvements, and Future Plans
The Kaldi Speech Recognition Toolkit
The Pytorch-kaldi Speech Recognition Toolkit
Towards Online End-to-end Transformer Automatic Speech Recognition
Towards Fast and Accurate Streaming End-To-End ASR
Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss
Using Accent-Spercific Pronunciation Modelling for Robust Speech Recognition
VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording
Vocal Tract Length Perturbation (VTLP) improves speech recognition
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
Wav2Letter: an End-to-End ConvNet-based Speech Recognition System
Wav2Letter++: A Fast Open-source Speech Recognition System
wav2vec: Unsupervised Pre-training for Speech Recognition
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
Word Beam Search: A connectionist Temporal Classification Decoding Algorithm
저자원 환경의 음성인식을 위한 자기 주의를 활용한 음향 모델 학습

Speaker Recognition

A Survey on Neural Speech Synthesis
AutoSpeech: Neural Architecture Search for Speaker Recognition
Deep Learning Methods in Speaker Recognition: A Review
Deep speaker: an end-to-end neural speaker embedding system
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
End-to-End Neural Speaker Diarization with Permutation-Free Objectives
End-to-End Neural Speaker Diarization with Self-Attention
End-to-end speaker segmentation for overlap-aware resegmentation
Generalized End-to-End Loss for Speaker Verifications
Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveformss
In defence of metric learning for speaker recognition
Multi-scale Speaker Diarization with Dynamic Scale Weighting
Pushing the limits of raw waveform speaker recognition
RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification
Speaker Recognition Based on Deep Learning: An Overview
Speaker Recognition from Raw Waveform with SincNet
TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context
Unsupervised Domain Adaptation via Domain Adversarial Training for Speaker Recognition

Speech Enhancement

A Fully Convolutional Neural Network for Speech Enhancement
Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression
FAST-RIR: Fast neural diffuse room impulse response generator
Improved Speech Enhancement with the Wave-U-Net
NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing
Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms
Real Time Speech Enhancement in the Waveform Domain
SEGAN: Speech Enhancement Generative Adversarial Network
Self-Attentive VAD: Context-Aware Detection of Voice from Noise
The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Resultss

Speech Synthesis

A review of deep learning based speech synthesis
Adversarial Audio Synthesis
Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Deep Voice: Real-time Neural Text-to-Speech
Emotional Speech Synthesis With Rich And Granularized Control
FastSpeech: Fast, Robust and Controllable Text to Speech
Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
Natural tts synthesis by conditioning wavenet on mel spectrogram predictions
Neural Speech Synthesis with Transformer Network
Period VITS: Variational Inference with Explicit Pitch Modeling for End-To-End Emotional Speech Synthesis
Tacotron: Towards end-to-end speech synthesis
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone
VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network
Waveglow: A flow-based generative network for speech synthesis
Wavenet: A Generative Model for Raw Audio

Emotion Recognition

Multimodal Cross- and Self-Attention Network for Speech Emotion Recognition
Multimodal Emotion Recognition with High-level Speech and Text Features
Multimodal Speech Emotion Recognition and Ambiguity Resolution
Multimodal Speech Emotion Recognition Using Audio and Text
Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text
Self-Supervised Learning with Cross-Modal Transformers for Emotion Recognition
Tensor Fusion Network for Multimodal Sentiment Analysis

Voice Conversion

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Natural Language Processing

An algorithm for suffix stripping
Attention is all you need
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Big Bird: Transformers for Longer Sequences
BTS: Back TranScription for Speech-to-Text Post-Processor using Text-to-Speech-to-Text
Electra: Pre-training text encoders as discriminators rather than generators
FAIRSEQ: A Fast, Extensible Toolkit for Sequence Modeling
Language Modeling with Deep Transformers
Language Models are Unsupervised Multitask Learners
Longformer: The Long-Document Transformer
Neural Machine Translation of Rare Words with Subword Units
Recent Trends in the Use of Deep Learning Models for Grammar Error Handling
RoBERTa: A Robustly Optimized BERT Pretraining Approach
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
SimCSE: Simple Contrastive Learning of Sentence Embeddings
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Thutmose Tagger: Single-pass neural model for Inverse Text Normalization
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Transformers: State-of-the-Art Natural Language Processing

Computer vision

Albumentations: fast and flexible image augmentations
An image is worth 16x16 words: Transformers for image recognition at scale
Deep residual learning for image recognition
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Rich feature hierarchies for accurate object detection and semantic segmentation
Squeeze-and-excitation networks
U-Net: Convolutional Networks for Biomedical Image Segmentation
Very deep convolutional networks for large-scale image recognition
You Only Look Once: Unified, Real-Time Object detection

Reinforcement Learning

Neural architecture search with reinforcement learning
Playing Atari with Deep Reinforcement Learning

Linguistics

Accent as a Social Symbol
Calibrating rhythm: First language and second language studies
Control Methods Used in a Study of the Vowels
Correlates of linguistic rhythm in the speech signal
Durational Variability in Speech and the Rhythm Class Hypothesis
History of ESL Pronunciation Teaching
Intonation
Language Discrimination by Newborns: Toward an Understanding of the Role of Rhythm
Measures of Native and Non-Native Rhythm in a Quantity Language
On the distinction between 'stress-timed' and 'syllable-timed' languages
On the Historical Phonotactic of English
Relations between language rhythm and speech rate
Stress-timing and Syllable-timing Reanalyzed
Sound Change And Syllable Structure in Germanic Phonology
Speech rhythm across turn transitions in cross-cultural talk-in-interaction
The Environment for Open-syllable Lengthening in Middle English
The Historical Evolution of English Pronunciation
The Original ToBI System and the Evolution of the ToBI Framework
The Past, Present and Future of English Rhythm
Voice Onset Time (VOT) at 50: Theoretical and practical issues in measuring voicing distinctions
그림의 법칙: 연쇄 밀기 입장과 연쇄 당김 입장

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper.md

paper.md

Paper

Paper list

Machine Learning & Deep Learning

Speech Recognition

Speaker Recognition

Speech Enhancement

Speech Synthesis

Emotion Recognition

Voice Conversion

Natural Language Processing

Computer vision

Reinforcement Learning

Linguistics

Files

paper.md

Latest commit

History

paper.md

File metadata and controls

Paper

Paper list

Machine Learning & Deep Learning

Speech Recognition

Speaker Recognition

Speech Enhancement

Speech Synthesis

Emotion Recognition

Voice Conversion

Natural Language Processing

Computer vision

Reinforcement Learning

Linguistics