Skip to content

Latest commit

 

History

History
289 lines (289 loc) · 18.7 KB

paper.md

File metadata and controls

289 lines (289 loc) · 18.7 KB

Paper

The paper list what I read

Paper list

Machine Learning & Deep Learning

  • A Comparative Survey of Deep Active Learning
  • A Survey of Deep Active Learning
  • A Survey on Deep Transfer Learning
  • Active Learning for Convolutional Neural Networks: A Core-Set Approach
  • An overview of Multi-Task Learning in Deep Neural Networks
  • Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
  • data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
  • Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
  • Deep Networks with Stochastic Depth
  • Distilling the Knowledge in a Neural Network
  • Domain-Adversarial Training of Neural Networks
  • Gaussian Error Linear Units (GELUs)
  • Generative Adversarial Nets
  • Gradient Episodic Memory for Continual Learning
  • Group Normalization
  • Layer Normalization
  • Learning Loss for Active Learning
  • Learning with Pseudo-Ensembles
  • Learning without Forgetting
  • Long-short Term Memory
  • Mixed Precision Training
  • Monotonic Chunkwise Attention
  • Online and Linear-Time Attention by Enforcing Monotonic Alignments
  • Overcoming catastrophic forgetting in neural networks
  • Progressive Neural Networks
  • Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks
  • Representation Learning with Contrastive Predictive Coding
  • Searching for Activation Functions
  • SGDR: Stochastic Gradient Descent with Warm Restarts
  • Unsupervised Data Augmentation for Consistency Training
  • ZeRO: Memory optimizations Toward Training Trillion Parameter Models

Speech Recognition

  • A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks
  • A Comparative Study on Transformer vs RNN in Speech Applications
  • A Comparison of Sequence-to-Sequence Models for Speech Recognition
  • A deep Learning Approach to Automatic Characterisation of Rhythm in Non-native English Speech
  • A pitch extraction algorithm tuned for automatic speech recognition
  • A study on data augmentation of reverberant speech for robust speech recognition
  • A time delay neural network architecture for efficient modeling of long temporal contexts
  • Active Learning for Speech Recognition: the Power of Gradients
  • Active Learning for LF-MMI Trained Neural Networks in ASR
  • Adaptation Methods for Non-native Speech
  • Adversarial Learning of Raw Speech Features for Domain Invariant Speech Recognition
  • Adversarial Multi-task Learning of Deep Neural Networks for Robust Speech Recognition
  • Adversarial Training for Multilingual Acoustic Modeling
  • An exploration of dropout with LSTMs
  • An overview of Automatic Speech Attribute Transcription (ASAT)
  • An overview of End-to-end Automatic Speech Recognition
  • Audio Augmentation for Speech Recognition
  • AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
  • Automatic Speech Recognition for Second Language Learning: How and Why It Actually Works
  • Automatic Speech Recognition of Multiple Accented Englsih Data
  • Blank Collapse: Compressing CTC emission for the faster decoding
  • Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
  • ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers
  • Common Voice: A Massively-Multilingual Speech Corpus
  • Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System
  • Computer-Assisted Pronunciation Training from Pronunciation Scoring Towards Spoken Language Learning
  • Conformer: Convolution-augmented Transformer for Speech Recognition
  • Conmer: Streaming Conformer without self-attention for interactive voice assistants
  • Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
  • ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
  • Continual Learning in Automatic Speech Recognition
  • Continual Learning Using Lattice-Free MMI for Speech Recognition
  • Coupled Training of Sequence-to-sequence Models for Accented Speech Recognition
  • CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition
  • Data Augmentation for Deep Neural Network Acoustic Modeling
  • Data Augmentation Improves Recognition of Foreign Accented Speech
  • Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
  • Deep Speech2: End-to-end Speech Recognition in English and Mandarin
  • Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
  • DistillW2V2: A Small and Streaming Wav2vec 2.0 Based ASR Model
  • Domain Adversarial Training for Accented Speech Recognition
  • E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition
  • E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model
  • E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
  • Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition
  • Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition
  • End-to-end Accented Speech Recognition
  • End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection
  • End-to-end Speech Recognition Using Lattice-free MMI
  • End-to-end Speech Recognition with Word-based RNN Language Models
  • English Conversational Telephone Speech Recognition by Humans and Machines
  • ESPnet: End-to-end Speech Processing Toolkit
  • ESPnet-ONNX: Bridging a Gap Between Research and Production
  • Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
  • ExKaldi-RT: A Real-Time Automatic Speech Recognition Extension Toolkit of Kaldi
  • Exploring Deep Learning Architectures for Automatically Grading Non-native Spontaneous Speech
  • Exploring Lexicon-Free Modeling Units for End-to-End Korean and Korean-English Code-Switching Speech Recognition
  • FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ
  • Far-Field Automatic Speech Recognition
  • Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
  • FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization
  • Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers
  • Full-duplex Speech-to-text System for Estonian
  • HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
  • Hyperparameter experiments on end-to-end automatic speech recognition
  • Improved Accented Speech Recognition using Accent Embeddings and Multi-task Learning
  • Improved end-of-query detection for streaming speech recognition
  • Improved Noisy Student Training for Automatic Speech Recognition
  • Improving Noise Robustness of an End-to-End Neural Model for Automatic Speech Recognition
  • Intermediate Loss Regularization for CTC-based Speech Recognition
  • Introducing Attribute Features to Foreign Accent Recognition
  • Iterative Pseudo-Labeling for Speech Recognition
  • Japanese and Korean voice search
  • Jasper: An End-to-End Convolutional Neural Acoustic Model
  • JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning
  • Joint CTC-attention based end-to-end speech recognition using multi-task learning
  • Language identification with suprasegmental cues: A study based on speech resynthesis
  • Leveraging Native Language Information for Improved Accented Speech Recognition
  • Lhotse: a speech data representation library for the modern deep learning ecosystem
  • Libri-Light: A Benchmark for ASR with Limited or No Supervision
  • Librispeech: An ASR Corpus Based on Public Domain Audio Books
  • Light Gated Recurrent Units for Speech Recognition
  • Listen, Attend and Spell
  • MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection
  • Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
  • Memory-Efficient Training of RNN-Transducer with Sampled Softmax
  • MixSpeech: Data Augmentation for Low-Resource Automatic Speech Recognition
  • MLLR-based Accent Model Adaptation without Accented Data
  • Montreal Forced Aligner: trainable text-speech alignment using Kaldi
  • Multi-accent Speech Recognition with Hierarchical Grapheme Based Models
  • Multi-dialect Speech Recognition with A Single Sequence-to-sequence Model
  • Multi-task Learning for Speech Recognition: An overview
  • MUSAN: A Music, Speech, and Noise Corpus
  • NeMo: a toolkit for building AI applications using Neural Modules
  • Online Continual Learning of End-to-End Speech Recognition Models
  • Output-Gate Projected Gated Recurrent Unit for Speech Recognition
  • Purely sequence-trained neural networks for ASR based on lattice-free MMI
  • Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
  • PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR
  • PyKaldi: A Python Wrapper for Kaldi
  • PyKaldi2: Yet Another Speech Toolkit Based on Kaldi and Pytorch
  • Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
  • Recent Developments on Espnet Toolkit Boosted By Conformer
  • Robust Speech Recognition via Large-Scale Weak Supervision
  • Run-and-Back Stitch Search: Novel Block Synchronous Decoding For Streaming Encoder-Decoder ASR
  • Sequence Transduction with Recurrent Neural Networks
  • Some Commonly Used Speech Feature Extraction Algorithms
  • SpecAugment: A simple Data Augmentation Method for Automatic Speech Recognition
  • Specaugment on Large Scale Datasets
  • Speech Augmentation using Wavenet in Speech Recognition
  • Speech Recognition of Multiple Accented English Data Using Acoustic Model Interpolation
  • Speech recognition with weighted finite-state transducers
  • Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition
  • SpeechBrain: A General-Purpose Speech Toolkit
  • SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
  • Spell My Name: Keyword Boosted Speech Recognition
  • Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
  • Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition
  • Streaming Automatic Speech Recognition with the Transformer Models
  • Streaming End-to-end Speech Recognition for Mobile Devices
  • Streaming Transformer Asr With Blockwise Synchronous Beam Search
  • The 2020 ESPnet Update: New Features, Broadened Applications, Performance Improvements, and Future Plans
  • The Kaldi Speech Recognition Toolkit
  • The Pytorch-kaldi Speech Recognition Toolkit
  • Towards Online End-to-end Transformer Automatic Speech Recognition
  • Towards Fast and Accurate Streaming End-To-End ASR
  • Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss
  • Using Accent-Spercific Pronunciation Modelling for Robust Speech Recognition
  • VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording
  • Vocal Tract Length Perturbation (VTLP) improves speech recognition
  • vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
  • W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
  • Wav2Letter: an End-to-End ConvNet-based Speech Recognition System
  • Wav2Letter++: A Fast Open-source Speech Recognition System
  • wav2vec: Unsupervised Pre-training for Speech Recognition
  • wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
  • Word Beam Search: A connectionist Temporal Classification Decoding Algorithm
  • 저자원 환경의 음성인식을 위한 자기 주의를 활용한 음향 모델 학습

Speaker Recognition

  • A Survey on Neural Speech Synthesis
  • AutoSpeech: Neural Architecture Search for Speaker Recognition
  • Deep Learning Methods in Speaker Recognition: A Review
  • Deep speaker: an end-to-end neural speaker embedding system
  • ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
  • End-to-End Neural Speaker Diarization with Permutation-Free Objectives
  • End-to-End Neural Speaker Diarization with Self-Attention
  • End-to-end speaker segmentation for overlap-aware resegmentation
  • Generalized End-to-End Loss for Speaker Verifications
  • Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveformss
  • In defence of metric learning for speaker recognition
  • Multi-scale Speaker Diarization with Dynamic Scale Weighting
  • Pushing the limits of raw waveform speaker recognition
  • RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification
  • Speaker Recognition Based on Deep Learning: An Overview
  • Speaker Recognition from Raw Waveform with SincNet
  • TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context
  • Unsupervised Domain Adaptation via Domain Adversarial Training for Speaker Recognition

Speech Enhancement

  • A Fully Convolutional Neural Network for Speech Enhancement
  • Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression
  • FAST-RIR: Fast neural diffuse room impulse response generator
  • Improved Speech Enhancement with the Wave-U-Net
  • NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing
  • Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms
  • Real Time Speech Enhancement in the Waveform Domain
  • SEGAN: Speech Enhancement Generative Adversarial Network
  • Self-Attentive VAD: Context-Aware Detection of Voice from Noise
  • The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Resultss

Speech Synthesis

  • A review of deep learning based speech synthesis
  • Adversarial Audio Synthesis
  • Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
  • Deep Voice: Real-time Neural Text-to-Speech
  • Emotional Speech Synthesis With Rich And Granularized Control
  • FastSpeech: Fast, Robust and Controllable Text to Speech
  • Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
  • HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
  • MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
  • Natural tts synthesis by conditioning wavenet on mel spectrogram predictions
  • Neural Speech Synthesis with Transformer Network
  • Period VITS: Variational Inference with Explicit Pitch Modeling for End-To-End Emotional Speech Synthesis
  • Tacotron: Towards end-to-end speech synthesis
  • YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone
  • VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network
  • Waveglow: A flow-based generative network for speech synthesis
  • Wavenet: A Generative Model for Raw Audio

Emotion Recognition

  • Multimodal Cross- and Self-Attention Network for Speech Emotion Recognition
  • Multimodal Emotion Recognition with High-level Speech and Text Features
  • Multimodal Speech Emotion Recognition and Ambiguity Resolution
  • Multimodal Speech Emotion Recognition Using Audio and Text
  • Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text
  • Self-Supervised Learning with Cross-Modal Transformers for Emotion Recognition
  • Tensor Fusion Network for Multimodal Sentiment Analysis

Voice Conversion

  • AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Natural Language Processing

  • An algorithm for suffix stripping
  • Attention is all you need
  • BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • Big Bird: Transformers for Longer Sequences
  • BTS: Back TranScription for Speech-to-Text Post-Processor using Text-to-Speech-to-Text
  • Electra: Pre-training text encoders as discriminators rather than generators
  • FAIRSEQ: A Fast, Extensible Toolkit for Sequence Modeling
  • Language Modeling with Deep Transformers
  • Language Models are Unsupervised Multitask Learners
  • Longformer: The Long-Document Transformer
  • Neural Machine Translation of Rare Words with Subword Units
  • Recent Trends in the Use of Deep Learning Models for Grammar Error Handling
  • RoBERTa: A Robustly Optimized BERT Pretraining Approach
  • SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
  • SimCSE: Simple Contrastive Learning of Sentence Embeddings
  • Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
  • Thutmose Tagger: Single-pass neural model for Inverse Text Normalization
  • Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
  • Transformers: State-of-the-Art Natural Language Processing

Computer vision

  • Albumentations: fast and flexible image augmentations
  • An image is worth 16x16 words: Transformers for image recognition at scale
  • Deep residual learning for image recognition
  • EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
  • Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
  • Rich feature hierarchies for accurate object detection and semantic segmentation
  • Squeeze-and-excitation networks
  • U-Net: Convolutional Networks for Biomedical Image Segmentation
  • Very deep convolutional networks for large-scale image recognition
  • You Only Look Once: Unified, Real-Time Object detection

Reinforcement Learning

  • Neural architecture search with reinforcement learning
  • Playing Atari with Deep Reinforcement Learning

Linguistics

  • Accent as a Social Symbol
  • Calibrating rhythm: First language and second language studies
  • Control Methods Used in a Study of the Vowels
  • Correlates of linguistic rhythm in the speech signal
  • Durational Variability in Speech and the Rhythm Class Hypothesis
  • History of ESL Pronunciation Teaching
  • Intonation
  • Language Discrimination by Newborns: Toward an Understanding of the Role of Rhythm
  • Measures of Native and Non-Native Rhythm in a Quantity Language
  • On the distinction between 'stress-timed' and 'syllable-timed' languages
  • On the Historical Phonotactic of English
  • Relations between language rhythm and speech rate
  • Stress-timing and Syllable-timing Reanalyzed
  • Sound Change And Syllable Structure in Germanic Phonology
  • Speech rhythm across turn transitions in cross-cultural talk-in-interaction
  • The Environment for Open-syllable Lengthening in Middle English
  • The Historical Evolution of English Pronunciation
  • The Original ToBI System and the Evolution of the ToBI Framework
  • The Past, Present and Future of English Rhythm
  • Voice Onset Time (VOT) at 50: Theoretical and practical issues in measuring voicing distinctions
  • 그림의 법칙: 연쇄 밀기 입장과 연쇄 당김 입장