Text-to-Speech Synthesis by Paul Taylor http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.5905&rep=rep1&type=pdf
Experimental and theoretical advances in prosody: A review https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3216045/
Intonational Phonology by Ladd https://books.google.de/books?id=ys_jtGM5WjYC&printsec=frontcover&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false
Adversarial Autoencoders https://arxiv.org/pdf/1511.05644.pdf
https://github.com/Naresh1318/Adversarial_Autoencoder
https://www.cl.uni-heidelberg.de/courses/ws14/deepl/BengioETAL12.pdf
IEMOCAP pdf https://sail.usc.edu/iemocap/Busso_2008_iemocap.pdf
Audio Google papers https://google.github.io/tacotron/
| paper | status | link/tag |
|---|---|---|
| Tacotron: Towards End-to-End Speech Synthesis | finished | https://arxiv.org/pdf/1703.10135.pdf |
| Uncovering Latent Style Factors for Expressive Speech Synthesis | finished | https://arxiv.org/pdf/1711.00520.pdf |
| Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions | finished | https://arxiv.org/pdf/1712.05884.pdf https://ai.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html |
| Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron | finished | https://arxiv.org/pdf/1803.09047.pdf https://ai.googleblog.com/2018/03/expressive-speech-synthesis-with.html |
| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | finished | https://arxiv.org/pdf/1803.09017.pdf https://ai.googleblog.com/2018/03/expressive-speech-synthesis-with.html |
| paper | status | link/tag |
|---|---|---|
| LEARNING LATENT REPRESENTATIONS FOR STYLE CONTROL AND TRANSFER IN END-TO-END SPEECH SYNTHESIS | ICASSP2019 finished | https://arxiv.org/pdf/1812.04342.pdf http://home.ustc.edu.cn/~zyj008/ICASSP2019/ |
| Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis | finished | https://arxiv.org/pdf/1808.01410.pdf |
| Hierarchical Generative Modeling for Controllable Speech Synthesis | finished | https://arxiv.org/pdf/1810.07217.pdf |
| Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization | finished | https://openreview.net/pdf?id=Bkg9ZeBB37 |
| Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis | can use this | https://goo.gl/Jy8WvF |
| Neural Discrete Representation Learning | read again for clarity | https://arxiv.org/pdf/1711.00937.pdf |
| A Style Control Technique for HMM-Based Speech Synthesis | cant be extended | https://goo.gl/Y9caHX |
| Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data | similar work done by google | https://arxiv.org/pdf/1709.07902.pdf |
| Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder | Better implementations are present | https://arxiv.org/pdf/1804.02135.pdf |
| A Comparison of Expressive Speech Synthesis Approaches based on Neural Network | great paper. can be used | http://lxie.npu-aslp.org/papers/2018ASMMC-XLM.pdf |
| Investigating context features hidden in End-to-End TTS | good read but not relevant | https://arxiv.org/pdf/1811.01376.pdf |
| Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition | finished | http://speech.ee.ntu.edu.tw/~tlkagk/paper/asr-guided-tacotron.pdf |
| Speech, Prosody, and Machines: Nine Challenges for Prosody Research | read again for lit review not for approach | https://www.isca-speech.org/archive/SpeechProsody_2018/pdfs/_Inv-5.pdf |
| Learning Latent Representations for Speech Generation and Transformation | finished | https://arxiv.org/pdf/1704.04222.pdf |
| Disentangled sequential autoencoder | finished | https://arxiv.org/pdf/1803.02991.pdf |
| Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis | finished | https://arxiv.org/pdf/1807.11470.pdf |
| FEATURE BASED ADAPTATION FOR SPEAKING STYLE SYNTHESIS | not a great paper wrt my rs view | https://goo.gl/f95mGb |
| NEURAL TTS STYLIZATION WITH ADVERSARIAL AND COLLABORATIVE GAMES (tts gan) | iclr 2019 | https://openreview.net/pdf?id=ByzcS3AcYX https://researchdemopage.wixsite.com/tts-gan |
| ROBUST AND FINE-GRAINED PROSODY CONTROL OF END-TO-END SPEECH SYNTHESIS | icassp 2019 | https://arxiv.org/pdf/1811.02122.pdf http://neosapience.com/en/research/2018-10-29-icassp/ |
| paper | status | link/tag |
|---|---|---|
| A Comparison of Expressive Speech Synthesis Approaches based on Neural Network | great paper. can be used | http://lxie.npu-aslp.org/papers/2018ASMMC-XLM.pdf |
| Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech | finished | https://arxiv.org/pdf/1706.00612.pdf |
| Emotional Statistical Parametric Speech Synthesis Using LSTM-RNNs | finished | https://ieeexplore.ieee.org/document/8282282 |
| An Investigation to Transplant Emotional Expressions in DNN-based TTS Synthesis Synthesis | can be used with paper 1 | https://ieeexplore.ieee.org/document/8282231 |
| Unsupervised clustering of emotion and voice styles for expressive tts. | finished | https://ieeexplore.ieee.org/document/6288797 |
| A DNN-based emotional speech synthesis by speaker adaptation | similar to other paper | http://www.apsipa.org/proceedings/2018/pdfs/0000633.pdf |
| Speaker Representations for Speaker Adaptation in Multiple Speakers BLSTM-RNN-based Speech Synthesis | not a great paper wrt my rs view | https://goo.gl/LynbNz |
| Emotional transplant in statistical speech synthesis based on emotion additive model | finished | https://www.isca-speech.org/archive/interspeech_2015/papers/i15_0274.pdf |
| Emotional End-to-End Neural Speech synthesizer | finished | https://arxiv.org/pdf/1711.05447.pdf |
| paper | status | link/tag |
|---|---|---|
| VOICELOOP: VOICE FITTING AND SYNTHESIS VIA A PHONOLOGICAL LOOP | not imp | https://arxiv.org/pdf/1707.06588.pdf |
| CHAR2WAV: END-TO-END SPEECH SYNTHESIS | not imp | https://mila.quebec/wp-content/uploads/2017/02/end-end-speech.pdf |
| DEEP VOICE 3: SCALING TEXT-TO-SPEECH WITH CONVOLUTIONAL SEQUENCE LEARNING | not imp | https://arxiv.org/pdf/1710.07654.pdf |
| VOICELOOP: VOICE FITTING AND SYNTHESIS VIA A PHONOLOGICAL LOOP | not imp | https://arxiv.org/pdf/1707.06588.pdf |
PHD thesis http://veu.talp.cat/igor/PhD_Igor_Jauk-June2017.pdf Unsupervised Learning for Expressive Speech Synthesis MSc thesis https://github.com/FeiCoding/State_of_the_art_tacotron2_model_reproduction Reproduction & Improvement of State-of-art TTS model