# Day 1 - NB1 - Googling and finding relevant root resources


- [Introduction](#Introduction)
- [Step 1 - Gather possibly relevant resources](#Step-1---Gather-possibly-relevant-resources)
- [Introduction video 1: Intro to DL for Audio and Speech Applications](#Introduction-video-1:-Intro-to-DL-for-Audio-and-Speech-Applications)
- [Introduction video 2 (Stanford Seminar)](#Introduction-video-2-(Stanford-Seminar))


# Introduction

It is possible that in few months I will start working in a healthtech project involving speech analysis. Since this field interests me by itself and I'm not familiar with it, I have decided to start studying it by myself.

This repo is the result and the documentation of my studies as they go forward.

# Step 1 - Gather possibly relevant resources

These resources should allow me to start digging in the field, being the starting point of a tree of links and useful resources.

## Theoretical introductions: 
* <a href="https://www.youtube.com/watch?v=dBAn67ZKbZ4" >Introduction to Deep Learning for Audio and Speech Applications - YouTube</a>
* <a href="https://www.youtube.com/watch?v=RBgfLvAOrss">Stanford Seminar - Deep Learning in Speech Recognition - YouTube</a>
* <a href="https://en.wikipedia.org/wiki/Speech_recognition">Speech Recognition - Wikipedia</a>
* <a href="https://en.wikipedia.org/wiki/Speech_processing">Speech Processing - Wikipedia</a>


## Kaggle speech competitions

* <a href="https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/code?competitionId=7634&amp;sortBy=voteCount">TensorFlow Speech Recognition Challenge | Kaggle</a>
* <a href="https://www.kaggle.com/c/rfcx-species-audio-detection/overview" >Rainforest Connection Species Audio Detection | Kaggle</a>
* <a href="https://www.kaggle.com/c/birdclef-2021/code?competitionId=25954&amp;sortBy=voteCount" >BirdCLEF 2021 - Birdcall Identification | Kaggle</a>


## Other Kaggle resources
* <a href="https://www.kaggle.com/davids1992/speech-representation-and-data-exploration" >Speech representation and data exploration | Kaggle</a>
* <a href="https://www.kaggle.com/alexozerin/end-to-end-baseline-tf-estimator-lb-0-72">End-to-end baseline TF Estimator LB 0.72 | Kaggle</a>
* <a href="https://www.kaggle.com/nandhuelan/wav2vec-wandb-learning-audio-representation/data">Wav2vec+wandb- Learning audio representation 🔥🤗 | Kaggle</a>


## Datasets, models and papers
* <a href="https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html" >Google AI Blog: Launching the Speech Commands Dataset</a>
* <a href="https://huggingface.co/anton-l/wav2vec2-random-tiny-classifier" >anton-l/wav2vec2-random-tiny-classifier · Hugging Face</a>
* <a href="https://huggingface.co/docs/transformers/model_doc/wav2vec2">Wav2Vec2</a>
* <a href="https://arxiv.org/abs/2006.11477" >[2006.11477] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations</a>


## More resources

* <a href="https://www.youtube.com/watch?v=Qf4YJcHXtcY">13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow</a>
* <a href="https://www.youtube.com/watch?v=9GJ6XeB-vMg">Speech Recognition in Python</a>
* <a href="https://www.youtube.com/watch?v=qV4lR9EWGlY">Sound: Crash Course Physics #18</a>
* <a href="https://www.youtube.com/watch?v=PWVH3Vx3dCI">Speech Recognition Using Python | How Speech Recognition Works In Python | Simplilearn</a>

# Introduction video 1: Intro to DL for Audio and Speech Applications
**<a href="https://www.youtube.com/watch?v=dBAn67ZKbZ4" >Introduction to Deep Learning for Audio and Speech Applications - YouTube (2019)</a>**

## Sections 1 and 2

* <a href="https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html" >Google AI Blog: Launching the Speech Commands Dataset</a>
* Preprocessing is common in audio DNNs. Time-frequency transform or Spectrogram. In these cases, CNNs are used.

![](imgs/cnn.png)

* Other popular nets are LSTMS. In this case, feature extraction is different. Time-varying features are mined.
* Spectrogram. MEL Spectrogram.
* Paper: CNNs for small-footprint Keyword Spotting (2015).
* Automate labelling: Use another's model speech2text
* Tools for manual labelling
* Data augmentation. For example, for keyword spotting:
  * Accelerate, change pitch, change volume, environments (echo), backgrounds

## Common algorithms for feature extraction, time-frequency transforms, and segmentation:

Feature extraction: (produce lower-rate time-varying metrics)
* MFCC - Mel Frequency Cepstral Coefficients
* GTCC - Gammatone Cepstral Coefficients
* Pitch and Harmonicity
* 11 Spectral Descriptors
* Wavelet scattering

Transformations:
* Mel-Spaced Spectrfogram
* Gammatone and Octave Filter Banks
* Modified Discrete Cosine Transform

Segmentation:
* **Voice Activity and Speech Detection**

### Mel Spectrogram (and MFCC)

* NFFT
  * FFT = Fast Fourier Transform 
  * N apparently stands for Non-negative as in NMF
  * See: [Nonequispaced fast Fourier transform](https://www-user.tu-chemnitz.de/~potts/nfft/)
  * FTs transform a wave into the "frequencies domain" (must dig into FFT)
  
![](imgs/spectrogram-and-mel-filterbank.png)

* This is applying FFT to the signal
* Frequency resolution is a parameter
* Enough resolution at lower frequencies (important for humans) means having too much information for higher frequencies. Redundant data.
* MEL Spectrogram - We project the raw FFT points into a series of predefined buckets (bands) that are representative of human perception. MEL Filterbank. Cuasi logarthimical  y-axis. Only 40 bands. Sustainbly-sized spectrogram

### Mel Frequency and Gammatone Cepstral Coefficients (MFCC / GTCC)

* MFCC = Uses Mel Filterbank to put info into predefined bands
* GTCC = Uses Gammatone Filterbank for the same purpose. The filterbank is more complex and accuracte.

* Both are some calculations on top of the filtered Spectrogram and a Log Energy calculation. From the picture, the output seems to be 2 numbers or two arrays.

* DCT "are like real FFTs".

![](imgs/mfcc.png)

* [Cepstrum  Wikipedia](https://en.wikipedia.org/wiki/Cepstrum)


#### Example tasks:
Classification
* Speech command recognition
* Acoustic scene recognition
* Classify gender

Regression or seq-to-seq:
* Cocktail party source separation
* Denoising
* Voice activity detection

Features: Mel Spectrogram, [STFT](https://en.wikipedia.org/wiki/Short-time_Fourier_transform), Time-varying features (Spectral descriptors, harmonic ratio)

Architectures: CNNs, FCs, and LSTMs.

# Introduction video 2 (Stanford Seminar)
**[Stanford Seminar - Deep Learning in Speech Recognition (30 nov 2017)](https://youtu.be/RBgfLvAOrss?t=1799)**

* Speech Recognition starts in minute 30
* ASR = Automatic Speech Recognition

## History

![](imgs/lm.png)

A language model acts as a bayesian prior!!

Does it say "Colloquium" or "Collaquium"? It's not word-level though, W is a sequence, not a word.

The lambda acts as a weight of the acoustic and the language model

For the acoustic model, HMM. The DRAGON system ('70).
Neural Networks in the '90. Hinton. Phoneme Recognition Using Time-Delay NN.

Historical datasets:
* Resource Management ('80) - Naval
* ATIS
* WSJ
* Conversational Speech

2006 - Hinton - End of Winter
* Reducing the Dimensionality of Data with NNs
* Deep Boltzman Machines

Autoencoders

### DL in Speech Recognition

* HMMs still here
* Criterion: Cross-entropy $\rightarrow$ MMI Sequence-level
  * _Did I get the sentence right?_
* Features: MFCC $\rightarrow$ Filter banks
* Batch norm, Distributed SGD, Dropout
* Acousting Modeling: CNN / CTC (Connection temporal classification) / CLDNN (LSTMs)
* LMs: RNNs (As the ex president would say: "Things happened")

* Task: Speech synthesis (the voice of Siri)
   * Prosody generation. Prosodic. Spectral. 
   * TTS = Text-to-speech
   * Viterbi