# End-to-End Time Domain Audio with fastai

`fastaudio` focuses on spectrograms. `fastai`
 use cases tend to focus on classification. We need to go beyond those. Instead we'll focus on two things:

1. autoregressive prediction in the time domain. We'll use an LSTM -- essentially adapting the language model lessons

2.  audio-to-audio processing/translation (e.g. audio effects). We'll use stacked 1D convolutions like a U-Net

(you probably noticed already that task #1 could be in task #2, for the case of translating to audio shifted ahead by one sample.)

#### "How many channels of audio are we going to use?"
 That's up to the dataset!  We're certainly not going to assume that it's just mono.

#### "What other fastai datatypes/projects are relevant?"
 There are three packages that are relevant for sequence modeling:

1.  `fastaudio`, as we mentioned, is only for spectrogram classification. The `AudioBlock` makes batches using an entire audio file which then gets converted to spectrograms.  Instead, we want to progressively grab sequences of audio samples and as (uniform-length) chunks.

2. The [Time Series Prediction](https://timeseriesai.github.io/tsai/) package is relevant, but the only time series output it seems to support is ["univariate forecasting"](https://timeseriesai.github.io/tsai/#Univariate-Forecasting).  Nope. 
3. Language Modeling, e.g. Chapters [10](https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb) and [12](https://github.com/fastai/fastbook/blob/master/12_nlp_dive.ipynb) from fastbook. Yea, that's the closest. We can treat the audio samples as if they were word vectors/embeddings: just make the tokenizer and numericalize methods to be no-ops (or we could use mu-law encoding).  Nice thing is the dimensionality of the embeddings is just equal to how many channels of audio you have. 
   
We'll use *some* of fastaudio but we'll also liberally rewrite/overwrite whatever we want. We'll start with their imports..  

In [None]:
#hide
from nbdev.showdoc import *

In [None]:
#all_slow

In [None]:
from fastai.vision.all import *
from fastai.text.all import *
import torch
import torchaudio
import torchaudio.functional as F
import torchaudio.transforms as T
from IPython.display import Audio 

from fastproaudio.core import *

use_fastaudio = False
if use_fastaudio:
    from fastaudio.core.all import *
    from fastaudio.augment.all import *
    from fastaudio.ci import skip_if_ci

As for data, we just use the fastaudio "speakers" dataset for now:

In [None]:
#URLs.SPEAKERS10 = "http://www.openslr.org/resources/45/ST-AEDS-20180100_1-OS.tgz"
#path = untar_data(URLs.SPEAKERS10)

....and then we could learn some kind of inverse effect such as denoising: we could add noise to the audio files and then train the network to remove the noise. 

But what other audio datasets are available? 

* [torchaudio datasets](https://pytorch.org/audio/stable/datasets.html). These are almost all about speech; only GTZAN is musical.  
* I've got the [SignalTrain audio dataset for compressors](https://zenodo.org/record/3824876) although it's 20 GB. 
* [Source separation datasets](https://source-separation.github.io/tutorial/data/datasets.html), i.e. mono-to-many
* [ISMIR has a list of datasets](https://ismir.net/resources/datasets/)
* We can always grab audio and then use Spotify's new [Pedalboard](https://github.com/spotify/pedalboard) to add effects
* [Marco Martinez' Leslie effects dataset](https://zenodo.org/record/3562442) is a bit less than 1 GB. It has "dry" (input) and "tremelo" (target) directories.

Since Christian Steinmetz uses SignalTrain data for his [Micro-TCN](https://github.com/csteinmetz1/micro-tcn), we'll do that, using the 200MB "Reduced" version I just made:

In [None]:
path = get_audio_data(URLs.SIGNALTRAIN_LA2A_REDUCED); path

Path('/home/shawley/.fastai/data/SignalTrain_LA2A_Reduced')

Steinmetz uses PyTorch Lightning instead of fastai.  We should be able to do the bare minimum integration by following Zach Mueller's prescription. 