<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

Many of these routines are dupes or mods from "audio-diffusion" repo by Zach Evans w/ contributions by Scott Hawley https://github.com/zqevans/audio-diffusion/blob/main/diffusion/utils.py

## Augmentation routines

Not all of these are used.  Code copied from https://github.com/zqevans/audio-diffusion/blob/main/diffusion/utils.py

In [1]:
#|output: asis
#| echo: false
show_doc(PadCrop)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L27){target="_blank" style="float:right; font-size:smaller"}

### PadCrop

>      PadCrop (n_samples, randomize=True, redraw_silence=True,
>               silence_thresh=-60, max_redraws=2)

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in
a tree structure. You can assign the submodules as regular attributes::

    import torch.nn as nn
    import torch.nn.functional as F

    class Model(nn.Module):
        def __init__(self):
            super().__init__()
            self.conv1 = nn.Conv2d(1, 20, 5)
            self.conv2 = nn.Conv2d(20, 20, 5)

        def forward(self, x):
            x = F.relu(self.conv1(x))
            return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their
parameters converted too when you call :meth:`to`, etc.

.. note::
    As per the example above, an ``__init__()`` call to the parent class
    must be made before assignment on the child.

:ivar training: Boolean represents whether this module is in training or
                evaluation mode.
:vartype training: bool

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| n_samples |  |  | length of chunk to extract from longer signal |
| randomize | bool | True | draw cropped chunk from a random position in audio file |
| redraw_silence | bool | True | a chunk containing silence will be replaced with a new one |
| silence_thresh | int | -60 | threshold in dB below which we declare to be silence |
| max_redraws | int | 2 | when redrawing silences, don't do it more than this many |

In [2]:
#|output: asis
#| echo: false
show_doc(PhaseFlipper)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L57){target="_blank" style="float:right; font-size:smaller"}

### PhaseFlipper

>      PhaseFlipper (p=0.5)

she was PHAAAAAAA-AAAASE FLIPPER, a random invert yeah

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| p | float | 0.5 | probability that phase flip will be applied |

In [3]:
#|output: asis
#| echo: false
show_doc(FillTheNoise)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L68){target="_blank" style="float:right; font-size:smaller"}

### FillTheNoise

>      FillTheNoise (p=0.33)

randomly adds a bit of noise, just to spice things up

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| p | float | 0.33 | probability that noise will be added |

In [4]:
#|output: asis
#| echo: false
show_doc(RandPool)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L79){target="_blank" style="float:right; font-size:smaller"}

### RandPool

>      RandPool (p=0.2)

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in
a tree structure. You can assign the submodules as regular attributes::

    import torch.nn as nn
    import torch.nn.functional as F

    class Model(nn.Module):
        def __init__(self):
            super().__init__()
            self.conv1 = nn.Conv2d(1, 20, 5)
            self.conv2 = nn.Conv2d(20, 20, 5)

        def forward(self, x):
            x = F.relu(self.conv1(x))
            return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their
parameters converted too when you call :meth:`to`, etc.

.. note::
    As per the example above, an ``__init__()`` call to the parent class
    must be made before assignment on the child.

:ivar training: Boolean represents whether this module is in training or
                evaluation mode.
:vartype training: bool

In [5]:
#|output: asis
#| echo: false
show_doc(NormInputs)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L91){target="_blank" style="float:right; font-size:smaller"}

### NormInputs

>      NormInputs (do_norm=True)

Normalize inputs to [-1,1]. Useful for quiet inputs

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| do_norm | bool | True | controllable parameter for turning normalization on/off |

In [6]:
#|output: asis
#| echo: false
show_doc(Mono)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L103){target="_blank" style="float:right; font-size:smaller"}

### Mono

>      Mono ()

convert audio to mono

In [7]:
#|output: asis
#| echo: false
show_doc(Stereo)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L109){target="_blank" style="float:right; font-size:smaller"}

### Stereo

>      Stereo ()

convert audio to stereo

In [8]:
#|output: asis
#| echo: false
show_doc(RandomGain)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L124){target="_blank" style="float:right; font-size:smaller"}

### RandomGain

>      RandomGain (min_gain, max_gain)

apply a random gain to audio

## WebDataset support


### Background Info
Refer to the official [WebDataset Repo on GitHub](https://github.com/webdataset/webdataset).

> WebDataset makes it easy to write I/O pipelines for large datasets. Datasets can be stored locally or in the cloud.

They use the word "shards" but never define what "shard" means.  I (S.H.) surmise they mean the groups of data files which are gathered into a series of `.tar` files -- the `.tar` files are the shards? 

cf. Video Tutorial: ["Loading Training Data with WebDataset"](https://www.youtube.com/watch?v=mTv_ePYeBhs).

The recommended usage for AWS S3 can be seen in [this GitHub Issue comment by tmbdev] (https://github.com/webdataset/webdataset/issues/21#issuecomment-706008342): 

```Python
url = "pipe:s3cmd get s3://bucket/dataset-{000000..000999}.tar -"
dataset = wds.Dataset(url)...
```
> ^[sic.] `s3cmd get` should read `aws s3 cp`. 

That URL is expecting a contiguously-numbered range of .tar files. So if the file numbers are contiguous (no gaps), then we'll have an easy time. Otherwise, there are ways to pass in a long list of similar "pipe:...tar" 'urls' for each and every tar file, which is still not a big deal though it may appear messier. 

### General utility: `get_s3_contents()`

In [9]:
#|output: asis
#| echo: false
show_doc(get_s3_contents)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L137){target="_blank" style="float:right; font-size:smaller"}

### get_s3_contents

>      get_s3_contents (dataset_path, s3_url_prefix='s3://s-laion-
>                       audio/webdataset_tar', filter='')

Gets a list of names of files or subdirectories on an s3 path

Let's test that on the FSD50K dataset:

In [None]:
#| eval: false
get_s3_contents('FSD50K')

['test', 'train', 'valid']

In [None]:
#| eval: false
get_s3_contents('FSD50K/test')

['0.tar',
 '1.tar',
 '10.tar',
 '11.tar',
 '12.tar',
 '13.tar',
 '14.tar',
 '15.tar',
 '16.tar',
 '17.tar',
 '18.tar',
 '19.tar',
 '2.tar',
 '3.tar',
 '4.tar',
 '5.tar',
 '6.tar',
 '7.tar',
 '8.tar',
 '9.tar',
 'sizes.json']

And let's try filtering for only tar files: 

In [None]:
#| eval: false
tar_names = get_s3_contents('FSD50K/test', filter='tar')
tar_names

['0.tar',
 '1.tar',
 '10.tar',
 '11.tar',
 '12.tar',
 '13.tar',
 '14.tar',
 '15.tar',
 '16.tar',
 '17.tar',
 '18.tar',
 '19.tar',
 '2.tar',
 '3.tar',
 '4.tar',
 '5.tar',
 '6.tar',
 '7.tar',
 '8.tar',
 '9.tar']

### For contiguous file-number lists...

Maybe the range of tar numbers is contigous. If so, let's have something to output that range:

In [10]:
#|output: asis
#| echo: false
show_doc(get_contiguous_range)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L146){target="_blank" style="float:right; font-size:smaller"}

### get_contiguous_range

>      get_contiguous_range (tar_names)

given a string of tar file names, return a string of their range if the numbers are contiguous. Otherwise return empty string

|    | **Details** |
| -- | ----------- |
| tar_names | list of tar file names, although the .tar part is actually optional |

In [None]:
#| eval: false
cont_range = get_contiguous_range(tar_names)
cont_range

'{0..19}'

Test if leading zeros are preserved:

In [None]:
#| eval: false
get_contiguous_range(['0000'+x for x in tar_names])

'{00000..000019}'

Test zero-element and single element versions:

In [None]:
print(get_contiguous_range([]))
print(get_contiguous_range([1]))


1


And show that '.tar' is optional:

In [None]:
get_contiguous_range(['01','02','3'])

'{01..3}'

....So, if a contiguous range of tar file names is available in a WebDataset directory, then we can just use the native WebDataset creation utilities and can ignore all the other %$#*& that's about to follow below. 

Let's test the simple version first:

In [None]:
#| eval: false
s3_url_prefix='s3://s-laion-audio/webdataset_tar/'
url = f"pipe:aws s3 cp {s3_url_prefix}FSD50K/test/{cont_range}.tar -"  # 'aws get' is not a thing. 'aws cp' is
print(url)
dataset = wds.WebDataset(url)

pipe:aws s3 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/{0..19}.tar -


Hooray, it didn't crash! 

Try dataloader-ing that:

In [None]:
#| eval: false
## NOTE TO SELF: DON'T RUN THIS ON STABILITY CLUSTER HEADNODE
if 'this next part fails' == 'darn it':
    loader = wds.WebLoader(dataset, num_workers=4, batch_size=8)
    #loader = loader.batched(12)
    batch = next(iter(loader))
    batch[0].shape, batch[1].shape

### Non-contiguously-numbered lists of tar files...
Because you could do a test-train-val split by moving the tar files around.
this is what all the extra code is for.

A lot of the code predating this was written by LAION who require that the `.json` file(s) for the webdataset(s) be downloaded first. So, let's write a utility for that: 

In [11]:
#|output: asis
#| echo: false
show_doc(download_webdataset_json)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L163){target="_blank" style="float:right; font-size:smaller"}

### download_webdataset_json

>      download_webdataset_json (datasetnames, dataset_split={},
>                                src_prefix='s3://s-laion-audio/webdataset_tar',
>                                dst_prefix='./json_files', force=False)

Downloads the json info of webdataset (sub-)file sizes

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| datasetnames |  |  | list of names of valid AudioDataset datasets / paths |
| dataset_split | dict | {} | keys are dataset names, values are lists of subdirs |
| src_prefix | str | s3://s-laion-audio/webdataset_tar | parent location where the dataset lives |
| dst_prefix | str | ./json_files | local path to save the json |
| force | bool | False | Force new download even if local copy exists |

test get_webdataset_json:

In [None]:
#| eval: false
from types import SimpleNamespace
args = SimpleNamespace(remotedata=True, datasetnames=['FSD50K'],
                       dataset_type="webdataset",
                       dataset_proportion=1, datasetpath='IDK')
download_webdataset_json(args.datasetnames, force=True)

download: s3://s-laion-audio/webdataset_tar/FSD50K/test/sizes.json to json_files/FSD50K/test/sizes.json
download: s3://s-laion-audio/webdataset_tar/FSD50K/train/sizes.json to json_files/FSD50K/train/sizes.json
download: s3://s-laion-audio/webdataset_tar/FSD50K/valid/sizes.json to json_files/FSD50K/valid/sizes.json


For non-contiguous files, we need a list of urls to every single tar file individually.  That's what this next code from LAION's CLAP repo does:

In [None]:
def get_tar_path_s3(base_s3_path:str, 
    train_valid_test:list[str], 
    dataset_names:list[str]=[''], 
    cache_path:str='', 
    recache:bool=False,
    ):
    "Code from LAOIN CLAP may not keep. This spits out a list of aws cli calls to download every tar file"
    if os.path.isfile(cache_path) and not recache:
        with open(cache_path) as f:
            print("Loading Cache")
            return json.load(f)

    # create cmd for collecting url spesific dataset, 
    # if `dataset_names` is not given it will search the full base_s3_path
    cmds = [f'aws s3 ls s3://{os.path.join(base_s3_path, name, "")} --recursive | grep /.*.tar' for name in dataset_names]
    # urls are collected
    urls = [os.popen(cmd).read() for cmd in cmds]
    # cleaning the urls to conform with webdataset
    final_urls = [i.split(' ')[-1] for url in urls for i in url.split('\n')]
    final_urls = [f'pipe:aws s3 --cli-connect-timeout 0 cp s3://{os.path.join(base_s3_path, *i.split("/")[1:])} -' for i in final_urls]
    # Spliting url by state e.g. train, test and valud
    final_urls = {state:[url for url in final_urls if state in url] for state in train_valid_test}

    if cache_path:
        with open(cache_path, 'w') as f:
            json.dump(final_urls, f)

    return final_urls

Let's grab every tar file in the entire FSD50K dataset:

In [None]:
#| eval: false
urls = get_tar_path_s3('s-laion-audio/webdataset_tar',['test', 'valid'], dataset_names=['FSD50K'])
print("urls =",urls)

urls = {'test': ['pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/0.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/1.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/10.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/11.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/12.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/13.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/14.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/15.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/16.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/17.tar -', 'pipe:aws s3 --

Another version that acheives the same effect: 

In [None]:
def get_tar_path_from_dataset_name(
    dataset_names, dataset_types, islocal,  dataset_path, proportion=1, ):
    """
    From LAOIN
    Get tar path from dataset name and type
    """
    if islocal:
        output = []
        for n in dataset_names:
            for s in dataset_types:
                tmp = []
                sizefilepath_ = f"./json_files/{n}/{s}/sizes.json" #  TODO:!!!
                if not os.path.exists(sizefilepath_):
                    continue
                sizes = json.load(open(sizefilepath_, "r"))
                for k in sizes.keys():
                    tmp.append(
                        f"{dataset_path}/{n}/{s}/{k}"
                    )
                if proportion!=1:
                    tmp = random.sample(tmp, int(proportion * len(tmp)))
                output.append(tmp)
        return sum(output, [])
    else:

        output = []
        for n in dataset_names:
            for s in dataset_types:
                tmp = []
                sizefilepath_ = f"./json_files/{n}/{s}/sizes.json"
                if not os.path.exists(sizefilepath_):
                    continue
                sizes = json.load(open(sizefilepath_, "r"))
                for k in sizes.keys():
                    tmp.append(
                        f"pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/{n}/{s}/{k} -"
                    )
                    # TODO: add dataset_path to remote dataset in the future.
                if proportion!=1:
                    tmp = random.sample(tmp, int(proportion * len(tmp)))
                output.append(tmp)
                print("output= ",output)
        return sum(output, [])

Test ^that:

In [None]:
#| eval: false
train_data_tar_path = get_tar_path_from_dataset_name(
    ['FSD50K'],
    ['test','valid'],
    islocal=False,
    proportion=1.0,
    dataset_path='/fsx/shawley/data/webdataset',
)
train_data_tar_path

output=  [['pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/0.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/1.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/2.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/3.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/4.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/5.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/6.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/7.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/8.tar -', 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/9.tar -', 'pipe:aws s3 --cli-connect-ti

['pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/0.tar -',
 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/1.tar -',
 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/2.tar -',
 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/3.tar -',
 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/4.tar -',
 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/5.tar -',
 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/6.tar -',
 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/7.tar -',
 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/8.tar -',
 'pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/FSD50K/test/9.tar -',
 'pipe:aws s3 --cli-connect-ti

And now a massive data-pipelining example from LAION that will definitely get modified for this repo: 

In [12]:
#|output: asis
#| echo: false
show_doc(get_wds_dataset)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L196){target="_blank" style="float:right; font-size:smaller"}

### get_wds_dataset

>      get_wds_dataset (args, model_cfg, is_train, audio_ext='flac',
>                       text_ext='json', max_len=480000, proportion=1.0,
>                       sizefilepath_=None, is_local=None)

Get a dataset for wdsdataloader.

In [13]:
#|output: asis
#| echo: false
show_doc(DataInfo)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L190){target="_blank" style="float:right; font-size:smaller"}

### DataInfo

>      DataInfo (dataloader:DataLoader, sampler:DistributedSampler)

.....yeah no tests for that yet 

In [None]:
#| eval: false
# LAION cut n paste
if 'doesnt work yet' == 'stand by':
    train_data = get_wds_dataset(    args,
        model_cfg,
        is_train,
        audio_ext="flac",
        text_ext="json",
        max_len=480000,
        proportion=1.0,
        sizefilepath_=None,
        is_local=False,)

# AudioDataset class

The flagship class!

In [14]:
#|output: asis
#| echo: false
show_doc(AudioDataset)

---

[source](https://github.com/drscotthawley/aeiou/blob/main/aeiou/datasets.py#L345){target="_blank" style="float:right; font-size:smaller"}

### AudioDataset

>      AudioDataset (*args, **kwds)

Reads from a tree of directories and serves up cropped bits from any and all audio files
found therein. For efficiency, best if you "chunk" these files via chunkadelic
modified from https://github.com/drscotthawley/audio-diffusion/blob/main/dataset/dataset.py

Quick check to catch minor errors:

In [None]:
dataset = AudioDataset('examples/', augs='Stereo(), PhaseFlipper(), FillTheNoise(), NormInputs()')
signal = dataset.__getitem__(0)
print("signal.shape =",signal.shape)

print("\nStereo -------------")
dataset2 = AudioDataset('examples/', augs='Stereo(), PhaseFlipper()')
signal2 = dataset2.__getitem__(0)
print("signal2.shape =",signal2.shape)