## Far-Field Audio Synthesis Toolkit (FAST) v0.2: User Guide
This notebook serves as a user guide for the FAST toolkit, designed to rapidly synthesis audio to train/fine-tune audio ML models.

#### ***v0.2 enhancements
- Developed generation parameters logging and data regeneration function
- Fixed bugs causing errors for some generation scenarios

### Table of Content:
- I. [Overview](#i-overview)
- II. [Getting Started](#ii-overview)
- III. [Synthesis Walkthrough](#iii-synthesis-walkthrough)
- IV. [Bulk Audio Generation](#iv-bulk-audio-generation)
- V. [Experimental Results (WIP)](#v-experimental-results-wip)
- VI. [Regenerate Dataset](#vi-regeneration-of-datasets)
- VII. [Contact Us](#vi-contact-us)

### I. Overview

#### I-A: Why FAST?
- To augment training data outside datasets used for conventional commerical/open-research
- Particularly, shortage of noisy labelled noisy data

#### II-B: Synthesis Approach
- WIP

### II. Getting Started

To get started, clone this repo and set up new environment in python 3.10. Then, install the libraries listed in requirements. txt using
pip.

In [1]:
# ! conda create -n new_env python=3.10.18
# ! conda activate new_env
# ! pip install -f requirements.txt

If you generating audio data in bulk, please note the following folders to park your raw data:
|Folder|Description|
|:---|:---|
|"./data/00_raw_speech/"| Folder for clean speech data |
|"./data/01_stationary_noise/" | Folder for stationary noise (e.g. background buzz, human chatter in crowded environment, etc.) |
|"./data/02_non-stationary_noise/" | Folder for non-stationary noise (e.g. Pen dropping, bell-ringing, passing car, etc.) |

### III. Synthesis Walkthrough
This walkthrough will provide an overview of the speech synthesis components in FAST.

In [1]:
import torch
import torchaudio
import torchaudio.functional as F
import numpy as np
import os
import soundfile as sf # to export librosa arrays into wav
import tempfile
from torchaudio import transforms
# Helper functions to plot graph, spectrum, load audio
from src.helper_functions import plot_waveform, plot_specgram, load_audio_with_pytorch
# Function to convolve audio with IRs
from src.ir_convolve import ir_convolve
# Function to right size audio data after convolution
from src.post_convo_sizer import post_convo_sizer
# Function to add effects to audio
from src.audio_effects_new import audio_effector
# Function to build noises from repo
from src.noise_builder import noise_builder
# Function to attach noise to speech data
from src.audio_stacker import audio_noise_stack
# Function to perform opus encoding decoding
from src.encoding_scripts.opus import encode_opus, decode_opus
# Function to perform bulk audio generation
from src.bulk_generation import bulk_generation
from src.bulk_generation_simple import bulk_generation_simple
# Function to regenerate dataset
from src.regenerate_dataset import regenerate_dataset

# Confirm torch is working
print(torch.__version__)
print(torchaudio.__version__)

2.7.0
2.7.0


#### III-A: Load Audio and Apply Tempo / Pitch Shift Effects (Obtain new clean audio)

In [3]:
## Load an audio using a load_audio_with_pytorch
sample_audio = "./data/Samples/raw_clips/Gump.wav"
sample_data, sr = load_audio_with_pytorch(sample_audio)

Audio ./data/Samples/raw_clips/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/Samples/raw_clips/Gump.wav resampled to 16000Hz


In [4]:
print(sample_data.shape)

torch.Size([1, 127704])


In [5]:
## Implement tempo and pitch shift effects on speech data
## Post-implementation, this will be your "clean" data for the clean-dirty pair
sample_data, sr, _ = audio_effector (sample_data,
                                     tempo_change=True,
                                     pitch_shift=True)

#### III-B: Synthesising Speech with Room Reverberation

In [6]:
print(sample_data.shape)

torch.Size([1, 117047])


In [7]:
## Convolve audio_data with Specific Room Impusle Response fi l e
rir_path = "./data/Impulse_Responses/room_IRs/Room007-00007.wav.npy"
sample_data, sr, size_orig, IR_applied, _ = ir_convolve(sample_data, sr, mode="specific", specific_ir_path=rir_path)

In [8]:
print(sample_data.shape)

torch.Size([1, 125046])


In [9]:
# Right size convolved audio
sample_data = post_convo_sizer(audio_data=sample_data, 
                               size_orig = size_orig, 
                               convo_type="room",
                               IR_applied=IR_applied)

IR peak detected at sample #134


In [10]:
print(sample_data.shape)

torch.Size([1, 117047])


#### III-C: Synthesising Noise

In [11]:
## Generate stationary noise and add effects
noise_stationary_repo = "./data/01_stationary_noise/"
noise_stationary_data, sr , _= noise_builder (sample_data,
                                           noise_stationary_repo,
                                           echo = True,
                                           low_pass = True,
                                           mode = "stationary")

## Generate non-stationary noise and convolve with room impulse response
noise_nonstationary_repo = "./data/02_non-stationary_noise/"
noise_nonstationary_data, sr, _ = noise_builder (sample_data,
                                              noise_nonstationary_repo,
                                              echo = True,
                                              mode = "non-stationary")

Audio ./data/01_stationary_noise/noise-free-sound-0064.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([1602133, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0069.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([41143, 1])


In [12]:
print(noise_nonstationary_data.shape)
print(noise_nonstationary_data.shape)

torch.Size([1, 117047])
torch.Size([1, 117047])


#### III-D: Combining Speech and Noise

In [13]:
## Stack Noises-s onto each other, then stack combined noise onto speech data
# Stack stationary and non-stationary noise first
combined_noise_data = audio_noise_stack(noise_stationary_data, noise_nonstationary_data, SNR=3)

# Stack speech onto noise
sample_data = audio_noise_stack(sample_data, combined_noise_data, SNR=0)

In [14]:
print(sample_data.shape)

torch.Size([1, 117047])


#### III-E: Simulating Passing of Audio through Fabric

In [15]:
## Convolve data with Fabric IR
fabric_ir_path = "./data/Impulse_Responses/fabric_IRs/ir_9LC_0deg_aligned_smooth.npy"
sample_data, sr, size_orig, IR_applied, _ = ir_convolve(sample_data, 
                                                    sr,
                                                    mode="specific",
                                                    specific_ir_path=fabric_ir_path)

In [16]:
sample_data.shape

torch.Size([1, 117646])

In [17]:
## Right size convolved data
sample_data = post_convo_sizer(audio_data=sample_data, 
                               size_orig = size_orig, 
                               convo_type="fabric",
                               IR_applied=IR_applied)

In [18]:
sample_data.shape

torch.Size([1, 117047])

#### III-F: Simulating Recording of Audio by Mobile Phones

In [19]:
## Convolved data with Mobile Phone IR
phone_ir_path = "./data/Impulse_Responses/handphone_IRs/IR_rog_phone_3.npy"
sample_data, sr, size_orig, chosen_ir, _ = ir_convolve(sample_data, 
                                                    sr,
                                                    mode="specific",
                                                    specific_ir_path=phone_ir_path)

In [20]:
sample_data.shape

torch.Size([1, 120046])

In [21]:
## Right size convolved data
sample_data = post_convo_sizer(audio_data=sample_data, 
                               size_orig = size_orig, 
                               convo_type="mobile",
                               IR_applied=IR_applied)

IR peak detected at sample #0


In [22]:
sample_data.shape

torch.Size([1, 117047])

#### III-G. Simulating Degradation of Audio from Mobile CODEC Encoding/Decoding

In [23]:
## Export audio into temp folder
# Create temp_folder to house current state to sample_data

with tempfile.TemporaryDirectory() as tmpdirname:
    temp_file_path = os.path.join(tmpdirname, "sample_audio.wav")
    torchaudio.save(temp_file_path, sample_data, sample_rate=sr, encoding="PCM_S", bits_per_sample=16)

    ## Encode and Decode audio
    opus_encoded_path = encode_opus(wav_path = temp_file_path,
                                    tmp_folder=tmpdirname)
    opus_decoded_path = decode_opus(opus_encoded_path=opus_encoded_path,
                                    output_folder="./output/sample_outputs")

File './output/sample_outputs/_sample_audio_opus_decoded.wav' already exists. Overwrite? [y/N] 

### IV. Bulk Audio Generation

What you have above will suffice to prepare a noisy, reveberant clip. For large scale training, you can generate bulk audio using the
function below.

This function randomly draws upon the data stored your data sub-folders (check documentation for default randomising parameters)
to synthesize a large audio dataset.

#### IV-A. Bulk Generation with Fabric IR, Room IR, and Mobile Codec Encoding/Decoding

In [8]:
bulk_generation(number_of_audios = 5)

Audio ./data/00_raw_speech/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/00_raw_speech/Gump.wav resampled to 16000Hz
IR peak detected at sample #56
Audio ./data/01_stationary_noise/noise-free-sound-0021.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([288725, 1])
Audio ./data/01_stationary_noise/noise-free-sound-0004.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([47784, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0069.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([41143, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0025.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([37547, 1])
IR peak detected at sample #1413


File './output/dirty_samples/0_sample_audio_opus_decoded.wav' already exists. Overwrite? [y/N] 

audio ./output/dirty_samples/0_sample_audio_opus_decoded.wav generated!
Audio ./data/00_raw_speech/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/00_raw_speech/Gump.wav resampled to 16000Hz
IR peak detected at sample #56
Audio ./data/01_stationary_noise/noise-free-sound-0030.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([1168196, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0060.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([16718, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0060.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([16718, 1])
IR peak detected at sample #1421


File './output/dirty_samples/1_sample_audio_opus_decoded.wav' already exists. Overwrite? [y/N] 

audio ./output/dirty_samples/1_sample_audio_opus_decoded.wav generated!
Audio ./data/00_raw_speech/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/00_raw_speech/Gump.wav resampled to 16000Hz
IR peak detected at sample #88
Audio ./data/01_stationary_noise/noise-free-sound-0030.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([1168196, 1])
IR peak detected at sample #1429


File './output/dirty_samples/2_sample_audio_opus_decoded.wav' already exists. Overwrite? [y/N] 

audio ./output/dirty_samples/2_sample_audio_opus_decoded.wav generated!
Audio ./data/00_raw_speech/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/00_raw_speech/Gump.wav resampled to 16000Hz
IR peak detected at sample #187
Audio ./data/01_stationary_noise/noise-free-sound-0064.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([1602133, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0069.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([41143, 1])
IR peak detected at sample #1416


File './output/dirty_samples/3_sample_audio_opus_decoded.wav' already exists. Overwrite? [y/N] 

audio ./output/dirty_samples/3_sample_audio_opus_decoded.wav generated!
Audio ./data/00_raw_speech/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/00_raw_speech/Gump.wav resampled to 16000Hz
IR peak detected at sample #147
Audio ./data/01_stationary_noise/noise-free-sound-0021.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([288725, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0069.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([41143, 1])
IR peak detected at sample #1421
audio ./output/dirty_samples/4_sample_audio_opus_decoded.wav generated!


File './output/dirty_samples/4_sample_audio_opus_decoded.wav' already exists. Overwrite? [y/N] 

#### IV-B. Bulk Generation with Simple Low/Band-Pass Filters in Place of Fabric IR, Mobile IR, and Mobile Codec Encoding/Decoding

In [3]:
bulk_generation_simple(number_of_audios = 5)

Audio ./data/00_raw_speech/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/00_raw_speech/Gump.wav resampled to 16000Hz
IR peak detected at sample #134
Audio ./data/01_stationary_noise/noise-free-sound-0030.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([1168196, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0060.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([16718, 1])
torch.Size([1, 135263])
<class 'torch.Tensor'>
torch.Size([1, 135263])
<class 'torch.Tensor'>
Audio ./data/00_raw_speech/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/00_raw_speech/Gump.wav resampled to 16000Hz
IR peak detected at sample #187
Audio ./data/01_stationary_noise/noise-free-sound-0001.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([648584, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0069.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size

### V. Experimental Results (WIP)

We are still in the process of using the audio synthesis by FAST to train denosing/STT models! We hope to update this soon - with
good news nevertheless!

### VI. Regeneration of Datasets

Using bulk generation will produce a json file in the output folder. This json file logs the parameter used to produce a dataset, which can be used to reproduce a dataset that was previously generated using the function below.

Note that this requires the original speech files, noise files, IR files, and codec to be inside their respective folders.

In [4]:
# Function to regenerate dataset
from src.regenerate_dataset import regenerate_dataset

In [9]:
## Reproduce dataset
test = regenerate_dataset(log_json="./output/experiment_log_250926_010900.json")

5 audio files to be regenerated...
Audio ./data/00_raw_speech/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/00_raw_speech/Gump.wav resampled to 16000Hz
IR peak detected at sample #56
Audio ./data/01_stationary_noise/noise-free-sound-0021.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([288725, 1])
Audio ./data/01_stationary_noise/noise-free-sound-0004.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([47784, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0069.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([41143, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0025.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([37547, 1])
IR peak detected at sample #1413


File './output/regenerated_samples/0_sample_audio_opus_decoded.wav' already exists. Overwrite? [y/N] 

Audio ./data/00_raw_speech/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/00_raw_speech/Gump.wav resampled to 16000Hz
IR peak detected at sample #56
Audio ./data/01_stationary_noise/noise-free-sound-0030.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([1168196, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0060.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([16718, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0060.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([16718, 1])
IR peak detected at sample #1421


File './output/regenerated_samples/1_sample_audio_opus_decoded.wav' already exists. Overwrite? [y/N] 

Audio ./data/00_raw_speech/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/00_raw_speech/Gump.wav resampled to 16000Hz
IR peak detected at sample #88
Audio ./data/01_stationary_noise/noise-free-sound-0030.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([1168196, 1])
IR peak detected at sample #1429


File './output/regenerated_samples/2_sample_audio_opus_decoded.wav' already exists. Overwrite? [y/N] 

Audio ./data/00_raw_speech/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/00_raw_speech/Gump.wav resampled to 16000Hz
IR peak detected at sample #187
Audio ./data/01_stationary_noise/noise-free-sound-0064.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([1602133, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0069.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([41143, 1])
IR peak detected at sample #1416


File './output/regenerated_samples/3_sample_audio_opus_decoded.wav' already exists. Overwrite? [y/N] 

Audio ./data/00_raw_speech/Gump.wav loaded!; Native Sampling_Rate: 44100Hz; Shape: torch.Size([351983, 1])
Audio ./data/00_raw_speech/Gump.wav resampled to 16000Hz
IR peak detected at sample #147
Audio ./data/01_stationary_noise/noise-free-sound-0021.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([288725, 1])
Audio ./data/02_non-stationary_noise/noise-free-sound-0069.wav loaded!; Native Sampling_Rate: 16000Hz; Shape: torch.Size([41143, 1])
IR peak detected at sample #1421
Regeneration Complete!


File './output/regenerated_samples/4_sample_audio_opus_decoded.wav' already exists. Overwrite? [y/N] 

### VII. Contact Us

Have a feedback? Have a Question? Wish to Collaborate?
Perhaps one of our future efforts excites you?
Feel free to reach out to our Speech Researchers @ angjunsiong@gmail.com:

- Simon Chee (Boss)
- Avery Khoo (Tech Lead)
- Jun Siong
- Rebecca Oel
- Winfred Kong

Future Efforts:
- (i) Custom RIR: Integrate libraries like pyroomacoustics, soundscapes to generate bespoke Room IRs
- (ii) Evaluation of Denoising / STT Models trained with FAST data: Evaluate utility of synthesised audio in improving Denoising/
STT models
- (ii) Generate Overlapping Speakers / Time-Stamp Labelling Systems: To expand use case to speaker separation and VAD
models training
- (iv) Ablation Studies / FAST Economisation: Investigate relative importance of each simulation steps and strealime/improve model
where applicable
- (v) Harvest Real Noise Data: Put real data through VAD and extract noise component to enrich noise database
- (vi) Create output log for reproducibility of data set: Function to replicate dataset from generation log
- (vii) Streaming input: Reduce storage space by synthesising audio on the fly