## Overview
This notebooks illustrates usage of the Tweetynet convolutional neural network for birdsong syllable classification and supporting software. The network processes magnitude spectral windows of raw audio recordings by extracting visual features, downsampling the input via 2D convolution and pooling operations, and finally providing downsampled activation maps as input to an LSTM, which feeds a linear readout unit that classifies discrete time steps as belonging either one of the pre-specified syllable types or a period of non-singing. In this notebook, we illustrate how to generate magnitude spectrograms for a variety of parameterizations, generate labelvectors from song annotation files, create datasets for training and evalution, and build + train a model.

This software's unique contribution is in its support of "wideband" spectral input to the Tweetynet neural network. More concretely, our implementation of Tweetynet can process a 3-dimensional input spectrogram window where depth is created by stacking copies of the same window that are computed with distinct FFT sizes (and resulting aspect ratios). Tweetynet applies a separate layer 1 convolution and pooling operation to each channel to appropriately featurize and downsample input to matching dimensions prior to the second convolutional layer. In addition, provided tooling for spectrogram and labelvector generation are easily parameterized with wideband input in mind. 

## Dataset

If you have not already done so, proceed to https://figshare.com/articles/dataset/Bengalese_Finch_song_repository/4805749 to collect the dataset most readily compatible with this software package and tutorial.

You are free to download the entirety of the dataset; however, all that is required is ```sober.repo1.gy6or6.032212.tar.gz```

Once downloaded, unpack the zip file and place the directory ```032212``` in a preferred location in preparation for the next step.

## Getting Started
Before starting visit and review ```/parameters/params.py```. Said file specifies key parameters of this script and its supporting software. At minimum, you will need to hardcode the following named directory parameters:
- ```audio_dir_path``` (this should be the path to directory ```032212```)
- ```annot_dir_path``` (this should be the path to directory ```032212```)
- ```spect_dir_path``` (wherever you want to write spectrograms on disk)

That being said, we highly recommend that while working your way through this notebook you frequently reference ```/parameters/params.py``` along with other ```_params.py``` files.

## Setup Directories
Create necessary directories if need be or remove old files from existing directories. Ensure you have already assigned the above listed named directory parameters.

In [None]:
from parameters.params import (
    WINDOWED_LABELVECS_DIR_PATH,
    UNCUT_LABELVECS_DIR_PATH,
    WINDOWED_SPECTS_DIR_PATH,
    UNCUT_SPECTS_DIR_PATH,
)
dirs = [
    WINDOWED_LABELVECS_DIR_PATH,
    UNCUT_LABELVECS_DIR_PATH,
    WINDOWED_SPECTS_DIR_PATH,
    UNCUT_SPECTS_DIR_PATH,
]
from src.utilities import setup_directories
setup_directories(dirs)

## Write Spectrograms
For each provided audio file we compute and save to disk a full duration spectrogram as well as consecutive overlapping windows. Short windows are useful for training. Full duration spectrograms can be used for reference and/or model evalution on song-by-song basis. As discussed, this version of Tweetynet supports processing multiple input spectrograms as "channels". Accordingly, with a single audio file it is easy to generate multiple spectrograms using different parameterizations to render different output dimensions. Our spectrogram generation procedure includes small quirks to produce output images with an identical number of pixels for different FFT setups. For example, with spectrogram A, B, and C, produced with FFT sizes 256, 512, and 1024, the height (frequency dimension) of C will be twice that of B and four times that of A. An identical relationship holds for the time dimension.

NOTE: In addition to actual spectrograms and spectrogram windows this method writes other metadata to disk with each record. Details follow.

NOTE: This should take a bit of time depending on the dataset size, number of N_FFTs, and extraction sliding window overlap. With default parameters on a stock Macbook Pro each minute of audio requires 1-1.5 minutes for spectrogram generation.


In [None]:
from src.spect_writer import SpectWriter
from parameters.spect_writer_params import SPECT_WRITER_PARAMS
spect_writer = SpectWriter(**SPECT_WRITER_PARAMS)
spect_writer.write()

## Determine Actual Spectrogram Window Sizes 
In order to instantiate our network, we need to provide input spectrogram shapes. Here, use a utility method to load a sample set of spectrogram windows, extract the shapes, and save them to the 'network_params' dictionary.

In [None]:
from src.utilities import get_spect_window_shapes
from parameters.params import WINDOWED_SPECTS_DIR_PATH, SPECT_FILE_FMT
from parameters.net_params import NETWORK_PARAMS
n_ffts = NETWORK_PARAMS["n_ffts"]
network_input_shapes = get_spect_window_shapes(WINDOWED_SPECTS_DIR_PATH, SPECT_FILE_FMT, n_ffts)
NETWORK_PARAMS["input_shapes"] = network_input_shapes

## Create Network. Sample 'forward()' Call To Compute Labelvec Length
To create label vectors from audio file anntations we need to know the network's output sizes for predetermined input sizes. Instiate the network with these parameters and save the labelvec length to be used later. NOTE: intuitively, we would expect the length of the labelvector to equal the length of the input spectrogram in the horizontal (time) dimension; however, as discussed, this software supports processing multiple spectrogram "channels" of different aspect ratios (time-frequency resolutions). Typically, we would not downsample in the time dimension, but in some cases we might want to hit specific dimensions across all input channels. Being sensitive to these cases we run a run a test input through the network on instantiation and confirm labelvector size.

In [None]:
from src.network import MultiChannelTweetynet
tweetynet = MultiChannelTweetynet(**NETWORK_PARAMS)
labelvec_len = tweetynet.labelvec_len

In [None]:
print(labelvec_len)

## View Tweetynet Architecture

In [None]:
print(tweetynet)

## Write Labelvectors
For training and evalution, each spectrogram and spectrogram window needs a labelvector. This method writes labelvectors to disk both for complete spectrograms and for shorter training windows. 

This step will progress slightly faster than the spectrogram generation step. Though the computational intensity of labelvector generation depends on total downsampling that occurs through convolution and is thus dependent on user provided pooling parameters. 

In [None]:
from src.labelvec_writer import LabelVecWriter
from parameters.labelvec_writer_params import LABELVEC_WRITER_PARAMS
LABELVEC_WRITER_PARAMS["labelvec_len"] = labelvec_len
labelvec_writer = LabelVecWriter(**LABELVEC_WRITER_PARAMS)
labelvec_writer.write()

## Create Training and Evaluation Datasets

Note: `sample_train_eval_files` randomly selects a subset of the data we have generated. In this case, we choose all training windows associated with 40 randomly selected audio files and then create an evaluation set from 10 additional files. The returned training and eval sets do not overlap. 

Try indexing into ```train_dataset``` and ```eval_dataset``` to inspect how training and evalution samples are structured.

In [None]:
from src.dataset import EvalDataset, TrainDataset
from parameters.dataset_params import EVAL_DATASET_PARAMS, TRAIN_DATASET_PARAMS
from parameters.params import AUDIO_FILE_FMT, UNCUT_SPECTS_DIR_PATH
from src.utilities import sample_train_eval_files
train_files_list, eval_files_list = sample_train_eval_files(
    uncut_spect_path=UNCUT_SPECTS_DIR_PATH,
    num_train_files=1,
    num_eval_files=1,
    audio_file_fmt=AUDIO_FILE_FMT,
    spect_file_fmt=SPECT_FILE_FMT,
)
TRAIN_DATASET_PARAMS['audio_files_list'] = train_files_list_list
EVAL_DATASET_PARAMS['spect_files_list'] = eval_files_list
train_dataset = TrainDataset(**TRAIN_DATASET_PARAMS)
eval_dataset = EvalDataset(**EVAL_DATASET_PARAMS)

In [None]:
samp = train_dataset[0]
windows, labvec = samp
print(windows)
print('\n')
print(labvec)

## Load Training Parameters, Instantiate DataLoaders

In [None]:
from parameters.train_params import DEVICE, NUM_EPOCHS, TRAIN_BATCH_SIZE, EVAL_BATCH_SIZE, NUM_WORKERS, EVAL_STEP, LR
from torch.utils.data import DataLoader
train_data = DataLoader(train_dataset, batch_size=TRAIN_BATCH_SIZE, num_workers=NUM_WORKERS, shuffle=True)
eval_data = DataLoader(eval_dataset, batch_size=EVAL_BATCH_SIZE, num_workers=NUM_WORKERS, shuffle=False)   

## Instantiate Model
Note, in ```model_params``` below we include the network (tweetynet) that was setup at an earlier step in this notebook.

Users are encouraged to write their own model classes or modify the provided class if they have specific desires with respect to saving model ouptuts or internal state throughout training.

In [None]:
from src.model import TweetynetModel
model = TweetynetModel(device=DEVICE)

## Start Training Loop



In [None]:
num_train_samps, accs = model.run(
    eval_step=EVAL_STEP,
    num_epochs=NUM_EPOCHS,
    train_data=train_data,
    eval_data=eval_data,
    net=tweetynet,
    lr=LR,
)

## Plot Results

In [None]:
from matplotlib import pyplot as plt
fig = plt.figure(figsize=(8,4))
plt.title("TweetyNet Classification")
plt.xlabel("Number of training windows")
plt.ylabel("Accuracy")
plt.plot(num_train_samps, accs)
plt.show()
