# Featurization Pipeline steps for Training and Inference

### Important Instructions

This notebook can be used in two ways.

#### 1. generate a subset of a training set

you'll need to provide 2 inputs: the video, and a whistle ground truth file.

The ground truth file MUST be at the same folder as the video.  It also must have a similar basename as the video, but ending in *_0.groundtruth.whistle.txt*

For example *5U5A9273.webm* and *5U5A9273_0.groundtruth.whistle.txt*

The ground truth is a file listing seconds for each whistle where each line in in the following format 
*`[Minutes:]Seconds[.Milliseconds] [# Comment]`*

> 2.9 <br>
> 11.3 # plus cheers <br>
> 30.38 <br>
> 59.149 <br>
> 1:06.49 <br>
> 1:22.34 <br>

running this notebook will create 2 files 
 
1. ending with *_swin_{low_1}_{high_1}.npy* where low_1 and high_1 are slice coordinates of the fft slice operation.

2. ending in *_training_set_1.npy*


#### 2. generate input features for inference

you'll need to provide 1 input: the video

running this notebook will create a file ending with _swin_{low_1}_{high_1}.npy where low_1 and high_1 are slice coordinates of the fft slice operation.

Select the  [Stop](#Stop) Cell.  And **Run All Above Selected Cell**


In [102]:
#
# Given a video file extract the name, and create audio, the file name to store the extracted audio
#
video = "5U5A9278.MOV"
video = "IMG_2398.MOV"
video = "IMG_2365.MOV"

# Input 1
video = "/Volumes/Stuff/youtube-dl-2/GH020192 [wj1Vc2QSdfI].webm"

name = video.split(".")
audio = f"{name[0]}.mp3"

# Input 2 (Optional)
ground_truth_file = name[0]+"_0.groundtruth.whistle.txt"

# Output 1
inference_tensor_file=f"{name[0]}_swin_{low_1}_{high_1}.npy"

# Output 2 (Optional)
training_tensor_file=name[0]+"_training_set_1.npy"

# Extract

In [103]:
import moviepy.editor as mp

b = mp.VideoFileClip(video)
b.audio.write_audiofile(audio)

MoviePy - Writing audio in /Volumes/Stuff/youtube-dl-2/GH020192 [wj1Vc2QSdfI].mp3


                                                                                

MoviePy - Done.


Filter the audio signal, bandpass between 2-7.9 K Hertz

In [104]:
import librosa

y,sr = librosa.load(audio)

# Transform
## Filter

In [105]:
from scipy import signal

b, a = signal.iirfilter(17, [2000, 7900], rs=60,fs=sr,
                         btype='band', analog=False, ftype='cheby2')
y_f =signal.lfilter(b,a,y,axis=-1)

## Frame

Specify frame size, frame_size_seconds

Specify # sliding frames, hop_in_window_dimensions

z - contains the sliding audio frames

In [106]:
import math

frame_size_seconds=0.7
frame_length=frame_size_seconds * sr
hop_in_window_divisions = 2
hop_length=frame_size_seconds / hop_in_window_divisions * sr
frame_length_c=math.ceil(frame_length)
frame_length= frame_length_c
hop_length=math.floor(frame_length_c / hop_in_window_divisions)

print(f"hop_length={hop_length} frame_length={frame_length}")

#zz=librosa.util.frame(y_f,
#                   frame_length=22050,
#                   hop_length=7350,
#                   axis=0,
#                   writeable=False,
#                   subok=False)

z=librosa.util.frame(y_f,
                   frame_length=frame_length_c,
                   hop_length=math.floor(frame_length_c / hop_in_window_divisions),
                   axis=0,
                   writeable=False,
                   subok=False)



hop_length=7717 frame_length=15435


## FFT

Functions for taking the **fast fourier transform** of audio frames. 

And functions for finding the index range for slicing the fft frames.

In [107]:

def window_fft(y,n_fft=2048,hop_length=12000):
    return librosa.stft(y,n_fft=n_fft,center=False,hop_length=hop_length)

def find_bounds(n_fft=2048,sr=sr,low=2000,high=4000,ts=3):
    freq=librosa.fft_frequencies(sr=sr,n_fft=n_fft)
    low_= None
    high_=None
    j=0
    for i in freq:

        if low_ is None:
            if i > low:
                low_ = j
        if low_ and high_ is None:
            if i > high:
                high_ = j-1
        j=j+1

    print(f"low={low_} - hi={high_} x {ts} cross {(high_-low_)*ts} fft bins {len(freq)}")
    return (low_,high_)

def find_bounds_bands(n_fft=nfft,ts=3):
    return (find_bounds(n_fft, low=2000,high=4096,ts=ts),
    find_bounds(n_fft, high=8000,low=6000,ts=ts))

In [108]:
from itertools import islice
import numpy as np

nfft=512
hop_length=math.ceil(frame_length_c)
def sss(ar,index,n_fft,positive_example=True):
    return window_fft(ar[index],n_fft=nfft,hop_length=hop_length)


acc=None

kk=sss(z,0,nfft,True)

## Slice FFT

Logic for slicing the fft frames

In [109]:
((low_1,high_1),(low_2,high_2))=find_bounds_bands(n_fft=nfft,ts=np.shape(kk)[1])

for i in range(len(z)): # islice(indices,0,5,None):
    kk=sss(z,i,nfft,True)
    kk=np.concatenate(np.abs(np.array(kk)[np.s_[low_1:high_1]]))
    if acc is not None:
        acc=np.vstack((acc,kk)) #kk),axis=0)
    else:
        acc=kk
        
#kk=ss(z,1111,nfft,True)
kk=acc
#((low_1,high_1),(low_2,high_2))=find_bounds_bands(n_fft=nfft,ts=np.shape(kk)[1])

np.shape(kk)

low=47 - hi=95 x 1 cross 48 fft bins 257
low=140 - hi=185 x 1 cross 45 fft bins 257


(2021, 48)

## Write Inference Feature Tensor

Save the features to a file that can be used for inference

In [110]:
np.save(inference_tensor_file,kk,allow_pickle=False)

<a id="Stop"></a>
## Stop 

**ONLY PROCEED RUNNING THE NEXT CELLS IF YOU HAVE PROVIDED A GROUND TRUTH FILE WITH THIS VIDEO.**

proceed to the next cells if this video will be used for training models

The following logic will load the ground truth file, convert the times into positive judgements and marry them to the frames containing the input features.

## Load ground truth
Logic for converting judgements in the whistle file format into frames.

In [111]:

def read(a_file):
    with open(a_file,"r") as f:
        lines = [parseMinSec(stripComments("#")(line.rstrip())) for line in f ]
        return lines

def convert_judgement_in_seconds_to_frames(hop_length, judgements=[]): # frame_size,
    for i in judgements:
        a = i * sr / hop_length # sec * samples / sec / ( samples / frame ) = frame
        a = math.floor(a)
        yield iter(range(a,a+2))
        
def parseMinSec(cs):
    minute = seconds = 0
  
    if ':' in cs:
        left, right = cs.split(":")
        minute = float(left) * 60
        seconds = float(right)
    else:
        seconds = float(cs)
    
    return minute + seconds

from itertools import takewhile
 
 
# stripComments :: [Char] -> String -> String
def stripComments(cs):
    '''The lines of the input text, with any
       comments (defined as starting with one
       of the characters in cs) stripped out.
    '''
    def go(cs):
        return lambda s: ''.join(
            takewhile(lambda c: c not in cs, s)
        ).strip()
    return lambda txt: '\n'.join(map(
        go(cs),
        txt.splitlines()
    ))

Load the judgements into an array containing frame indices that should be positive class.

In [112]:
indices=[item for i in convert_judgement_in_seconds_to_frames(
    hop_length=hop_length,
    judgements=read(ground_truth_file)) for item in i]

Convert the judgements into vector same length as number of frames.  
Use np.put to set the indices of positive class from the previous step

## Create Judgement Vector

In [113]:
gt = np.zeros(dtype=int,shape=(len(kk)))

np.put(gt, indices, v=1.)

Marry the features matrix with judgments vector, resulting in a single matrix with the last column containing the judgments.  The model training logic will split the judgments back out.

In [114]:
kkk=np.column_stack((kk,gt))

verifying we have additional column

In [115]:
np.shape(kkk)

(2021, 49)

## Write Training Tensor
Save the married features+jugements to a file that can be used for training

In [116]:
np.save(training_tensor_file,kkk,allow_pickle=False)