# Summary
The purpose of this code is to create a TensorFlow Lite model. To do this, we supply a data set, create a TensorFlow model, train the model on that data set, then convert the TF model into a TF Lite model. This TF Lite model can then be loaded onto a mobile device for audio classification. 

There are a variety of models that can be trained on a variety of data sets. This code is for training MFCC models, which which takes a couple of extra steps in data processing compared to its amplitude counterpart. This guide aims to describe all of the parameters in the training process, so that you can change them and build your own models. Remember, the end goal is to acheive the highest validation accuracy possible before loading that model onto a phone for use.

**Numbers to beat for our l6-data:** <br>
*A good goal--* 85% val_acc <br>
*Our best model--* 94% val_acc

# Imports
We use Keras(which is built on top of Tensorflow) to build and train our models. Librosa is used for audio processing.

In [None]:
# make sure kernel matches pip version
!pip3 install -r requirements.txt

In [None]:
%load_ext autoreload
%autoreload 2

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Tensorflow
import tensorflow as tf
from tensorflow.python.tools import freeze_graph
from tensorflow.python.tools import optimize_for_inference_lib

# Keras
import keras
from keras import regularizers
from keras.models import Sequential
from keras.layers import (Activation, Dense, Dropout, Flatten, Conv2D, Conv1D, 
                          MaxPooling2D, GlobalAveragePooling2D, MaxPooling1D, Lambda)
from keras.layers.normalization import BatchNormalization
from keras.callbacks import Callback, ReduceLROnPlateau, ModelCheckpoint
from keras.utils import to_categorical, multi_gpu_model
import keras.backend as K

import librosa
import multiprocessing as mp
import numpy as np
import scipy.io.wavfile
from scipy.fftpack import dct
from sklearn.model_selection import train_test_split
from tqdm import tqdm as tqdm
import time
from pprint import pprint
import uuid
import glob
import math

# Constants

## General

**RAW_DATA_DIR:** Where the raw training data is located. In the directory each folder name is a label, and its contents are .wav files corresponding to that label. See the "l6-data" directory for an example. 

**MFCC_PROCESSED_DATA_DIR:** Where the processed training data is to be stored. After processing the data, it will be populated with numpy files. Each numpy file is named after a label, and contains all of the training data for that label stored as a 4d numpy array. More info on this later.

**AUDIO_LENGTH:** The desired input size for the model. An input size of 44100 at 44100 Hz would be a one second input. 

**SAMPLE_RATE:** The sample rate of the microphone.

## MFCC

**nmfcc:** The number of mfccs to use. This directly affects the shape of the input to your model. 

**nmels:** A pre-computed log-power Mel spectrogram. Not sure what this number does exactly, but everyone seems to only ever use 128. Librosa documentation isn't very thorough.

**hop_length:** The hop length for calculating MFCCs.

**n_fft:** Number of FFTs for calculating MFCCs.

**nframe:** The number of frames for MFCCs. This directly affects the shape of the input to your model, and is based on the audio length and hop length. Makes sense if you think about it. 

## Training

**channel:** We use a one channel audio input

**epochs:** This is how many times the model will train on your data. 200 is usually a good number, there are diminishing returns after a certain point. 

**batch_size:** I don't even know what this is.

**verbose:** 1 is true, 0 is false. Its always a good idea to have this on. 

**num_classes:** The number of classes (also referred to as labels) in your data set.

In [None]:
RAW_DATA_DIR = "data/"
MFCC_PROCESSED_DATA_DIR = "mfcc-processed-data/"
AUDIO_LENGTH = 44100
SAMPLE_RATE = 44100

nmfcc = 128
nmels = 128
hop_length = 512
n_fft = 1024
nframe = int(math.ceil(AUDIO_LENGTH / hop_length))
print(nframe)

channel = 1
epochs = 200
batch_size = 128
verbose = 1
num_classes = 6

# Data Processing

### get_labels(path)
**Input:** `RAW_DATA_DIR` <br>
**Output:** `Tuple (Labels, Indices of the labels, one-hot encoded labels)` <br>
**Description:** Gets all labels (aka filenames) inside your raw data directory. The indices are the positions of these labels, and the hot encoded vector is a vector of zeros the length of the number of labels, with a 1 at the corresponding label's index. More info can be found about these online.

### wav2mfcc(file_path)
**Input:** `file` - .wav file <br>
**Output:** `mfcc_vectors` - array with shape: (nmfccs, nframes) <br>
**Description:** Takes a .wav file and converts it to MFCCs using librosa and the MFCC constants declared above.

### label_to_mfcc_vecs(args)
**Input:** `Tuple (label, input_path, output_path, tqdm_position)`<br>
**Output:** Shape of the generated numpy file (stored under `output_path`)<br>
**Description:** This function is called by `process_data_mfcc()` in parallel to convert a label's raw .wav files to a single numpy file. This can take a while, so David made a super fancy tdqm display (hence the tqdm_position parameter). **This function will convert the training data into overlapping MFCCs with dimensions (nmfcc, nframe), then store them in a 4d numpy array under the `output_path` dir.** It will move though the clip at a rate of AUDIO_LENGTH / 2, so if a clip is 10.1 seconds long, and the input length is 44100 with a 44100 Hz sampling rate, it will be processed as 21 one second clips. The final piece of the clip is padded with zeros to match the input length. This is done for all the .wav files in a label, which are converted to amplitude arrays with `librosa.load()`, then mfccs with `librosa.feature.mfcc()`. The resulting numpy array contains the MFCCs that are actually trained on.

### process_data_mfcc(input_path, output_path)
**Input:** RAW_DATA_DIR, AMP_PROCESSED_DATA_DIR<br>
**Output:** Shape of the numpy files stored in output_path<br>
**Description:** Calls ` label_to_mfcc_vecs()`, and populates output_path with numpy files. Each numpy file is named after a label, and its content is a 4d numpy array with all the training data for that label. The shape of this array is (number_of_files, nmfcc, nframes, 1). The 4th empty dimension is required by Keras.

In [None]:
def get_labels(path):
    labels = [i for i in sorted(os.listdir(path)) if i[0] != "."]
    label_indices = np.arange(0, len(labels))
    return labels, label_indices, to_categorical(label_indices)

def wav2mfcc(file_path):
    mfcc_vectors = []
    audio_buf,_ = librosa.load(file_path, mono=True, sr=SAMPLE_RATE)
    audio_buf = (audio_buf - np.mean(audio_buf)) / np.std(audio_buf)
    
    remaining_buf = audio_buf.copy()
    while remaining_buf.shape[0] > AUDIO_LENGTH:
        # Add the first AUDIO_LENGTH of the buffer as a new vector to train on
        new_buf = remaining_buf[ : AUDIO_LENGTH ]
        mfcc = librosa.feature.mfcc(new_buf, sr=SAMPLE_RATE,S=None, n_mfcc=nmfcc, n_fft=n_fft, hop_length=hop_length, n_mels=nmels)
        mfcc_vectors.append(mfcc)

        # Remove 1/2 * AUDIO_LENGTH from the front of the buffer
        remaining_buf = remaining_buf[ int(AUDIO_LENGTH / 2) : ]

    # Whatever is left, pad and stick in the training data
    remaining_buf = np.concatenate((remaining_buf, np.zeros(shape=(AUDIO_LENGTH - len(remaining_buf)))))
    mfcc = librosa.feature.mfcc(remaining_buf,sr=SAMPLE_RATE,S=None, n_mfcc=nmfcc, n_fft=n_fft, hop_length=hop_length,n_mels=nmels)
    mfcc_vectors.append(mfcc)
    
    return mfcc_vectors

In [None]:
def label_to_mfcc_vecs(args) -> None:
    label, input_path, output_path, tqdm_position = args
    
    vectors = []

    wavfiles = [os.path.join(input_path, label, wavfile) for wavfile in os.listdir(os.path.join(input_path, label))]
    
    # tqdm is amazing, so print all the things this way
    print(" ", end="", flush=True)
    twavs = tqdm(wavfiles, position=tqdm_position)
    for i, wavfile in enumerate(twavs):
        vectors_for_file = wav2mfcc(wavfile)
        for v in vectors_for_file:
            vectors.append(v)        
        # Update tqdm
        twavs.set_description("Label - '{}'".format(label))
        twavs.refresh()
#     np.delete(vectors, 0)  # deletes first zero entry    
    np_vectors = np.array(vectors)
    np.save(os.path.join(output_path, label + '.npy'), np_vectors)
    return np_vectors.shape   

def process_data_mfcc(input_path, output_path):
    if not os.path.exists(MFCC_PROCESSED_DATA_DIR):
        os.mkdir(MFCC_PROCESSED_DATA_DIR)
    
    labels, _, _ = get_labels(input_path)
    pool = mp.Pool()
    result = pool.map(label_to_mfcc_vecs, 
                     [(label, input_path, output_path, tqdm_position) 
                          for tqdm_position, label in enumerate(labels)])
    pool.close()
    return result

In [None]:
process_data_mfcc(RAW_DATA_DIR, MFCC_PROCESSED_DATA_DIR)

# Training the Model

### get_train_test(split_ratio=0.75, random_state=42)
**Inputs:** `split_ratio` <br>
**Outputs:** `X_train, X_test, y_train, y_test` <br>
**Description:** Uses a sklearn library to split the processed data into training data and test data. We use a .75 ratio of training to test data. X_train is the training data, which is a 4d array with dimnesions (number_of_files, nmfcc, nframes, 1). Again, the 4th dimension is a requirement from Keras. y_train is a 1d array containing label indices that correspond to each clip in X_train. X_test and y_test are the same, but for testing data instead of training data. The testing data is used to calculate accuracy metrics during training.
### get_model()
**Inputs:** None <br>
**Outputs:** `Tensorflow Model` <br> 
**Description:** This function is where the model itself is built. The input size must match the clip length (e.g. our MFCC shapes), and the output must be a set of weights corresponding to each label (so an array of length labels). The highest weight will correspond to the final classification. The models we use are CNNs, which are built layer by layer. Any model with the proper input and output sizes can be loaded. In our case, the input shape was (nmfcc, nframes, 1), as Keras required the extra dimension. 

### train(model, X_train, y_train_hot, X_test, y_test_hot)
**Inputs:** `model, X_train, y_train_hot, X_test, y_test_hot`<br>
**Outputs:** The given model is trained on the given data. <br>
**Description:** We use 'adam' optimization during training, as it usually results in much higher accuracies. We also save the model at its highest validation accuracies as it trains, so that we can retrieve the model from the epoch with the highest accuracy. These are saved as "weights-{val_acc}.hdf5", and can be converted to TF Lite models later.

In [None]:
def get_train_test(split_ratio=0.75, random_state=42):
    # Get available labels
    labels, indices, _ = get_labels(RAW_DATA_DIR)

    # Getting first arrays
    X = np.load(os.path.join(MFCC_PROCESSED_DATA_DIR, labels[0] + '.npy'))
    y = np.zeros(X.shape[0])

    # Append all of the dataset into one single array, same goes for y
    for i, label in enumerate(labels[1:]):
        x = np.load(os.path.join(MFCC_PROCESSED_DATA_DIR, label + '.npy'))
#         print(label)
#         print(x.shape)
#         print(X.shape)
        X = np.vstack((X, x))
        y = np.append(y, np.full(x.shape[0], fill_value= (i + 1)))

    assert X.shape[0] == len(y)

    return train_test_split(X, y, test_size= (1 - split_ratio), random_state=random_state, shuffle=True)

In [None]:
# # Loading train set and test set
X_train, X_test, y_train, y_test = get_train_test()

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

# Reshaping to perform 2D convolution
print(X_train.shape)

X_train = X_train.reshape(X_train.shape[0], nmfcc, nframe, channel)
X_test = X_test.reshape(X_test.shape[0], nmfcc, nframe, channel)
y_train_hot = to_categorical(y_train)
y_test_hot = to_categorical(y_test)

print(X_train.shape)
print(X_test.shape)
print(y_train_hot.shape)
print(y_test_hot.shape)
print(X_train[0][0][0])

In [None]:
def get_model():
    model = Sequential()

    model.add(Conv2D(16, 
                     kernel_size=3, padding='same', activation='relu',name='voice', input_shape=(nmfcc,nframe, channel)))
    model.add(BatchNormalization())
    model.add(Conv2D(16, kernel_size=3, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=2, strides=2))
    
    model.add(Conv2D(32,kernel_size=3, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(Conv2D(32, kernel_size=3, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=2, strides=2))
    
    model.add(Conv2D(64,kernel_size=3, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(Conv2D(64, kernel_size=3, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=2, strides=2))
    
    model.add(Conv2D(128,kernel_size=3, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(Conv2D(128, kernel_size=3, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=2, strides=2))
    
    model.add(Conv2D(256,kernel_size=3, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(Conv2D(256, kernel_size=3, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=2))
    
    model.add(Conv2D(512,kernel_size=3, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=2, strides=2))
    
    model.add(Conv2D(1024,kernel_size=2,padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(Conv2D(num_classes, kernel_size=1,padding='same', activation='sigmoid'))
    
    model.add(GlobalAveragePooling2D())

    return model

In [None]:
def train(model, X_train, y_train_hot, X_test, y_test_hot):
    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    print(model.summary())
    
    reduce_lr = ReduceLROnPlateau(monitor='val_acc', factor=0.5, patience=10, min_lr=0.0001, verbose=1)
    mcp_save = ModelCheckpoint('out/weights.{epoch:02d}-{val_acc:.2f}.hdf5', save_best_only=True, monitor='val_acc', mode='max')
    model.fit(X_train, 
              y_train_hot, 
              batch_size=batch_size, 
              epochs=epochs, 
              verbose=verbose, 
              validation_data=(X_test, y_test_hot),
              callbacks=[reduce_lr, mcp_save],
              shuffle=True)

In [None]:
if not os.path.exists('out'):
    os.mkdir('out')

model = get_model()
train(model, X_train, y_train_hot, X_test, y_test_hot)

# Testing MFCCs
Here we loaded an audio file and classified it on the computer. This sped up MFCC debugging, even though it's pretty ugly to look at.

In [None]:
wave1, sr = librosa.load("cough.wav", mono=True, sr=SAMPLE_RATE,duration=1.0)
wave1
print(wave1.shape)
wave1[0:40]

In [None]:
observe=wave1*32768
observe[0:40]

In [None]:
plt.figure(1)
plt.plot(wave1)
plt.show
#wave = wave[::3]

mfcc = librosa.feature.mfcc(wave1, sr=SAMPLE_RATE,S=None, n_mfcc=nmfcc,n_fft=n_fft, hop_length=hop_length,n_mels=nmels,fmax=22050)
mfcc.shape

In [None]:
mfcc[0]

In [None]:
mfcc[0].shape

In [None]:

if (20 > mfcc.shape[1]):
        pad_width = 20 - mfcc.shape[1]
        mfcc = np.pad(mfcc, pad_width=((0, 0), (0, pad_width)), mode='constant')

    # Else cutoff the remaining parts
else:
        mfcc = mfcc[:, :20]
mfcc[19]

In [None]:
#plt.figure(2)
#plt.plot(mfcc)
#plt.show        
sample_reshaped = mfcc.reshape(1, nmfcc, nframe, channel)
sample_reshaped.shape

In [None]:
# Predicts one sample
def predict(filepath, model):
    sample = wav2mfcc(filepath)
    sample
    print(sample.shape)
    sample_reshaped = sample.reshape(1, nmfcc,nframe, channel)
    #sample_reshaped.shape
    print(sample_reshaped.shape)
    
    return get_labels()[0][
            np.argmax(model.predict(sample_reshaped))
    ]

In [None]:
#sample_reshaped.shape
print(sample_reshaped.shape)
get_labels()[0][
            np.argmax(model.predict(sample_reshaped))
    ]

# Converting to TF Lite
**frozen_model_name:** The frozen model to convert. These are saved automatically in the `out/` directory during training.<br>
**tf_lite_model_name:** The desired name of your TF model. We follow the following format: `dataset_feature_'i'inputsize_'l'labelsize`. So, a model trained on 6 office sounds with 1 second of amplitude at a sample rate of 44100 Hz would be named `OfficeSounds_Amplitude_i44100_l6.tflite`.

In [None]:
frozen_model_name = 'out/weights-0.91.hdf5'
tf_lite_model_name = "amplitude-l6-acc91.tflite"

converter = tf.lite.TFLiteConverter.from_keras_model_file(frozen_model)
tfmodel = converter.convert()
open (tf_lite_name, "wb") .write(tfmodel)

# filename = './out/frozen_Audio_Recorder.pb'
# converter = tf.lite.TFLiteConverter.from_frozen_graph(filename,["voice_input"], ["label/Softmax"])
# tflite_model = converter.convert()
# open("Yunus_OfficeSounds.tflite", "wb").write(tflite_model)