In [19]:
import librosa
import numpy as np
import pandas as pd
from scipy.fftpack import dct

# PRE-EMPHASIS

Pre-emphasis is an initial stage in the Mel Frequency Cepstral Coefficients (MFCC) extraction process to improve the quality of the sound signal before extracting its features. Pre-emphasis is done by applying a high-pass filter to amplify the high frequency components of the audio signal[1].

$$y(n)=x(n)-\alpha.x(n-1)$$

- $n$ is the input signal.
- $y(n)$ is the signal after pre-emphasis.
- $\alpha$ is the *pre-emphasis* coefficient which is in the range of 0 to 1. The commonly used value of *α* is about 0.95.


### `func pre_emphasis__`
> **parameter:**
> * `signal`: input signal, np.ndarray [shape=(n,) or (…, n)]
> * `coefficient`: pre-emphasis coefficient, float 

> **output:**
> * `pre-emphasis signal`, np.ndaarray [shape=(n, ) or (…, n)]

In [2]:
def pre_emphasis__(signal, coefficient=0.97):
  return np.append(signal[0], signal[1:] - coefficient * signal[:-1])

# FRAME BLOCKING

In the Frame Blocking process, the speech signal is split into many small chunks called frames, with each frame overlapping each other. This process is designed to minimize the loss of important information (deleted) or disconnected pieces of signal during frame division. This operation continues until the entire audio signal is thoroughly mapped into frames. By dividing the signal into frames, the information contained in it can be represented in a more detailed and specific manner, making it easier for sound processing algorithms to process. In addition, frame blocking also plays an important role in overcoming variations in the duration of the sound signal, making the feature extraction process more consistent and reliable for various purposes, such as speech recognition or audio analysis.

$$frame = \frac {I-N} {M} + 1$$

Description:
- $I$ is the value of *sampling rates.*
- $N$ indicates the *size* of *frame blocking.*
- $M$ is the length of *overlap.*

### `func framing__`
> **parameter**
> * `signal`: pre-emphased signal, np.ndarray [shape=(n,) or (…, n)]
> * `sr`: signal sampling rate, int or float
> * `frame_size`: size of frame in second (> 1), default 1
> * `frame_stride`: size of frame step in second (> 1), default 0.5

> **output**
> `framed signal`, 2D arrays

In [3]:
def framing__(signal, sr, frame_size=1, frame_stride=0.5):
  frame_length, frame_step = int(round(frame_size * sr)), int(round(frame_stride * sr))
  signal_length = len(signal)
  num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step) + 1)
  pad_signal_length = (num_frames - 1) * frame_step + frame_length
  z = np.zeros((pad_signal_length - signal_length))
  pad_signal = np.append(signal, z)
  indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step,frame_step), (frame_length, 1)).T
  framed_signal = pad_signal[indices.astype(np.int32, copy=False)]
  return framed_signal

# WINDOWING

Every frame, a windowing process using a specific window function is performed on the sound signal that has been processed in the previous step. Windowing attempts to eliminate distortions caused by abrupt frame retrieval. To make the frames smoother and conform to the constraints of the Fourier signal, windowing divides each frame into smaller segments to minimize the influence of edges on the signal.

There are several windowing techniques that can be used in MFCC, including hanning, hamming, bartlett, blackman, kaiser, and gaussian. The Hamming window function is defined by the equation

$$ w(n)=0.54-0.46 * cos \bigg (\frac {2\pi n} {N-1} \bigg) $$

Description:

- $n$ is the sample index in *frame*,
- $N$ is the size of *frame*
- $w(n)$ is the value of the *Window* *Hamming* function


### `func windowing__`
> **parameter**
> * `signal`: framed signal, 2D arrays
> * `sr`: signal sampling rate, int or float
> * `frame_size`: size of frame in second (> 1), default 1

> **output**
> `windowed signal`, 2D arrays


In [4]:
def windowing__(signal, sr, frame_size=1):
  windowed = signal * np.hamming(int(round(frame_size * sr)))
  return windowed

# FFT

Fast Fourier Transform (FFT) is used to transform an audio signal from the time domain to the frequency domain, so that the frequency content of the signal can be analyzed. The original audio signal being in the time domain (amplitude versus time) is not sufficient to identify spectral information such as the dominant frequency. Therefore, a Fast Fourier Transform (FFT) is performed to convert the signal into a spectral representation (amplitude versus frequency).
The Fast Fourier Transform (FFT) is defined by Eq:
$$ y[k]=\sum_{n=0}^{N-1}e^{-2\pi j\frac{kn}{N}}x[n] $$
Description:

- $y[k]$ is the representation of signal $x[n]$ in the frequency domain.
- $e^{-2\pi j\frac{kn}{N}}$ is the complex exponential factor (also known as Fourier basis), which is responsible for mapping the signal from time domain to frequency domain.
- $x[n]$ is the amplitude value of the signal at time index $n$.
- $N$ is the signal length or the number of samples in the discrete signal $x[n]$


### `func fft__`
> **parameter**
> * `signal`: windowed signal, 2D arrays
> * `NFFT`: Number of points along transformation axis in the input to use, int or float, default 512

> **output**
> `power spectrum of fft`, 2D arrays


In [5]:
def fft__(signal, NFFT=512):
  mag_frames = np.absolute(np.fft.rfft(signal, NFFT))
  return ((1.0 / NFFT) * ((mag_frames) ** 2))

# MEL FILTERBANK

The Mel Filterbank is used to convert the linear frequency spectrum (the result of the Fourier transform) into a Mel frequency scale, which is more in line with human auditory perception. The Mel frequency scale reflects the human perception of sound frequency, which is not directly proportional to the linear frequency scale.
Frequency to mel scale conversion is defined by the equation
$$ m=2595 \cdot \log_{10} \bigg ( 1+\frac {f} {700} \bigg ) $$

Description:
- $f$ is the linear frequency in *hertz.*
- $m$ is the frequency in *Mel scale.*




### `func melbank__`
> **parameter**
> * `signal`: power spectrum of fft, 2D arrays
> * `sr`: signal sampling rate, int or float
> * `NFFT`: Number of points along transformation axis in the input to use, int, default 512
> * `NFILT`: Number of total filter, int, default 40

> **output**
> `filtered bank signal`, 2D arrays


In [6]:
def melbank__(signal, sr, NFFT=512, NFILT=40):
  low_freq_mel = 0
  high_freq_mel = (2595 * np.log10(1 + (sr / 2) / 700))
  mel_points = np.linspace(low_freq_mel, high_freq_mel, NFILT + 2)
  hz_points = (700 * (10**(mel_points / 2595) - 1))
  bin = np.floor((NFFT + 1) * hz_points / sr)
  fbank = np.zeros((NFILT, int(np.floor(NFFT / 2 + 1))))
  for m in range(1, NFILT + 1):
    f_m_minus = int(bin[m - 1])
    f_m = int(bin[m])
    f_m_plus = int(bin[m + 1])
    for k in range(f_m_minus, f_m):
      fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
    for k in range(f_m, f_m_plus):
      fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
  filter_banks = np.dot(signal, fbank.T)
  filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks)
  return 20 * np.log10(filter_banks)

# DCT

DCT is used after the Mel Filterbank stage. After calculating the energy at each filter in the Mel filterbank, the next step is to apply DCT to the energy values. DCT produces a series of cepstral coefficients that represent the spectral characteristics of the sound signal. The application of DCT to the energy values of the Mel filterbank is done with the aim of reducing the data dimension and extracting important information from the frequency domain into the cepstral domain. DCT produces a series of cepstral coefficients that describe the energy distribution of the speech signal in the cepstral domain, and these coefficients are used as acoustic features for speech analysis and recognition.
Terdapat beberapa tipe dari DCT salah satunya adalah tipe 2, didefinisikan dengan persamaan:
$$y_{k}=2\sum_{n=10}^{N-1}x_{n}\cos \bigg( \frac {\pi k (2n+1)} {2N} \bigg)$$

Description:

- $y_k$ is the DCT Coefficient at the $k$th index.
- $N$ is the total number of samples.
- $x_n$ is the input data value (in the context of MFCC, this is the log energy of the mel bank filter).
- $cos$ is the cosine function used to convert the data to the frequency domain.



### `func dct__`
> **parameter**
> * `signal`: filtered bank signal, 2D arrays
> * `coefficient`: Number of total coefficient taken, int, default 13

> **output**
> `transformed melbank`, 2D arrays


In [7]:
def dct__(signal, coefficient=13):
  return dct(signal, type=2, axis=1, norm='ortho')[:, 1 : (coefficient + 1)]

# APPLY MFCC TO ALL DATASET

In [8]:
train = []
test_normal_50, test_normal_100 = [], []
test_noise_50, test_noise_100 = [], []

In [9]:
def apply_schema(schema, csv_path, audio_path):
  csv = pd.read_csv(csv_path)
  for i, row in csv.iterrows():
    schema.append({
      'title': row['title'],
      'audio_path': f"{audio_path}/{row['title']}.mp3",
    })

In [10]:
apply_schema(train, csv_path='csv/train.csv', audio_path='audio/train')
apply_schema(test_normal_50, csv_path='csv/test_normal_50.csv', audio_path='audio/test/normal/50')
apply_schema(test_normal_100, csv_path='csv/test_normal_100.csv', audio_path='audio/test/normal/100')
# apply_schema(test_noise_50, csv_path='csv/test_noise_100.csv', audio_path='audio/test/noise/50')
# apply_schema(test_noise_100, csv_path='csv/test_noise_100.csv', audio_path='audio/test/noise/100')

In [11]:
pd.DataFrame(train).head(5)

Unnamed: 0,title,audio_path
0,4U,audio/train/4U.mp3
1,23,audio/train/23.mp3
2,Watch The World Burn,audio/train/Watch The World Burn.mp3
3,Ark,audio/train/Ark.mp3
4,Arrow,audio/train/Arrow.mp3


In [12]:
def apply_mfcc(
        schema,
        npy_path,
        em_coeff=0.97, 
        frame_size=1, 
        frame_stride=0.5,
        NFFT=512,
        NFILT=40,
        dct_coeff=13):
  
  print(f"\nPROCESSING {npy_path}...")
  for audio in schema:
    print(f"\textracting: {audio['title']} to {npy_path}...")
    y, sr = librosa.load(audio['audio_path'])
    pre_emphasis_signal = pre_emphasis__(y, coefficient=em_coeff)
    framed_signal = framing__(signal=pre_emphasis_signal, sr=sr, frame_size=frame_size, frame_stride=frame_stride)
    windowed_signal = windowing__(signal=framed_signal, sr=sr, frame_size=frame_size)
    fft_signal = fft__(signal=windowed_signal, NFFT=NFFT)
    melbank_signal = melbank__(signal=fft_signal, sr=sr, NFFT=NFFT, NFILT=NFILT)
    dct_signal = dct__(signal=melbank_signal, coefficient=dct_coeff)
    np.save(f"{npy_path}/{audio['title']}.npy", dct_signal)

In [13]:
apply_mfcc(train, npy_path="npy/train/05", frame_size=0.5, frame_stride=0.25)
apply_mfcc(train, npy_path="npy/train/10", frame_size=1, frame_stride=0.5)
apply_mfcc(train, npy_path="npy/train/15", frame_size=1.5, frame_stride=0.75)

apply_mfcc(test_normal_50, npy_path="npy/test/normal/50/05", frame_size=0.5, frame_stride=0.25)
apply_mfcc(test_normal_50, npy_path="npy/test/normal/50/10", frame_size=1, frame_stride=0.5)
apply_mfcc(test_normal_50, npy_path="npy/test/normal/50/15", frame_size=1.5, frame_stride=0.75)
# apply_mfcc(test_noise_50, npy_path="npy/test/noise/50/05", frame_size=0.5, frame_stride=0.25)
# apply_mfcc(test_noise_50, npy_path="npy/test/noise/50/10", frame_size=1, frame_stride=0.5)
# apply_mfcc(test_noise_50, npy_path="npy/test/noise/50/15", frame_size=1.5, frame_stride=0.75)
apply_mfcc(test_normal_100, npy_path="npy/test/normal/100/05", frame_size=0.5, frame_stride=0.25)
apply_mfcc(test_normal_100, npy_path="npy/test/normal/100/10", frame_size=1, frame_stride=0.5)
apply_mfcc(test_normal_100, npy_path="npy/test/normal/100/15", frame_size=1.5, frame_stride=0.75)
# apply_mfcc(test_noise_100, npy_path="npy/test/noise/100/05", frame_size=0.5, frame_stride=0.25)
# apply_mfcc(test_noise_100, npy_path="npy/test/noise/100/10", frame_size=1, frame_stride=0.5)
# apply_mfcc(test_noise_100, npy_path="npy/test/noise/100/15", frame_size=1.5, frame_stride=0.75)


PROCESSING npy/train/05...
	extracting: 4U to npy/train/05...
	extracting: 23 to npy/train/05...
	extracting: Watch The World Burn to npy/train/05...
	extracting: Ark to npy/train/05...
	extracting: Arrow to npy/train/05...
	extracting: Awakening to npy/train/05...
	extracting: Be Around to npy/train/05...
	extracting: Blank VIP to npy/train/05...
	extracting: Blank to npy/train/05...
	extracting: Bleed to npy/train/05...
	extracting: C U Again to npy/train/05...
	extracting: Castle to npy/train/05...
	extracting: Cetus to npy/train/05...
	extracting: Circles to npy/train/05...
	extracting: Clear My Head to npy/train/05...
	extracting: Close to npy/train/05...
	extracting: Coming Home to npy/train/05...
	extracting: Control to npy/train/05...
	extracting: Cradles to npy/train/05...
	extracting: Crazy to npy/train/05...
	extracting: Crest to npy/train/05...
	extracting: Cyberpunk to npy/train/05...
	extracting: Dancefloor to npy/train/05...
	extracting: Defeat The Night to npy/train/05