# Audio Analysis (FLAC File Analysis)

In this colab, we're going to explore the various ways on performing audio analysis. We're going to take all files that corresponds to a specific speaker identity.

I'm going to use a manual method (instead of using a library, e.g. SoX). This is done so we can get a detailed analysis on each of the information in the FLAC format.

Format of the file can be seen here: https://xiph.org/flac/documentation_format_overview.html.

In [10]:
import numpy as np
import zipfile
import gc
import cv2
import math
import warnings
import random
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import albumentations as A
import sox
import torchvision
import torchvision.transforms.functional as TF
import torchvision.transforms as transforms
import torch.utils.data as data_utils
import torch
import torchvision
import torchvision.transforms as transforms
import torch.optim as optim
import torch.nn as nn
# import torchmetrics
import torch.nn.functional as F
import PIL
import torch.utils.data as data_utils
import json
import struct # Library to parse the data as binary file.

# import pytorch_lightning as pl
import imutils
import zipfile

from torch.utils.data import Dataset
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow.python.client import device_lib
from zipfile import ZipFile
from IPython import display
from torchvision import models, transforms
from google.colab.patches import cv2_imshow
from sklearn.metrics import confusion_matrix, roc_curve
from sklearn.preprocessing import LabelEncoder
from torchvision.models.feature_extraction import create_feature_extractor
from PIL import Image
from collections import defaultdict
from google.colab import drive

In [2]:
! sudo apt install sox
! pip install sox

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa libsox-fmt-base
  libsox3 libwavpack1
Suggested packages:
  libsox-fmt-all
The following NEW packages will be installed:
  libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa libsox-fmt-base
  libsox3 libwavpack1 sox
0 upgraded, 7 newly installed, 0 to remove and 19 not upgraded.
Need to get 617 kB of archives.
After this operation, 1,764 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libopencore-amrnb0 amd64 0.1.5-1 [94.8 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libopencore-amrwb0 amd64 0.1.5-1 [49.1 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libsox3 amd64 14.4.2+git20190427-2+deb11u2ubuntu0.22.04.1 [240 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 

Load the dataset from LibSpeech

In [3]:
! wget https://www.openslr.org/resources/12/train-clean-100.tar.gz -O libspeech.tar.gz
! tar -xzf libspeech.tar.gz

--2023-11-12 12:54:53--  https://www.openslr.org/resources/12/train-clean-100.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://us.openslr.org/resources/12/train-clean-100.tar.gz [following]
--2023-11-12 12:54:54--  http://us.openslr.org/resources/12/train-clean-100.tar.gz
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6387309499 (5.9G) [application/x-gzip]
Saving to: ‘libspeech.tar.gz’


2023-11-12 12:57:51 (34.5 MB/s) - ‘libspeech.tar.gz’ saved [6387309499/6387309499]



## Data Analysis (Single File)

From the documentation, the basic structure of a FLAC stream is:
* The four byte string "fLaC"
* The STREAMINFO metadata block (METADATA_BLOCK)
* Zero or more other metadata blocks (METADATA_BLOCK*)
* One or more audio frames (FRAME+)

Characteristics of the blocks defined in the FLAC file:

* All numbers used in a FLAC bitstream are integers
* There are no floating-point representations.
* All numbers are big-endian coded.
* All numbers are unsigned unless otherwise specified.



In [24]:
sample_path = "/content/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac"
sample_file = open(sample_path, mode = "rb")
sample_file

<_io.BufferedReader name='/content/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac'>

Read the first 4 byte string, which is fLaC.

In [25]:
chunk_id = sample_file.read(4)
chunk_id

b'fLaC'

METADATA_BLOCK consists of 2 things: METADATA_BLOCK_HEADER and METADATA_BLOCK_DATA

### METADATA_BLOCK_HEADER



METADATA_BLOCK_HEADER block size: 1, 7, and 24 bits. Total = 32 bits.

Now, let's read STREAMINFO block, which is mandatory for FLAC format. This block has plenty of informations, such as sample rate, number of channels.

How to get all of the information in the structure? Simply put, we take all of the bits in the METADATA_BLOCK_HEADER, then we can get the bit information with binary operations.

In [26]:
metadata_block_header_packed = sample_file.read(4)

# Unpack the bytes as an unsigned integer
metadata_block_header_unpacked = struct.unpack('>I', metadata_block_header_packed)[0]
print(metadata_block_header_unpacked)

34


In [27]:
metadata_block_header_unpacked_copied = metadata_block_header_unpacked

# Extract and print 1 bit
sample_rate = (metadata_block_header_unpacked_copied >> 31) & 0x1

# Extract and print the next 7 bits
bits2_to_8 = (metadata_block_header_unpacked_copied >> 24) & 0b1111111

# Extract and print the next 24 bits
bits9_to_32 = metadata_block_header_unpacked_copied & 0xFFFFFF

print(f"Last metadata block flag: {sample_rate}")
print(f"Block type: {bits2_to_8}")
print(f"Length of metadata: {bits9_to_32}")

Last metadata block flag: 0
Block type: 0
Length of metadata: 34


### METADATA_BLOCK_STREAMINFO

The block consists of this information:

* Minimum block size (in samples) used in the stream (16 bits)
* Maximum block size (in samples) used in the stream. (16 bits)
* Minimum frame size (in bytes) used in the stream. (24 bits)
* Maximum frame size (in bytes) used in the stream. (24 bits)
* Sample rate in Hz (20 bits)
* Number of channels - 1 (range: 1 - 8 channels) (3 bits)
* Bits per sample - 1 (range: 4 - 32) (5 bits)
* Total samples in stream. (36 bits)

Total 144 bits

To simplify the bytes reading, we can break the readings into some sections:

* Min Block_size. (16 bits)
* Max block size and frame_size. (16 + 24 + 24 = 64 bits)
* Other informations. (20 + 3 + 5 + 36 = 64 bits)

In [28]:
minblocksize_packed = sample_file.read(2)
maxblocksize_framesize_packed = sample_file.read(8)
samplerate_nochannels_bits_totalsample_packed = sample_file.read(8)

# Unpack the data using struct
minblocksize_unpacked = struct.unpack(
    ">H", minblocksize_packed
)[0]
maxblocksize_framesize_unpacked = struct.unpack(
    ">Q", maxblocksize_framesize_packed
)[0]
samplerate_nochannels_bits_totalsample_unpacked = struct.unpack(
    ">Q", samplerate_nochannels_bits_totalsample_packed
)[0]

In [29]:
# Extract and print the specified bits
minblocksize_packed_unpacked_copied = minblocksize_unpacked
maxblocksize_framesize_unpacked_copied = maxblocksize_framesize_unpacked
samplerate_nochannels_bits_totalsample_unpacked_copied = samplerate_nochannels_bits_totalsample_unpacked

print(f"Min block size: {minblocksize_packed_unpacked_copied}")

# Extract and print the first 16 bits
max_block_size = (maxblocksize_framesize_unpacked_copied >> 48) & 0xFFFF

# Extract and print the next 24 bits
min_frame_size = (maxblocksize_framesize_unpacked_copied >> 24) & 0xFFFFFF

# Extract and print the next 24 bits
max_frame_size = maxblocksize_framesize_unpacked_copied & 0xFFFFFF

print(f"Max block size: {max_block_size}")
print(f"Min frame size: {min_frame_size}")
print(f"Max frame size: {max_frame_size}")

# Extract and print the first 20 bits
sample_rate = (samplerate_nochannels_bits_totalsample_unpacked_copied >> 44) & 0xFFFFF

# Extract and print the next 3 bits
no_channels = (samplerate_nochannels_bits_totalsample_unpacked_copied >> 41) & 0b111

# Extract and print the next 5 bits
bits = (samplerate_nochannels_bits_totalsample_unpacked_copied >> 36) & 0b11111

# Extract and print the next 36 bits
total_sample = samplerate_nochannels_bits_totalsample_unpacked_copied & 0xFFFFFFFFF

print(f"Sample rate: {sample_rate}")
print(f"No channels: {no_channels + 1}")
print(f"Bits: {bits + 1}")
print(f"Total sample: {total_sample}")

Min block size: 4096
Max block size: 4096
Min frame size: 107
Max frame size: 5961
Sample rate: 16000
No channels: 0
Bits: 15
Total sample: 225360


## Comparation with SoX Library

To check our logic from before, we can use Python SoX library.

Option 1: Using Sox --i command.

In [14]:
! sox --i "/content/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac"


Input File     : '/content/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:14.09 = 225360 samples ~ 1056.38 CDDA sectors
File Size      : 255k
Bit Rate       : 145k
Sample Encoding: 16-bit FLAC



Option 2: Using SoX wrapper in Python.

In [13]:
sox_sample_path = sample_path

sox_sample_rate = sox.file_info.sample_rate(sox_sample_path)
sox_channels = sox.file_info.channels(sox_sample_path)
sox_extension_file = sox.file_info.file_extension(sox_sample_path)
sox_file_type = sox.file_info.file_type(sox_sample_path)

print(f"Analyzing file using SoX:")
print(f"Sample rate: {sox_sample_rate}")
print(f"No channels: {sox_channels}")
print(f"Extension file: {sox_extension_file}")
print(f"File type: {sox_file_type}")

Analyzing file using SoX:
Sample rate: 16000.0
No channels: 1
Extension file: flac
File type: flac
