# Formatted Zone (Audio)

This notebook contains the scripts needed for the extraction of audio from the formatted zone, its processing and cleaning of the data and storage to the trusted Zone. The trusted zone is represented by another bucket and aims to replicate the same folder structure as the formatted zone. The difference is that the data has been processed and transform in order to clean and ensure a clean audios.

This notebook focuses only on audio (the equivalent notebooks for the other types of data can be found in the same folder). Particularly, the following scripts are responsible of the following tasks:

Extraction of audios from formatted zone. Treatment of the data, to ensure data quality.

First, we will connect to MinIO and prepare the new bucket:

In [1]:
import boto3
import os
from dotenv import load_dotenv

load_dotenv()
access_key_id = os.getenv("ACCESS_KEY_ID")
secret_access_key = os.getenv("SECRET_ACCESS_KEY")
minio_url = "http://" + os.getenv("S3_API_ENDPOINT")


minio_client = boto3.client(
    "s3",
    aws_access_key_id=access_key_id,
    aws_secret_access_key=secret_access_key,
    endpoint_url=minio_url
)

new_bucket = "trusted-zone"
try:
    minio_client.create_bucket(Bucket=new_bucket)
except (minio_client.exceptions.BucketAlreadyExists, minio_client.exceptions.BucketAlreadyOwnedByYou):
    print(f"Bucket '{new_bucket}' already exists")

Bucket 'trusted-zone' already exists


This script is responsible for extracting audio files from the Formatted Zone, processing and cleaning them to ensure their quality, and finally storing them in the Trusted Zone. The process begins by reading all audio files from the formatted-zone bucket and then applying several treatments to enhance their quality.

First, it normalizes the sampling rate to 44.1 kHz to ensure compatibility and consistent quality across all audio files. Then, it detects and removes long silences at the beginning and end of the files using a threshold of –50 decibels and a minimum duration of 500 milliseconds, which improves storage efficiency.

Next, it equalizes the volume levels of all audio files to ensure consistency during playback and applies dynamic compression with a threshold of –20 decibels and a 4:1 ratio to balance dynamic levels and enhance clarity and presence.

The process also includes quality filters: a high-pass filter at 80 Hz to remove low-frequency noise and a low-pass filter at 16 kHz to eliminate high-frequency noise. Finally, it increases the gain by 2 decibels to improve the presence of the audio and optimize it for playback.

The final result is a set of processed audio files with 192 kbps optimized MP3 quality, completely clean, without unwanted noise or silences, while preserving the original folder structure when stored in the Trusted Zone.

In [2]:
from pydub import AudioSegment
from pydub.effects import normalize, compress_dynamic_range
from pydub.silence import split_on_silence, detect_silence
import io
from tqdm import tqdm

bucket_origen = "formatted-zone"
bucket_desti = "trusted-zone"
prefix_origen = "audio/"
freq_final = 44100

paginator = minio_client.get_paginator("list_objects_v2")

for page in paginator.paginate(Bucket=bucket_origen, Prefix=prefix_origen):
    for obj in tqdm(page.get("Contents", []), desc="Processant àudios"):
        key = obj["Key"]
        filename = key.split("/")[-1]

        #Llegir àudio original
        response = minio_client.get_object(Bucket=bucket_origen, Key=key)
        audio_data = response["Body"].read()

        #Obrir amb pydub
        try:
            audio = AudioSegment.from_file(io.BytesIO(audio_data), format="mp3")
        except Exception as e:
            print(f"Error amb {filename}: {e}")
            continue

        # APLICAR TRATAMIENTOS DE LIMPIEZA Y MEJORA
        
        # 1. Normalizar frecuencia de muestreo
        audio = audio.set_frame_rate(freq_final)
        
        # 2. Limpiar silencios al inicio y final
        # Detectar y eliminar silencios largos al principio y final
        silence_threshold = -50  # dB
        silence_duration = 500   # ms
        
        # Detectar silencios
        silence_ranges = detect_silence(audio, min_silence_len=silence_duration, silence_thresh=silence_threshold)
        
        if silence_ranges:
            # Eliminar silencio del final
            if silence_ranges[-1][1] > len(audio) - 1000:  # Si el último silencio está cerca del final
                audio = audio[:silence_ranges[-1][0]]
            
            # Eliminar silencio del inicio
            if silence_ranges[0][0] < 1000:  # Si el primer silencio está cerca del inicio
                audio = audio[silence_ranges[0][1]:]
        
        # 3. Normalizar volumen (uniformar niveles)
        audio = normalize(audio)
        
        # 4. Compresión dinámica para equilibrar niveles
        audio = compress_dynamic_range(audio, threshold=-20.0, ratio=4.0, attack=5.0, release=50.0)
        
        # 5. Filtro de ruido (reducir ruido de fondo)
        # Aplicar un filtro pasa-altos muy suave para eliminar ruido de baja frecuencia
        audio = audio.high_pass_filter(80)  # Hz
        
        # 6. Filtro de ruido (reducir ruido de alta frecuencia)
        # Aplicar un filtro pasa-bajos para eliminar ruido de alta frecuencia
        audio = audio.low_pass_filter(16000)  # Hz
        
        # 7. Normalización final del volumen
        audio = normalize(audio)
        
        # 8. Ajustar ganancia final (opcional: +2dB para más presencia)
        audio = audio + 2

        # Desa com MP3 i puja al bucket formatted-zone
        buffer = io.BytesIO()
        audio.export(buffer, format="mp3", bitrate="192k")
        buffer.seek(0)

        new_key = f"audio/{filename}"
        minio_client.put_object(
            Bucket=bucket_desti,
            Key=new_key,
            Body=buffer
        )

Processant àudios: 100%|██████████| 100/100 [01:44<00:00,  1.05s/it]
