# Benchmarking Audioseal on the SHUSH attack applied on RAVDESS Dataset

In this notebook, we outline the steps taken to benchmark the Audioseal architecture against different attacks on a dataset of audio files.  
In particular, we follow these steps:
- Load audio files from a dataset 
- Watermark each audio file using Audioseal
- Perform perturbations/attacks to the audio files
- Detect the watermarks on these attacked files and keep track of the confidence of Audioseal in its predictions that the files are watermarked.


For a better understanding of Audioseal and its functionalities, it is highly recommended to go through the [Getting started notebook](https://github.com/facebookresearch/audioseal/blob/main/examples/Getting_started.ipynb).

## Dataset

We use the [RAVDESS Emotional Speech audio](https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio) dataset for this experiment.   
When added to a Kaggle notebook environment, all input datasets are stored in the read-only `/kaggle/input` path. If you are not using Kaggle, or have stored your files elsewhere, you can load nested audio files by modifying `PARENT_FILES_DIR` in the cell below.

In [1]:
import numpy as np 
import pandas as pd
import os

all_input_files = []
PARENT_FILES_DIR = '/kaggle/input'

for dirname, _, filenames in os.walk(PARENT_FILES_DIR):
    for filename in filenames:
        if "wav" in filename:
            all_input_files.append(os.path.join(dirname, filename))
            
print(f"Number of input files: {len(all_input_files)}")

Number of input files: 2880


### Installations and Imports 

In [2]:
import sys
!{sys.executable} -m pip install -q torchaudio soundfile matplotlib audioseal

import typing as tp
import julius
import torch
import torchaudio
import urllib

### Load Audioseal models

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


In [4]:
from audioseal import AudioSeal

model = AudioSeal.load_generator("audioseal_wm_16bits")
detector = AudioSeal.load_detector("audioseal_detector_16bits")

### Helper functions to load audio data, watermark audio, and get prediction scores for audio

In [5]:
model = model.to(device)
detector = detector.to(device)

In [6]:
secret_message = torch.randint(0, 2, (1, 16), dtype=torch.int32)
secret_message = secret_message.to(device)
print(f"Secret message: {secret_message}")

# Function to load an audio file from its file path
def load_audio_file(
    file_path: str
) -> tp.Optional[tp.Tuple[torch.Tensor, int]]:
    try:
        wav, sample_rate = torchaudio.load(file_path)
        return wav, sample_rate
    except Exception as e:
        print(f"Error while loading audio: {e}")
        return None
    
# Function to generate a watermark for the audio and embed it into a new audio tensor
def generate_watermark_audio(
    tensor: torch.Tensor,
    sample_rate: int
) -> tp.Optional[torch.Tensor]:
    try:
        global model, device, secret_message
        audios = tensor.unsqueeze(0).to(device)
        watermarked_audio = model(audios, sample_rate=sample_rate, message=secret_message.to(device), alpha=1)
        return watermarked_audio

    
    except Exception as e:
        print(f"Error while watermarking audio: {e}")
        return None
    
# Function to get the confidence score that an audio tensor was watermarked by Audioseal
def detect_watermark_audio(
    tensor: torch.Tensor,
    sample_rate: int,
    message_threshold: float = 0.50
) -> tp.Optional[float]:
    try:
        global detector, device
        # In our analysis we are not concerned with the hidden/embedded message as of now
        result, _ = detector.detect_watermark(tensor, sample_rate=sample_rate, message_threshold=message_threshold)
        return float(result)
    except Exception as e:
        print(f"Error while detecting watermark: {e}")
        return None

Secret message: tensor([[1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1]], device='cuda:0',
       dtype=torch.int32)


## Audio attacks

- In this notebook, we migrate the code from the [Audioseal repo](https://github.com/facebookresearch/audioseal/blob/main/examples/attacks.py).
- We also introduce the SHUSH attack and use it for benchmarking 

In [7]:
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import typing as tp

import julius
import torch


def generate_pink_noise(length: int) -> torch.Tensor:
    """
    Generate pink noise using Voss-McCartney algorithm with PyTorch.
    """
    num_rows = 16
    array = torch.randn(num_rows, length // num_rows + 1)
    reshaped_array = torch.cumsum(array, dim=1)
    reshaped_array = reshaped_array.reshape(-1)
    reshaped_array = reshaped_array[:length]
    # Normalize
    pink_noise = reshaped_array / torch.max(torch.abs(reshaped_array))
    return pink_noise


def audio_effect_return(
    tensor: torch.Tensor, mask: tp.Optional[torch.Tensor]
) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
    """Return the mask if it was in the input otherwise only the output tensor"""
    if mask is None:
        return tensor
    else:
        return tensor, mask


class AudioEffects:
    @staticmethod
    def speed(
        tensor: torch.Tensor,
        speed_range: tuple = (0.5, 1.5),
        sample_rate: int = 16000,
        mask: tp.Optional[torch.Tensor] = None,
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        """
        Function to change the speed of a batch of audio data.
        The output will have a different length !

        Parameters:
        audio_batch (torch.Tensor): The batch of audio data in torch tensor format.
        speed (float): The speed to change the audio to.

        Returns:
        torch.Tensor: The batch of audio data with the speed changed.
        """
        speed = torch.FloatTensor(1).uniform_(*speed_range)
        new_sr = int(sample_rate * 1 / speed)
        resampled_tensor = julius.resample_frac(tensor, sample_rate, new_sr)
        if mask is None:
            return resampled_tensor
        else:
            return resampled_tensor, torch.nn.functional.interpolate(
                mask, size=resampled_tensor.size(-1), mode="nearest-exact"
            )

    @staticmethod
    def updownresample(
        tensor: torch.Tensor,
        sample_rate: int = 16000,
        intermediate_freq: int = 32000,
        mask: tp.Optional[torch.Tensor] = None,
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:

        orig_shape = tensor.shape
        # upsample
        tensor = julius.resample_frac(tensor, sample_rate, intermediate_freq)
        # downsample
        tensor = julius.resample_frac(tensor, intermediate_freq, sample_rate)

        assert tensor.shape == orig_shape
        return audio_effect_return(tensor=tensor, mask=mask)

    @staticmethod
    def echo(
        tensor: torch.Tensor,
        volume_range: tuple = (0.1, 0.5),
        duration_range: tuple = (0.1, 0.5),
        sample_rate: int = 16000,
        mask: tp.Optional[torch.Tensor] = None,
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        """
        Attenuating the audio volume by a factor of 0.4, delaying it by 100ms,
        and then overlaying it with the original.

        :param tensor: 3D Tensor representing the audio signal [bsz, channels, frames]
        :param echo_volume: volume of the echo signal
        :param sample_rate: Sample rate of the audio signal.
        :return: Audio signal with reverb.
        """

        # Create a simple impulse response
        # Duration of the impulse response in seconds
        duration = torch.FloatTensor(1).uniform_(*duration_range)
        volume = torch.FloatTensor(1).uniform_(*volume_range)

        n_samples = int(sample_rate * duration)
        impulse_response = torch.zeros(n_samples).type(tensor.type()).to(tensor.device)

        # Define a few reflections with decreasing amplitude
        impulse_response[0] = 1.0  # Direct sound

        impulse_response[
            int(sample_rate * duration) - 1
        ] = volume  # First reflection after 100ms

        # Add batch and channel dimensions to the impulse response
        impulse_response = impulse_response.unsqueeze(0).unsqueeze(0)

        # Convolve the audio signal with the impulse response
        reverbed_signal = julius.fft_conv1d(tensor, impulse_response)

        # Normalize to the original amplitude range for stability
        reverbed_signal = (
            reverbed_signal
            / torch.max(torch.abs(reverbed_signal))
            * torch.max(torch.abs(tensor))
        )

        # Ensure tensor size is not changed
        tmp = torch.zeros_like(tensor)
        tmp[..., : reverbed_signal.shape[-1]] = reverbed_signal
        reverbed_signal = tmp

        return audio_effect_return(tensor=reverbed_signal, mask=mask)

    @staticmethod
    def random_noise(
        waveform: torch.Tensor,
        noise_std: float = 0.001,
        mask: tp.Optional[torch.Tensor] = None,
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        """Add Gaussian noise to the waveform."""
        noise = torch.randn_like(waveform) * noise_std
        noisy_waveform = waveform + noise
        return audio_effect_return(tensor=noisy_waveform, mask=mask)

    @staticmethod
    def pink_noise(
        waveform: torch.Tensor,
        noise_std: float = 0.01,
        mask: tp.Optional[torch.Tensor] = None,
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        """Add pink background noise to the waveform."""
        noise = generate_pink_noise(waveform.shape[-1]) * noise_std
        noise = noise.to(waveform.device)
        # Assuming waveform is of shape (bsz, channels, length)
        noisy_waveform = waveform + noise.unsqueeze(0).unsqueeze(0).to(waveform.device)
        return audio_effect_return(tensor=noisy_waveform, mask=mask)

    @staticmethod
    def lowpass_filter(
        waveform: torch.Tensor,
        cutoff_freq: float = 5000,
        sample_rate: int = 16000,
        mask: tp.Optional[torch.Tensor] = None,
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:

        return audio_effect_return(
            tensor=julius.lowpass_filter(waveform, cutoff=cutoff_freq / sample_rate),
            mask=mask,
        )

    @staticmethod
    def highpass_filter(
        waveform: torch.Tensor,
        cutoff_freq: float = 500,
        sample_rate: int = 16000,
        mask: tp.Optional[torch.Tensor] = None,
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:

        return audio_effect_return(
            tensor=julius.highpass_filter(waveform, cutoff=cutoff_freq / sample_rate),
            mask=mask,
        )

    @staticmethod
    def bandpass_filter(
        waveform: torch.Tensor,
        cutoff_freq_low: float = 300,
        cutoff_freq_high: float = 8000,
        sample_rate: int = 16000,
        mask: tp.Optional[torch.Tensor] = None,
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        """
        Apply a bandpass filter to the waveform by cascading
        a high-pass filter followed by a low-pass filter.

        Parameters:
        - waveform (torch.Tensor): Input audio waveform.
        - low_cutoff (float): Lower cutoff frequency.
        - high_cutoff (float): Higher cutoff frequency.
        - sample_rate (int): The sample rate of the waveform.

        Returns:
        - torch.Tensor: Filtered audio waveform.
        """

        return audio_effect_return(
            tensor=julius.bandpass_filter(
                waveform,
                cutoff_low=cutoff_freq_low / sample_rate,
                cutoff_high=cutoff_freq_high / sample_rate,
            ),
            mask=mask,
        )

    @staticmethod
    def smooth(
        tensor: torch.Tensor,
        window_size_range: tuple = (2, 10),
        mask: tp.Optional[torch.Tensor] = None,
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        """
        Smooths the input tensor (audio signal) using a moving average filter with the given window size.

        Parameters:
        - tensor (torch.Tensor): Input audio tensor. Assumes tensor shape is (batch_size, channels, time).
        - window_size (int): Size of the moving average window.

        Returns:
        - torch.Tensor: Smoothed audio tensor.
        """

        window_size = int(torch.FloatTensor(1).uniform_(*window_size_range))
        # Create a uniform smoothing kernel
        kernel = torch.ones(1, 1, window_size).type(tensor.type()) / window_size
        kernel = kernel.to(tensor.device)

        smoothed = julius.fft_conv1d(tensor, kernel)
        # Ensure tensor size is not changed
        tmp = torch.zeros_like(tensor)
        tmp[..., : smoothed.shape[-1]] = smoothed
        smoothed = tmp

        return audio_effect_return(tensor=smoothed, mask=mask)

    @staticmethod
    def boost_audio(
        tensor: torch.Tensor,
        amount: float = 20,
        mask: tp.Optional[torch.Tensor] = None,
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        return audio_effect_return(tensor=tensor * (1 + amount / 100), mask=mask)

    @staticmethod
    def duck_audio(
        tensor: torch.Tensor,
        amount: float = 20,
        mask: tp.Optional[torch.Tensor] = None,
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        return audio_effect_return(tensor=tensor * (1 - amount / 100), mask=mask)

    @staticmethod
    def identity(
        tensor: torch.Tensor, mask: tp.Optional[torch.Tensor] = None
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        return audio_effect_return(tensor=tensor, mask=mask)
    
    @staticmethod
    def shush(
        tensor: torch.Tensor,
        fraction: float = 0.001,
        mask: tp.Optional[torch.Tensor] = None
    ) -> tp.Union[tp.Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
        """
        Sets a specified chronological fraction of indices of the input tensor (audio signal) to 0.

        Parameters:
        - tensor (torch.Tensor): Input audio tensor. Assumes tensor shape is (batch_size, channels, time).
        - fraction (float): Fraction of indices to be set to 0 (from the start of the tensor) (default: 0.001, i.e, 0.1%)

        Returns:
        - torch.Tensor: "Shushed" audio tensor.
        """
        time = tensor.size(-1)
        shush_tensor = tensor.detach().clone()
        
        # Set the first `fraction*time` indices of the waveform to 0.0
        shush_tensor[:, :, :int(fraction*time)] = 0.0
                
        return audio_effect_return(tensor=shush_tensor, mask=mask)

### Experimental setup
- `fraction` values: \{0.1\%, 1\%, 10\%, 30\%\}
- `nomenclature` : n, s, m, l

In this notebook, we set the above parameters for the SHUSH attack and note the average confidence scores of Audioseal in predicting the presence of watermarks for these attacked audio files.

In [8]:
import random
random.seed(42)
torch.backends.cudnn.benchmark = True
np.random.seed(42)
torch.manual_seed(42)

<torch._C.Generator at 0x791cf49c2890>

In [9]:
from tqdm import tqdm

all_scores_n = []
all_scores_s = []
all_scores_m = []
all_scores_l = []
all_saved_files = []

for input_file in tqdm(all_input_files):
    try:
        # Load audio
        audio, sample_rate = load_audio_file(input_file)

        # Generate watermarked audio
        watermarked_audio = generate_watermark_audio(audio, sample_rate)

        # Perform SHUSH attacks
        shush_attack_audio_n = AudioEffects.shush(watermarked_audio, fraction=0.001)
        shush_attack_audio_s = AudioEffects.shush(watermarked_audio, fraction=0.01)
        shush_attack_audio_m = AudioEffects.shush(watermarked_audio, fraction=0.1)
        shush_attack_audio_l = AudioEffects.shush(watermarked_audio, fraction=0.3)

        # Compute scores
        shush_score_n = detect_watermark_audio(shush_attack_audio_n, sample_rate)
        shush_score_s = detect_watermark_audio(shush_attack_audio_s, sample_rate)
        shush_score_m = detect_watermark_audio(shush_attack_audio_m, sample_rate)
        shush_score_l = detect_watermark_audio(shush_attack_audio_l, sample_rate)

        # Store scores
        all_scores_n.append(float(shush_score_n))
        all_scores_s.append(float(shush_score_s))
        all_scores_m.append(float(shush_score_m))
        all_scores_l.append(float(shush_score_l))
        all_saved_files.append(input_file)
    except Exception as e:
        print(f"Skipping file {input_file} due to {e}")
        pass

  5%|▌         | 148/2880 [01:38<09:22,  4.86it/s]  

Error while watermarking audio: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 67807] to have 1 channels, but got 2 channels instead
Skipping file /kaggle/input/ravdess-emotional-speech-audio/Actor_05/03-01-02-01-02-02-05.wav due to 'NoneType' object has no attribute 'shape'


 12%|█▏        | 335/2880 [02:27<04:57,  8.56it/s]  

Error while watermarking audio: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 57663] to have 1 channels, but got 2 channels instead
Skipping file /kaggle/input/ravdess-emotional-speech-audio/Actor_01/03-01-02-01-01-02-01.wav due to 'NoneType' object has no attribute 'shape'


 12%|█▏        | 339/2880 [02:27<04:13, 10.02it/s]

Error while watermarking audio: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 52324] to have 1 channels, but got 2 channels instead
Skipping file /kaggle/input/ravdess-emotional-speech-audio/Actor_01/03-01-08-01-02-02-01.wav due to 'NoneType' object has no attribute 'shape'


 15%|█▍        | 425/2880 [02:45<03:49, 10.68it/s]

Error while watermarking audio: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 69942] to have 1 channels, but got 2 channels instead
Skipping file /kaggle/input/ravdess-emotional-speech-audio/Actor_20/03-01-06-01-01-02-20.wav due to 'NoneType' object has no attribute 'shape'


 15%|█▍        | 431/2880 [02:45<03:42, 11.01it/s]

Error while watermarking audio: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 55528] to have 1 channels, but got 2 channels instead
Skipping file /kaggle/input/ravdess-emotional-speech-audio/Actor_20/03-01-03-01-02-01-20.wav due to 'NoneType' object has no attribute 'shape'


 45%|████▍     | 1289/2880 [04:43<02:16, 11.62it/s]

Error while watermarking audio: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 67807] to have 1 channels, but got 2 channels instead
Skipping file /kaggle/input/ravdess-emotional-speech-audio/audio_speech_actors_01-24/Actor_05/03-01-02-01-02-02-05.wav due to 'NoneType' object has no attribute 'shape'


 51%|█████▏    | 1476/2880 [05:02<02:07, 11.02it/s]

Error while watermarking audio: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 57663] to have 1 channels, but got 2 channels instead
Skipping file /kaggle/input/ravdess-emotional-speech-audio/audio_speech_actors_01-24/Actor_01/03-01-02-01-01-02-01.wav due to 'NoneType' object has no attribute 'shape'


 51%|█████▏    | 1478/2880 [05:02<01:55, 12.10it/s]

Error while watermarking audio: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 52324] to have 1 channels, but got 2 channels instead
Skipping file /kaggle/input/ravdess-emotional-speech-audio/audio_speech_actors_01-24/Actor_01/03-01-08-01-02-02-01.wav due to 'NoneType' object has no attribute 'shape'


 54%|█████▍    | 1564/2880 [05:10<01:57, 11.20it/s]

Error while watermarking audio: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 69942] to have 1 channels, but got 2 channels instead
Skipping file /kaggle/input/ravdess-emotional-speech-audio/audio_speech_actors_01-24/Actor_20/03-01-06-01-01-02-20.wav due to 'NoneType' object has no attribute 'shape'


 55%|█████▍    | 1570/2880 [05:11<01:52, 11.61it/s]

Error while watermarking audio: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 55528] to have 1 channels, but got 2 channels instead
Skipping file /kaggle/input/ravdess-emotional-speech-audio/audio_speech_actors_01-24/Actor_20/03-01-03-01-02-01-20.wav due to 'NoneType' object has no attribute 'shape'


100%|██████████| 2880/2880 [07:29<00:00,  6.40it/s]


## Store results and calculate metrics

In [10]:
df = pd.DataFrame({
    "input_file" : all_saved_files,
    "watermark_confidence_n" : all_scores_n,
    "watermark_confidence_s" : all_scores_s,
    "watermark_confidence_m" : all_scores_m,
    "watermark_confidence_l" : all_scores_l,
})

In [11]:
df.describe()

Unnamed: 0,watermark_confidence_n,watermark_confidence_s,watermark_confidence_m,watermark_confidence_l
count,2870.0,2870.0,2870.0,2870.0
mean,0.998849,0.990138,0.900209,0.699678
std,0.000776,0.000738,0.000763,0.000516
min,0.976302,0.967376,0.876146,0.694676
25%,0.998877,0.990022,0.900083,0.699631
50%,0.998922,0.990202,0.90026,0.699777
75%,0.998963,0.990334,0.900352,0.699923
max,0.999177,0.99055,0.900958,0.700464


## We note that Audioseal performs very well in recalling the watermarks - even in extreme conditions of masking the first 30\% of the audio, the average confidence is $0.699678$. 