# Audio Anonymization

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abderrahmane-mhd/audio-anonymization/blob/main/Audio_Anonymization.ipynb)


In this notebook, we will explore the process of audio anonymization, which involves three key steps:

* Transcribing audio into text using a Speech-to-Text model.
* Applying Named Entity Recognition (NER) to identify sensitive information in the text.
* Replacing the time ranges of detected entities in the audio with a beep sound.

## 1. Speech-To-Text Model

In this phase we need to transcribe the Audio and get word level timestamps.

In [1]:
!pip install git+https://github.com/linto-ai/whisper-timestamped

Collecting git+https://github.com/linto-ai/whisper-timestamped
  Cloning https://github.com/linto-ai/whisper-timestamped to /tmp/pip-req-build-bzqrrln8
  Running command git clone --filter=blob:none --quiet https://github.com/linto-ai/whisper-timestamped /tmp/pip-req-build-bzqrrln8
  Resolved https://github.com/linto-ai/whisper-timestamped to commit f69750eae23c586f828744b3a4d6c4785125d84f
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dtw-python (from whisper-timestamped==1.15.8)
  Downloading dtw_python-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.1/48.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai-whisper (from whisper-timestamped==1.15.8)
  Downloading openai-whisper-20240930.tar.gz (800 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build d

In [25]:
import whisper_timestamped as whisper
import json

audio = whisper.load_audio("input_audio.mp3")

model = whisper.load_model("base", device="cuda")

transcript_result = whisper.transcribe(model, audio, language="en")

  checkpoint = torch.load(fp, map_location=device)
100%|██████████| 1965/1965 [00:01<00:00, 1195.56frames/s]


In [26]:
# full model outputs
print(json.dumps(transcript_result, indent = 2, ensure_ascii = False))

{
  "text": " Hello, this is John Smith calling. My client number is 55,632, and I placed in order with your company, Fastrac Supplies. The order reference is FT7890. I called yesterday, and Stephane told me that it was scheduled for delivery today. But I still haven't received it. Could you help me with this, please?",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.1,
      "end": 8.86,
      "text": " Hello, this is John Smith calling. My client number is 55,632, and I placed in order with your company, Fastrac Supplies.",
      "tokens": [
        50364,
        2425,
        11,
        341,
        307,
        2619,
        8538,
        5141,
        13,
        1222,
        6423,
        1230,
        307,
        12330,
        11,
        21,
        11440,
        11,
        293,
        286,
        7074,
        294,
        1668,
        365,
        428,
        2237,
        11,
        15968,
        12080,
        9391,
        24119,
       

In [27]:
# Text transcript
transcript_result['text']

" Hello, this is John Smith calling. My client number is 55,632, and I placed in order with your company, Fastrac Supplies. The order reference is FT7890. I called yesterday, and Stephane told me that it was scheduled for delivery today. But I still haven't received it. Could you help me with this, please?"

In [78]:
words_details = [word for segment in transcript_result['segments'] for word in segment['words']]

words_details

[{'text': 'Hello,', 'start': 0.1, 'end': 0.32, 'confidence': 0.955},
 {'text': 'this', 'start': 0.62, 'end': 0.86, 'confidence': 0.974},
 {'text': 'is', 'start': 0.86, 'end': 1.0, 'confidence': 0.999},
 {'text': 'John', 'start': 1.0, 'end': 1.3, 'confidence': 0.779},
 {'text': 'Smith', 'start': 1.3, 'end': 1.64, 'confidence': 0.982},
 {'text': 'calling.', 'start': 1.64, 'end': 2.06, 'confidence': 0.606},
 {'text': 'My', 'start': 2.26, 'end': 2.36, 'confidence': 0.712},
 {'text': 'client', 'start': 2.36, 'end': 2.72, 'confidence': 0.989},
 {'text': 'number', 'start': 2.72, 'end': 3.04, 'confidence': 0.976},
 {'text': 'is', 'start': 3.04, 'end': 3.22, 'confidence': 0.995},
 {'text': '55,632,', 'start': 3.22, 'end': 5.5, 'confidence': 0.895},
 {'text': 'and', 'start': 5.9, 'end': 6.04, 'confidence': 0.82},
 {'text': 'I', 'start': 6.04, 'end': 6.16, 'confidence': 0.983},
 {'text': 'placed', 'start': 6.16, 'end': 6.4, 'confidence': 0.91},
 {'text': 'in', 'start': 6.4, 'end': 6.62, 'confiden

## Detecting PII words

In [16]:
!pip install presidio-analyzer
!pip install presidio-anonymizer
!python -m spacy download en_core_web_lg

Collecting presidio-analyzer
  Downloading presidio_analyzer-2.2.355-py3-none-any.whl.metadata (2.9 kB)
Collecting phonenumbers<9.0.0,>=8.12 (from presidio-analyzer)
  Downloading phonenumbers-8.13.52-py2.py3-none-any.whl.metadata (10 kB)
Collecting tldextract (from presidio-analyzer)
  Downloading tldextract-5.1.3-py3-none-any.whl.metadata (11 kB)
Collecting requests-file>=1.4 (from tldextract->presidio-analyzer)
  Downloading requests_file-2.1.0-py2.py3-none-any.whl.metadata (1.7 kB)
Downloading presidio_analyzer-2.2.355-py3-none-any.whl (109 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.2/109.2 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading phonenumbers-8.13.52-py2.py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m70.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tldextract-5.1.3-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.9/104.9 kB[

In [17]:
import spacy

nlp = spacy.load("en_core_web_lg")

In [18]:
from presidio_anonymizer import AnonymizerEngine
from presidio_analyzer import AnalyzerEngine, EntityRecognizer, RecognizerResult, Pattern, PatternRecognizer

from presidio_analyzer.nlp_engine import NlpArtifacts,NlpEngineProvider


In [19]:
configuration = {"nlp_engine_name":"spacy", "models":[{"lang_code":"en", "model_name":"en_core_web_lg"}]}


provider = NlpEngineProvider(nlp_configuration=configuration)

nlp_engine = provider.create_engine()


analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine,
    supported_languages = ['en']
)



In [58]:
result = analyzer.analyze(text=transcript_result['text'], language='en')

In [59]:
result

[type: PERSON, start: 16, end: 26, score: 0.85,
 type: DATE_TIME, start: 163, end: 172, score: 0.85,
 type: PERSON, start: 178, end: 186, score: 0.85,
 type: DATE_TIME, start: 230, end: 235, score: 0.85,
 type: US_DRIVER_LICENSE, start: 146, end: 152, score: 0.3,
 type: IN_VEHICLE_REGISTRATION, start: 146, end: 152, score: 0.01]

In [75]:
result_dicts = [obj.to_dict() for obj in result]

In [76]:
for obj in result_dicts:
    obj['word'] = transcript_result['text'][obj['start']:obj['end']]

In [77]:
result_dicts

[{'entity_type': 'PERSON',
  'start': 16,
  'end': 26,
  'score': 0.85,
  'analysis_explanation': None,
  'recognition_metadata': {'recognizer_name': 'SpacyRecognizer',
   'recognizer_identifier': 'SpacyRecognizer_138684627599520'},
  'word': 'John Smith'},
 {'entity_type': 'DATE_TIME',
  'start': 163,
  'end': 172,
  'score': 0.85,
  'analysis_explanation': None,
  'recognition_metadata': {'recognizer_name': 'SpacyRecognizer',
   'recognizer_identifier': 'SpacyRecognizer_138684627599520'},
  'word': 'yesterday'},
 {'entity_type': 'PERSON',
  'start': 178,
  'end': 186,
  'score': 0.85,
  'analysis_explanation': None,
  'recognition_metadata': {'recognizer_name': 'SpacyRecognizer',
   'recognizer_identifier': 'SpacyRecognizer_138684627599520'},
  'word': 'Stephane'},
 {'entity_type': 'DATE_TIME',
  'start': 230,
  'end': 235,
  'score': 0.85,
  'analysis_explanation': None,
  'recognition_metadata': {'recognizer_name': 'SpacyRecognizer',
   'recognizer_identifier': 'SpacyRecognizer_138

## Replacing PII words with Beep Sound

In [34]:
!pip install librosa Unidecode pydub

Collecting Unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub, Unidecode
Successfully installed Unidecode-1.3.8 pydub-0.25.1


In [79]:
from unidecode import unidecode
from pydub import AudioSegment
import librosa
import numpy as np

# Soundfile is used to export the new anonymized audio
import soundfile as sf

In [80]:
def map_words(text_details, anonymized):
    """
        Used to map detected words to anonymize with transcription details, returns list of words with time range (start, end)
    """
    mapped_words = []
    not_found_words = []

    for anon_word in anonymized:
        splitted_text = anon_word['word'].strip().split(' ')

        for word in splitted_text:
            if len(word) <= 3:
                matches = [detail for detail in text_details if unidecode(detail['text'].lower()) == (unidecode(word.lower()))]
            else:
                matches = [detail for detail in text_details if unidecode(detail['text'].lower()).__contains__(unidecode(word.lower()))]

            # Match words ignoring case and accents
            if matches:
                mapped_words.extend(matches)
                for match in matches:
                    text_details.remove(match)

            else:
                #print("Word Not Found:", word)
                not_found_words.append(word)

    # Potential enhancement: sometimes audio has names spelled such as C-H-R-I-S-T-O-P-H-E
    # We can add custom regex detector in Presidio and custom logic to map words


    return mapped_words

In [81]:
def anonymize_audio(audio_file, time_ranges, output_file = "anonymized_audio.mp3"):
  """
    The following function inserts Beep sound to the input_file, following the specified time_range and writes the output_file
  """

  y, sr = librosa.load(audio_file)
  # Split the audio signal at the specified time (in seconds)
  for t_range in time_ranges:
    start = float(t_range['start'])
    end = float(t_range['end'])
    duration = end - start
    first_part = y[:int(start * sr)]
    second_part = y[int(end * sr):]

    # In case you want to replace the PII with Beep sound
    beep = librosa.tone(440, duration=(duration))

    # In this demo we are setting volume level to 0
    beep = beep * 0

    # Concatenate the beep sound with the first and second part of the audio
    y = np.concatenate((first_part, beep, second_part))

  # Save the audio
  sf.write(output_file, y, sr)

  return output_file

In [82]:
anonymized_words = map_words(words_details, result_dicts)

In [83]:
anonymized_words

[{'text': 'John', 'start': 1.0, 'end': 1.3, 'confidence': 0.779},
 {'text': 'Smith', 'start': 1.3, 'end': 1.64, 'confidence': 0.982},
 {'text': 'yesterday,', 'start': 12.36, 'end': 12.9, 'confidence': 0.993},
 {'text': 'Stephane', 'start': 13.42, 'end': 13.92, 'confidence': 0.561},
 {'text': 'today.', 'start': 15.6, 'end': 15.98, 'confidence': 0.951},
 {'text': 'FT7890.', 'start': 10.24, 'end': 11.86, 'confidence': 0.49}]

In [84]:
anonymized_audio = anonymize_audio("input_audio.mp3", anonymized_words)

In [73]:
anonymized_audio

'anonymized_audio.mp3'

In [74]:
from IPython.display import Audio
Audio(anonymized_audio)