Basic template for using untrained Whisper to detect singular curse words in a music track. Demucs is used to split the audio first then the vocals tracks is muted during the curse words. 

Probably everything should be done with .wav files to preserve audio integrity. There's also a lot of temp files that get created that could be deleted at the end (the separated tracks)

Needed packages:
- Whisper (incl pytorch, torchaudio)
- Demucs (vocals stem separation)
- ffmpeg (for mp3)
- soundfile (for wav)
- pydub (for editing audio files)
- mutagen (for preserving metadata)

In [25]:
import whisper
import torch
import pandas as pd
import os

# Check for cuda/cpu
device = ('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using {device}')

# Load the model. I've tired the following:
# large-v3-turbo, large-v3, medium.en (english only), large (needs 10GB of VRAM and takes FOREVER)
model = whisper.load_model("large-v3-turbo", device=device)

Using cuda


Get file location and audio path, split into vocals and instruments. Demucs does a bunch of things automatically perhaps we can investigate further

In [17]:
import demucs.separate
import os

# I put a few tracks in here
song = 'big'
audio_file = f"music/{song}.mp3"

# Format file path
notebook_dir = os.getcwd()
full_audio_path = os.path.join(notebook_dir, audio_file)

In [18]:
# Split the vocals with demucs
demucs.separate.main(["--two-stems", "vocals", "-n", "mdx_extra", full_audio_path])

# Comments:
# Add "--mp3" command for output in mp3 format. But .wav is lossless and will probably (?) sound better
# listening with headphones the audio processed with --mp3 sounded pretty weird
# mdx_extra is just one of the models included in demucs. try other models

Selected model is a bag of 4 models. You will see that many progress bars per track.
Separated tracks will be stored in C:\Users\dacla\Documents\auto-censoring-local\separated\mdx_extra
Separating track c:\Users\dacla\Documents\auto-censoring-local\music\big.mp3


100%|████████████████████████████████████████████████████████████████████████| 330.0/330.0 [00:03<00:00, 99.80seconds/s]
100%|████████████████████████████████████████████████████████████████████████| 330.0/330.0 [00:03<00:00, 93.22seconds/s]
100%|███████████████████████████████████████████████████████████████████████| 330.0/330.0 [00:03<00:00, 102.40seconds/s]
100%|███████████████████████████████████████████████████████████████████████| 330.0/330.0 [00:03<00:00, 103.61seconds/s]


Process the audio with Whisper. 

(I'm getting Triton kernel issues when I run this. I don't know why, this is apparently a Windows problem)

In [26]:
# Paths for vocals and no_vocals stems
vocals = f"separated/mdx_extra/{song}/vocals.wav"
no_vocals = f"separated/mdx_extra/{song}/no_vocals.wav"

vocals_path = os.path.join(notebook_dir, vocals)
no_vocals_path = os.path.join(notebook_dir, no_vocals)

### Apply the transcription with Whisper

# word_timestamps=True for timestamp info

# this one for running Whisper on the vocals track only
result = model.transcribe(vocals_path, word_timestamps=True)


## or run on the full track
#result = model.transcribe(full_audio_path, word_timestamps=True)



Here we consider the different words to look out for. I think this is the part that will need the most work in terms of NLP to figure out what to edit. Obvious bad words are easy, but there's context dependent things that will need a separate model to interpret

There's also issues of knowing the probabilities of the outputs. I'm not sure how to access the 2nd or 3rd word that Whisper thinks a particular segment is, if the 2nd most-likely word is a curse word and the prob is close to the 1st this is probably worth editing. 

Another problem is that there are some curse words that are only offensive in pairs: e.g., "god damn" is not allowed but "damn" generally is. 

In [27]:
# Edit for things to look for
curses = {'fuck', 'motherfucker', 'shit', 'bitch', 'nigga', 'cock', } #To name a few...?
to_add = []
all = []

# Collect words
for segment in result["segments"]:
    for word_info in segment['words']:
        all.append([word_info['word'].strip().lower(), float(word_info['start']), float(word_info['end']), word_info['probability']])
        for curse in curses:
            if curse in word_info['word'].strip().lower():
                to_add.append([word_info['word'].strip().lower(), float(word_info['start']), float(word_info['end']), word_info['probability']])
        

# Create Dataframe
columns = ['word', 'start', 'end', 'prob']
df = pd.DataFrame(to_add, columns=columns)
df_all = pd.DataFrame(all, columns=columns)

Inspect the words identified by Whisper, save log 

In [28]:
# Inspect dataframe
df = df.drop_duplicates()
df_all = df_all.drop_duplicates()

df

Unnamed: 0,word,start,end,prob
0,shit.,59.48,59.88,0.003777
1,shitting,235.68,236.0,0.590071
2,shit,241.22,241.42,0.381918


In [None]:
# Save to csv if desired
df.to_csv(f'logs/{song}.csv')

In [9]:
## Print full transcript
full_text = result["text"].strip()
print(full_text)

Thank you. Thank you. Don't be cross this sick I want. I've seen the boss bling on and on. Fake concerns is what's the matter man. And you think I ought to shake your motherfucking hand. Well, I know how much you care. Don't be cross this sick I want. I've seen the boss bling on and on. Come here by me. I want you here. Nightmares come me. It's so fucking clear. I want you here. Don't be cross this sick I want. I've seen the boss bling on and on. Come here by me. I want you here. Nightmares come me. It's so fucking clear. Nightmares come me. It's so fucking clear. Nightmares come me. I want you here. Nightmares come me. Nightmares come me. I want you here.


This function does the actual editing. I have two procedures here, one which reverses the whole track and the other which mutes the *audio* track (ideally)

In [22]:
# Pydub does the audio effects
from pydub import AudioSegment

## Applies silecning to input_audio_path at given list of times 
def silence_audio_segment(input_audio_path, output_audio_path, times):
    
    print(f'Applying silencing edits to vocals: {input_audio_path}')
    # Load the audio file
    audio = AudioSegment.from_file(input_audio_path)
    for (start_ms, end_ms) in times:
        # Select times to reverse
        before_segment = audio[:start_ms]

        # -60dB to the audio effectively mutes it
        target_segment = audio[start_ms:end_ms] - 60

        after_segment = audio[end_ms:]

        # Concatenate
        audio = before_segment + target_segment + after_segment

    # Export the modified audio
    print(f'Outputting edited vocals to {output_audio_path}')
    audio.export(output_audio_path, format="wav") 
    return

# Combines two audio tracks via their paths (vocals and instruments for example)
def combine_audio(path1, path2, outpath):
    audio1 = AudioSegment.from_file(path1, format='wav')
    audio2 = AudioSegment.from_file(path2, format='wav')

    combined_audio = audio1.overlay(audio2)

    # format='mp3' for mp3 files
    combined_audio.export(outpath, format="wav") 
    
    return

Mute the explicit content, then recombines the vocals and no_vocals tracks

In [23]:
## Maybe a bit inefficient
times = []
for row in df.itertuples():
    #word = row[1]
    start = int(row[2]*1000)
    end = int(row[3]*1000)
    times.append((start, end))

# Run the silencing script
silence_audio_segment(vocals_path, vocals_path, times)

## Output file name
output_file = f"music/{song}-edit-silenced.wav"
output_path = os.path.join(notebook_dir, output_file)

print('Combining the audio...')
combine_audio(vocals_path, no_vocals_path, output_path)
print(f'Exported to {output_path}')

Applying silencing edits to vocals: c:\Users\dacla\Documents\auto-censoring-local\separated/mdx_extra/big/vocals.wav
Outputting edited vocals to c:\Users\dacla\Documents\auto-censoring-local\separated/mdx_extra/big/vocals.wav
Combining the audio...
Exported to c:\Users\dacla\Documents\auto-censoring-local\music/big-edit-silenced.wav


Method 2 for reversing the entire track. It's less pleasing to the ear honestly



In [None]:
# Reverse track at specified times
def reverse_audio_segment(input_audio_path, output_audio_path, times):
    print(f'Applying reverse edits to: {input_audio_path}')
    # Load the audio file
    audio = AudioSegment.from_file(input_audio_path)
    for (start_ms, end_ms) in times:
        # Select times to reverse
        before_segment = audio[:start_ms]


        target_segment = audio[start_ms:end_ms]
        target_segment = target_segment.reverse()

        after_segment = audio[end_ms:]

        # Concatenate
        audio = before_segment + target_segment + after_segment

    # Export the modified audio
    print(f'Outputting audio to {output_audio_path}')
    audio.export(output_audio_path, format="wav") 
    return


In [None]:
# Path formatting
notebook_dir = os.getcwd()

input_file = f"music/{song}.wav"
output_file = f"music/{song}-edit-reversed.wav"

input_path = os.path.join(notebook_dir, input_file)
output_path = os.path.join(notebook_dir, output_file)

## Maybe a bit inefficient
times = []
for row in df.itertuples():
    #word = row[1]
    start = int(row[2]*1000)
    end = int(row[3]*1000)
    times.append((start, end))

# Run the reversing script
reverse_audio_segment(input_path, output_path, times)

[(66400, 66800), (67540, 67900), (104340, 104620), (119340, 119620), (124360, 125040), (130060, 130240)]
Applying edits to: c:\Users\dacla\Documents\auto-censoring\separated\mdx_extra\5th-ward\vocals.mp3
Outputting audio to output.mp3
