Basic template for using untrained Whisper to detect singular curse words in a music track. Demucs is used to split the audio first then the vocals tracks is muted during the curse words. 

Needed packages:
- Whisper (audio-to-text, incl. pytorch, torchaudio)
- Demucs (vocals stem separation)
- pandas
- os (file path management)
- ffmpeg (for mp3)
- soundfile (for wav)
- pydub (for editing audio files)
- mutagen (for preserving metadata)

Ideas for improvement:
- Use a fine-tuned HuBERT model instead of Whisper. Train on DALI database (or similar ?) to produce better audio-to-text from the original files.
- Train a separate model to process the transcribed lyrics and flag words/phrases as explicit. Can likely use a pre-trained BERT, e.g., Jigsaw Toxic Comment Classifier but this will need a lot of work

In [9]:
import whisper
import torch
import pandas as pd
import os

# Check for cuda/cpu
device = ('cuda' if torch.cuda.is_available() else 'cpu')

#### Load the model. I've tired the following:

# large-v3-turbo (seems to perform the best?)
# large-v3
# medium.en (english only)
# large (needs 10GB of VRAM and takes FOREVER)
whisper_type = "medium.en"

print(f'Loading Whisper {whisper_type} on {device}...')
model = whisper.load_model(whisper_type, device=device)

print('Done')


Loading Whisper medium.en on cuda...
Done


Get file location and audio path, split into vocals and instruments. Demucs does a bunch of things automatically perhaps we can investigate further

In [2]:
import demucs.separate
import os

# Audio file path
song = 'acumen'
audio_file = f"music/{song}.mp3"

# Format file path
notebook_dir = os.getcwd()
full_audio_path = os.path.join(notebook_dir, audio_file)

In [12]:
# Split the vocals with demucs
demucs.separate.main(["--two-stems", "vocals", "-n", "mdx_extra", full_audio_path])

# Comments:
# Add "--mp3" command for output in mp3 format. But .wav is lossless and will probably (?) sound better
# listening with headphones the audio processed with --mp3 sounded pretty weird
# mdx_extra is just one of the models included in demucs. try other models

Selected model is a bag of 4 models. You will see that many progress bars per track.
Separated tracks will be stored in C:\Users\dacla\Documents\auto-censoring-local\separated\mdx_extra
Separating track c:\Users\dacla\Documents\auto-censoring-local\music\acumen.mp3


100%|███████████████████████████████████████████████████████████████████████| 297.0/297.0 [00:02<00:00, 103.91seconds/s]
100%|████████████████████████████████████████████████████████████████████████| 297.0/297.0 [00:03<00:00, 97.21seconds/s]
100%|███████████████████████████████████████████████████████████████████████| 297.0/297.0 [00:02<00:00, 106.44seconds/s]
100%|███████████████████████████████████████████████████████████████████████| 297.0/297.0 [00:02<00:00, 112.80seconds/s]


Process the audio with Whisper. 

(I'm getting Triton kernel issues when I run this. I don't know why, this is apparently a Windows problem)

In [13]:
# Paths for vocals and no_vocals stems
vocals = f"separated/mdx_extra/{song}/vocals.wav"
no_vocals = f"separated/mdx_extra/{song}/no_vocals.wav"

vocals_path = os.path.join(notebook_dir, vocals)
no_vocals_path = os.path.join(notebook_dir, no_vocals)

### Apply the transcription with Whisper
# word_timestamps=True for timestamp info

# run Whisper on vocals track only
print(f'Transcribing audio at {vocals_path}...')
result = model.transcribe(vocals_path, word_timestamps=True)

Transcribing audio at c:\Users\dacla\Documents\auto-censoring-local\separated/mdx_extra/acumen/vocals.wav...


Here we consider the different words to look out for. I think this is the part that will need the most work in terms of NLP to figure out what to edit. Obvious bad words are easy, but there's context dependent things that will need a separate model to interpret

There's also issues of knowing the probabilities of the outputs. I'm not sure how to access the 2nd or 3rd word that Whisper thinks a particular segment is, if the 2nd most-likely word is a curse word and the prob is close to the 1st this is probably worth editing. 

Another problem is that there are some curse words that are only offensive in pairs: e.g., "god damn" is not allowed but "damn" generally is. 

In [15]:
import re

# Whisper can sometimes add bits of punctuation, we don't care about those
def remove_punctuation(s):
    s = re.sub(r'[^a-zA-Z0-9\s]', '', s)
    return s.lower()

# Create dataframe from the transcribed lyrics
all = []
for segment in result["segments"]:
    for word_info in segment['words']:
        word = word_info['word'].strip()
        word = remove_punctuation(word)
        
        start_time = float(word_info['start'])
        end_time = float(word_info['end'])
        prob = word_info['probability']
        
        all.append([word, start_time, end_time, prob])

# Create Dataframe
columns = ['word', 'start', 'end', 'prob']
df_all = pd.DataFrame(all, columns=columns)

In [None]:
# Edit for things to look for
curses = {'fuck', 'shit', 'bitch', 'nigga', 'cock', 'faggot', 'cunt', 'pussy', 'dick'} #To name a few...?
pattern = '|'.join(curses)

# I noticed it can create duplicate entries for some reason, delete them then save the log to a .csv
df_all = df_all.drop_duplicates()

# Add a column which is boolean 1 for "is curse" and 0 for "is not curse"
df_all['explicit'] = df_all['word'].str.contains(pattern, case=False, na=False, regex=True).astype(int)

# Save the dataframe
df_all.to_csv(f'logs/{song}-{whisper_type}-all.csv')

In [24]:
# Select only those words which are curse words
df = df_all[df_all['explicit']==1]
print(df)


        word   start     end      prob  explicit
179  bastard  128.22  128.78  0.865051         1
183  bastard  129.30  129.30  0.798574         1
187  bastard  134.14  134.14  0.726056         1
192  bastard  140.20  140.26  0.928996         1
198  bastard  144.94  145.72  0.991119         1
204  bastard  150.88  151.04  0.935638         1
221  bastard  166.86  168.32  0.000246         1
225  bastard  170.34  171.18  0.631440         1
229  bastard  173.30  173.64  0.772456         1
233  bastard  176.00  176.00  0.805456         1
237  bastard  178.50  180.02  0.818085         1
241  bastard  181.06  181.94  0.000038         1
262  bastard  215.90  216.84  0.807133         1
267  bastard  221.94  222.58  0.997818         1
272  bastard  227.18  227.98  0.804872         1
278  bastard  232.88  233.24  0.995458         1
289  fucking  266.94  267.22  0.777497         1


In [22]:
## Inspect the full lyrics
full_text = " ".join(df_all['word'].tolist())
print(full_text)

therefore the keep the failures from me its like im a sass its for the better half the weaker just watching upon my shores and ive got a need to always kick em down me jack reaction got a mouth on me im sure to make enemies our best friends why am i stricken with a pair of these ribs like a voice in an all round trap i hear snap down with me i hear snap down with me got a knack for not letting all my loved ones down maybe i keep smokes coming back how can i expect to ever be loved when all i got to do is just to break his heart we have a thing to censor the pride inside taking shots for nothing better than art always have to have that last word no matter what is it in my blood to have to take you down me jack reaction got a mouth on me i hear snap down with me i hear snap down with me as i am just a bastard im just a bastard im just a bastard cause im just a bastard cause i am just a bastard cause i am just a bastard to find the way to my own to be a stupid slushy hag im just a bastard

This function does the actual editing. Using the curses identified in df, mute the *audio* track only at the specified times

In [19]:
# Pydub does the audio effects
from pydub import AudioSegment

## Applies silecning to input_audio_path at given list of times 
def silence_audio_segment(input_audio_path, output_audio_path, times):
    
    print(f'Applying silencing edits to vocals: {input_audio_path}')
    # Load the audio file
    audio = AudioSegment.from_file(input_audio_path)
    for (start_ms, end_ms) in times:
        # Select times to reverse
        before_segment = audio[:start_ms]

        # -60dB to the audio effectively mutes it
        target_segment = audio[start_ms:end_ms] - 60

        after_segment = audio[end_ms:]

        # Concatenate: this can be made faster, but it's not a priority
        audio = before_segment + target_segment + after_segment

    # Export the modified audio
    print(f'Outputting edited vocals to {output_audio_path}')             
    audio.export(output_audio_path, format='wav') 
    return

# Combines two audio tracks via their paths (vocals and instruments for example)
def combine_audio(path1, path2, outpath):
    audio1 = AudioSegment.from_file(path1, format='wav')
    audio2 = AudioSegment.from_file(path2, format='wav')

    combined_audio = audio1.overlay(audio2)

    # format='mp3' for mp3 files   
    combined_audio.export(outpath, format="mp3") 
    
    return

Mute the explicit content, then recombines the vocals and no_vocals tracks

In [25]:
## Maybe a bit inefficient
times = []
for row in df.itertuples():
    #word = row[1]
    start = int(row[2]*1000)
    end = int(row[3]*1000)

    times.append((start, end))

# Run the silencing script
silence_audio_segment(vocals_path, vocals_path, times)

## Output file name
output_file = f"music/{song}-{whisper_type}-edit.mp3"
output_path = os.path.join(notebook_dir, output_file)

print('Combining the audio...')
combine_audio(vocals_path, no_vocals_path, output_path)

print(f'Exported to {output_path}')

Applying silencing edits to vocals: c:\Users\dacla\Documents\auto-censoring-local\separated/mdx_extra/acumen/vocals.wav
Outputting edited vocals to c:\Users\dacla\Documents\auto-censoring-local\separated/mdx_extra/acumen/vocals.wav
Combining the audio...
Exported to c:\Users\dacla\Documents\auto-censoring-local\music/acumen-medium.en-edit.mp3


Last step is to transfer the metadata from the original track to the edited track

In [21]:
from mutagen.easyid3 import EasyID3

# Transfer the metadata from the original to the edited track
def transfer_metadata(original_audio_path, edited_audio_path):
    
    audio_orig = EasyID3(original_audio_path)
    audio_edit = EasyID3(edited_audio_path)

    metadata = dict()

    ## Add more if wanted
    metadata['title'] = audio_orig.get('title', [None])[0]
    metadata['artist'] = audio_orig.get('artist', [None])[0]
    metadata['album'] = audio_orig.get('album', [None])[0]
    metadata['date'] = audio_orig.get('date', [None])[0] # Often 'year' or full date
    metadata['tracknumber'] = audio_orig.get('tracknumber', [None])[0]
    
    # Apply metadata to edited track
    for key, value in metadata.items():
        audio_edit[key] = [str(value)]

    # and save
    audio_edit.save()
    return
        
transfer_metadata(original_audio_path=full_audio_path,
                  edited_audio_path=output_path)