Basic template for using untrained Whisper to detect singular curse words in a music track. Demucs is used to split the audio first then the vocals tracks is muted during the curse words. 

Probably everything should be done with .wav files to preserve audio integrity. There's also a lot of temp files that get created that could be deleted at the end (the separated tracks)

Needed packages:
- Whisper (audio-to-text, incl. pytorch, torchaudio)
- Demucs (vocals stem separation)
- pandas
- os (file path management)
- ffmpeg (for mp3)
- soundfile (for wav)
- pydub (for editing audio files)
- mutagen (for preserving metadata)

In [38]:
import whisper
import torch
import pandas as pd
import os

# Check for cuda/cpu
device = ('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using {device}')

#### Load the model. I've tired the following:

# large-v3-turbo (seems to perform the best?)
# large-v3
# medium.en (english only)
# large (needs 10GB of VRAM and takes FOREVER)
whisper_type = "large-v3-turbo"
model = whisper.load_model(whisper_type, device=device)

Using cuda


Get file location and audio path, split into vocals and instruments. Demucs does a bunch of things automatically perhaps we can investigate further

In [54]:
import demucs.separate
import os

# I put a few tracks in here
song = 'jpeg'
audio_file = f"music/{song}.mp3"

# Format file path
notebook_dir = os.getcwd()
full_audio_path = os.path.join(notebook_dir, audio_file)

In [55]:
# Split the vocals with demucs
demucs.separate.main(["--two-stems", "vocals", "-n", "mdx_extra", full_audio_path])

# Comments:
# Add "--mp3" command for output in mp3 format. But .wav is lossless and will probably (?) sound better
# listening with headphones the audio processed with --mp3 sounded pretty weird
# mdx_extra is just one of the models included in demucs. try other models

Selected model is a bag of 4 models. You will see that many progress bars per track.
Separated tracks will be stored in C:\Users\dacla\Documents\auto-censoring-local\separated\mdx_extra
Separating track c:\Users\dacla\Documents\auto-censoring-local\music\jpeg.mp3


100%|███████████████████████████████████████████████████████████████████████| 132.0/132.0 [00:01<00:00, 117.74seconds/s]
100%|███████████████████████████████████████████████████████████████████████| 132.0/132.0 [00:01<00:00, 118.26seconds/s]
100%|███████████████████████████████████████████████████████████████████████| 132.0/132.0 [00:01<00:00, 119.33seconds/s]
100%|███████████████████████████████████████████████████████████████████████| 132.0/132.0 [00:01<00:00, 120.28seconds/s]


Process the audio with Whisper. 

(I'm getting Triton kernel issues when I run this. I don't know why, this is apparently a Windows problem)

In [56]:
# Paths for vocals and no_vocals stems
vocals = f"separated/mdx_extra/{song}/vocals.wav"
no_vocals = f"separated/mdx_extra/{song}/no_vocals.wav"

vocals_path = os.path.join(notebook_dir, vocals)
no_vocals_path = os.path.join(notebook_dir, no_vocals)

### Apply the transcription with Whisper
# word_timestamps=True for timestamp info

# this one for running Whisper on the vocals track only
result = model.transcribe(vocals_path, word_timestamps=True)




Here we consider the different words to look out for. I think this is the part that will need the most work in terms of NLP to figure out what to edit. Obvious bad words are easy, but there's context dependent things that will need a separate model to interpret

There's also issues of knowing the probabilities of the outputs. I'm not sure how to access the 2nd or 3rd word that Whisper thinks a particular segment is, if the 2nd most-likely word is a curse word and the prob is close to the 1st this is probably worth editing. 

Another problem is that there are some curse words that are only offensive in pairs: e.g., "god damn" is not allowed but "damn" generally is. 

In [57]:
import re

# Whisper can sometimes add bits of punctuation, we don't care about those
def remove_punctuation(s):
    s = re.sub(r'[^a-zA-Z0-9\s]', '', s)
    return s.lower()

all = []

# Collect words
for segment in result["segments"]:
    for word_info in segment['words']:
        word = word_info['word'].strip()
        word = remove_punctuation(word)
        
        start_time = float(word_info['start'])
        end_time = float(word_info['end'])
        prob = word_info['probability']
        
        all.append([word, start_time, end_time, prob])

# Create Dataframe
columns = ['word', 'start', 'end', 'prob']
df_all = pd.DataFrame(all, columns=columns)

In [58]:
# Edit for things to look for
curses = {'fuck', 'shit', 'bitch', 'nigga', 'cock', 'faggot', 'cunt', 'pussy', 'dick'} #To name a few...?
pattern = '|'.join(curses)

# I noticed it can create duplicate entries for some reason, delete them then save the log to a .csv
df_all = df_all.drop_duplicates()

# Add a column which is boolean 1 for "is curse" and 0 for "is not curse"
df_all['curse'] = df_all['word'].str.contains(pattern, case=False, na=False, regex=True).astype(int)

# Save the dataframe
df_all.to_csv(f'logs/{song}-{whisper_type}-all.csv')

In [65]:
# Select only those words which are curse words
df = df_all[df_all['curse']==1]
df 

Unnamed: 0,word,start,end,prob,curse
2,fuck,9.84,10.0,0.941962,1
8,bitch,11.34,11.44,0.610328,1
45,niggas,17.2,17.36,0.985287,1
61,shit,20.12,20.32,0.978746,1
79,fuck,24.94,25.26,0.474615,1
82,niggas,25.52,25.74,0.920795,1
117,niggas,35.98,36.16,0.958646,1
128,bitch,38.06,38.28,0.891795,1
135,bitch,39.22,39.36,0.138935,1
154,shit,42.84,43.08,0.969069,1


In [64]:
## Inspect the full lyrics
full_text = " ".join(df_all['word'].tolist())
print(full_text)

first off fuck elon musk 8 too much bitch thats expensive put a hose in the back and a crack in the slack if i tweet and delete and i meant it i dont really need a check cause i got no respect and these niggas might not be like a dentist city all about free speech still he had some shit in his head and his fingers this aint what you want no this aint what you want fuck y all niggas i feel like papa john left respect to the bank im tony khan and they hate what i say cause i aint wrong what kind of rabbit is this the kind of man rat niggas miss you know it i put a throw in my bitch gotta push it in finna like bitch she know it you think its me texting with kids it doesnt get stranger than this thats oh shit i dont fuck you niggas like im unfugged wait i am me written eyes with a shot of repeat while you stretching the truth in your tweets ive been stretching it girl i am she once a deets in the crib you get so under for weeks watch your energy watch what you tweet you gon feel like you a

This function does the actual editing. Using the curses identified in df, mute the *audio* track only at the specified times

In [61]:
# Pydub does the audio effects
from pydub import AudioSegment

## Applies silecning to input_audio_path at given list of times 
def silence_audio_segment(input_audio_path, output_audio_path, times):
    
    print(f'Applying silencing edits to vocals: {input_audio_path}')
    # Load the audio file
    audio = AudioSegment.from_file(input_audio_path)
    for (start_ms, end_ms) in times:
        # Select times to reverse
        before_segment = audio[:start_ms]

        # -60dB to the audio effectively mutes it
        target_segment = audio[start_ms:end_ms] - 60

        after_segment = audio[end_ms:]

        # Concatenate: this can be made faster, but it's not a priority
        audio = before_segment + target_segment + after_segment

    # Export the modified audio
    print(f'Outputting edited vocals to {output_audio_path}')             
    audio.export(output_audio_path, format='wav') 
    return

# Combines two audio tracks via their paths (vocals and instruments for example)
def combine_audio(path1, path2, outpath):
    audio1 = AudioSegment.from_file(path1, format='wav')
    audio2 = AudioSegment.from_file(path2, format='wav')

    combined_audio = audio1.overlay(audio2)

    # format='mp3' for mp3 files   
    combined_audio.export(outpath, format="mp3") 
    
    return

Mute the explicit content, then recombines the vocals and no_vocals tracks

In [62]:
## Maybe a bit inefficient
times = []
for row in df.itertuples():
    #word = row[1]
    start = int(row[2]*1000)
    end = int(row[3]*1000)
    times.append((start, end))

# Run the silencing script
silence_audio_segment(vocals_path, vocals_path, times)

## Output file name
output_file = f"music/{song}-{whisper_type}-edit.mp3"
output_path = os.path.join(notebook_dir, output_file)

print('Combining the audio...')
combine_audio(vocals_path, no_vocals_path, output_path)
print(f'Exported to {output_path}')

Applying silencing edits to vocals: c:\Users\dacla\Documents\auto-censoring-local\separated/mdx_extra/jpeg/vocals.wav
Outputting edited vocals to c:\Users\dacla\Documents\auto-censoring-local\separated/mdx_extra/jpeg/vocals.wav
Combining the audio...
Exported to c:\Users\dacla\Documents\auto-censoring-local\music/jpeg-large-v3-turbo-edit.mp3


Last step is to transfer the metadata from the original track to the edited track

In [63]:
from mutagen.easyid3 import EasyID3

# Transfer the metadata from the original to the edited track
def transfer_metadata(original_audio_path, edited_audio_path):
    
    audio_orig = EasyID3(original_audio_path)
    audio_edit = EasyID3(edited_audio_path)

    metadata = dict()

    ## Add more if wanted
    metadata['title'] = audio_orig.get('title', [None])[0]
    metadata['artist'] = audio_orig.get('artist', [None])[0]
    metadata['album'] = audio_orig.get('album', [None])[0]
    metadata['date'] = audio_orig.get('date', [None])[0] # Often 'year' or full date
    metadata['tracknumber'] = audio_orig.get('tracknumber', [None])[0]
    
    # Apply metadata to edited track
    for key, value in metadata.items():
        audio_edit[key] = [str(value)]

    # and save
    audio_edit.save()
    return
        
transfer_metadata(original_audio_path=full_audio_path,
                  edited_audio_path=output_path)