Basic template for using untrained Whisper to detect singular curse words in a music track. Demucs is used to split the audio first then the vocals tracks is muted during the curse words. 

Probably everything should be done with .wav files to preserve audio integrity. There's also a lot of temp files that get created that could be deleted at the end (the separated tracks)

In [1]:
import whisper
import torch
import pandas as pd
import os

# Check for cuda/cpu
device = ('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using {device}')

# Load the model: large-v3-turbo or large-v3
model = whisper.load_model("large-v3-turbo", device=device)

Using cuda


Get file location and audio path, split into vocals and instruments. Demucs does a bunch of things automatically perhaps we can investigate further

In [68]:
import demucs.separate

# I put a few tracks in here
song = 'clint'
audio_file = f"music/{song}.mp3"

# Format file path
notebook_dir = os.getcwd()
full_audio_path = os.path.join(notebook_dir, audio_file)

# Split the vocals with demucs
# There's some parameters here that could be improved on
demucs.separate.main(["--mp3", "--two-stems", "vocals", "-n", "mdx_extra", full_audio_path])


Selected model is a bag of 4 models. You will see that many progress bars per track.
Separated tracks will be stored in C:\Users\dacla\Documents\auto-censoring\separated\mdx_extra
Separating track c:\Users\dacla\Documents\auto-censoring\music\clint.mp3


100%|███████████████████████████████████████████████████████████████████████| 363.0/363.0 [00:03<00:00, 102.11seconds/s]
100%|████████████████████████████████████████████████████████████████████████| 363.0/363.0 [00:03<00:00, 97.98seconds/s]
100%|████████████████████████████████████████████████████████████████████████| 363.0/363.0 [00:03<00:00, 98.78seconds/s]
100%|███████████████████████████████████████████████████████████████████████| 363.0/363.0 [00:03<00:00, 104.37seconds/s]


In [69]:
# Paths for vocals and no_vocals stems
vocals = f"separated/mdx_extra/{song}/vocals.mp3"
no_vocals = f"separated/mdx_extra/{song}/no_vocals.mp3"

vocals_path = os.path.join(notebook_dir, vocals)
no_vocals_path = os.path.join(notebook_dir, no_vocals)

### Apply the transcription with Whisper

# word_timestamps=True for timestamp info
# Getting Triton kernel issues when I run this ?
result = model.transcribe(vocals_path, word_timestamps=True)



Here we consider the different words to look out for. I think this is the part that will need the most work in terms of NLP to figure out what to edit. Obvious bad words are easy, but there's context dependent things that will need a separate model to interpret

There's also issues of knowing the probabilities of the outputs. I'm not sure how to access the 2nd or 3rd word that Whisper thinks a particular segment is, if the 2nd most-likely word is a curse word and the prob is close to the 1st this is probably worth editing. 

Another problem is that there are some curse words that are only offensive in pairs: e.g., "god damn" is not allowed but "damn" generally is. 

In [70]:
# Edit for things to look for
curses = {'fuck', 'motherfucker', 'shit', 'bitch', 'nigga', 'cock', } #To name a few...?
to_add = []
all = []

# Collect words
for segment in result["segments"]:
    for word_info in segment['words']:
        all.append([word_info['word'].strip().lower(), float(word_info['start']), float(word_info['end']), word_info['probability']])
        for curse in curses:
            if curse in word_info['word'].strip().lower():
                to_add.append([word_info['word'].strip().lower(), float(word_info['start']), float(word_info['end']), word_info['probability']])
        

# Create Dataframe
columns = ['word', 'start', 'end', 'prob']
df = pd.DataFrame(to_add, columns=columns)
df_all = pd.DataFrame(all, columns=columns)

Inspect the words identified by Whisper, save log 

In [71]:
# Inspect dataframe
df = df.drop_duplicates()
df_all = df_all.drop_duplicates()

df

Unnamed: 0,word,start,end,prob
0,"shit,",80.94,81.22,0.394846
1,fucking,132.84,133.1,0.607768
2,motherfuckers,149.04,149.78,0.998553


In [None]:
# Save to csv
df.to_csv(f'logs/{song}.csv')

In [72]:
## Print full transcript
full_text = result["text"].strip()
print(full_text)

I ain't happy, I'm feeling glad I got sunshine In a bag I'm useless, but not for long The future is coming on I ain't happy, I'm feeling glad I got sunshine In a bag I'm useless, but not for long The future is coming on It's coming on It's coming on Finally, someone let me out of my cage Now, time for me is nothing because I'm counting no A's Nah, I couldn't be there Nah, you shouldn't be scared I'm good at repairs And I'm under each snare Intangible Bet you didn't think so I command you to panoramic view Look, I'll make it all manageable Pick and choose, sit and lose All you different crews Chicks and dudes Who you think is really kicking tunes Picture you getting down in a picture tube Like you lit the fuse You think it's fictional, mystical, maybe Spiritual Hero who appears in you to clear your view When you're too crazy Lifeless to those The definition for what life is Priceless to you Because I put you on the hype Shit, you like it Gunsmoking righteous with one token You're psychi

This function does the actual editing. I have two procedures here, one which reverses the whole track and the other which mutes the *audio* track (ideally)

In [57]:
# Pydub does the audio effects
from pydub import AudioSegment

def reverse_audio_segment(input_audio_path, output_audio_path, times):
    print(f'Applying reverse edits to: {input_audio_path}')
    # Load the audio file
    audio = AudioSegment.from_file(input_audio_path)
    for (start_ms, end_ms) in times:
        # Select times to reverse
        before_segment = audio[:start_ms]


        target_segment = audio[start_ms:end_ms]
        target_segment = target_segment.reverse()

        after_segment = audio[end_ms:]

        # Concatenate
        audio = before_segment + target_segment + after_segment

    # Export the modified audio
    print(f'Outputting audio to {output_audio_path}')
    audio.export(output_audio_path, format="mp3") 
    return

## Applies silecning to input_audio_path at given list of times 
def silence_audio_segment(input_audio_path, output_audio_path, times):
    
    print(f'Applying silencing edits to vocals: {input_audio_path}')
    # Load the audio file
    audio = AudioSegment.from_file(input_audio_path)
    for (start_ms, end_ms) in times:
        # Select times to reverse
        before_segment = audio[:start_ms]

        # -60dB to the audio effectively mutes it
        target_segment = audio[start_ms:end_ms] - 60

        after_segment = audio[end_ms:]

        # Concatenate
        audio = before_segment + target_segment + after_segment

    # Export the modified audio
    print(f'Outputting edited vocals to {output_audio_path}')
    audio.export(output_audio_path, format="mp3") 
    return

# Combines two audio tracks via their paths (vocals and instruments for example)
def combine_audio(path1, path2, outpath):
    audio1 = AudioSegment.from_file(path1, format='mp3')
    audio2 = AudioSegment.from_file(path2, format='mp3')

    combined_audio = audio1.overlay(audio2)
    combined_audio.export(outpath, format="mp3") 
    return

Method 1 uses the vocals track to mute the bad words. Then recombines the vocals and no_vocals tracks

In [73]:
## Maybe a bit inefficient
times = []
for row in df.itertuples():
    #word = row[1]
    start = int(row[2]*1000)
    end = int(row[3]*1000)
    times.append((start, end))

# Run the silencing script
silence_audio_segment(vocals_path, vocals_path, times)

## Output file name
output_file = f"music/{song}-edit-silenced.mp3"
output_path = os.path.join(notebook_dir, output_file)

print('Combining the audio...')
combine_audio(vocals_path, no_vocals_path, output_path)
print(f'Exported to {output_path}')

Applying silencing edits to vocals: c:\Users\dacla\Documents\auto-censoring\separated/mdx_extra/clint/vocals.mp3
Outputting edited vocals to c:\Users\dacla\Documents\auto-censoring\separated/mdx_extra/clint/vocals.mp3
Combining the audio...
Exported to c:\Users\dacla\Documents\auto-censoring\music/clint-edit-silenced.mp3


Method 2 for reversing the entire track. It's less pleasing to the ear honestly



In [26]:
# Path formatting
notebook_dir = os.getcwd()

input_file = f"music/{song}.mp3"
output_file = f"music/{song}-edit-reversed.mp3"

input_path = os.path.join(notebook_dir, input_file)
output_path = os.path.join(notebook_dir, output_file)

## Maybe a bit inefficient
times = []
for row in df.itertuples():
    #word = row[1]
    start = int(row[2]*1000)
    end = int(row[3]*1000)
    times.append((start, end))

# Run the reversing script
reverse_audio_segment(input_path, output_path, times)

[(66400, 66800), (67540, 67900), (104340, 104620), (119340, 119620), (124360, 125040), (130060, 130240)]
Applying edits to: c:\Users\dacla\Documents\auto-censoring\separated\mdx_extra\5th-ward\vocals.mp3
Outputting audio to output.mp3
