<a href="https://colab.research.google.com/github/dvdtoth/whisper-to-stablediffusion/blob/main/whisper2sd_recorder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Whisper to Stable Diffusion

Two open-source models hacked together for a cool speech2image function.

Using the recently released speech recognition model, Whisper from openai.com,
and the text to image GAN, Stable Diffusion from stability.ai.

Requires a https://huggingface.co/ account.


### Sources

Models:
 * https://github.com/openai/whisper
 * https://github.com/CompVis/stable-diffusion

Whisper to SD:
 * https://huggingface.co/spaces/fffiloni/whisper-to-stable-diffusion
 * https://colab.research.google.com/drive/12DzxLRpCEDE7OtKMlkAtHKbNVGBlk7Pz



@dvdtoth

In [None]:
!nvidia-smi #Check for GPU

In [None]:
# Prep environment
!pip install diffusers==0.3.0
!pip install transformers scipy ftfy ipywebrtc
!pip install "ipywidgets>=7,<8"
!pip install git+https://github.com/openai/whisper.git 
!apt update && sudo apt install ffmpeg

In [None]:
# Sign in to huggingface
from huggingface_hub import notebook_login

notebook_login()

In [None]:
# Load models
import torch
from diffusers import StableDiffusionPipeline
from torch import autocast
import whisper
import IPython.display as ipd

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16, use_auth_token=True)  
pipe = pipe.to("cuda")
whisper_model = whisper.load_model("medium")

In [None]:
# Record audio
from google.colab import output
output.enable_custom_widget_manager()

from ipywebrtc import AudioRecorder, CameraStream

audio = CameraStream(constraints={'audio': True, 'video': False})
recorder = AudioRecorder(stream=audio)
recorder

In [None]:
# Recognise and translate prompt
audio_file_path = '/content/audio.webm'
recorder.save(audio_file_path)

audio = whisper.load_audio(audio_file_path)
audio = whisper.pad_or_trim(audio)
    
mel = whisper.log_mel_spectrogram(audio).to(whisper_model.device)
    
_, probs = whisper_model.detect_language(mel)
    
options = whisper.DecodingOptions(task="translate", fp16 = False)
prompt = whisper.decode(whisper_model, mel, options)
print("Translated prompt:", prompt.text)

In [None]:
# Generate image
with autocast("cuda"):
  image = pipe(prompt.text).images[0]
image