# Video Killed The Radio Star ...Diffusion.

Notebook by David Marx ([@DigThatData](https://twitter.com/digthatdata))

Shared under MIT license


## FAQ

**What is this?**

Point this notebook at a youtube url and it'll make a music video for you.

**How does this animation technique work?**

For each text prompt you provide, the notebook will...

1. Generate an image based on that text prompt (using stable diffusion)
2. Use the generated image as the `init_image` to recombine with the text prompt to generate variations similar to the first image. This produces a sequence of extremely similar images based on the original text prompt
3. Images are then intelligently reordered to find the smoothest animation sequence of those frames
3. This image sequence is then repeated to pad out the animation duration as needed

The technique demonstrated in this notebook was inspired by a [video](https://www.youtube.com/watch?v=WJaxFbdjm8c) created by Ben Gillin.

**How are lyrics transcribed?**

This notebook uses openai's recently released 'whisper' model for performing automatic speech recognition. 
OpenAI was kind enough to offer several different sizes of this model which each have their own pros and cons. 
This notebook uses the largest whisper model for transcribing the actual lyrics. Additionally, we use the 
smallest model for performing the lyric segmentation. Neither of these models is perfect, but the results 
so far seem pretty decent.

The first draft of this notebook relied on subtitles from youtube videos to determine timing, which was
then aligned with user-provided lyrics. Youtube's automated captions are powerful and I'll update the
notebook shortly to leverage those again, but for the time being we're just using whisper for everything
and not referencing user-provided captions at all.

**Something didn't work quite right in the transcription process. How do fix the timing or the actual lyrics?**

The notebook is divided into several steps. Between each step, a "storyboard" file is updated. If you want to
make modifications, you can edit this file directly and those edits should be reflected when you next load the
file. Depending on what you changed and what step you run next, your changes may be ignored or even overwritten.
Still playing with different solutions here.

**Can I provide my own images to 'bring to life' and associate with certain lyrics/sequences?**

Yes, you can! As described above: you just need to modify the storyboard. Will describe this functionality in
greater detail after the implementation stabilizes a bit more.

**This gave me an idea and I'd like to use just a part of your process here. What's the best way to reuse just some of the machinery you've developed here?**

Most of the functionality in this notebook has been offloaded to library I published to pypi called `vktrs`. I strongly encourage you to import anything you need 
from there rather than cutting and pasting function into a notebook. Similarly, if you have ideas for improvements, please don't hesitate to submit a PR!

**How can I support your work or work like it?**

This notebook was made possible thanks to ongoing support from [stability.ai](https://stability.ai/). The best way to support my work is to share it with your friends, [report bugs](https://github.com/dmarx/video-killed-the-radio-star/issues/new), or to donate to open source non-profits :) 

In [None]:
%%capture
# @title # 0. Setup
!pip install vktrs[api,hf]

In [None]:
# @markdown # Check GPU Status

from vktrs.utils import gpu_info

gpu_info()

In [None]:
# @title # 1. 🔑 Provide your API Key
# @markdown Running this cell will prompt you to enter your API Key below. 

# @markdown To get your API key, visit https://beta.dreamstudio.ai/membership

# @markdown ---

# @markdown A note on security best practices: **don't publish your API key.**

# @markdown We're using a form field designed for sensitive data like passwords.
# @markdown This notebook does not save your API key in the notebook itself,
# @markdown but instead loads your API Key into the colab environment. This way,
# @markdown you can make changes to this notebook and share it without concern
# @markdown that you might accidentally share your API Key. 
# @markdown 

use_stability_api = True # @param {type:'boolean'}

if use_stability_api:
    import os, getpass
    os.environ['STABILITY_KEY'] = getpass.getpass('Enter your API Key')
else:
    # use diffusers
    !pip install diffusers
    !pip install "ipywidgets>=7,<8"
    !pip install transformers

    !sudo apt -qq install git-lfs
    !git config --global credential.helper store

    from google.colab import output
    from huggingface_hub import notebook_login

    output.enable_custom_widget_manager()
    notebook_login()

In [None]:
# @title # 2. 📋 Animation parameters

import datetime as dt

import json
import os
from pathlib import Path
import re
import string
from subprocess import Popen, PIPE
import textwrap
import time
import warnings

import tokenizations
import webvtt

from IPython.display import display
import numpy as np


from itertools import chain, cycle

from tqdm.autonotebook import tqdm

from vktrs.youtube import (
    YoutubeHelper,
    parse_timestamp,
    vtt_to_token_timestamps,
    srv2_to_token_timestamps,
)

from omegaconf import OmegaConf
# to do: use project name to name file
# to do: separate global params defined here from the storyboard object.
#        users will not anticipate that updates here will destroy their work
storyboard = OmegaConf.create()

storyboard.params = dict(

     video_url = 'https://www.youtube.com/watch?v=REojIUxX4rw' # @param {type:'string'}
    #, audio_fpath = '' # @param {type:'string'} # TO DO: drop reliance on youtube for audio
    , audio_fpath = None
    , theme_prompt = "extremely detailed, painted by ralph steadman and radiohead, beautiful, wow" # @param {type:'string'}

    , n_variations=5 # @param {type:'integer'}
    , image_consistency=0.9 # @param {type:"slider", min:0, max:1, step:0.01}
    , fps = 12 # @param {type:"slider", min:4, max:60, step:1}
    , height = 512 # @param {type:'integer'}
    , width = 512 # @param {type:'integer'}

    , output_filename = 'output.mp4' # @param {type:'string'}
    , add_caption = True # @param {type:'boolean'}
    , display_frames_as_we_get_them = True # @param {type:'boolean'}

    , optimal_ordering = True # @param {type:'boolean'}
    , whisper_seg = True # @param {type:'boolean'}
    , max_video_duration_in_seconds = 300 # @param {type:'integer'}

    , max_variations_per_opt_pass=15
    , use_stability_api = use_stability_api

)

if not storyboard.params.audio_fpath:
    storyboard.params.audio_fpath = None


# @markdown ---

# @markdown `video_url` - URL of a youtube video to download as a source for audio and potentially for text transcription as well.

##################
# markdown `audio_fpath` - Optionally provide an audio file instead of relying on a youtube download. Name it something other than 'audio.mp3', 
# markdown                 otherwise it might get overwritten accidentally.
##################

# @markdown `theme_prompt` - Text that will be appended to the end of each lyric, useful for e.g. applying a consistent aesthetic style


# @markdown `n_variations` - How many unique variations to generate for a given text prompt

# @markdown `image_consistency` - controls similarity between images generated by the prompt.
# @markdown - 0: ignore the init image
# @markdown - 1: true as possible to the init image

# @markdown `fps` - Frames-per-second of generated animations


# @markdown `output_filename` - filename your video will be saved to

# @markdown `add_caption` - Whether or not to overlay the prompt text on the image

# @markdown `display_frames_as_we_get_them` - Displaying frames will make the notebook slightly slower


# @markdown `optimal_ordering` - Intelligently permutes animation frames to provide a smoother animation.

# @markdown `whisper_seg` - Whether or not to use openai's whisper model for lyric segmentation. This is currently the only option, but that will change in a few days.

# @markdown `max_video_duration_in_seconds` - Early stopping if you don't want to generate a video the full duration of the provided audio file. Defaults to 5 minutes.




storyboard.params.max_frames = storyboard.params.fps * storyboard.params.max_video_duration_in_seconds


print(f"Max total frames: {storyboard.params.max_frames}")
#print(f"Max API requests: {int(max_frames/repeat)}")

if storyboard.params.optimal_ordering:

    opt_batch_size = storyboard.params.n_variations
    while opt_batch_size > storyboard.params.max_variations_per_opt_pass:
        opt_batch_size /= 2
    print(f"Frames per re-ordering batch: {opt_batch_size}")
    storyboard.params.opt_batch_size = opt_batch_size

storyboard_fname = 'storyboard.yaml'
with open(storyboard_fname,'wb') as fp:
    OmegaConf.save(config=storyboard, f=fp.name)



In [None]:
%%capture

# @title # 3. 📥 Download audio from youtube

storyboard_fname = 'storyboard.yaml'
storyboard = OmegaConf.load(storyboard_fname)

video_url = storyboard.params.video_url

if video_url:
    # check if user provided an audio filepath (or we already have one from youtube) before attempting to download
    if storyboard.params.get('audio_fpath') is None:
        helper = YoutubeHelper(video_url)

        input_audio = helper.info['requested_downloads'][-1]['filepath']
        !ffmpeg -y -i "{input_audio}" -acodec libmp3lame audio.mp3

        # to do: write audio and subtitle paths/meta to storyboard
        storyboard.params.audio_fpath = 'audio.mp3'

        if False:
            subtitle_format = helper.info['requested_subtitles']['en']['ext']
            subtitle_fpath = helper.info['requested_subtitles']['en']['filepath']

            if subtitle_format == 'srv2':
                with open(subtitle_fpath, 'r') as f:
                    srv2_xml = f.read() 
                token_start_times = srv2_to_token_timestamps(srv2_xml)
                # to do: handle timedeltas...
                #storyboard.params.token_start_times = token_start_times

            elif subtitle_format == 'vtt':
                captions = webvtt.read(subtitle_fpath)
                token_start_times = vtt_to_token_timestamps(captions)
                # to do: handle timedeltas...
                #storyboard.params.token_start_times = token_start_times

            # If unable to download supported subtitles, force use whisper
            else:
                storyboard.params.whisper_seg = True

# force use
storyboard.params.whisper_seg = True

with open(storyboard_fname,'wb') as fp:
    OmegaConf.save(config=storyboard, f=fp.name)

whisper_seg = storyboard.params.whisper_seg

In [None]:
# @title # 4. 💬 Transcribe and segment speech using whisper

storyboard_fname = 'storyboard.yaml'
storyboard = OmegaConf.load(storyboard_fname)

if whisper_seg:
    !pip install git+https://github.com/openai/whisper
    from vktrs.asr import whisper_lyrics

    prompt_starts = whisper_lyrics(audio_fpath=storyboard.params.audio_fpath)

    #storyboard.prompt_starts = prompt_starts
    # to do: deal with these td objects
    #with open('storyboard.yaml') as fp:
    #    OmegaConf.save(config=storyboard, f=fp.name)

In [None]:
# @title # 5. 🧮 Math

### This cell computes how many frames are needed for each segment
### based on the start times for each prompt

import datetime as dt
fps = storyboard.params.fps

ifps = dt.timedelta(seconds=1/fps)

# estimate video end
video_duration = dt.timedelta(seconds=helper.info['duration'])

# dummy prompt for last scene duration
prompt_starts.append({'td':video_duration})

# make sure we respect the duration of the previous phrase
frame_start=dt.timedelta(seconds=0)
prompt_starts[0]['anim_start']=frame_start
for i, rec in enumerate(prompt_starts[1:], start=1):
  rec_prev = prompt_starts[i-1]
  k=0
  while rec_prev['anim_start'] + k*ifps < rec['td']:
    k+=1
  k-=1
  rec_prev['frames'] = k
  rec_prev['anim_duration'] = k*ifps
  frame_start+=k*ifps
  rec['anim_start']=frame_start

# make sure we respect the duration of the previous phrase
# to do: push end time into a timedelta and consider it... somewhere near here
for i, rec1 in enumerate(prompt_starts):
    rec0 = prompt_starts[i-1]
    rec0['duration'] = rec1['td'] - rec0['td']

# drop the dummy frame
prompt_starts = prompt_starts[:-1]

# to do: given a 0 duration prompt, assume its duration is captured in the next prompt 
#        and guesstimate a corrected prompt start time and duration 


### checkpoint the processing work we've done to this point

import copy

prompt_starts_copy = copy.deepcopy(prompt_starts)

for rec in prompt_starts_copy:
    for k,v in list(rec.items()):
        if isinstance(v, dt.timedelta):
            rec[k] = v.total_seconds()

        # flush image objects if they're there, they anger omegaconf
        if k in ('frame0','variations','images', 'images_raw'):
            rec.pop(k)

storyboard.prompt_starts = prompt_starts_copy

# to do: deal with these td objects
storyboard_fname = 'storyboard.yaml'
with open(storyboard_fname) as fp:
    OmegaConf.save(config=storyboard, f=fp.name)

In [None]:
# @title # 6. 🙭 Generate init images

import copy
import datetime as dt
from omegaconf import OmegaConf
from pathlib import Path
import random
import string
from tqdm.autonotebook import tqdm

import PIL

from vktrs.tsp import (
    tsp_permute_frames,
    batched_tsp_permute_frames,
)

from vktrs.utils import (
    add_caption2image,
    save_frame,
    remove_punctuation,
)


storyboard_fname = 'storyboard.yaml'
storyboard = OmegaConf.load(storyboard_fname)

prompt_starts = storyboard.prompt_starts
use_stability_api = storyboard.params.use_stability_api


if use_stability_api:
    from vktrs.api import get_image_for_prompt
else:
    from vktrs.hf import HfHelper
    helper = HfHelper()
    get_image_for_prompt = helper.get_image_for_prompt


def get_variations_w_init(prompt, init_image, **kargs):
    return list(get_image_for_prompt(prompt=prompt, init_image=init_image, **kargs))

def get_close_variations_from_prompt(prompt, n_variations=2, image_consistency=.7):
    """
    prompt: a text prompt
    n_variations: total number of images to return
    image_consistency: float in [0,1], controls similarity between images generated by the prompt.
                        you can think of this as controlling how much "visual vibration" there will be.
                        - 0=regenerate each iandely identical
    """
    images = list(get_image_for_prompt(prompt))
    for _ in range(n_variations - 1):
        img = get_variations_w_init(prompt, images[0], start_schedule=(1-image_consistency))[0]
        images.append(img)
    return images



theme_prompt = storyboard.params.theme_prompt

display_frames_as_we_get_them = storyboard.params.display_frames_as_we_get_them

height = storyboard.params.height
width = storyboard.params.width

# to do: move this up to run params
proj_name = 'test'

print("Ensuring each prompt has an associated image")
for idx, rec in enumerate(prompt_starts):
    print(
        f"[{rec['anim_start']} | {rec['ts']}] [{rec['duration']} | {rec['anim_duration']}] - {rec['frames']} - {rec['prompt']}"
    )
    lyric = rec['prompt']
    prompt = f"{lyric}, {theme_prompt}"
    if rec.get('frame0_fpath') is None:
        init_image = list(get_image_for_prompt(
              prompt, 
              height=height,
              width=width,
              )
          )[0]
        rec['frame0_fpath'] = save_frame(
            init_image,
            idx,
            root_path=Path('./frames') / proj_name,
            name=proj_name, ## to do.... uh... i dunno
            )

        if display_frames_as_we_get_them:
            print(lyric)
            display(init_image)

########################
# update config

prompt_starts_copy = copy.deepcopy(prompt_starts)

for rec in prompt_starts_copy:
    for k,v in list(rec.items()):
        if isinstance(v, dt.timedelta):
            rec[k] = v.total_seconds()
        # flush images for now
        if k in ('frame0','variations','images', 'images_raw'):
            rec.pop(k)

storyboard.prompt_starts = prompt_starts_copy

# to do: deal with these td objects
storyboard_fname = 'storyboard.yaml'
with open(storyboard_fname) as fp:
    OmegaConf.save(config=storyboard, f=fp.name)

In [None]:
# @title # 7. 🚀 Generate animation frames

from omegaconf import OmegaConf
from PIL import Image

import copy
import datetime as dt
from itertools import cycle

# reload config
storyboard_fname = 'storyboard.yaml'
storyboard = OmegaConf.load(storyboard_fname)
prompt_starts = OmegaConf.to_container(storyboard.prompt_starts, resolve=True)

add_caption = storyboard.params.get('add_caption')
optimal_ordering = storyboard.params.optimal_ordering
display_frames_as_we_get_them = storyboard.params.display_frames_as_we_get_them
image_consistency = storyboard.params.image_consistency
max_frames = storyboard.params.max_frames
max_variations_per_opt_pass = storyboard.params.max_variations_per_opt_pass
n_variations = storyboard.params.n_variations


# load init_images and generate variations as needed
# to do: use SDK args to request multiple images in single request...
frames = []
print("Fetching variations")
for idx, rec in enumerate(prompt_starts):
    images = []

    if rec.get('images_fpaths') is None:
        init_image = Image.open(rec['frame0_fpath'])
        n_variations = rec.get('variations', storyboard.params.n_variations)
        n_variations = min(n_variations, rec['frames']) # don't generate variations we won't use
        for _ in range(n_variations - 1):
            img = get_variations_w_init(prompt, init_image, start_schedule=(1-image_consistency))[0]
            images.append(img)

        # to do: collect images in a separate object to facilitate storyboard updates
        rec['variations'] = images
        images = [init_image] + images

        rec['variations_fpaths'] = [
            save_frame(
                img,
                idx,
                root_path=Path('./frames') / proj_name,
                #name=proj_name, ## need to make sure each image gets a unique name
            ) for j, img in enumerate(rec['variations'])
        ]

        # to do: persist the ordering in the storyboard
        if optimal_ordering:
            images = batched_tsp_permute_frames(
                images,
                max_variations_per_opt_pass
            )
        rec['images'] = rec['images_raw'] = images

        if add_caption:
            rec['images'] = [add_caption2image(im, rec['prompt']) for im in rec['images']]
        
        rec['images_fpaths'] = [
            save_frame(
                img,
                idx,
                root_path=Path('./frames') / proj_name,
                #name=proj_name, ## need to make sure each image gets a unique name
            ) for j, img in enumerate(rec['images'])
        ]
    else:
        # load frames if we've already generated them
        for im_fpath in rec['images_fpaths']:
            im = Image.open(im_fpath)
            images.append(im)
        rec['images'] = images

    if display_frames_as_we_get_them:
        print(rec['prompt'])
        for im in rec['images']:
            display(im)

    #images *= repeat
    sequence = []
    frame_factory = cycle(rec['images'])
    while len(sequence) < rec['frames']:
        sequence.append(next(frame_factory))
    frames.extend(sequence)
    if len(frames) >= max_frames:
        break

########################
# update config

prompt_starts_copy = copy.deepcopy(prompt_starts)

for rec in prompt_starts_copy:
    for k,v in list(rec.items()):
        if isinstance(v, dt.timedelta):
            rec[k] = v.total_seconds()
        # flush images for now
        if k in ('frame0','variations','images', 'images_raw'):
            rec.pop(k)

storyboard.prompt_starts = prompt_starts_copy

# to do: deal with these td objects
storyboard_fname = 'storyboard.yaml'
with open(storyboard_fname) as fp:
    OmegaConf.save(config=storyboard, f=fp.name)

In [None]:
# @title # 6. 🎥 Compile your video!

from subprocess import Popen, PIPE

from omegaconf import OmegaConf
from tqdm.autonotebook import tqdm

# reload config
storyboard_fname = 'storyboard.yaml'
storyboard = OmegaConf.load(storyboard_fname)

fps = storyboard.params.fps
input_audio = storyboard.params.audio_fpath
output_filename = storyboard.params.output_filename


# to do: read frames and variations back into memory. This should be the last cell that gets run, so we need to 
# update state wrt any user interventions in the storyboard object. actually, should probably do the text overlay step here


cmd_in = ['ffmpeg', '-y', '-f', 'image2pipe', '-vcodec', 'png', '-r', str(fps), '-i', '-']
cmd_out = ['-vcodec', 'libx264', '-r', str(fps), '-pix_fmt', 'yuv420p', '-crf', '1', '-preset', 'veryslow', '-shortest', output_filename]

if input_audio:
  cmd_in += ['-i', str(input_audio)]

cmd = cmd_in + cmd_out

p = Popen(cmd, stdin=PIPE)
#for im in tqdm(chain(frames)):
for im in tqdm(frames):
  im.save(p.stdin, 'PNG')
p.stdin.close()

print("Encoding video...")
p.wait()
print("Video complete.")
print(f"Video saved to: {output_filename}")

# to do: optionally compress for download
# !tar -czvf output.tar.gz output.mp4