# Notebook 1: Text-to-video generation

In this notebook, we will use a text-to-video (T2V) diffusion model such as the one presented in the lecture, to generate a short video. The learning goals of this notebook are:
1. Familiarise yourselves with the HuggingFace T2V workflow
2. Identify the salient components of a diffusion model pipeline
3. Observe the effect of prompts on the final video quality

In the other notebooks, we will use a text-to-speech model to generate a spoken track for the video, and finally lip-sync the video to the text.

This notebook is designed to be able to run on the free version of Google Colab. It will run on any instance (even the CPU instance type), but for best results, go to `Runtime`and choose `Change Runtime Type`, then select a GPU runtime if available. Sometimes they are not available, so you might need to wait a bit.

These notebooks usually run best in Google Chrome.

Let's begin!

First, we will import the relevant parts of the HuggingFace Diffusers library and some tools to compile the final video. To prepare, please read the following material:
- [Overview of HuggingFace Diffusers for video generation](https://huggingface.co/docs/diffusers/en/using-diffusers/text-img2vid)
- [Information about the video model used below](https://huggingface.co/ali-vilab/text-to-video-ms-1.7b)

If you would like to prepare for the lecture and challenge yourselves, research the relevant components of a diffusion model pipeline. Don't worry yet if you don't understand all parts of it, we will cover it in the lecture. In particular, try to find out what the **Noise Scheduler**, the **UNet** and the **VAE** are used for. You do not need to understand any of the mathematics yet.

In [1]:
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video
from transformers import AutoModelForCausalLM, AutoTokenizer
from IPython.display import HTML
from base64 import b64encode
import imageio
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from skimage.transform import resize

We are enabling some memory optimisations that the Diffusers library includes. These are:
- Half-precision inference (`fp16`)
- CPU offloading
- VAE slicing. This means that batches are processed one item at a time.

You can read more about memory optimisations [here](https://huggingface.co/docs/diffusers/v0.12.0/en/optimization/fp16) if you are interested.

In [2]:
%%capture
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16").to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()

The TextToVideoSDPipeline has been deprecated and will not receive bug fixes or feature updates after Diffusers version 0.33.1. 


We are now ready to create the video. In the lecture, we will learn about prompt control, i.e., the ability to choose the content that the diffusion model generates. It's important to understand that the prompt has a big impact on the quality of the output, perhaps more than in LLMs you might know. Below, we have provided two types of prompt, a short and factual one and a long and elaborate one. Play with both of them to see the differences in the video output.
The "negative prompt" tries to lead the model away from undesirable outputs. We will also discuss how this works in the lecture.

Note that this model only supports relatively short prompts. If your prompt is too long, it will not fail, but try to put the important parts of the prompt directly at the start.

In [3]:
# prompt = 'A woman holding a cat says hello.'
# neg_prompt = "low quality"

prompt = "A woman stands in a living room filled with warm, natural light. She cradles a fluffy cat in her arms. She radiates warmth and friendliness. The cat nuzzles into her chest as she turns toward the camera with a genuine smile. She says “hello”. The scene is framed in a wide shot with a static camera"
neg_prompt = "blurry image, motion blur, camera shake, distorted face, unnatural facial expressions, incorrect lip sync, uncanny valley effect, over-smoothed skin, disproportionate body parts, abnormal hand or finger positions, unnatural cat behavior, floating objects, misplaced shadows, inconsistent lighting, low-resolution textures, pixelation, noise, background clutter, surreal or sci-fi elements, horror aesthetics, dark or moody lighting, excessive contrast, glitch effects, unintentional camera movements, wrong perspective, dynamic angles, broken anatomy, poorly animated clothing or hair, unnatural eye movement, flickering frames, compression artifacts"

To make things efficient, we will only generate a short video of 3 seconds at 10 FPS. If you have time, you can play with longer videos too.
Note the `num_inference_steps` parameter below. This tells the diffusion pipeline how many diffusion reverse process steps to run, and we will talk about this in the lecture.

In [4]:
video_duration_seconds = 3
num_frames = video_duration_seconds * 10
video_frames = pipe(prompt, negative_prompt=neg_prompt, num_inference_steps=25, num_frames=num_frames).frames
video_path = export_to_video(video_frames.squeeze(), "./video.mp4")

  0%|          | 0/25 [00:00<?, ?it/s]

Finally, this is a helper function to display the video in the notebook. To use this video with the other notebooks, please download it locally to your computer to re-upload it for the other notebooks.

In [5]:
def display_video(video):
    fig = plt.figure(figsize=(4.2,4.2))  #Display size specification
    fig.subplots_adjust(left=0, right=1, bottom=0, top=1)
    mov = []
    for i in range(len(video)):  #Append videos one by one to mov
        img = plt.imshow(video[i], animated=True)
        plt.axis('off')
        mov.append([img])

    #Animation creation
    anime = animation.ArtistAnimation(fig, mov, interval=100, repeat_delay=1000)

    plt.close()
    return anime
video = imageio.mimread(video_path)  #Loading video
HTML(display_video(video).to_html5_video())  #Inline video display in HTML5

**Conclusion**: In this notebook, we have generated a short video using the HuggingFace Diffusers library. Please remember to:
- Download the video we generated for later use.
- Shut down your instance (`Runtime` $\to$ `Disconnect and Delete Runtime`) after you are finished, so you can spin up another runtime for the next notebooks.