# Generating Audio Reactive Visuals

By:
* Daniel STULBERG HUF
* Lawson OLIVEIRA LIMA
* Lucas VITORIANO DE QUEIROZ LIRA

## 1. Introduction

In this Applied Course, we want you to help us create and play with a music visualizer. More specifically, we want you to complete the development of a code which can generate a video that is responsive to the features of a soundtrack.

During this process, we also want to introduce you to a powerful tool known as Generative Adversarial Networks (or simply GANs), which represents a class of Deep Learning frameworks that typically produces generative artworks. 

<br><strong>Acknowledgment:</strong> This file was adapted from the deep music visualizer created by <strong>Matt Siegelman</strong>. The article describing the tool can be read <a href="https://towardsdatascience.com/the-deep-music-visualizer-using-sound-to-explore-the-latent-space-of-biggan-198cd37dac9a">here</a> and the repo containing the original script is available <a href="https://github.com/msieg/deep-music-visualizer">here</a>. 

<br><strong>Note</strong>: Make sure that the runtime type of your colab is set to GPU.

First of all, let's install and import all the libraries required to run this notebook.

In [None]:
!pip install -q pytorch_pretrained_biggan
!pip install -q imageio==2.4.1
!pip install -q imageio-ffmpeg

import sys
import librosa
import librosa.display
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import torch
import random
import moviepy.editor as mpy
from pytorch_pretrained_biggan import BigGAN, truncated_noise_sample
from tqdm import tqdm
from google.colab import files

%matplotlib inline

We will also clone the GitHub repo of one of the members of our group in order to add some functions (you won't need to care about them).

In [None]:
!git clone 'https://github.com/danielhuf/audio-reactive-visuals.git'
sys.path.append('/content/audio-reactive-visuals')
from utils import *

Of course, we also have to choose and then read the song that we want to visualize. We recommend that, for your first time, you keep with the song uploaded from the repo and pick only the first 30 seconds of the song in order to make the execution faster. But feel free to upload another mp3 file with the song of your choice.

In [None]:
print('Reading audio') 
song = '/content/audio-reactive-visuals/audios/up.mp3'   
y, sr = librosa.load(song, duration=30)   # Loading an audio file as a floating point time series with librosa
print('Audio successfully read') 

## 2. Signal analysis

In this section, we will process the audio by adjusting some parameters of the soundtrack. After that, we will perform a feature extraction to obtain a latent vector representation of the musical features at each timestep corresponding to a frame at a specific rate.

### 2.1. Setting audio parameters

The frame length is the number of audio samples per video frame. A low frame corresponds to a higher frame rate, which is suitable for visualizing very rapid music (but the rendering time should be longer). Conversely, a higher frame length corresponds to a lower frame rate, which cuts down runtime. 

In [None]:
frame_length = 256 # from 512 down means high quality (must be a multiple of 64)

The pitch sensitivity controls the changes in the song pitch. A higher pitch sensitivy will make the shapes, textures and objects of the generated video to change more rapidly according to the music notes.

In [None]:
pitch_sensitivity = 220 # range [1, 299] ~int
pitch_sensitivity = (300 - pitch_sensitivity) * 512 / frame_length

The tempo sensitivity controls the changes in the song volume and tempo (how fast or slow a piece of music is performed). A higher tempo sensitivity brings more movement to the generated video.

In [None]:
tempo_sensitivity = 0.25 # range [0, 1] ~float
tempo_sensitivity = tempo_sensitivity * frame_length / 512

### 2.2. Extracting audio features

Now that we've set the basic parameters of the audio, it is time to extract some of its features. In order to create a music visualizer, we will be playing mainly with two tools, the <strong>chromagram</strong> and the <strong>Mel spectogram</strong>.

The chromagram is a representation that maps the whole spectral audio information into one octave, and each octave is split into 12 bins, each one representing a semitone of the song. The plot of the chromagram across the music time window allows us to observe how the representation's pitch content is spread over the 12 chroma bands and how much energy is present in each pitch class. 

In [None]:
# create chromagram of pitches X time points
chroma = librosa.feature.chroma_cqt(y=y, sr=sr, hop_length=frame_length)

# sort pitches by overall power 
chromasort = np.argsort(np.mean(chroma,axis=1))[::-1]

# plot chromagram
img = librosa.display.specshow(chroma, y_axis='chroma', x_axis='time')
plt.title('Chromagram: ' + song.split('/')[-1])
plt.colorbar(img, label='Power')

A spectogram allows us to visualize the energy of each frequency component of the song across time. Instead of a traditional spectogram, in which the frequencies are linearly distributed, we will be using the Mel spectogram. By scaling the frequency bins to Mel scales (in Decibel), we are simulating the human ear and therefore corresponding better to the human perception of a song. In the code below, besides creating the spectogram containing the overal power of the frequencies over time, we will also keep track of the gradient of such power. 

Complete the following line of code to compute a Mel spectogram of the uploaded audio. The spectogram should generate 128 Mel bands, a highest frequency of 8 kHz, and a hop length of value [```frame_length```]. (See documentation for the function below <a href="https://librosa.org/doc/main/generated/librosa.feature.melspectrogram.html">here</a>).

In [None]:
# TO DO: create spectrogram
spec = librosa.feature.melspectrogram = (##############)

<details>
  <summary> Solution </summary>
  <pre>
spec = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=8000, hop_length=frame_length)</pre>
</details>

In [None]:
# get mean power at each time point
specm = np.mean(spec,axis=0)

# compute power gradient across time points
gradm = np.gradient(specm)

# set max gradient to 1
gradm = gradm/np.max(gradm)

# set negative gradient time points to zero 
gradm = gradm.clip(min=0)
    
# normalize mean power between 0-1
specm = (specm-np.min(specm))/np.ptp(specm)

# plot spectogram
librosa.display.specshow(librosa.power_to_db(spec, ref=np.max), y_axis='mel', fmax=8000, x_axis='time');
plt.title('Mel Spectrogram: ' + song.split('/')[-1]);
plt.colorbar(format='%+2.0f dB');

Okay, but what's all this for? We'll get back to this tools in a moment, we first need to introduce you to BigGan.  

## 3. Visualization algorithm

You may be wondering from where are we going to creat all the images, and the anser is: <strong>BigGAN</strong> and <strong>ImageNet</strong>.

Briefly speaking, generative adversarial networks, or simply GANs, represent a class of machine learning frameworks trained by two neural networks competing in a zero-sum game: a generator creates new images learning from a database of images (ImageNet in our case), while a discriminator tries to classify the images as real or fake. At the end of this process, GANs are capable of generating high-quality synthetic images.  

BigGAN is a state-of-the-art GAN architecture created by Google in 2018. The version that we'll use, which is ```BigGAN-Deep-256```, contains over 1 million features and 781 million parameters.

We will be interested in handling with 1128 of this parameters, which are split between:
1.   A 1000-unit class vector corresponding to the 1000 classes of ImageNet images (you can consult them all on the file named ```/content/audio-reactive-visuals/imagenet_classes.txt```).
2.   A 128-unit noise vector with values ranging from -2 to 2 that controls the visual features of the generated images, introducing randomness and diversity to them.





In [None]:
# load pre-trained model
model = BigGAN.from_pretrained('biggan-deep-256')

# set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# set number of classes to visualize
num_classes = 12 # range [1, 12] ~int

# set classes (you can select a random set of classes, or define a numbered list of them by hand)
# NOTE: if you select a list of classes by hand, just comment the following 3 lines and also
# be sure that the length of the list of classes is the same as the number of classes previously set !!!
cls1000 = list(range(1000))
random.shuffle(cls1000)
classes = cls1000[:num_classes]

# set truncation, i.e. variability of generated images. Higher truncation yields more variable images, while lower truncation yields simpler images
truncation = 1 # range [0.1, 1] ~float

# set depth, i.e. the maximum value of the class vector units. Higher depth yields more thematically rich content, while lower depth yields more deep structures like human faces.
depth = 1 # range [0.01, 1] ~float

# set smooth factor, i.e. the bin size for linearly interpolating between the means of the class vectors in order to smooth flutuations in the video.
# A higher smooth factor yields smoother results, while a lower smooth factor is suitable for visualizing fast music with rapid changes.
smooth_factor = 20 # range [10, 30] ~int
smooth_factor = int(smooth_factor * 512 / frame_length)

In [None]:
print('Initializing input vectors')

# initialize first class vector
cv1 = np.zeros(1000)

for pi, p in enumerate(chromasort[:num_classes]):
    
    if num_classes < 12:
        cv1[classes[pi]] = chroma[p][np.min([np.where(chrow>0)[0][0] for chrow in chroma])]       
    else:
        cv1[classes[p]] = chroma[p][np.min([np.where(chrow>0)[0][0] for chrow in chroma])]

# initialize first noise vector
nv1 = truncated_noise_sample(truncation=truncation)[0]

# initialize list of class and noise vectors
class_vectors = [cv1]
noise_vectors = [nv1]

# initialize previous vectors (will be used to track the previous frame)
cvlast = cv1
nvlast = nv1

# initialize the direction of noise vector unit updates
update_dir = np.zeros(128)

for ni, n in enumerate(nv1):
    if n < 0:
        update_dir[ni] = 1
    else:
        update_dir[ni] = -1

# initialize noise unit update
update_last = np.zeros(128)

print('Vectors successfully initialized') 

Interpolating between classes and/or noises in the latent space may lead 
to interesting results, generally explored by artists to build AI artworks. In our case, we will set BigGan to play with music.

You do not need to understand deeply how the following algorithm works. What is important to keep in mind is that the algorithm syncs the pitch value with the class vector and the tempo value with the noise vector. Furthermore, at each time point of the song, the weights of the ImageNet classes of the class vector will be determined by the power of the pitches from the chromagram (as seen in the figure below). Independently, the change rate of the noise vector will determine the changes in tempo and volume of the song.

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*FKaaLad96Mhqsx4W_ShQTw.jpeg" width="500px" align="center">

Font: <a href="https://towardsdatascience.com/the-deep-music-visualizer-using-sound-to-explore-the-latent-space-of-biggan-198cd37dac9aa">Matt Siegelman (Towards Data Science)</a>

In [None]:
print('\nGenerating input vectors \n')

for i in tqdm(range(len(gradm))):   

    # update jitter vector (property for avoiding noise repetitiveness) every 200 frames by setting ~half of noise vector units to lower sensitivity
    if i%200 == 0:
        jitters = new_jitters()

    # get last noise vector
    nv1 = nvlast

    # set noise vector update based on direction, tempo sensitivity, jitter, and combination of overall power and gradient of power
    update = np.array([tempo_sensitivity for k in range(128)]) * (gradm[i]+specm[i]) * update_dir * jitters 
    
    # smooth the update with the previous update (to avoid overly sharp frame transitions)
    update = (update+update_last*3)/4
    
    # set last update
    update_last = update
        
    # update noise vector
    nv2 = nv1 + update

    # append to noise vectors
    noise_vectors.append(nv2)
    
    # set last noise vector
    nvlast = nv2
                   
    # update the direction of noise units
    for ni, n in enumerate(nv2):

      if n >= 2*truncation - tempo_sensitivity:
        update_dir[ni] = -1  
                    
      elif n < -2*truncation + tempo_sensitivity:
        update_dir[ni] = 1 

    # get last class vector
    cv1 = cvlast
    
    # generate new class vector based on chromatic notes and tempo sensitivity
    cv2 = np.zeros(1000)
    for j in range(num_classes):
        
        cv2[classes[j]] = (cvlast[classes[j]] + ((chroma[chromasort[j]][i])/(pitch_sensitivity)))/(1+(1/((pitch_sensitivity))))

    # if more than 6 classes, normalize new class vector between 0 and 1, else simply set max class val to 1
    if num_classes > 6:
        cv2 = normalize_cv(cv2)
    else:
        cv2 = cv2/np.max(cv2)
    
    # adjust depth    
    cv2 = cv2 * depth
    
    # this prevents rare bugs where all classes have the same value
    if np.std(cv2[np.where(cv2!=0)]) < 0.0000001:
        cv2[classes[0]] = cv2[classes[0]] + 0.01

    # append new class vector
    class_vectors.append(cv2)
    
    # set last class vector
    cvlast = cv2


# interpolate between class vectors of bin size [smooth_factor] to smooth frames 
class_vectors = smooth(class_vectors, smooth_factor)


# convert vectors to Tensor
noise_vectors = torch.Tensor(np.array(noise_vectors))      
class_vectors = torch.Tensor(np.array(class_vectors))

## 4. Rendering final video

In this section, we will finally generate the frames and then combine them into one single video.

Before doing that, we first need to set some parameters.

In [None]:
# set batch size, i.e. size of the batch that BigGAN will use to generate the images
batch_size = 30

# set number of frames per second based on the song duration, the frame length, and the batch size
frame_lim = int(np.floor(len(y)/sr*22050/frame_length/batch_size))

# set output file name
outname = song.split('/')[-1][:-1] + '4'

After that, for each frame of the video, the model will produce one image. In the end, the frames are all combined by using the ```FFmpeg``` library. We also add the accompanying audio to the final video.

In [None]:
print('\n\nGenerating frames \n')

# send to CUDA if running on GPU
model = model.to(device)
noise_vectors = noise_vectors.to(device)
class_vectors = class_vectors.to(device)

frames = []

for i in tqdm(range(frame_lim)):

    if (i+1)*batch_size > len(class_vectors):
        torch.cuda.empty_cache()
        break
    
    # get batch
    noise_vector = noise_vectors[i*batch_size:(i+1)*batch_size]
    class_vector = class_vectors[i*batch_size:(i+1)*batch_size]

    # TO DO: call the model passing the noise vector, the class vector, and the truncation value in order to generate the images
    with torch.no_grad():
        output = model(###########)

    output_cpu = output.cpu().data.numpy()

    # convert to image array and add to frames
    for out in output_cpu:  
        im = np.array(toimage(out))
        frames.append(im)
        
    #empty cuda caches
    torch.cuda.empty_cache()

#Save video  
aud = mpy.AudioFileClip(song, fps = 44100) 
aud.duration = int(len(y)/sr)

clip = mpy.ImageSequenceClip(frames, fps=22050/frame_length)
clip = clip.set_audio(aud)
clip.write_videofile(outname,audio_codec='aac')

<details>
  <summary> Solution </summary>
  <pre>
output = model(noise_vector, class_vector, truncation)</pre>
</details>

If you want to check the final result, run the following line of code to visualize the generated mp4 file.

In [None]:
show_video('/content/' + outname)

## 5. MCQ

<b>1. What do the class and noise vectors mean in the BigGAN network?</b>

<div>
  A. <input type="checkbox">
  <label><strong>Class vector</strong>: the 12 pitches of the song
  <br>&nbsp &nbsp &nbsp &nbsp &nbsp <strong>Noise vector</strong>: variations of each song pitch</label>
</div>

<br>
<div>
  B. <input type="checkbox">
  <label><strong>Class vector</strong>: ImageNet groups of images
  <br>&nbsp &nbsp &nbsp &nbsp &nbsp <strong>Noise vector</strong>: randomness of the generated images</label>
</div>

<br>
<div>
  C. <input type="checkbox">
  <label><strong>Class vector</strong>: ImageNet groups of images
  <br>&nbsp &nbsp &nbsp &nbsp &nbsp <strong>Noise vector</strong>: visual features of generated images</label>
</div>

<br>
<div>
  D. <input type="checkbox">
  <label><strong>Class vector</strong>: visual features of generated images
  <br>&nbsp &nbsp &nbsp &nbsp &nbsp <strong>Noise vector</strong>: variations of each song pitch</label>
</div>

<br>
<details>
  <summary> Answer </summary>
C
</details>

<b>2. Which of these options represent parameters of the visualizer that we have implemented? (1 answer or more)</b>

<div>
  A. <input type="checkbox">
  <label>Tempo sensitivity</label>
</div>

<div>
  B. <input type="checkbox">
  <label>Equalization</label>
</div>

<div>
  C. <input type="checkbox">
  <label>Overdrive</label>
</div>

<div>
  D. <input type="checkbox">
  <label>Truncation</label>
</div>

<br>
<details>
  <summary> Answer </summary>
AD
</details>

<b>3. What statement(s) is (are) true about the content of this notebook?</b>

<div>
  A. <input type="checkbox">
  <label>This audio reactive visualizer always produce the same visual output for the same sound input.</label>
</div>

<div>
  B. <input type="checkbox">
  <label>The algorithm is capable of automatically selecting ImageNet classes based on semantic associations with song lyrics.</label>
</div>

<div>
  C. <input type="checkbox">
  <label>The generated images are created by interpolating the values of the class and noise vectors in the latent space.</label>
</div>

<div>
  D. <input type="checkbox">
  <label>None of the above.</label>
</div>

<br>
<details>
  <summary> Answer </summary>
C
</details>

<b>4. In which of these applications can an audio reactive visualizer be used?</b>

<div>
  A. <input type="checkbox">
  <label>Build a visualizer that responds to live music in real time.</label>
</div>

<div>
  B. <input type="checkbox">
  <label>Interface the class and noise vectors with neural activity to create deep music videos from the brain.</label>
</div>

<div>
  C. <input type="checkbox">
  <label>Meditation techniques that create a calming visual environment, complementing the soundscape and promoting relaxation.</label>
</div>

<div>
  D. <input type="checkbox">
  <label>All of the above.</label>
</div>

<br>
<details>
  <summary> Answer </summary>
D
</details>

## 6. Bonus: Playground (and other cool examples)

Now, it is up to you generate a deep music visualizer the way you want! As a reminder, here are all the parameters you that can change in this notebook:


*   ```song```
*   ```frame_length```
*   ```pitch_sensitivity```
*   ```tempo_sensitivity```
*   ```num_classes```
*   ```classes```
*   ```truncation```
*   ```depth```
*   ```smooth_factor```


We have also created some cool examples that you can check out by yourself.

In [None]:
show_video('/content/audio-reactive-visuals/creations/bee_thoven.mp4')            # classes = [309]

In [None]:
show_video('/content/audio-reactive-visuals/creations/strawberry_beatles.mp4')    # classes = [949]

In [None]:
show_video('/content/audio-reactive-visuals/creations/french_pride.mp4')          # classes = [975, 978]

In [None]:
show_video('/content/audio-reactive-visuals/creations/acid_rave.mp4')             # classes = [947]

In [None]:
show_video('/content/audio-reactive-visuals/creations/otaku_bird.mp4')            # classes = [13, 14]

Lastly, you can also take a look at what the original authors have created using this algorithm. 

* <a href="https://instagram.com/deep_music_visualizer?igshid=YmMyMTA2M2Y=">@deep_music_visualizer (Matt Siegelman)</a>
* <a href="https://instagram.com/lucidsonicdreams?igshid=YmMyMTA2M2Y=">@lucidsonicdreams (Mikael Alafriz)</a>