# Multimedia Structure Analysis
In this lab, you will practice with inferring the structure of audiovisual content. You will work with a **full** video (Big Buck Bunny), and will be asked to try finding strong episodical changes in the video from shot boundaries and audio novelty points.

With 'strong episodical changes', we mean to find major turning points in the video's storyline, by considering the video's content.


## What to hand in
As final deliverable to demonstrate your successful completion of this assignment, please submit a file named [studentNumberMember1_studentNumberMember2.pdf] through Brightspace.

This file should:
* three detected scenes in the video (formatted as hh:mm:ss - hh:mm:ss) indicating the strongest episodical changes in the video used in this lab. Discuss what parameters and audio features you used to detect them (e.g., threshold choices, choice of audio feature, choice of similarity metric), and shortly discuss what content is displayed within these shots.
* your ideas on what further features could contribute to picking scenes with strong episodical changes in the Big Buck Bunny video.

Further instructions can be found further down this notebook, at the point where we give an example plot.

## Installing dependencies to run on Google Colab!
We will first setup the system and install the dependencies for this lab

**Note**: This cell needs to be run everytime you start a new session of the Colab Notebook

In [0]:
import os, time


!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse

# Packages for the lab notebooks
!apt-get -qq -y install mono-complete
!apt-get -qq -y install libsamplerate0 libsamplerate0-dev
!apt-get install ffmpeg

!pip install --upgrade setuptools

### One-time user authentication for accessing google drive folders

 - Open the link that comes in the output cell and copy & paste the token in the text box here.
 - If just the text box appears (i.e textbox without authentication links), then input your google password in the textbox.

**Note**: This cell needs to be run everytime you start a new session of the Colab Notebook

In [0]:
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

### Accessing Google Drive files

In this step, we basically create an empty directory and then **mount** `google-drive`(google drive directory) on the empty directory that we have created.

In [0]:
# Check the current working directory first
!pwd
!ls

In [0]:
# If the gdrive folder does not exist already then perform the following steps, else skip this
!mkdir -p gdrive
!google-drive-ocamlfuse gdrive
# !google-drive-ocamlfuse nonempty gdrive

### Change working directory to the MMSR repository folder

In [0]:
# Check the current working directory
!pwd
!ls

#### If running for the FIRST time, **`CREATE`** a new directory for the MMSR labs.
**NOTE**: **SKIP this cell if the directory is already created earlier.** 

In [0]:
os.mkdir("gdrive/MMSR_lab")

#### Change to the MMSR lab directory

In [0]:
# Change to the created directory
os.chdir("gdrive/MMSR_lab")

# Verify if we are in the right directory
!pwd

### Clone the MMSR repository (https://gitlab.ewi.tudelft.nl/mmc-tudelft/education/cs4065.git)
Alternatively, you can download the MMSR repository on your local machine and then upload the repository folder manually to the `MMSR_lab` folder.

**NOTE 1**: Sometimes, this step takes a lot of time to download the git repository from the GitLab. We suggest you to download/clone the above repository on your local machine first and then, upload the downloaded/cloned folder to your Google drive; inside the `MMSR_lab` folder.

**NOTE 2**: Skip this cell if the repository is already cloned or uploaded manually

In [0]:
!git clone https://gitlab.ewi.tudelft.nl/mmc-tudelft/education/cs4065.git

**Important**: Check if the repository exists in the current directory

In [0]:
!ls

In [0]:
# Change directory to the repository
os.chdir("cs4065")
!ls

### Pull the updated git repository 

**NOTE 1**: Sometimes, this step takes a lot of time to update the git repository. We suggest you to download/clone/pull the above repository on your local machine first and then, upload the downloaded/cloned folder to your Google drive; inside the `MMSR_lab` folder.

In [0]:
!git pull

### Install python dependencies for this lab exercise

Python requirements for the **Colab Notebook** versions can be found in the *colab_requirements* directory

For this lab, run `systems-lab2-requirements.txt`

In [0]:
!pip install -r colab_requirements/systems-lab2-requirements.txt
!pip install bokeh

## Getting started

As usual, we will first import necessary libraries.

In [0]:
import datetime
import numpy as np
import os
import urllib

import cv2
import librosa
import matplotlib.pyplot as plt
# Instead of %matplotlib inline we use %matplotlib notebook this time.
# This allows for more interactive examination of graphs, which will be useful
# as you will manually inspect the results curves.
%matplotlib notebook
from scipy.signal import resample
from scipy.signal import find_peaks_cwt
from scipy.spatial import distance

from IPython.display import Audio
from IPython.display import YouTubeVideo

from cvtools import ipynb_show_cv2_image
from cvtools import VideoReader

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import HoverTool


def bokeh_imshow(img, title=None, colormap='Spectral11'):
    """"""
    hover = HoverTool(tooltips=[
        # ("index", "$index"),
        ("(x,y)", "($x, $y)"),
        ("value", "@image")
    ])
    output_notebook()
    w, h = img.shape
    p = figure(title=title, tools=[hover], x_range=(0, w), y_range=(0, h))
    p.image([img], x=0, y=0, dw=w, dh=h, palette=colormap)
    show(p)
    
def bokeh_plot(data, title=None, plot_width=800, plot_height=250,
               x_axis_label=None, y_axis_label=None, plot_color='blue'):
    """"""
    hover = HoverTool(tooltips=[
        # ("index", "$index"),
        ("(x,y)", "($x, $y)"),
        ("value", "@y")
    ])
    output_notebook()
    p = figure(
        title=title, tools=[hover],
        plot_width=plot_width, plot_height=plot_height,
        x_axis_label=x_axis_label, y_axis_label=y_axis_label,
    )    
    p.line(np.arange(len(data)), data, line_color=plot_color)
    show(p)
    
def bokeh_plot_n_peak(plot_data, peak_data_x, peak_data_y,
                      title=None, plot_width=800, plot_height=250,
                      x_axis_label=None, y_axis_label=None, plot_color='blue'):
    """"""    
    hover = HoverTool(tooltips=[
        # ("index", "$index"),
        ("(x,y)", "($x, $y)"),
        ("value", "@y")
    ])
    output_notebook()
    p = figure(
        title=title, tools=[hover],
        plot_width=plot_width, plot_height=plot_height,
        x_axis_label=x_axis_label, y_axis_label=y_axis_label
    )
    p.line(np.arange(len(plot_data)), plot_data, line_color=plot_color)
    p.circle(peak_data_x, peak_data_y, fill_color=None, line_color='red')
    show(p)

You will be analyzing the Blender video 'Big Buck Bunny'. Let's first check out the full (audiovisual) video, which is available on YouTube, below:

In [0]:
YouTubeVideo("YE7VzlLtp-4")

As you can notice, the video contains several semantic episodes, and within each episode several highlights or surprising events occur. In this lab, we will investigate to what extent structural segmentation and highlight detection could be performed based on audiovisual analysis.

We will initially consider the visual and audio domain separately in the analysis. For convenience, we already made a separate video and audio track available to you.

In [0]:
# DATA_PATH = '/home/student/data/cs4065/mm_structure_analysis'
DATA_PATH = os.path.join(os.getcwd(), "data/mm_structure_analysis")

VIDEO_URL = 'https://www.dropbox.com/s/g8ta0t47hzz40u0/BigBuckBunny_video.mp4?dl=1'
VIDEO_PATH = os.path.join(DATA_PATH, 'BigBuckBunny_video.mp4')

AUDIO_URL = 'https://www.dropbox.com/s/doqcxtojqigo4s0/BigBuckBunny_audio.aac?dl=1'
AUDIO_PATH = os.path.join(DATA_PATH, 'BigBuckBunny_audio.aac')

def fetch_data(url, filepath):
  if os.path.exists(filepath):
    print '<%s> already available' %  url
    return filepath
  try:
    os.makedirs(os.path.dirname(filepath))
  except:
    pass
  print 'fetching <%s>...' %  url,
  (filepath, _) = urllib.urlretrieve(url, filepath)
  print 'done.'
  return filepath


_ = fetch_data(VIDEO_URL, VIDEO_PATH)
_ = fetch_data(AUDIO_URL, AUDIO_PATH)

## Video analysis
You will have to find shot/scene boundaries in the given video. To simplify our implementation, in this lab, we will extract a feature vector for each frame, and then compute a **full** self-similarity matrix.

Note that this approach is good for visualization, but not for efficiency. In case you would ever need to implement a more efficient solution, you can iterate over the frames using a buffer of past frames (or features derived from them), avoiding keeping them all in memory. Furthermore, self-similarity can efficiently be computed by using a circular buffer of feature vectors, and when you would be interested in novelty detection, it may be enough to restrict your analysis to a band close to the self-similarity matrix diagonal. If you are interested in how to implement this, see the documentation for <code>mirsimatrix()</code> in https://www.jyu.fi/hum/laitokset/musiikki/en/research/coe/materials/mirtoolbox/MIRtoolbox1.6.1guide (from page 163 on)---in particular, the "Width" option, which enables to restrict to a diagonal bandwidth only.

In any case, as said before, in the current lab, we will just consider a full analysis of our video.

### Reading a video
Below, we show how to read a video.

In [0]:
# Let's open the video and read its properties.
video_reader = VideoReader()
video_reader.open(VIDEO_PATH)

# Video duration (in seconds).
video_duration = float(
    video_reader.get_number_of_frames()) / float(video_reader.get_frame_rate())

print 'resolution: %d x %d' % (video_reader.get_width(), video_reader.get_height())
print 'number of frames: %d' % video_reader.get_number_of_frames()
print 'duration: %s' % datetime.timedelta(seconds=video_duration)

*The <code>VideoReader</code> class is very simple and has no <code>seek()</code> method. Therefore, the next frame to be read by VideoReader always is the next unread frame in the video. if you want to read the same video multiple times, you need to re-open it (and re-instantiate <code>video_reader</code> if you want to open the same video).*

We first examine a number of frames in the first 10 seconds of the video (feel free to experiment with the parameters).

In [0]:
# Re-open the video to now process it from the start.
video_reader = VideoReader()
video_reader.open(VIDEO_PATH)

# Read a number of video frames (say, the first 10 seconds).
video_fps = video_reader.get_frame_rate()
sampling_period = int(3.0 * video_fps)  # One every 3 seconds.
stop_at = int(10.0 * video_fps)  # For the first 10 seconds.

%matplotlib inline

for frame in video_reader.get_frames():
  index = video_reader.get_current_frame_index()
  if 0 != index % sampling_period:
    continue
  ipynb_show_cv2_image(frame, 'frame %d' % index)
  if index > stop_at:
    break

### Extract frame features
We will now extract HS color histograms for each frame in the video and downsample the resulting feature matrix by retaining 1 histogram per second.

### Question
Complete the <code>extract_frame_hs_histogram()</code> function below, so it will extract HS color histograms. Use 8 bins for each channel when computing the histogram; the ranges for the H and S channels are 0-180 and 0-256 respectively.

*If you are lost, check back on what you did in the second lab assignment (week 2) of this course.*

NB: when computing your histogram, verify its dimensions with <code>np.shape()<code>. In case the histogram originally is stored as a matrix, use <code>np.flatten()</code> to make sure that you return a vector (and not a matrix).

In [0]:
def extract_frame_hs_histogram(frame):
  # TODO: replace empty list by actual histogram.
  histogram = []
  return histogram / np.sum(histogram)  # Return a normalized histogram.


# Find the size of the hs histograms computed by extract_frame_hs_histogram().
# We do this by passing an 8x8 black patch.
histogram_size = len(extract_frame_hs_histogram(np.zeros((8, 8, 3), dtype=np.uint8)))
print 'HS histogram size: %d' % histogram_size

You now have to extract the matrix of HS histograms.

In [0]:
# Re-open the video to now process it from the start.
video_reader = VideoReader()
video_reader.open(VIDEO_PATH)

# Initialize the matrix (one row for each frame).
hs_histogram_matrix = np.zeros((
    video_reader.get_number_of_frames(), histogram_size))

# Compute features per frame. Note this will take some time.
for frame in video_reader.get_frames():
  index = video_reader.get_current_frame_index()
  hs_histogram_matrix[index, :] = extract_frame_hs_histogram(frame)

We now have a histogram per frame. We will resample the data to keep one value per second.

In [0]:
number_of_seconds = int(np.round(video_duration))
hs_histogram_matrix_resampled = np.array(resample(hs_histogram_matrix, number_of_seconds)).astype(np.float32)

print 'HS histogram matrix size: %d x %d' % np.shape(hs_histogram_matrix_resampled)

# Remove the full size HS histogram matrix (we don't need it anymore).
del hs_histogram_matrix

### Self-similarity analysis
To understand the structure of the video, we will proceed by conducting a self-similarity analysis. For any type of stream, the self-similarity matrix is computed using two "ingredients": (i) a matrix of feature vectors representing the stream and (ii) a similarity function to compare feature vector pairs.

### Question
Complete <code>hs_histograms_similarity()</code> (HS histograms similarity function) using the intersection area similarity metric. When passing numpy arrays to Open CV functions, cast them to float 32 using <code>my_array.astype(np.float32)</code>.

In [0]:
def hs_histograms_similarity(hs_histogram0, hs_histogram1):
  # TODO: implement.
  return 0.0


# Example.
print '100 vs 101: %.6f' % hs_histograms_similarity(
    hs_histogram_matrix_resampled[100, :],
    hs_histogram_matrix_resampled[101, :])

print '100 vs 105: %.6f' % hs_histograms_similarity(
    hs_histogram_matrix_resampled[100, :],
    hs_histogram_matrix_resampled[105, :])

We now will define a function called <code>compute_self_similarity</code> that computes a square self-similarity matrix, given a matrix of feature vectors that represents a stream (video, audio) over time.

The second parameter, named <code>similarity_function</code>, is a function handle that uses your <code>hs_histogram_similarity</code> function by default. You can however use any other similarity function of your choice. (If you would like to do this, do keep in mind that we want to measure *similarity* (and not distance))

### Question
Complete the <code>compute_self_similarity()</code> function below.

Some questions to think of when doing this:
- the result should be a matrix. What should be the dimensions of this matrix?
- what does each element in this matrix represent? Here, your similarity function will have to be used.

In [0]:
def compute_self_similarity(feature_vector_matrix, similarity_function = hs_histograms_similarity):
  # TODO: implement. 
  # For calling the similarity function, you can just use similarity_function(first_vector, second_vector).
  return []

Let's now build the self-similarity matrix.

In [0]:
hs_histograms_self_similarity = compute_self_similarity(
    hs_histogram_matrix_resampled, hs_histograms_similarity)
print 'HS histograms self-similarity matrix size: %d x %d' % np.shape(hs_histograms_self_similarity)

Subsequently, we will visualize the self-similarity matrix as an image.

In [0]:
bokeh_imshow(hs_histograms_self_similarity)

### Novelty curve
The self-similarity matrix shows blocks around the diagonal that correspond to regions of high self-similarity. As discussed in the lecture, detecting transitions between these blocks can point you to *novelty* points, indicating that a new coherent episode is starting. Therefore, we will now focus on extracting a *novelty curve*.

To this end, you have to create a 2D square *kernel* matrix of arbitrary size. It will be used to slide along the diagonal of the self-similarity matrix to compute the correlation with the block it lies on. Sliding along the diagonal corresponds to moving along the timeline of the analyzed stream.

We first show the *checkerboard kernel*; then, you will have to compute a smoother version of it called *Gaussian checkerboard kernel* (by completing <code>compute_gaussian_checkerboard_kernel()</code>).

In [0]:
def compute_checkerboard_kernel(kernel_size = 10):
  # This is the size on a side of the kernel.
  kernel_side = int(np.ceil(kernel_size / 2.0))

  # Initialize.
  kernel = np.ones((kernel_size, kernel_size))
  
  # Set the top-right and bottom-left blocks to -1.
  kernel[0:kernel_side, kernel_side:] = -1.0
  kernel[kernel_side:, 0:kernel_side] = -1.0
  
  return kernel


%matplotlib inline

fig = plt.figure('Checkerboard kernel')
cax = plt.imshow(
    compute_checkerboard_kernel(), interpolation='nearest', cmap='gray')
fig.colorbar(cax)

The Gaussian checkerboard kernel is a smoother version of the checkerboard kernel displayed above. Similarity values around its center are weighted more than those at the edges. Using this instead of the checkerboard kernel leads to a smoother novelty curve.

Use the following definition of 2D Gaussian (centered in the middle of the kernel) to complete <code>gaussian_checkerboard_kernel()</code>:

$$
f(x, y) = exp(- 4 \cdot [\frac{(x - \mu)^2}{\mu^2} + \frac{(y - \mu)^2}{\mu^2}])
$$

where $\mu$ is equal to <code>kernel_side</code> (see its value in <code>checkerboard_kernel()</code>).

*Tip: re-use <code>checkerboard_kernel()</code>.*

### Question
Complete <code>compute_gaussian_checkerboard_kernel()</code>.

In [0]:
def compute_gaussian_checkerboard_kernel(kernel_size = 10):
  # TODO: implement.
  return []


%matplotlib inline

fig = plt.figure('Gaussian checkerboard kernel')
cax = plt.imshow(
    compute_gaussian_checkerboard_kernel(64), interpolation='nearest', cmap='gray')
fig.colorbar(cax)

As anticipated above, the last step is computing the correlation of the kernel with self-similarity blocks extracted along the diagonal. Note that the kernel always is centered around the time points it is targeting for calculation; as a consequence, its scope will slightly 'fall outside' the self-similarity matrix boundaries when targeting the very first and very last seconds of the video.

To deal with this, we will apply zero padding to the self-similarity matrix, adding extra zeros around the matrix to handle the boundary cases at the start and end of the analysis.

### Question
Complete the function <code>compute_novelty()</code>.

In [0]:
def compute_novelty(self_similarity_matrix, kernel):
  diag_length = np.shape(self_similarity_matrix)[0]
  kernel_size = np.shape(kernel)[0]
  kernel_size_half = int(np.ceil(kernel_size / 2.0))
  
  # Pad the self-similarity matrix.
  padded_size = 0  # TODO: the padding size is not zero. Replace this with the correct value

  padded_self_sim_matrix = np.zeros((padded_size, padded_size))
  stop = - kernel_size_half + int(1 == kernel_size % 2)
  padded_self_sim_matrix[
      kernel_size_half:stop,
      kernel_size_half:stop] = self_similarity_matrix
  
  # Compute novelty.
  novelty = np.zeros(diag_length)
  for x in range(diag_length):
    stop = 0  # TODO: compute the first excluded index in the interval x:stop (see below).
    sub_block = padded_self_sim_matrix[x:stop, x:stop]
    novelty[x] = np.sum(sub_block * kernel)  # Correlation between the kernel and the subblock.
    
  return novelty

In [0]:
video_novelty = compute_novelty(
    hs_histograms_self_similarity, compute_gaussian_checkerboard_kernel(32))

bokeh_plot(video_novelty, title='Video Novelty')

## Audio analysis
So far, we considered the video channel. We will now compute a novelty curve for the audio stream in similar fashion.

In general, long audio signals can efficiently be analyzed by using the sliding window technique: as shown in the picture below, the signal is split into (overlapping) frames of fixed size (which, for efficiency, is a power of 2) and each frame is analyzed independetly. The output is a list of feature vectors (one per frame).

<p style="text-align: center;"><a href="https://commons.wikimedia.org/wiki/File:Depiction_of_overlap-add_algorithm.png#/media/File:Depiction_of_overlap-add_algorithm.png"><img src="https://upload.wikimedia.org/wikipedia/commons/7/77/Depiction_of_overlap-add_algorithm.png" style="width: 640px" alt="Depiction of overlap-add algorithm.png"></a><br>By en:Bob K (modifications), <a href="//en.wikipedia.org/wiki/en:User:Paolostar" class="extiw" title="w:en:User:Paolostar">User:Paolostar</a>, Paolo Serena (original, released for free use) - en wikipedia, derived from <a href="//en.wikipedia.org/wiki/en:File:Oa_idea.jpg" class="extiw" title="w:en:File:Oa idea.jpg">w:en:File:Oa idea.jpg</a> by <a href="//en.wikipedia.org/wiki/en:User:Paolostar" class="extiw" title="w:en:User:Paolostar">User:Paolostar</a>, Paolo Serena, University of Parma (Italy), Public Domain, https://commons.wikimedia.org/w/index.php?curid=5015398</p>

### Reading an audio file
Let's read the audio file and print some of its properties.

In [0]:
audio_signal, sample_rate = librosa.core.load(AUDIO_PATH)

In [0]:
# Audio duration (in seconds).
number_of_audio_samples = len(audio_signal)
audio_duration = float(
    number_of_audio_samples / float(sample_rate))

print 'sample rate: %d' % sample_rate
print 'number of samples: %d' % number_of_audio_samples
print 'duration: %s' % datetime.timedelta(seconds=audio_duration)

In [0]:
# When defining the frame size, we want a frame with a length which is a power of 2
# (in order for Fast Fourier Transform techniques to be applicable).
# Let's add a function to find the smallest next power of 2 for a given number.

def next_pow_2(x):
    """Smallest next power of two of a given value x."""
    return 1 << (x - 1).bit_length()

In [0]:
# Sliding window analysis parameters.
audio_frame_size = next_pow_2(int(sample_rate / 4.0))  # i.e., about 0.25 seconds.
audio_hop_size = int(audio_frame_size / 2.0)  # i.e., 50% overlap.
print ' - sliding window analysis'
print '   frame size: %d' % audio_frame_size
print '   hop size: %d' % audio_hop_size

### Self-similarity analysis
We first compute the self-similarity matrix for an audio feature called Chroma (which was discussed in the lectures). Later on, you will have to do the same using MFCCs.

In [0]:
# Extract Chroma vectors.
chroma_matrix = librosa.feature.chroma_stft(
    audio_signal, n_fft=audio_frame_size, hop_length=audio_hop_size)

Reduce the data to process by resampling the feature matrix. We will keep one value per second.

In [0]:
number_of_seconds = int(np.round(audio_duration))
chroma_matrix_resampled = np.array(resample(chroma_matrix.transpose(), number_of_seconds))

print 'Chroma feature matrix size: %d x %d' % np.shape(chroma_matrix_resampled)

# Remove the full size Chroma matrix (we don't need it anymore).
del chroma_matrix

### Question
Compute and plot the Chroma self-similarity matrix using the cosine distance. To do that, replace the empty matrix below by the actual self-similarity matrix. Feel free to create a helper function for this. Also note that you could use pdist() (as you did in the very first lab) as an intermediate step, but then you would still have to convert dissimilarity into similarity. 

In [0]:
chroma_self_similarity = []  # TODO: replace this empty matrix by an actual self-similarity matrix.

print 'Chroma self-similarity matrix size: %d x %d' % np.shape(chroma_self_similarity)

In [0]:
# inspect the results visually
bokeh_imshow(chroma_self_similarity)

### Question
Extract the MFCCs matrix for the audio recording using <code>librosa.feature.mfcc()</code>. Remember to subsample the full matrix as done for Chroma. 

In [0]:
# Extract MFCCs vectors.
mfccs_matrix = []  # TODO: complete.

# Subsample.
mfccs_matrix_resampled = []  # TODO: complete.

print 'MFCCs feature matrix size: %d x %d' % np.shape(mfccs_matrix_resampled)

# Remove the full size Chroma matrix (we don't need it anymore).
del mfccs_matrix

### Question
Compute the MFCC self-similarity matrix.

In [0]:
mfccs_self_similarity = []  # TODO: complete.
print 'MFCCs self-similarity matrix size: %d x %d' % np.shape(mfccs_self_similarity)

In [0]:
bokeh_imshow(mfccs_self_similarity)

### Question
Choose to use either <code>chroma_self_similarity</code> or <code>mfccs_self_similarity</code> to extract novelty and motivate your choice.

In [0]:
# TODO: make your choice by uncommenting one of the following two lines.
# audio_self_similarity = mfccs_self_similarity
# audio_self_similarity = chroma_self_similarity

### Novelty curve
Use the chosen self-similarity matrix to derive a novelty curve for the audio stream.

### Question
Extract the novelty curve. Try different kernel sizes.

In [0]:
audio_novelty = []  # TODO: complete.

bokeh_plot(audio_novelty, title='Audio Novelty')

# Finding scenes via analysis of the novelty curves.
Above, you have extracted two novelty curves (namely, <code>audio_novelty</code> and <code>video_novelty</code>). The peaks in those signals can be interpreted as *boundaries* between shots/scenes. Depending on the parameters you chose above, you will found the strongest peaks at different locations.

The two novelty signals may be complementary: that is, they may reflect different types of episodical changes. As final step, you therefore will combine the two novelty signals, extract its peaks and spot the strongest ones by visual inspection of the plots.

First, we create a scaling function for the novelty vectors. We do this as a normalization step, to ensure both vectors are in similar range.

In [0]:
def scale_vector(v):
  return (v - np.min(v)) / (np.max(v) - np.min(v))

Then, we create a function that detects peaks in a vector. 

In [0]:
def detect_peaks(data, threshold = None):
  # Use scipy.signal.find_peaks_cwt to detect peaks.
  peaks_positions = np.array(
      find_peaks_cwt(data, np.arange(1, 10)))

  if 0 == len(peaks_positions):
    return []

  if threshold is not None:
    # Filter peaks by thresholding.
    mask = data[peaks_positions] > threshold
    peaks_positions = peaks_positions[mask]
  
  return peaks_positions

In [0]:
video_novelty_normalized = scale_vector(video_novelty)
video_boundaries = detect_peaks(video_novelty_normalized)
print 'number of video boundaries: %d' % len(video_boundaries)

audio_novelty_normalized = scale_vector(audio_novelty)
audio_boundaries = detect_peaks(audio_novelty_normalized)
print 'number of audio boundaries: %d' % len(audio_boundaries)

bokeh_plot_n_peak(video_novelty_normalized, video_boundaries, 
                  video_novelty_normalized[video_boundaries],
                  title=None, plot_width=800, plot_height=200,
                  x_axis_label='time (s)', y_axis_label='Video Novelty', 
                  plot_color='blue')
bokeh_plot_n_peak(audio_novelty_normalized, audio_boundaries, 
                  audio_novelty_normalized[audio_boundaries],
                  title=None, plot_width=800, plot_height=200,
                  x_axis_label='time (s)', y_axis_label='Audio Novelty', 
                  plot_color='green')

Let's combine the two novelty curves. Below, you find two possible ways. Feel free to test different options.

In [0]:
# Soft-and merging. If novelty is high in both signals, the combined result will be high.
# However, if it is high in one signal but low in the other, the combined result will be penalized.
soft_and_novelty = video_novelty_normalized * audio_novelty_normalized
soft_and_boundaries = detect_peaks(soft_and_novelty)
print 'number of soft-and boundaries: %d' % len(soft_and_boundaries)

# Soft-or merging. If novelty is high in at least one of the signals, the combined result will be high.
soft_or_novelty = video_novelty_normalized + audio_novelty_normalized
soft_or_boundaries = detect_peaks(soft_or_novelty)
print 'number of soft-or boundaries: %d' % len(soft_or_boundaries)

bokeh_plot_n_peak(soft_and_novelty, soft_and_boundaries, 
                  soft_and_novelty[soft_and_boundaries],
                  title=None, plot_width=800, plot_height=200,
                  x_axis_label='time (s)', y_axis_label='Video Novelty', 
                  plot_color='blue')
bokeh_plot_n_peak(soft_or_novelty, soft_or_boundaries, 
                  soft_or_novelty[soft_or_boundaries],
                  title=None, plot_width=800, plot_height=200,
                  x_axis_label='time (s)', y_axis_label='Audio Novelty', 
                  plot_color='green')

# What to hand in

Based on the analysis so far, you can conduct different types of multimodal structural analyses on the Big Buck Bunny video. As you noticed, the default analysis we conducted so far may not be optimal yet though.

Play with parameter settings, different analysis window sizes, audio feature types, and similarity metrics to see if you can detect the strongest episodical changes in the video based on video and audio content in a better way than done so far in the lab.

As an alternative (or complement) to novelty point analysis from a self-similarity matrix, you may also want to look at feature statistics *within* episode blocks over time. For example, you could investigate audio loudness, which you can calculate using <code>librosa.feature.rmse</code>, or consider how shot boundary density evolves over the timeline of the video.

To demonstrate you completed this assignment, please upload a file [studentNumberMember1_studentNumberMember2.pdf] to Brightspace in which you include the following:

- three detected *scene* boundaries in the video (formatted as hh:mm:ss) indicating the strongest episodical changes in the video used in this lab. Discuss what parameters and audio features you used to detect them (e.g., threshold choices, choice of audio feature, choice of similarity metric), and shortly discuss what content is displayed within the shots across these boundaries.
- your ideas on what further features could contribute to picking scene boundaries with strong episodical changes in the Big Buck Bunny video.