In [1]:
import cv2
import numpy as np
import random
import tensorly as tl
from tensorly.decomposition import tucker

Using numpy backend.


# Data Ingestion

We read all of the files, and get the number of frames in each one. When reading them as tensors we will truncate to the smallest number of frames. I strived to take videos of the same length (~11s), but small discrepancies are bound to exist. For our particular application, truncation ought not to matter too much.

In [2]:
# Create VideoCapture objects
parking_lot = cv2.VideoCapture('parking_lot.MOV')
patio = cv2.VideoCapture('patio.MOV')
commute = cv2.VideoCapture('commute.MOV')

# Get number of frames in each video
parking_lot_frames = int(parking_lot.get(cv2.CAP_PROP_FRAME_COUNT))
patio_frames = int(patio.get(cv2.CAP_PROP_FRAME_COUNT))
commute_frames = int(commute.get(cv2.CAP_PROP_FRAME_COUNT))

parking_lot_frames, patio_frames, commute_frames

(321, 328, 314)

In [3]:
# Get dimensions of each frame
parking_lot_height = int(parking_lot.get(cv2.CAP_PROP_FRAME_HEIGHT))
parking_lot_width = int(parking_lot.get(cv2.CAP_PROP_FRAME_WIDTH))
patio_height = int(patio.get(cv2.CAP_PROP_FRAME_HEIGHT))
patio_width = int(patio.get(cv2.CAP_PROP_FRAME_WIDTH))
commute_height = int(commute.get(cv2.CAP_PROP_FRAME_HEIGHT))
commute_width = int(commute.get(cv2.CAP_PROP_FRAME_WIDTH))

print(parking_lot_height, parking_lot_width)
print(patio_height, patio_width)
print(commute_height, commute_width)

1080 1920
1080 1920
1080 1920


Based on the number of frames and the dimensions of the frames, we need a 4D tensor (314x1080x1920x3) to hold these videos:
- 314 for the frames of the images (we truncate the extra frames for the patio and parking lot videos)
- 1080x1920 for the height and width of the images
- 3 for the RGB color channels

In [4]:
# Create function to read all frames of a video in an array
def read_frames(video_capture, max_frames):
    """
    INPUTS:
    video_capture: an OpenCV VideoCapture object whose frames we want to read
    max_frames: the maximum number of frames we want to read
    
    OUTPUT:
    array of all the frames until max_frames
    """
    # Initialize empty array
    frames_array = []
    
    # Keep track of the frame number
    frame_nb = 0
    
    # iterate through the frames and append them to the array
    while video_capture.isOpened() and frame_nb < max_frames:
        ret, frame = video_capture.read()
        if not ret:
            break
        frames_array.append(frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
        frame_nb += 1
    
    # release the video capture
    video_capture.release()
    cv2.destroyAllWindows()
    
    # return the array
    return(frames_array)

In [5]:
# Read in all the videos
parking_lot_array = read_frames(video_capture=parking_lot, max_frames=commute_frames)
patio_array = read_frames(video_capture=patio, max_frames=commute_frames)
commute_array = read_frames(video_capture=commute, max_frames=commute_frames)

# Data Manipulation

We create tensors out of the NumPy arrays with the TensorLy library.

In [6]:
# Create tensors from matrices
parking_lot_tensor = tl.tensor(parking_lot_array)
patio_tensor = tl.tensor(patio_array)
commute_tensor = tl.tensor(commute_array)

To speed up later steps, we randomly select 50 frames of the tensors to focus on.

In [7]:
# Set the seed for reproducibility
random.seed(42)
random_frames = random.sample(range(0, commute_frames), 50)

In [8]:
# Use these random frames to subset the tensors
subset_parking_lot = parking_lot_tensor[random_frames,:,:,:]
subset_patio = patio_tensor[random_frames,:,:,:]
subset_commute = commute_tensor[random_frames, :, :, :]

Convert to double, otherwise Tucker decomposition will not work.

In [9]:
# Convert three tensors to double
subset_parking_lot = subset_parking_lot.astype('d')
subset_patio = subset_patio.astype('d')
subset_commute = subset_commute.astype('d')

# Naive Comparison

A natural way of comparing two tensors is to compute the norm of the difference between them.

In [17]:
# Parking and patio
parking_patio_naive_diff = tl.norm(subset_parking_lot - subset_patio)

# Parking and commute
parking_commute_naive_diff = tl.norm(subset_parking_lot - subset_commute)

# Patio and commute
patio_commute_naive_diff = tl.norm(subset_patio - subset_commute)

# Print our differences
print("The difference between parking and patio tensors is {}, {} between parking and commute and {} between patio and commute".format(int(parking_patio_naive_diff), int(parking_commute_naive_diff), 
                                         int(patio_commute_naive_diff)))

The difference between parking and patio tensors is 1832804, 1862115 between parking and commute and 1840975 between patio and commute


# Unsupervised Learning

Now that we have the tensors, we can perform Tucker decomposition to get a more robust representation (using the resulting core tensor). This rids us of noise and we get a better sense of the similarity between two videos.

The main tuning parameter is the n-rank of the tensor. If we were seeking the optimal decomposition, AIC criterion could be used to choose the best value of the hyperparameter. Nevertheless, in this specific context we are not looking for an optimal setting, rather something that is usable. Besides, we need similar dimensions across tensors to be able to make comparisons.

For this reason, we chose n-rank of [2,2,2,2] for all tensors and compare the resulting core tensors. Choosing this somewhat small n-rank helps by limiting the computational complexity of our operations (trying out n-rank of [5,5,5,5] will exceed the capabilities of LAPACK, which is used under the hood).

In [11]:
# Get core tensor for the parking lot video
core_parking_lot, factors_parking_lot = tucker(subset_parking_lot, ranks = [2,2,2,2])

In [12]:
# Get core tensor for the patio video
core_patio, factors_patio = tucker(subset_patio, ranks = [2,2,2,2])

In [13]:
# Get core tensor for the commute video
core_commute, factors_commute = tucker(subset_commute, ranks = [2,2,2,2])

In [18]:
# Compare core parking lot and patio
parking_patio_diff = tl.norm(core_parking_lot - core_patio)
int(parking_patio_diff)

3707771

In [19]:
# Compare core parking lot and commute
parking_commute_diff= tl.norm(core_parking_lot - core_commute)
int(parking_commute_diff)

675670

In [20]:
# Compare core patio and commute
patio_commute_diff = tl.norm(core_patio - core_commute)
int(patio_commute_diff)

3630173

# Conclusion

Leveraging Tucker decomposition allows us to make robust comparisons between videos by extracting the core tensor, the main information contained in it.

This has very broad applications (recommender systems, material science) but also needs a lot of computing power, with some potential for parallelization for this to be used at scale.

For more information on the mathematical underpinning of tensor decomposition as well as broader context on this analysis, please refer to the Medium article linked in the associated README file.