# Machine Learning-Powered Video Library
**By [Czarina Luna](https://www.linkedin.com/in/czarinagluna/)**
***

Video sharing applications today lack the functionality for users to search videos by their content. As a solution I developed a searchable video library that processes videos and returns exact matches to queries using machine learning and artificial intelligence including speech recognition, optical character recognition, and object detection. 

*[Link to Web Application](https://share.streamlit.io/czarinagluna/ml-powered-video-library/main)*

### Contents
* [Business Problem](#Business-Problem)
* [Data and Methodology](#Data-and-Methodology)
* [Video Processing](#Video-Processing)
    * [I. Audio Processing](#I.-Audio-Processing)
    * [II. Extracting Visual Text](#II.-Extracting-Visual-Text)
    * [III. Detecting Image Object](#III.-Detecting-Image-Object)
* [Natural Language Processing](#Natural-Language-Processing)
* [Search Results](#Search-Results)
* [Further Research](#Further-Research)

## Business Problem

Applications for video sharing and storage may be able to enhance user experience by allowing users to search for videos by their content, such as specific words or objects in the video. One of the most popular video sharing apps right now is TikTok where users can save the videos they like to their profile but yet cannot search through the liked videos. 

As it lacks that functionality, its millions of users are forced to scroll through every single video they have ever liked to find one single clip, and over again. To address this problem, I create a library of TikTok videos and build a search engine that breaks down the videos into several features and returns exact matches to any given query.

## Data and Methodology
A sample of 140 videos are provided in the [videos](https://github.com/czarinagluna/ml-video-library/tree/main/data/videos) folder of this repository for the purpose of demonstrating the end-to-end process I performed. This set of sample is originally saved from my personal user account, and in addition, I downloaded two datasets each containing 1000 videos from Kaggle (found [here](https://www.kaggle.com/datasets/marqueurs404/tiktok-trending-videos) and [here](https://www.kaggle.com/datasets/erikvdven/tiktok-trending-december-2020?select=videos)). Altogether I analyzed over 2000 videos for the whole project, which I uploaded on [Google Drive](https://drive.google.com/drive/folders/1-OMkbBMzBGWH9PVU0ojZACtnFlP3ANbE?usp=sharing). You may download all the videos to explore the complete dataset.

**Multimedia Data**

A video is a complex data type that can be broken down in a lot of different ways. Through feature engineering, I turned the raw videos into multiple data features that I extracted using the following approaches:
- Converting the video to audio and transcribing the speech
- Breaking down the video as a sequence of images or frames
    - Recognizing on-screen text in the video frames
    - Detecting image objects in the video frames

**Data Processing**

1. Audio processing using `moviepy`, `pydub`, and `speech_recognition`
2. Optical character recognition using `opencv-python`, `PIL`, and `pytesseract`
3. Object detection using `opencv-python` and `YOLOv3` algorithm

Using the above packages and models, the features are extracted as text and so I applied Natural Language Processing (NLP) to process the text and to create a corpus of all the words to search through. Lastly, I built the search engine using `BM25` and deployed the full app via Streamlit.

## Video Processing

To start, let's create a dataframe containing the file paths of the videos.

In [1]:
# Create a list of names of all the video files
import os
directory = 'data/videos'
file_list = []

for file in os.listdir(directory):
    f = os.path.join(directory, file)
    if os.path.isfile(f):
        file_list.append(f)
        
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Turn the list into a pandas dataframe
import pandas as pd
import numpy as np

data = pd.DataFrame(file_list, columns=['file_path'])

In [3]:
# Display the first five rows of the table
data.head()

Unnamed: 0,file_path
0,data/videos/v09044b10000bt9ch34evc02dolrmnm0.mov
1,data/videos/v090440c0000btvqv225mcbk472mqcfg.MP4
2,data/videos/v09044650000bqp6ame0bkbl9lnj30ng.MP4
3,data/videos/v090440c0000bu92lni5mcbk473oa89g.MP4
4,data/videos/v09044f20000btekl246h3878d1mjtcg.MP4


The file paths are used to find the videos. For example, let's display the first video.

In [4]:
video_0 = data.iloc[0, 0]

In [5]:
# Use the Display module from IPython to play a video
from IPython.display import Video
Video(video_0, width=300)

### I. Audio Processing

The first feature to extract from the videos is audio. Audio processing is the fastest part in the full process of feature engineering. 

- To create the audio file, the video formatted as *mov* or *mp4* is converted to a *wav* file. 
- The audio file is sliced into smaller chunks of audio, split by silence of 500 milliseconds or more, for faster processing. 
- To recognize the contents, I create an instance of the `Recognizer` class and call the method `recognize_google`. 
- The full transcription is stored in a text file, which is returned by the function I define as `transcribe_audio`:

In [6]:
# Import Python libraries and modules
import os
import speech_recognition as sr 
from moviepy.editor import AudioFileClip
from pydub import AudioSegment
from pydub.silence import split_on_silence

def transcribe_audio(file_path):
    '''
    Converts video to audio and returns audio transcription.
    
    Parameters:
    file_path (str): file path of video to be transcribed.
    
    Returns:
    full_text (str): full text transcription of video's audio.
    '''
    # Write the audio file from the video using MoviePy
    # to convert the MP4 or MOV video format to a wav file
    transcribed_audio_file = './data/audio/transcribed_audio.wav'
    audioclip = AudioFileClip(file_path)
    audioclip.write_audiofile(transcribed_audio_file)

    try:
        sound = AudioSegment.from_file(file_path, 'mp3')
    except:
        sound = AudioSegment.from_file(file_path, format='mp4')    

    # Split the wav file into chunks where there is silence 
    # for 500 milliseconds or more using PyDub
    chunks = split_on_silence(sound, min_silence_len = 500, 
                              silence_thresh = sound.dBFS-14, keep_silence=500)

    # Create a folder called audio_chunks to save the chunks of wav files
    folder_name = './data/audio/audio_chunks'
    if not os.path.isdir(folder_name):
        os.mkdir(folder_name)
        
    full_text = ''
    
    # Create an instance of the Recognizer class from SpeechRecognition
    r = sr.Recognizer()
    
    # Call the method that uses Google Speech Recognition API
    # to transcribe the audio and return a string of text
    for i, audio_chunk in enumerate(chunks, start=1):
        chunk_filename = os.path.join(folder_name, f'chunk{i}.wav')
        audio_chunk.export(chunk_filename, format='wav')

        with sr.AudioFile(chunk_filename) as source:
            audio_listened = r.record(source)

            try:
                text = r.recognize_google(audio_listened)
            except sr.UnknownValueError as e:
                print('Error:', str(e))
            else:
                text = f'{text.capitalize()}.'
                print(chunk_filename, ':', text)
                full_text += text

    return full_text

In [None]:
# Apply the function and add a column to the table for the audio transcription
data['audio_text'] = data['file_path'].apply(transcribe_audio)

### II. Extracting Visual Text

The other features to extract from the videos are visual content. Extracting visual text is a faster process than extracting the visual objects.
- To create the sequence of images, the video is broken down into frames by `VideoCapture` in the function I define as `save_frames`.
- The images are captured every *n*th frame and then opened with the python imaging library.
- To recognize the text, the image frames are passed onto the method `image_to_string`.
- The extracted text is returned by the function I define as `extract_visual_text` and processed later using NLP.

In [8]:
# Import Python libraries and modules
import cv2
import pytesseract
import shutil
import re
import numpy as np
try:
    from PIL import Image
except ImportError:
    import Image

In [9]:
image_frames = './data/images/image_frames'

def save_frames(file_path):
    '''
    Creates image folder and saves video frames in the folder.
    
    Parameters:
    file_path (str): file path of video to be captured as images.
    
    Returns:
    image_frames folder where the video frames are stored.
    '''
    try:
        os.remove(image_frames)
    except OSError:
        pass

    # Create a folder called image_frames to save the images or frames of the video
    if not os.path.exists(image_frames):
        os.makedirs(image_frames)
    
    # Capture every 20th frame of the video using cv2 from OpenCV and save to folder
    src_vid = cv2.VideoCapture(file_path)

    index = 0
    while src_vid.isOpened():
        ret, frame = src_vid.read()
        if not ret:
            break

        name = './data/images/image_frames/frame' + str(index) + '.png'

        if index % 20 == 0:
            print('Extracting frames...' + name)
            cv2.imwrite(name, frame)
        index = index + 1
        if cv2.waitKey(10) & 0xFF == ord('q'):
            break
  
    src_vid.release()
    cv2.destroyAllWindows()

In [10]:
def sorted_alphanumeric(name_list):
    '''
    Sorts names according to alphanumeric characters.
    
    Parameters:
    name_list (list): list of names to be sorted.
    
    Returns:
    sorted_names (list): sorted list using natural sorting e.g. 1, 2, 10 rather than 1, 10, 2
    '''
    convert = lambda text: int(text) if text.isdigit() else text.lower()
    alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)] 
    sorted_names = sorted(name_list, key=alphanum_key)
    return sorted_names

# Credits to user136036 for this function found on stack overflow
# https://stackoverflow.com/questions/4813061/non-alphanumeric-list-order-from-os-listdir

In [11]:
def extract_visual_text(file_path):
    '''
    Extracts visual text from images saved of video frames.

    Parameters:
    file_path (str): file path of video from which to extract the visual text.
    
    Returns:
    full_text (str): text as seen in the video taken from every 20th frame.
    '''
    save_frames(file_path)
    print('Folder created.')
    
    text_list = []
    
    # Sort the frames in the folder using the function above for correct ordering
    image_list = sorted_alphanumeric(os.listdir(image_frames))
    
    # Open each image frame using PIL, and pass as argument in a function that uses 
    # Google Tesseract OCR to recognize text in the image
    for i in image_list:
        print(str(i))
        single_frame = Image.open(image_frames + '/' + i)
        text = pytesseract.image_to_string(single_frame, lang='eng')
        text_list.append(text)

    # Remove the new line character `\n` and the word TikTok 
    # from the strings of text returned and joined together
    full_text = ' '.join([i for i in text_list])
    full_text = full_text.replace('\n', '').replace('\x0c', '').replace('TikTok', '')

    # Remove the folder to erase the image frames of the video
    shutil.rmtree('./data/images/image_frames/')
    print('Folder removed.')
    
    return full_text

In [None]:
# Apply the function and add a column to the table for the visual text
data['visual_text'] = data['file_path'].apply(extract_visual_text)

#### Extracting Username

To extract the username from the visual text, I define the function `extract_username` that executes the following steps:

- Create a list of all words in the string of words lowercased.
- Create another list of the strings that start with the sign '@'.
- Return the most frequent word in the list using the function `most_frequent`.

The most frequent word is most likely the username because TikTok automatically displays it for the full duration of the video, as in every single video frame.

In [13]:
def most_frequent(username_list):
    '''Takes in a list of strings and return the most frequent word in the list or none.'''
    most_frequent = max(set(username_list), key = username_list.count)
    if most_frequent == '':
        return np.nan
    else:
        return most_frequent

def extract_username(visual_text):
    '''
    Lists possible usernames from visual text and returns the most frequent one that may most likely be the username.
    
    Parameters:
    visual_text (str): full visual text extracted from video.
    
    Returns:
    username (str): most frequent word that starts with @ sign; if none, returns none.
    '''
    visual_text = ''.join([i for i in visual_text.lower() if not i.isdigit()])
    
    text = ' '.join(visual_text.split())
    text_list = [word for word in text.lower().split()]

    username_list = []
    for word in text_list:
        if re.search(r'[@]', word):
            username_list.extend([word.rsplit('@')[-1]])
    if username_list == []:
        return np.nan

    else:
        username_list = ' '.join([username for username in username_list])
        username_list = [username for username in username_list.strip().split()]
        try:
            return most_frequent(username_list)
        except:
            return ' '.join(username_list)

In [14]:
# Apply the function and add a column to the table for the username 
data['username'] = data['visual_text'].apply(extract_username)

### III. Detecting Image Object

The final feature to extract are objects in the videos, which is accomplished by the state-of-the-art object detection system YOLO that uses a deep learning algorithm.

**Deep Learning**

[YOLO](https://pjreddie.com/darknet/yolo/) applies a convolutional neural network with a network architecture illustrated as such:
![](data/images/cnn.png)

*Image Source: https://arxiv.org/pdf/1506.02640.pdf*

***

Compared to prior object detection systems YOLO uses a totally different approach—the network looks at the image once. Thus, the name *You Only Look Once*.
- The input image is divided into a grid of *x* by *x* number of cells. 
- Around the cells, bounding boxes are predicted with confidence scores.
- Class probabilities are mapped, with the bounding boxes weighted by the predictions.
- The output of objects detected are displayed if the threshold set is met.

**Transfer Learning**

The weights from the YOLO pre-trained network model are adapted to our data. To load the network model, download the weight and configuration files from Darknet. The configuration file describes the layout of the network by block.

> "YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par with Focal Loss but about 4x faster. Moreover, you can easily tradeoff between speed and accuracy simply by changing the size of the model, no retraining required!" ([Darknet](https://pjreddie.com/darknet/yolo))

Choosing `YOLOv3-spp` for accuracy, the model with the highest mean average precision of 60.6 performed on the COCO [dataset](https://cocodataset.org/#home) and for speed, you may try the model with the highest frame per second of 220 which is `YOLOv3-tiny`.

In [15]:
# Load the network model into OpenCV using the configuration and weight files 
net = cv2.dnn.readNetFromDarknet('./data/yolo/yolov3-spp.cfg', './data/yolo/yolov3-spp.weights')

This YOLO neural network consists of 263 parts such as convolutional layers (`conv`), batch normalization (`bn`) etc. 

Printing them...

`ln = net.getLayerNames()
print(len(ln), ln)`

...returns the following:

`263 ('conv_0', 'bn_0', 'leaky_1', 'conv_1', 'bn_1', 'leaky_2', 'conv_2', 'bn_2', 'leaky_3', 'conv_3', 'bn_3', 'leaky_4', 'shortcut_4', 'conv_5', 'bn_5', 'leaky_6', 'conv_6', 'bn_6', 'leaky_7', 'conv_7', 'bn_7', 'leaky_8', 'shortcut_8', 'conv_9', 'bn_9', 'leaky_10', ...)`

***

**COCO**

To get the labels of the model trained on the COCO dataset, download the [COCO](https://github.com/pjreddie/darknet/blob/master/data/coco.names) name file that contains the names of all the classes—the model can detect a total of 80 objects. A full list of object classes is provided in the name file, along with the weights and configuration files available in the [data](https://github.com/czarinagluna/ml-video-library/tree/main/data/yolo) folder of the repository.

- The objects detected are returned by the function I define as `detect_object`:

In [16]:
def detect_object(file_path):
    '''
    Uses YOLO algorithm to detect objects in video frames.
    
    Parameters:
    file_path (str): file path of video from which to detect objects in the frames.
    
    Returns:
    object_set (list): list of unique objects detected in the video.
    '''
    classes = []

    with open('./data/yolo/coco.names', 'r') as f:
        classes = f.read().splitlines()
  
    try:
        cap = cv2.VideoCapture(file_path)
        count = 0
        object_list = []

        while cap.isOpened():
            ret, img = cap.read()

            if not ret:
                break

            if ret:
                cv2.imwrite('frame{:d}.jpg'.format(count), img)
                count += 50
                cap.set(cv2.CAP_PROP_POS_FRAMES, count)

                height, width, _ = img.shape
                blob = cv2.dnn.blobFromImage(img, 1/255, (416, 416), (0,0,0), swapRB=True, crop=False)
                net.setInput(blob)

                output_layers_names = net.getUnconnectedOutLayersNames()
                layerOutputs = net.forward(output_layers_names)

                boxes = []
                confidences = []
                class_ids = []

                for output in layerOutputs:
                    for detection in output:
                        scores = detection[5:]
                        class_id = np.argmax(scores)
                        confidence = scores[class_id]
                        if confidence > 0.5:
                            center_x = int(detection[0]*width)
                            center_y = int(detection[1]*height)
                            w = int(detection[2]*width)
                            h = int(detection[3]*height)
                            x = int(center_x - w/2)
                            y = int(center_y - h/2)
                            boxes.append([x, y, w, h])
                            confidences.append(float(confidence))
                            class_ids.append(class_id)
                print(len(boxes))
                indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
                if len(indexes) > 0:
                    print(indexes.flatten())
                    for i in indexes.flatten():
                        label = str(classes[class_ids[i]])
                    object_list.append(label)
            else:
                cap.release()
                cv2.destroyAllWindows()
                break
            
        cap.release()
        cv2.destroyAllWindows()
        
        object_set = list(set(object_list))
        
        print('Done detecting object in this video.')
        print(f'These are the objects detected: {object_set}')
    
        return object_set
    
    except:
        print(f'{filename} did not work.')

In [None]:
# Apply the function and add a column to the table for the list of objects detected
data['object_list'] = data['file_path'].apply(detect_object)

**Feature Engineering Results**

In [18]:
data = data.fillna('')
data

Unnamed: 0,file_path,audio_text,visual_text,username,object_list
0,data/videos/v09044b10000bt9ch34evc02dolrmnm0.mov,Just came home from work.To find this.On our b...,eelee age Brebs ryJ@ 7 a)D we yamked...,lexwolff,"[person, bed, book]"
1,data/videos/v090440c0000btvqv225mcbk472mqcfg.MP4,,PASTA Di pi pasta without him knowingWheat Td!...,abh.giving,"[person, bowl, diningtable]"
2,data/videos/v09044650000bqp6ame0bkbl9lnj30ng.MP4,,n WrteiticentPowder Powder o Faced - Peachu...,jhk.cof,"[person, cell phone]"
3,data/videos/v090440c0000bu92lni5mcbk473oa89g.MP4,,» QUICK/EASYRH“HEALTHY MEAL) = >» QUICK/EASYHE...,gusmevean,"[person, pizza, chair, bowl, spoon, cup]"
4,data/videos/v09044f20000btekl246h3878d1mjtcg.MP4,Play wherever the b**** that ain't got no ass ...,ob @kamiorellano@ How u do that @S®@Dteach us ...,kamiorellano,"[person, chair]"
...,...,...,...,...,...
135,data/videos/v09044080000btsg4svg6g0t4bro25q0.MP4,Honda huntington beach can we end of the s.,most- worn jeans most- worn jeansf d rcmost- w...,vivianeaudiiiob,"[person, handbag, refrigerator]"
136,data/videos/v09044c10000bud34lrgnk9tslqmsi80.MP4,Guy let's play.,isThis NY restaurant will “make you feel like ...,amorraytravels,"[person, laptop, chair, bus, umbrella, bottle,..."
137,data/videos/v09044100000brk27uahq10bgjc7ags0.MP4,One day somebody who changes my mind.,ob @jannamoreau ob @jannamoreau ...,jannamoreau,[person]
138,data/videos/v090449d0000br13ppm0bkbnmcmu533g.MP4,Do they record your phone and you're moving to...,ob @angpark oft 4@angpark A op 4 2@angpark i...,angpark,[person]


## Natural Language Processing

To process the text features, I utilize the Natural Language Toolkit (`nltk`) library for standardization to make the letters lowercase, to remove punctuation marks and stopwords, for tokenization and lemmatization. [`WordSegment`](https://pypi.org/project/wordsegment/) is used too, to segment the strings of words that did not have spaces between them.

In [19]:
# Convert speech transcribed to lowercase and remove full stop
data['standardized_audio_text'] = data['audio_text'].apply(lambda x: x.lower().replace('.', ''))

In [20]:
import nltk
nltk.download('words')
words = set(nltk.corpus.words.words())

import wordsegment
wordsegment.load()

import sys
sys.setrecursionlimit(2000)

[nltk_data] Downloading package words to
[nltk_data]     /Users/czarinaluna/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [21]:
def process_visual_text(text):
    '''Processes string of text by removing punctuation marks and segment words.'''
    text = text.lower()
    text = re.sub(r'([^A-Za-z0-9|\s|[:punct:]]*)', '', text)
    text = text.replace('|', '').replace(':', '')
    text = wordsegment.segment(text) 
    text = ' '.join([i for i in text if i in words])
    return text

In [22]:
# Apply the function and add a column to the table for the processed visual text
data['processed_visual_text'] = data['visual_text'].apply(process_visual_text)

In [23]:
def segment_text(text):
    '''Segments strings of words without spaces between them.'''
    text = wordsegment.segment(text)
    text = ' '.join([i for i in text])
    return text

In [24]:
# Convert list to string and add a column to the table for the processed objects
data['object_text'] = data['object_list'].apply(lambda x: ' '.join([word for word in x]))

data['object_text'] = data['object_text'].apply(segment_text)

**Text Feature**

In [25]:
# Put all the features together
data['text'] = data['standardized_audio_text'] + ' ' + data['visual_text'] + ' ' + data['processed_visual_text'] + ' ' + data['username'] + ' ' + data['object_text']

In [26]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()

nltk.download('stopwords')
nltk.download('wordnet')

stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(["i've", "let's", "lets", "youve", "theyve", "they've"])

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/czarinaluna/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/czarinaluna/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [27]:
def preprocess_text(text):
    '''Processes text by lemmatization and stopwords removal.'''
    text = ' '.join([i for i in text.split() if len(i)>3])
    text = text.split()
    text = [lemmatizer.lemmatize(word) for word in text]
    text = [word for word in text if word not in stopwords]
    text = ' '.join(text)
    return text

In [28]:
# Apply the function and add a column to the table for the processed text
data['preprocessed_text'] = data['text'].fillna('').apply(preprocess_text)

In [29]:
# Save the dataset
data.to_csv('./data/data.csv', index=False)
%store data

Stored 'data' (DataFrame)


## Search Results

In [None]:
import spacy
from tqdm import tqdm

nlp = spacy.load('en_core_web_sm')

tokenized_text = [] 

for doc in tqdm(nlp.pipe(data['preprocessed_text'].fillna('').str.lower().values, disable=['tagger', 'parser', 'ner'])):
    tokenized = [token.text for token in doc if token.is_alpha]
    tokenized_text.append(tokenized)

In [31]:
from rank_bm25 import BM25Okapi

bm25 = BM25Okapi(tokenized_text)

In [32]:
def search_video(query, result=3, n=1):
    '''
    Uses YOLO algorithm to detect objects in video frames.
    
    Parameters:
    query (str): word or phrases to search.
    result (int): the number of results to search.
    n (int): the nth result to display.
    
    Returns:
    video to display from the list of results.
    '''
    tokenized_query = query.lower().split(' ')
    
    results = bm25.get_top_n(tokenized_query, data['file_path'], result)
    results_list = [video for video in results]

    video = Video(results_list[n-1], width=300)
    print(results_list[n-1])
    return video

In [33]:
search_video('chinatown dumplings')

data/videos/v090440e0000bttubbe6r5jsiqc8bji0.MP4


In [34]:
search_video('italian restaurant ny umbrella')

data/videos/v09044c10000bud34lrgnk9tslqmsi80.MP4


## Further Research

Further developments that could add value to the model are **music recognition** to resolve the limits of speech recognition, and application of a **neural network-based search** feature to further improve accuracy. Developing this on **other applications** for video sharing and storage such as Apple Photos could help users better manage videos on their devices.

# Contact
Feel free to contact me for any questions and connect with me on [Linkedin](https://www.linkedin.com/in/czarinagluna).