<b><font color="black" size="+4">Natural Language YouTube Searcher</font></b>

<b><font color="black" size="+2">Based on:</font></b>

**GitHub repository**: [CLIP](https://github.com/openai/CLIP)

Article: [Learning Transferable Visual Models From Natural Language Supervision](https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf)

Creator: **[OpenAI](https://openai.com/)**

Colab based on **[Vladimir Haltakov's](https://haltakov.net/)** [notebook](https://colab.research.google.com/github/haltakov/natural-language-youtube-search/blob/main/natural-language-youtube-search.ipynb).

<b><font color="black" size="+2">Colab created by:</font></b>

GitHub: [@tg-bomze](https://github.com/tg-bomze),
Telegram: [@bomze](https://t.me/bomze),
Twitter: [@tg_bomze](https://twitter.com/tg_bomze).

---


```
(ENG) To get started, click on the button (where the red arrow indicates). After clicking, wait until the execution is complete.
```
```
(RUS) Чтобы начать, поочередно нажимайте на кнопки (куда указывают красные стрелки), дожидаясь завершения выполнения каждого блока.
```

In [None]:
#@title <b><font color="red" size="+3">←</font><font color="black" size="+3"> Install all necessary libraries</font></b>
#@markdown <b><font color="black" size="+1"> (RUS) Установить все необходимые библиотеки</font></b>

!nvidia-smi -L

try: 
  !pip3 install googletrans==3.1.0a0
  from googletrans import Translator, constants
  from pprint import pprint
  translator = Translator()
except: pass

import subprocess
from IPython.display import clear_output
CUDA_version = [s for s in subprocess.check_output(["nvcc", "--version"]).decode("UTF-8").split(", ") if s.startswith("release")][0].split(" ")[-1]
print("CUDA version:", CUDA_version)

if CUDA_version == "10.0":
    torch_version_suffix = "+cu100"
elif CUDA_version == "10.1":
    torch_version_suffix = "+cu101"
elif CUDA_version == "10.2":
    torch_version_suffix = ""
else:
    torch_version_suffix = "+cu110"

# pytube is used to download videos from YouTube
!pip install pytube

# Intall a newer version of plotly
!pip install plotly==4.14.3

# Install CLIP from the GitHub repo
!pip install git+https://github.com/openai/CLIP.git

# Install torch 1.7.1 with GPU support
!pip install torch==1.7.1{torch_version_suffix} torchvision==0.8.2{torch_version_suffix} -f https://download.pytorch.org/whl/torch_stable.html ftfy regex

import clip
import torch
from pytube import YouTube
import cv2
from PIL import Image
import matplotlib.pyplot as plt
import os
import math
import numpy as np
import plotly.express as px

# Load the open CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

def search_video(search_query, display_heatmap=True, display_results_count=3):

  # Encode and normalize the search query using CLIP
  with torch.no_grad():
    text_features = model.encode_text(clip.tokenize(search_query).to(device))
    text_features /= text_features.norm(dim=-1, keepdim=True)

  # Compute the similarity between the search query and each frame using the Cosine similarity
  similarities = (100.0 * video_features @ text_features.T)
  values, best_photo_idx = similarities.topk(display_results_count, dim=0)

  # Display the heatmap
  if display_heatmap:
    print("Search query heatmap over the frames of the video:")
    fig = px.imshow(similarities.T.cpu().numpy(), height=50, aspect='auto', color_continuous_scale='viridis')
    fig.update_layout(coloraxis_showscale=False)
    fig.update_xaxes(showticklabels=False)
    fig.update_yaxes(showticklabels=False)
    fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
    fig.show()
    print()

  # Display the top N frames
  for frame_id in best_photo_idx:
    display(video_frames[frame_id])
    try:
      in_sec = int((int(frame_id)*frames_to_skip)/fps_of_video)
      print(f'This frame is at {in_sec} seconds of the video.\n')
    except: print()

clear_output()
print('Done!')

In [None]:
#@title <b><font color="red" size="+3">←</font><font color="black" size="+3"> Get Video</font></b>
#@markdown <b><font color="black" size="+1"> (RUS) Получить видеофайл</font></b>

# Dashcam video of driving around San Francisco
video_url = "https://www.youtube.com/watch?v=odM92ap8_c0" #@param {type:'string'}  

# How much frames to skip
frames_to_skip = 60 #@param {type:"integer"}
#@markdown *(ENG) The more "frames_to_skip", the faster the neural network works, but less accurate.*

#@markdown *(RUS) Чем больше значение "frames_to_skip", тем быстрее работает нейросеть, но менее точно.*

# Choose a video stream with resolution of 360p
streams = YouTube(video_url).streams.filter(adaptive=True, subtype="mp4", resolution="360p", only_video=True)

# Check if there is a valid stream
if len(streams) == 0:
  raise "No suitable stream found for this YouTube video!"

# Download the video as video.mp4
print("Downloading...")
streams[0].download(filename="video")
print("Download completed.")

# The frame images will be stored in video_frames
video_frames = []

# Open the video file
capture = cv2.VideoCapture('video.mp4')
fps_of_video = int(capture.get(cv2.CAP_PROP_FPS))

current_frame = 0
while capture.isOpened():
  # Read the current frame
  ret, frame = capture.read()

  # Convert it to a PIL image (required for CLIP) and store it
  if ret == True:
    video_frames.append(Image.fromarray(frame[:, :, ::-1]))
  else:
    break

  # Skip N frames
  current_frame += frames_to_skip
  capture.set(cv2.CAP_PROP_POS_FRAMES, current_frame)

# Print some statistics
print(f"Frames extracted: {len(video_frames)}")

# You can try tuning the batch size for very large videos, but it should usually be OK
batch_size = 256
batches = math.ceil(len(video_frames) / batch_size)

# The encoded features will bs stored in video_features
video_features = torch.empty([0, 512], dtype=torch.float16).to(device)

# Process each batch
for i in range(batches):
  print(f"Processing batch {i+1}/{batches}")

  # Get the relevant frames
  batch_frames = video_frames[i*batch_size : (i+1)*batch_size]
  
  # Preprocess the images for the batch
  batch_preprocessed = torch.stack([preprocess(frame) for frame in batch_frames]).to(device)
  
  # Encode with CLIP and normalize
  with torch.no_grad():
    batch_features = model.encode_image(batch_preprocessed)
    batch_features /= batch_features.norm(dim=-1, keepdim=True)

  # Append the batch to the list containing all features
  video_features = torch.cat((video_features, batch_features))

# Print some stats
print(f"Features: {video_features.shape}")
print('Done!')

In [None]:
#@title <b><font color="red" size="+3">←</font><font color="black" size="+3"> Start Searching</font></b>
#@markdown <b><font color="black" size="+1"> (RUS) Начать поиск</font></b>
search_text = 'king kong' #@param {type:'string'}
try: prompt = translator.translate(search_text).text
except: prompt = search_text

max_num_results = 3 #@param {type:"slider", min:1, max:12, step:1}

search_video(prompt, display_results_count=max_num_results)
#@markdown *(ENG) The more "max_num_results", the more search results, but less accurate.*

#@markdown *(RUS) Чем больше значение "max_num_results", тем больше результатов поиска, но меньше точность.*
