# Video subcategories preprocessing and clustering
To achieve a meaningful semantic clustering, we are using a pre-trained CLIP model to project the text features onto a 786-dimentional latent space using CLIP's causal language model. For this first phase, we are projecting the title and all the tags for each video as they're the most representative textual features available. With those projections, we obtain a point cloud for each video from which we compute a representative using a weighted average between the title and the tags point cloud.  
In future steps we'll then proceed to perform PCA to reduce the dimensionality of the obtained projected data, and at last we'll apply a clustering algorithm to obtain the subcategories.

To make the process faster, we are running the model on the GPU, so this notebook needs a pytorch version with cuda support and a GPU with VRAM >= 4GB.

In [1]:
import pandas as pd
import numpy as np

# data head for visualization
data = pd.read_json('data/yt_metadata_en.jsonl.gz', compression='gzip', chunksize=20, lines=True)
for chunk in data:
    display(chunk.head(2))
    break

Unnamed: 0,categories,channel_id,crawl_date,description,dislike_count,display_id,duration,like_count,tags,title,upload_date,view_count
0,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:26.270363,Lego City Police Lego Firetruck Cartoons about...,1,SBqSc91Hn9g,1159,8,"lego city,lego police,lego city police,lego ci...",Lego City Police Lego Firetruck Cartoons about...,2016-09-28 00:00:00,1057
1,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:26.914516,Lego Marvel SuperHeroes Lego Hulk Smash Iron-M...,1,UuugEl86ESY,2681,23,"Lego superheroes,lego hulk,hulk smash,lego mar...",Lego Marvel SuperHeroes Lego Hulk Smash Iron-M...,2016-09-28 00:00:00,12894


In [2]:
# load pre-trained CLIP from transformers and move the model to the GPU

from transformers import CLIPProcessor, CLIPModel
import torch

model_im = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor_im = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
model_im.to(torch.device('cuda'));

The `yt_metadata_en.jsonl.gz` file is very large and need to be processed in chunks. To make the computation interruptible with minimal impact, we split the results into several files (one per "cpu chunk"). Within those chunks, the data must be further divided into smaller subchunks to fit into the gpu memory. For each video in the "gpu chunk", we use the model to project its title and tags onto the latent space and proceed to compute the representative as described above. The approach of dividing the computation in "cpu chunks" allows us to deal with the dataset conveniently, stop the computation, and resume it from where it was interrupted.

In [3]:
import torch
import csv
from math import log10, ceil
import os
import re

files = sorted([int(re.search("[0-9]+", x).group()) for x in os.listdir('features') if re.search("part_[0-9]+\.csv$", x)])
start_index = files[-1] if len(files) > 0 else 0

cpu_chunksize = 10000
gpu_chunksize = 1
tags_batch_size = 15

tot_gpu_chunks = cpu_chunksize // gpu_chunksize
assert type(tot_gpu_chunks) == int

display_handle_cpu = display("", display_id=True)
display_handle_gpu = display("", display_id=True)
data = pd.read_json('data/yt_metadata_en.jsonl.gz', compression='gzip', chunksize=cpu_chunksize, lines=True, dtype={"tags": pd.StringDtype()})

for i, cpu_chunk in enumerate(data):
    display_handle_cpu.update(f"Processing chunk {i+1:3d} (processed {cpu_chunksize*(i+1):8d} rows)")

    if i < start_index:
        continue
    
    with open(f"features/features_part_{i}.csv", 'w', newline="") as f_part:
        part_writer = csv.writer(f_part)

        for j in range(0, tot_gpu_chunks):
            chunk = cpu_chunk.iloc[gpu_chunksize * j : gpu_chunksize * (j+1)].copy().reset_index(drop=True)
            
            allocated, tot = torch.cuda.mem_get_info()
            display_handle_gpu.update(f"Processing subchunk {j+1:{int(ceil(log10(tot_gpu_chunks)))}d}/{tot_gpu_chunks} - memory allocated: {(tot - allocated) / tot * 100}% ({(tot - allocated) / 1024 / 1024 / 1024:.2f}/{tot / 1024 / 1024 / 1024:.2f} GB)")

            def get_features(x):
                encoded = processor_im(text=x, padding=True, truncation=True, return_tensors="pt").to('cuda')
                result = model_im.get_text_features(**encoded)
                cpu_result = result
                del encoded
                del result
                return cpu_result.detach()

            title_features = get_features(chunk['title'].str[:77].to_list())

            tag_series = chunk['tags'].str.split(",").apply(lambda xs: [x.strip()[:77] for x in xs])
            tags_features = torch.stack([torch.mean(torch.cat([get_features(video_tags[i*tags_batch_size:(i+1)*tags_batch_size]).cpu() for i in range(ceil(len(video_tags)/tags_batch_size))], dim=0), dim=0) for video_tags in tag_series])

            text_features = 0.6 * title_features.cpu() + 0.4 * tags_features
            text_features_numpy = text_features.cpu().detach().numpy()
            
            part_writer.writerows(text_features_numpy)

            del title_features
            del tags_features
            del text_features

'Processing chunk 173 (processed  1730000 rows)'

'Processing subchunk  566/10000 - memory allocated: 87.27461391044523% (3.49/4.00 GB)'