<a href="https://colab.research.google.com/github/faizanhemotra/yt-watch-later-analysis/blob/main/watch_later_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Background: I've amassed >2000 videos in my YouTube's Watch Later Playlist (was even higher but moved many 🐱 videos to dedicated playlist(s)).
* Task: Analyse which videos are more likely to get added to the watch later playlist.
    * The outcome of this would be to reduce information overload.
* Plan:
    * Scrape YouTube and get a list. (Done)
    * Plot videos by channel. (Done)
    * List potential fields for analysis. (Done)
    * Find Similarity between videos by title. (Done)
        *  Investigate false positives and check how video descriptions can be used to reduce it. (Pending)
    * Assign granular categories based on similarities (manual process?). (Pending)
    * Choose a random video on the home page and assign the category. (Pending)
        * Check how likely it would be added to the watch later. (Pending)

### Scraping Guide
- Pre-requisites
1. [yt-dlp](https://github.com/yt-dlp/yt-dlp) for scraping
2. [PhantomJS](https://phantomjs.org/) [refer to yt-dlp docs for more info]
3. [jq](https://jqlang.github.io/jq/) for post-processing
4. Login to a browser with your credentials (I logged in to my brave browser)
- The fastest way is to use the flat-playlist flag:
    ```bat
    yt-dlp --cookies-from-browser brave --flat-playlist --output-na-placeholder "" --print-to-file %(title)#S,%(url)q,%(channel)#S,%(channel_url)q,%(duration)d,%(duration_string)q,%(view_count)d export-user-wl.csv https://www.youtube.com/playlist?list=WL
    ```
    - The obvious limitation is that you can't get tags and categories which are nested. You can use the json dump and the process it:
        ```bat
        yt-dlp --cookies-from-browser brave --no-flat-playlist --dump-json https://www.youtube.com/playlist?list=WL | jq --slurp "[.[] | {title: .title, webpage_url: .webpage_url, description: .description, uploader: .uploader, channel_url: .channel_url, duration: .duration, duration_string: .duration_string, view_count: .view_count, age_limit: .age_limit, categories: .categories, tags: .tags, availability: .availability}]" > filtered_wl_dump.json
        ```

- Appendix
    1. get raw_dump
    ```bat
    yt-dlp --cookies-from-browser brave --no-flat-playlist --dump-json https://www.youtube.com/playlist?list=WL > full_wl_dump.json
    ```
    2. post-process json dump
    ```bat
    jq --slurp "[.[] | {title: .title, webpage_url: .webpage_url, description: .description, uploader: .uploader, channel_url: .channel_url, duration: .duration, duration_string: .duration_string, view_count: .view_count, age_limit: .age_limit, categories: .categories, tags: .tags, availability: .availability}]" full_wl_dump.json > filtered_wl_dump.json
    ```


## Exploratory Analysis

### Imports

In [1]:
import nltk
import numpy as np
import pandas as pd
import plotly.express as px
import string
import torch

from sklearn.metrics.pairwise import cosine_similarity
from transformers import (
    AutoTokenizer,
    AutoModel,
    PreTrainedTokenizer,
    PreTrainedModel
    )
from tqdm import tqdm
from typing import (
    Iterable,
    List,
    TypeAlias
    )

# Type hints
PandasSeries: TypeAlias = pd.Series
PandasDataFrame: TypeAlias = pd.DataFrame

In [3]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
df = pd.read_json('filtered_wl_dump.json')

In [6]:
df.shape, df.dtypes

((2386, 12),
 title              object
 webpage_url        object
 description        object
 uploader           object
 channel_url        object
 duration            int64
 duration_string    object
 view_count          int64
 age_limit           int64
 categories         object
 tags               object
 availability       object
 dtype: object)

In [7]:
# drop duplicate videos from the playlist
df = df.drop_duplicates(subset=['webpage_url'])

In [9]:
# Should've watched News & Politics when the videos were uploaded.
# I think I've got a few cat videos in watch later as well, needs a dedicated playlist :)
df['categories'].explode().value_counts()

Education                846
Science & Technology     426
Entertainment            318
News & Politics          209
People & Blogs           151
Howto & Style            118
Gaming                   106
Comedy                    88
Film & Animation          42
Travel & Events           34
Nonprofits & Activism     17
Music                     13
Sports                     7
Autos & Vehicles           6
Pets & Animals             5
Name: categories, dtype: int64

In [10]:
df_public = df.query('availability=="public"')
MIN_VIDEOS = 5
df_public_most_frequent = df_public[df_public.groupby('uploader')['uploader'].transform('size') > MIN_VIDEOS]

# Channels with more than five videos added to watch later make up the significant amount (~50%)
# So, we'll only consider those for analysis
df_public_most_frequent_counts = (
    df_public_most_frequent['uploader']
    .value_counts().reset_index()
    )
df_public_most_frequent_counts['uploader'].sum()

1131

### Some plots

In [11]:
# Plot to see which channels have the most videos added to watch later.
channel_counts = (
    df_public_most_frequent_counts
    .rename(columns={'index': 'Channel', 'uploader': 'Videos'})
    )
fig = px.bar(channel_counts, x='Channel', y='Videos')
fig.update_layout(bargap=0.2)
fig.show()

In [12]:
# Let's inspect most frequent tags (>10)
# Surprised that there are so many history tags. Seems to be from a single channel.
fig = px.bar(df_public_most_frequent['tags'].explode().value_counts().reset_index().query('tags>10'), x='index', y='tags')
fig.update_layout(bargap=0.2)
fig.show()

In [13]:
# Why are there so many history tags?
history_tags = ['geography', 'geography (field of study)', 'world geography', 'map of the world', 'world map real size', 'world map with countries', 'world map', 'real life maps', 'real life lore geography', 'real life lore maps', 'world map is wrong']

# Query the Series to filter rows containing history tags
query_result = df_public_most_frequent['tags'].apply(lambda x: any(item in x for item in history_tags))

# Filter the Series based on the query result
filtered_series = df_public_most_frequent['tags'][query_result]

# Check channels with the history tags
df_public_most_frequent.loc[filtered_series.index]['uploader'].unique()

array(['RealLifeLore', 'Be Smart', 'Wendover Productions',
       'Half as Interesting'], dtype=object)

### Notes about tags:

* Maybe tags shouldn't be used for analysis as there would be a lot of duplicates.

* Maybe for future analysis cleaning can be done to remove these duplicates and a meaning column can be derived for analysis?

### Transformers to find similarity between video titles

In [28]:
def get_device_based_on_cuda_availability() -> torch.device:
    """Get the appropriate device based on the availability of CUDA.

    Returns
    -------
    torch.device
        The selected device (cuda or cpu).
    """
    return torch.device("cuda" if torch.cuda.is_available() else "cpu")

def move_model_to_device(model: torch.nn.Module) -> torch.nn.Module:
    """Moves the given model to the appropriate device based on the availability of CUDA.

    Parameters
    ----------
    model : torch.nn.Module
        The model to be moved to the device.

    Returns
    -------
    torch.nn.Module
        The model moved to the selected device.
    """
    return model.to(get_device_based_on_cuda_availability())

def encode_texts(texts: pd.Series,
                 tokenizer: PreTrainedTokenizer,
                 model: PreTrainedModel,
                 batch_size: int = 16) -> np.ndarray:
    """Encode the given texts using a tokenizer and model.

    Parameters
    ----------
    texts : list of str
        The texts to encode.
    tokenizer : PreTrainedTokenizer
        The tokenizer to use for encoding.
    model : PreTrainedModel
        The model to use for encoding.
    batch_size : int, default 16
        The batch size for encoding, by default 16.

    Returns
    -------
    pd.DataFrame
        The DataFrame containing the encoded texts and their similarity results.
    """
    encoded_inputs = tokenizer.batch_encode_plus(
        texts.tolist(),
        padding=True,
        truncation=True,
        return_tensors='pt'
    )
    input_ids = move_model_to_device(encoded_inputs["input_ids"])
    attention_mask = move_model_to_device(encoded_inputs["attention_mask"])

    with torch.no_grad():
        try:
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
            return embeddings
        except Exception as e:
            raise RuntimeError(f"Error occurred during encoding: {str(e)}")

def find_top_similar_texts(texts: pd.Series,
                           similarity_scores: np.ndarray,
                           threshold: float):
    """Find the top similar texts based on similarity scores.

    Parameters
    ----------
    texts : pd.Series
        The original texts.
    similarity_scores : pd.DataFrame
        The similarity scores between the texts.
    threshold : float
        The similarity threshold for considering texts as similar.

    Returns
    -------
    list
        The list of dictionaries containing the top similar texts and their similarity scores.
    """
    results = []
    for i, text in enumerate(texts):
        scores = similarity_scores[i, :]
        top_indices = scores.argsort()[::-1]
        top_scores = scores[top_indices]
        top_texts = texts.iloc[top_indices]

        text_results = {
            'Text': text,
            'Similarity': [],
            'Top Text': []
        }

        for j in range(1, len(top_texts)):
            idx = top_indices[j]
            score = top_scores[j]
            top_text = top_texts.iloc[j]
            if idx != i and score > threshold:
                text_results['Similarity'].append(score)
                text_results['Top Text'].append(top_text)

        if text_results['Similarity']:
            results.append(text_results)

    return results

In [29]:
# Example pandas series
texts = df['title']

# Load pre-trained DistilBERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
move_model_to_device(model)

threshold = 0.97  # Similarity threshold

try:
    embeddings = encode_texts(texts, tokenizer=tokenizer, model=model)
    if embeddings is not None:
        similarity_scores = cosine_similarity(embeddings)
        results = find_top_similar_texts(texts, similarity_scores, threshold)

        # Create DataFrame from results dictionary
        df_results = pd.DataFrame(results)
    else:
        print("Text encoding failed.")
except RuntimeError as e:
    print(f"Error occurred: {str(e)}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Results post-processing

In [17]:
def remove_reciprocal_duplicates(df: PandasDataFrame, cols: List[str]) -> PandasDataFrame:
    """Remove reciprocal duplicates in a Pandas Dataframe.

    Parameters
    ----------
    df: PandasDataFrame
        Input dataframe.
    cols: list of str
        List of columns to remove duplicates from.

    Examples
    --------
    >>> example_df = pd.DataFrame([['a', 'spider'], [ 'spider', 'a'], ['ant', 'c'], ['d', 'aardvark']], columns=['A', 'B'])
    >>> print(example_df)
            A         B
    0       a    spider
    1  spider         a
    2     ant         c
    3       d  aardvark
    >>> remove_reciprocal_duplicates(example_df, ['A', 'B'])
         A         B
    0    a    spider
    2  ant         c
    3    d  aardvark
    """
    type_col1 = pd.api.types.infer_dtype(df[cols[0]])
    type_col2 = pd.api.types.infer_dtype(df[cols[1]])
    if type_col1 != type_col2:
        raise ValueError('Unequal data types')
    if type_col1.startswith('mixed'):
        raise ValueError('Does not work for mixed data types')
    transpose_and_sort = (df[cols].T.apply(sorted).T)
    duplicate_sorted_columns = transpose_and_sort.duplicated()
    return df.loc[~duplicate_sorted_columns]

In [30]:
df_results_exploded = df_results.explode(['Similarity', 'Top Text']).reset_index(drop=True)

In [31]:
# We can assess de-duplicated results now
remove_reciprocal_duplicates(df_results_exploded, ['Text', 'Top Text']).reset_index(drop=True)

Unnamed: 0,Text,Similarity,Top Text
0,What Low Testosterone Does to the Body,0.98638,What Testosterone Does to the Body
1,What Low Testosterone Does to the Body,0.970925,What Sugar Really Does to the Body
2,What The Internet Gets Wrong About Philosophy,0.970553,Why you don’t hear about the ozone layer anymore
3,The Argument from Improbability DEBUNKED,0.978391,The Truth About Declawing Cats
4,The Argument from Improbability DEBUNKED,0.976283,The hidden history of “Hand Talk”
...,...,...,...
239,What Megalodon’s Teeth Say About Their Parenting,0.983644,How the Egg Came First
240,What Megalodon’s Teeth Say About Their Parenting,0.974401,I Was Challenged
241,I Was Challenged,0.972628,How the Egg Came First
242,How Not To Die From A Cardiac Arrest,0.975998,How Not To Die From Electrocution


In [33]:
# There are potentially 159 categories.
df_results_exploded['Text'].nunique()

159

### For future development

In [None]:
# some post-processing code for youtube video descriptions
def get_stop_words(lan: str = 'english') -> Iterable[str]:
    """Get nltk stop words for input language."""
    return nltk.corpus.stopwords.words(lan)

def stop_words_regex(stop_words: Iterable[str]) -> str:
    """Get regex for replacing stop word given stop word list, set or any iterable."""
    return r'\b(?:{})\b'.format('|'.join(stop_words))

def remove_english_stop_words(pandas_series: PandasSeries, pattern: str) -> PandasSeries:
    """Use regex and replace stop words with an empty string for a pandas series."""
    return pandas_series.str.replace(pattern, '', regex=True, case=False)

def remove_links(pandas_series: PandasSeries) -> PandasSeries:
    """Use regex to remove links from descriptions."""
    return pandas_series.str.replace('http[s]?://\S+', '', regex=True)

def remove_non_ascii_chars(pandas_series: PandasSeries, pattern: str = f"[^{string.printable}]") -> PandasSeries:
    """Use regex to remove non-ASCII characters."""
    return pandas_series.str.replace(pattern, "", regex=True)

# clean the descriptions of youtube videos
english_stop_words_regex = stop_words_regex(get_stop_words('english'))
df['description_processed'] = (
    remove_english_stop_words(
        remove_links(
            remove_non_ascii_chars(
                df['description'])
            ), english_stop_words_regex
        )
    )