Skip to content

epfl-dlab/youtube-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Youtube topics

The goal of this repository is to create embeddings for YouTube channels. These embeddings can be used as-is for content similarity, and can also be used to extract social dimensions.

Types of Embeddings

We propose three types of embeddings.

  • Social Sharing / Reddit embedding: made using shares of YouTube videos on Reddit (using Pushshift data)
  • Content embedding: made from video titles and descriptions, fed through a Sentence Transformer.
  • Recommendation embedding: made from recording recommendations YouTube provides to a history-less user, and computing a node embedding.

Those embeddings for our filtered 40K channels are featured in the embeds/ folder.

Similarly, social dimensions are featured in the dims/ folder.

Recreating embeddings

Create a conda (/mamba) environment using conda env create -f environment.yml. This creates a conda environment named ytb with all libraries necessary for running the code. It might be necessary to upgrade your conda version beforehand (conda upgrade conda) if you get any error.

The repository uses jupytext for notebooks version control, so notebooks are saved in Markdown format, which still makes them readable from github, and removes the output.

All of the notebooks for recreating the embeddings are in the generate_embeddings/ folder. Please note that it will require some work to get everything working. Notably, it assumes you have already extracted all links to youtube in reddit comments and submissions (the pyspark code for extracting them is not (not yet?) public).


Unfortunately, it looks like the pushshift dumps are currently not accessible over on https://files.pushshift.io/ (although there seems to be a torrent remaining), and according to this post, Reddit revoked pushshift's access, so more recent posts will not be able to be included in datasets.

Notebook & methods diagram

Shows an illustrated sun in light color mode and a moon with stars in dark color mode.

About

YouTube channel embeddings and social dimensions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published