## 1. inspiration from
* https://stackoverflow.com/questions/71894141/how-to-get-the-search-list-from-pytube-on-python
* https://pytube.io/en/latest/user/quickstart.html
* https://codeium.com/

## 2. initialization

We exposed 3 mechanisms to identify videos:
- video_id
- playlist_id
- search videos through text

In [1]:
from pytube import YouTube, Playlist, Search
import pandas as pd

In [2]:
video_id = "JctmnczWg0U"
playlist_id = "PLsWyhklHwjExuXrXjJktcdYkCFL0PNdW7"
search_id = "Manolis Kellis Lex Friedman"

In [3]:
y = YouTube(f"https://youtu.be/v={video_id}")
p = Playlist(f"https://www.youtube.com/playlist?list={playlist_id}")
s = Search(search_id)

## 3. helper functions

In [4]:
def vl(x):
    """return a list of videos, can be empty"""
    v = []
    if type(x) == Playlist: v = list(x.videos)
    elif type(x) == Search: v = x.results
    elif type(x) == YouTube: v = [x]  
    else: print(f"Invalid type, x={x}, should be: Youtube, Search, Playlist")
    return v

In [5]:
def y2a(x, n=2, output_path="../audio"):
    """download audio from youtube for specified x object
    x can be Playlist, Search, YouTube, or a list of videos
    if n is specified, only the first n videos will be downloaded
    output_path is where the audio will be downloaded to (default: ../audio)"""
    for v in vl(x)[:n]:
        try:
            v.streams.filter(only_audio=True).first().download(output_path=output_path)
            print(f"DOWNLOADED -> {v.title}")
        except Exception as e:
            print(f"SKIPPED -> {v.title}")

In [6]:
def vm(videos):
  """return dataframe with video metadata"""
  out = {}
  for v in videos:
    out[v.video_id] = {
      "author": v.author, "title": v.title, 
      "publish_date": v.publish_date, 
      "length": v.length, "views": v.views, 
      "rating": v.rating, "watch_url": v.watch_url,
      "keywords": v.keywords
    }
  return pd.DataFrame(out).T

## 4. download data

In [7]:
# assemble list of videos
all_videos = vl(y)+vl(s)+vl(p)
# select first 3 videos
top3 = all_videos[:3]

Unexpected renderer encountered.
Renderer name: dict_keys(['showingResultsForRenderer'])
Search term: Manolis Kellis Lex Friedman
Please open an issue at https://github.com/pytube/pytube/issues and provide this log output.


In [8]:
# download audio from list of videos
[y2a(x) for x in top3];

DOWNLOADED -> AI Blog Post Summarization with Hugging Face Transformers & Beautiful Soup Web Scraping
DOWNLOADED -> Manolis Kellis: Evolution of Human Civilization and Superintelligent AI | Lex Fridman Podcast #373
SKIPPED -> Manolis Kellis: Human Genome and Evolutionary Dynamics | Lex Fridman Podcast #113


In [9]:
# download top 5 audio from playlist
y2a(p,n=5);

DOWNLOADED -> Git and GitHub for Beginners Tutorial
DOWNLOADED -> What Is GitLab Pipeline? | How To Create GitLab Pipeline | GitLab Tutorial For Beginners | Part V
DOWNLOADED -> How to use Git inside of VSCode - 2020


## 5. download metadata

In [10]:
# list of watch urls from a video list
[f"{v.watch_url}" for v in top3]

['https://youtube.com/watch?v=JctmnczWg0U',
 'https://youtube.com/watch?v=wMavKrA-4do',
 'https://youtube.com/watch?v=brslF-Cy3HU']

In [11]:
# list of watch urls with titles from a video list
[f"{v.watch_url} {v.title}" for v in top3]

['https://youtube.com/watch?v=JctmnczWg0U AI Blog Post Summarization with Hugging Face Transformers & Beautiful Soup Web Scraping',
 'https://youtube.com/watch?v=wMavKrA-4do Manolis Kellis: Evolution of Human Civilization and Superintelligent AI | Lex Fridman Podcast #373',
 'https://youtube.com/watch?v=brslF-Cy3HU Manolis Kellis: Human Genome and Evolutionary Dynamics | Lex Fridman Podcast #113']

In [12]:
df = vm(all_videos)
df.head(3)

Unnamed: 0,author,title,publish_date,length,views,rating,watch_url,keywords
JctmnczWg0U,Nicholas Renotte,AI Blog Post Summarization with Hugging Face T...,2021-02-17,1980,13004,,https://youtube.com/watch?v=JctmnczWg0U,"[web scraping, python web scraping, web scrapi..."
wMavKrA-4do,Lex Fridman,Manolis Kellis: Evolution of Human Civilizatio...,2023-04-21,9028,583652,,https://youtube.com/watch?v=wMavKrA-4do,"[agi, ai, ai podcast, artificial, artificial i..."
brslF-Cy3HU,Lex Fridman,Manolis Kellis: Human Genome and Evolutionary ...,2020-07-31,8964,370872,,https://youtube.com/watch?v=brslF-Cy3HU,[]


In [13]:
# visualize the top 5 videos: title and views
df[['views','title']].sort_values('views',ascending=False).head(5)

Unnamed: 0,views,title
wMavKrA-4do,583652,Manolis Kellis: Evolution of Human Civilizatio...
tRZGeaHPoaw,581705,Git and GitHub for Beginners Tutorial
t06rkOOUa7g,513194,"Manolis Kellis: Origin of Life, Humans, Ideas,..."
brslF-Cy3HU,370872,Manolis Kellis: Human Genome and Evolutionary ...
F2DBSH2VoHQ,247737,How to use Git inside of VSCode - 2020
