See here - https://towardsdatascience.com/the-easiest-way-to-download-youtube-videos-using-python-2640958318ab.

PyTube seems to be maintained now - install via pip.

In [1]:
from pytube import YouTube
# misc
import os
import shutil
import math
import datetime
# plots
import matplotlib.pyplot as plt
%matplotlib inline
# image operation
import cv2

In [2]:
# Load a video via URL
video = YouTube("https://www.youtube.com/watch?v=ZOnzBSWIAzo")

In [3]:
# View the available video streams
video.streams.all()

  """Entry point for launching an IPython kernel.


[<Stream: itag="18" mime_type="video/mp4" res="360p" fps="30fps" vcodec="avc1.42001E" acodec="mp4a.40.2" progressive="True" type="video">,
 <Stream: itag="133" mime_type="video/mp4" res="240p" fps="30fps" vcodec="avc1.4d4015" progressive="False" type="video">,
 <Stream: itag="242" mime_type="video/webm" res="240p" fps="30fps" vcodec="vp9" progressive="False" type="video">,
 <Stream: itag="160" mime_type="video/mp4" res="144p" fps="30fps" vcodec="avc1.4d400c" progressive="False" type="video">,
 <Stream: itag="278" mime_type="video/webm" res="144p" fps="30fps" vcodec="vp9" progressive="False" type="video">,
 <Stream: itag="140" mime_type="audio/mp4" abr="128kbps" acodec="mp4a.40.2" progressive="False" type="audio">,
 <Stream: itag="249" mime_type="audio/webm" abr="50kbps" acodec="opus" progressive="False" type="audio">,
 <Stream: itag="250" mime_type="audio/webm" abr="70kbps" acodec="opus" progressive="False" type="audio">,
 <Stream: itag="251" mime_type="audio/webm" abr="160kbps" acodec

In [4]:
# Filter to show only mp4 streams
video.streams.filter(file_extension = "mp4").all()

  and should_run_async(code)
  """Entry point for launching an IPython kernel.


[<Stream: itag="18" mime_type="video/mp4" res="360p" fps="30fps" vcodec="avc1.42001E" acodec="mp4a.40.2" progressive="True" type="video">,
 <Stream: itag="133" mime_type="video/mp4" res="240p" fps="30fps" vcodec="avc1.4d4015" progressive="False" type="video">,
 <Stream: itag="160" mime_type="video/mp4" res="144p" fps="30fps" vcodec="avc1.4d400c" progressive="False" type="video">,
 <Stream: itag="140" mime_type="audio/mp4" abr="128kbps" acodec="mp4a.40.2" progressive="False" type="audio">]

In [5]:
# Download a video stream using the itag ID
video.streams.get_by_itag(18).download()

  and should_run_async(code)


'/home/ben/Projects/var_processor/Frege Russell & Modern Logic - A J Ayer.mp4'

In [6]:
# Define a wrapper class for accessing frame information
class FrameExtractor():
    '''
    Class used for extracting frames from a video file.
    '''
    def __init__(self, video_path):
        self.video_path = video_path
        self.vid_cap = cv2.VideoCapture(video_path)
        self.n_frames = int(self.vid_cap.get(cv2.CAP_PROP_FRAME_COUNT))
        self.fps = int(self.vid_cap.get(cv2.CAP_PROP_FPS))
        
    def get_video_duration(self):
        duration = self.n_frames/self.fps
        print(f'Duration: {datetime.timedelta(seconds=duration)}')
        
    def get_n_images(self, every_x_frame):
        n_images = math.floor(self.n_frames / every_x_frame) + 1
        print(f'Extracting every {every_x_frame} (nd/rd/th) frame would result in {n_images} images.')
        
    def extract_frames(self, every_x_frame, img_name, dest_path=None, img_ext = '.jpg'):
        if not self.vid_cap.isOpened():
            self.vid_cap = cv2.VideoCapture(self.video_path)
        
        if dest_path is None:
            dest_path = os.getcwd()
        else:
            if not os.path.isdir(dest_path):
                os.mkdir(dest_path)
                print(f'Created the following directory: {dest_path}')
        
        frame_cnt = 0
        img_cnt = 0

        while self.vid_cap.isOpened():
            
            success,image = self.vid_cap.read() 
            
            if not success:
                break
            
            if frame_cnt % every_x_frame == 0:
                img_path = os.path.join(dest_path, ''.join([img_name, '_', str(img_cnt), img_ext]))
                cv2.imwrite(img_path, image)  
                img_cnt += 1
                
            frame_cnt += 1
        
        self.vid_cap.release()
        cv2.destroyAllWindows()

In [7]:
path = '/home/ben/Projects/var_processor/Frege Russell & Modern Logic - A J Ayer.mp4'

In [8]:
fe = FrameExtractor(path)

In [9]:
print(fe.n_frames, fe.fps)

75694 29


In [10]:
fe.get_video_duration()

Duration: 0:43:30.137931


In [11]:
fe.get_n_images(every_x_frame=1000)

Extracting every 1000 (nd/rd/th) frame would result in 76 images.


So what do we need?
* A way to get synchronised audio and video streams.
* A way to get subtitles/captions.
* A possible way to stream the data rather than download.
* A way to search and access a large number of videos.

Let's start with the PyTube docs - https://python-pytube.readthedocs.io/en/latest/user/quickstart.html#working-with-streams.

## Audio

We have separate audio and video streams for mp4. We also have opus audio streams. 

We download and process the audio separately, using the timing information to synchronise.

In [14]:
video.streams.filter(only_audio=True).all()

  """Entry point for launching an IPython kernel.


[<Stream: itag="140" mime_type="audio/mp4" abr="128kbps" acodec="mp4a.40.2" progressive="False" type="audio">,
 <Stream: itag="249" mime_type="audio/webm" abr="50kbps" acodec="opus" progressive="False" type="audio">,
 <Stream: itag="250" mime_type="audio/webm" abr="70kbps" acodec="opus" progressive="False" type="audio">,
 <Stream: itag="251" mime_type="audio/webm" abr="160kbps" acodec="opus" progressive="False" type="audio">]

## Captions

PyTube also supports getting the caption information - but only videos with captions with have this.

In [15]:
video.captions.all()

  and should_run_async(code)
  """Entry point for launching an IPython kernel.


[<Caption lang="English (auto-generated)" code="a.en">]

In [19]:
caption = video.captions.get_by_language_code('a.en')

  """Entry point for launching an IPython kernel.


In [20]:
caption.xml_captions

  and should_run_async(code)


'<?xml version="1.0" encoding="utf-8" ?><transcript><text start="0" dur="6.019">[Music]</text><text start="6.62" dur="5.83">a great deal of the philosophy that&amp;#39;s</text><text start="10.92" dur="4.2">going on today in the english-speaking</text><text start="12.45" dur="4.26">world in fact probably most of it can be</text><text start="15.12" dur="4.05">traced back through intermediate</text><text start="16.71" dur="4.34">developments to the work of two men got</text><text start="19.17" dur="4.41">flogged Frager and Bertrand Russell</text><text start="21.05" dur="4.63">working for the most part independently</text><text start="23.58" dur="4.38">of each other they laid the foundations</text><text start="25.68" dur="4.08">of modern logic but more than that</text><text start="27.96" dur="3.93">although the work on which they began</text><text start="29.76" dur="4.17">concerned chiefly the principles of</text><text start="31.89" dur="4.829">mathematics and the relationship between</tex

In [21]:
dir(caption)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'code',
 'download',
 'float_to_srt_time_format',
 'generate_srt_captions',
 'name',
 'url',
 'xml_caption_to_srt',
 'xml_captions']

In [22]:
srt_captions = caption.generate_srt_captions()
print(srt_captions)

1
00:00:00,000 --> 00:00:06,019
[Music]

2
00:00:06,620 --> 00:00:12,450
a great deal of the philosophy that's

3
00:00:10,920 --> 00:00:15,120
going on today in the english-speaking

4
00:00:12,450 --> 00:00:16,710
world in fact probably most of it can be

5
00:00:15,120 --> 00:00:19,170
traced back through intermediate

6
00:00:16,710 --> 00:00:21,050
developments to the work of two men got

7
00:00:19,170 --> 00:00:23,580
flogged Frager and Bertrand Russell

8
00:00:21,050 --> 00:00:25,680
working for the most part independently

9
00:00:23,580 --> 00:00:27,960
of each other they laid the foundations

10
00:00:25,680 --> 00:00:29,760
of modern logic but more than that

11
00:00:27,960 --> 00:00:31,890
although the work on which they began

12
00:00:29,760 --> 00:00:33,930
concerned chiefly the principles of

13
00:00:31,890 --> 00:00:36,719
mathematics and the relationship between

14
00:00:33,930 --> 00:00:39,059
mathematics and logic its implications

15
00:00:36,719 --> 00:00:41,

In [23]:
type(srt_captions)

str

So the generate_srt_captions just generates a long string. But you can use the XML to get individual timings.

It's interesting that the captions last for about 4 seconds and contain multiple words. Also no punctuation.

## Multiple Videos

One thing we can do quickly is create a playlist and use that - https://python-pytube.readthedocs.io/en/latest/user/playlist.html.

### Searching

You can use the YouTube http search and a wrapper such as urllib - https://codefather.tech/blog/youtube-search-python/.

Or the official python youtube URI - https://developers.google.com/youtube/v3/quickstart/python.

Also check out this tutorial - https://towardsdatascience.com/automatic-speech-recognition-data-collection-with-youtube-v3-api-mask-rcnn-and-google-vision-api-2370d6776109.

To stream rather than download you can extract YouTube urls and pass to a command line media player using SubProcess - https://stackoverflow.com/questions/49354232/how-to-stream-audio-from-a-youtube-url-in-python-without-download.

The main command line media players are MPV or VLC.
* There are Python bindings for both MPV - https://github.com/jaseg/python-mpv
* VLC - https://linuxconfig.org/how-to-play-audio-with-vlc-in-python

# Summary

The easiest way to go is to find or create a playlist with days of captioned videos, then to download video, audio and captions, then to process.

This will, though, take up a lot of space (we might need an external hard drive). But setting up a streaming wrapper will take a LOT more effort.