# Dazbo's YouTube Demos

## Overview

Examples of how to work with YouTube using Python. Here I'll demonstrate:

- How to [download videos and extract audio](#downloading-videos-and-extracting-audio)
- How to transcribe audio to text.

**To run this notebook, first execute the cells in the [Setup](#Setup) section, as described below.** Then you can experiment with any of the subsequent cells.

A few useful notes:

- The source for this notebook source lives in my GitHub repo, <a href="https://github.com/derailed-dash/dazbo-python-demos" target="_blank">Dazbo-Python-Demos</a>.
- Check out further guidance - including tips on how to run the notebook, in the project's `README.md`.
- For example, you could...
  - Run the notebook locally, in your own Jupyter environment.
  - Run the notebook in a cloud-based Jupyter environment, with no setup required on your part!  For example, <a href="https://colab.research.google.com/github/derailed-dash/dazbo-python-demos/blob/main/notebooks/youtube-demos.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Google Colab"/></a>
- For more ways to run Jupyter Notebooks, check out [my guide](https://medium.com/python-in-plain-english/five-ways-to-run-jupyter-labs-and-notebooks-23209f71e5c0).


## Setup

First, let's install any dependent packages:

In [None]:
%pip install --upgrade --no-cache-dir dazbo-commons pytubefix moviepy yt_dlp

In [2]:
import logging
import re
from pathlib import Path
from dataclasses import dataclass
import dazbo_commons as dc

Now we'll setup logging. Here I'm using coloured logging from my [dazbo-commons](https://pypi.org/project/dazbo-commons/) package. Feel free to change the logging level.

In [None]:
# Setup logging
APP_NAME="dazbo-yt-demos"
logger = dc.retrieve_console_logger(APP_NAME)
logger.setLevel(logging.DEBUG)
logger.info("Logger initialised.")
logger.debug("DEBUG level logging enabled.")

Here we initialise some file path locations, e.g. an output folder.

In [None]:
locations = dc.get_locations(APP_NAME)
for attribute, value in vars(locations).items():
    logger.debug(f"{attribute}: {value}")

Now some utility functions.

In [5]:
def clean_filename(filename):
    """ Create a clean filename by removing unallowed characters. """
    pattern = r'[^a-zA-Z0-9._\s-]'
    return  re.sub(pattern, '_', filename)

## Downloading Videos and Extracting Audio

Here I'll demonstrate a few different Python libraries for working with YouTube videos.

In [6]:
# YouTube videos to download
urls = [
    "https://www.youtube.com/watch?v=udRAIF6MOm8",  # Sigrid - Burning Bridges
    "bla", # Test a bad URL
    "https://www.youtube.com/watch?v=CiTn4j7gVvY",  # Melissa Hollick - I Believe
    "https://www.youtube.com/watch?v=d4N82wPpdg8",  # Jerry Heil & Alyona Alyona - Teresa & Maria
]

### Option 1 - With PyTubeFix

Here I'll use the [pytubefix](https://github.com/JuanBindez/pytubefix) library to download YouTube videos, and then to download mp3 audio-only streams as files.

This library is a community-maintained fork of `pytube`. It was created to provide quick fixes for issues that the official pytube library faced, particularly when YouTube's updates break `pytube`.

Pros:

- The library is very easy to use.
- We can work with video, audio, channels, playlists, and even search and filter.
- It is [well documented](https://pytubefix.readthedocs.io/en/latest/).
- It can be used from the command line, with its simple CLI.
- It is VERY FAST!

Cons:

- Does not offer some of the more sophisticated capabilities that are offered by `yt_dlp`.
- The audio mp3 files It does not appear to set mp3 headers correctly, meaning that subsequent programmatic maninpulation (e.g. converting to wav) is not as trivial as it could be!

In [None]:

from pytubefix import YouTube
from pytubefix.cli import on_progress

output_locn = f"{locations.output_dir}/pytubefix"
def process_yt_videos():
    for i, url in enumerate(urls):
        logger.info(f"Downloads progress: {i+1}/{len(urls)}")

        try:
            yt = YouTube(url, on_progress_callback=on_progress)
            logger.info(f"Getting: {yt.title}")
            video_stream = yt.streams.get_highest_resolution()
            if not video_stream:
                raise Exception("Stream not available.")
            
            # YouTube resource titles may contain special characters which 
            # can't be used when saving the file. So we need to clean the filename.
            cleaned = clean_filename(yt.title)
            
            video_output = f"{output_locn}/{cleaned}.mp4"
            logger.info(f"Downloading video {cleaned}.mp4 ...")
            video_stream.download(output_path=output_locn, filename=f"{cleaned}.mp4")
        
            logger.info(f"Creating audio...")
            audio_stream = yt.streams.get_audio_only()
            audio_stream.download(output_path=output_locn, filename=cleaned, mp3=True)
            
            logger.info("Done")
            
        except Exception as e:        
            logger.error(f"Error processing URL '{url}'.")
            logger.error(f"The cause was: {e}") 
            
    logger.info(f"Downloads finished. See files in {output_locn}.")
    
process_yt_videos()


### Option 2 - PyTubeFix and MoviePy

Here I'm doing the same as before, but I'm extracting the audio using the Python [MoviePy](https://github.com/Zulko/moviepy) library. This is a powerful video and audio editing library. 

Pros:

- We can extract audio without the broken mp3 headers.
- It is [well documented](https://zulko.github.io/moviepy/).
- It is powerful.

Cons:

- It is slower to extract the audio than using `pytubefix` alone.

In [None]:

from pytubefix import YouTube
from pytubefix.cli import on_progress
from moviepy.editor import VideoFileClip

output_locn = f"{locations.output_dir}/pytubefix_with_moviepy"
def process_yt_videos():
    for i, url in enumerate(urls):
        logger.info(f"Downloads progress: {i+1}/{len(urls)}")

        try:
            yt = YouTube(url, on_progress_callback=on_progress)
            logger.info(f"Getting: {yt.title}")
            video_stream = yt.streams.get_highest_resolution()
            if not video_stream:
                raise Exception("Stream not available.")
            
            # YouTube resource titles may contain special characters which 
            # can't be used when saving the file. So we need to clean the filename.
            cleaned = clean_filename(yt.title)

            video_output = f"{output_locn}/{cleaned}.mp4"
            logger.info(f"Downloading video {cleaned}.mp4 ...")
            video_stream.download(output_path=output_locn, filename=f"{cleaned}.mp4")
        
            logger.info(f"Creating audio...")
            video_clip = VideoFileClip(video_output) # purely to give us access to methods
            video_clip.audio.write_audiofile(f"{output_locn}/{cleaned}.mp3")
            video_clip.close()
            
            logger.info("Done")
            
        except Exception as e:        
            logger.error(f"Error processing URL '{url}'.")
            logger.error(f"The cause was: {e}") 
            
    logger.info(f"Downloads finished. See files in {output_locn}.")
    
process_yt_videos()

### Option 3 - With YT_DLP

I wanted to try the other popular YouTube package: [yt-dlp](https://pypi.org/project/yt-dlp/). The [repo](https://github.com/yt-dlp/yt-dlp) repo is a fork of the now unmaintained `youtube-dl`. 

Pros:

- It is very powerful, with far more options and features than `pytubefix`.
- It can be installed as a standalone command-line executable, or as a pip-installable Python package.
- Sets mp3 headers properly!

Cons:

- It is more complicated to use.
- The documentation is complex. And there's no real Python-specific documentation.
- It depends on having ffmpeg installed for many use cases.
- It is significantly slower that `pytubefix` for performing video download and audio extraction.


In [None]:
import yt_dlp

output_locn = f"{locations.output_dir}/yt_dlp"

def process_yt_videos():
    for i, url in enumerate(urls):
        logger.info(f"Downloads progress: {i+1}/{len(urls)}")

        try:
            # Options for downloading the video
            video_opts = {
                'format': 'best',  # Download the best quality video
                'outtmpl': f'{output_locn}/%(title)s.%(ext)s',  # Save video in output directory
            }
            
            # Download the video
            with yt_dlp.YoutubeDL(video_opts) as ydl:
                logger.info("Downloading video...")
                ydl.download([url])
            
            # Options for extracting audio and saving as MP3
            audio_opts = {
                'format': 'bestaudio',  # Download the best quality audio
                'outtmpl': f'{output_locn}/%(title)s.%(ext)s',  # Save audio in output directory
                'postprocessors': [{
                    'key': 'FFmpegExtractAudio',
                    'preferredcodec': 'mp3',
                }],
            }
            
            # Download and extract audio
            with yt_dlp.YoutubeDL(audio_opts) as ydl:
                logger.info("Extracting and saving audio as MP3...")
                ydl.download([url])
            
        except Exception as e:        
            logger.error(f"Error processing URL '{url}'.")
            logger.error(f"The cause was: {e}") 
            
    logger.info(f"Downloads finished. Check out files at {output_locn}.")
    
process_yt_videos()

### Conclusion

If you:

- Want to just download the videos and/or audio in the simplest and fastest way possible, then go with [Option 1](#option-1---with-pytubefix).
- Want to download the videos and/or audio and then carry out some sort of manipulation or conversion of the media, go with [Option 2](#option-2---pytubefix-and-moviepy).

## Audio Conversion

### Install Additional Packages

- [ffmpeg](https://ffmpeg.org/): a useful utility for video and audio format conversion. Many Python libraries use it.

In [None]:
import os
import platform
import subprocess

def run_command(command):
    """Run a shell command and print its output in real-time."""
    process = subprocess.Popen(
        command, 
        shell=True, 
        stdout=subprocess.PIPE, 
        stderr=subprocess.PIPE
    )
    
    # Read and print the output line by line
    if process.stdout is not None:
        for line in iter(process.stdout.readline, b''):
            logger.info(line.decode().strip())
        process.stdout.close()
        
    process.wait()
    
def install_software():
    os_name = platform.system()
    logger.info(f"Installing packages on {os_name}...")
    
    if os_name == "Windows":
        run_command("winget install ffmpeg --no-upgrade")
    elif os_name == "Linux":
        run_command("apt-get -qq update && apt-get -qq -y install ffmpeg")
    elif os_name == "Darwin":
        run_command("brew install ffmpeg")
    else:
        logger.error(f"Unsupported operating system: {os_name}")

install_software()

logger.info("Note that installed applications may not be immediately available after first installing.\n" \
            "It may be necessary to relaunch the notebook environment.")

Check ffmpeg version. 

On Windows, this may not have been added to your path. If so, you can check your default install location using `winget --info`, and then add it to your path.

In [None]:
!ffmpeg -version

## Transcribing Audio to Text

### Extracting Audio Using Python Speech Recognition

_Note: the code below requires good mp3 headers.  So it only works with options 2 and 3._

The Python `speech_recognition` package has a number of built in `Recognizer` implementations.Here I'm using the [Google Web Speech API](https://wicg.github.io/speech-api/) `Recognizer`, which has its default API key hard coded into the Python `speech_recognition` library. It is free, but has some limitations. For example, it only allows a max of 60s segments.

It has limited speech recognition capability. Also, it's not going to natively detect language. So our next step will be to add some more smarts!

In [None]:
%pip install --upgrade --no-cache-dir pydub SpeechRecognition ffmpeg-python

In [None]:
import speech_recognition as sr
from pydub import AudioSegment
import ffmpeg
import traceback
from io import BytesIO

In [None]:
def divide_chunks(sound, segment_size_secs=60):
    """ Split audio file into 60s chunks """
    segment_size = segment_size_secs*1000
    for i in range(0, len(sound), segment_size):
        yield sound[i:i + segment_size]

def transcribe_audio():
    recogniser = sr.Recognizer()        
    for mp3_file in Path(output_locn).glob(f'*.mp3'):
        logger.info(f"Converting {mp3_file}...")
        try:
            audio = AudioSegment.from_file(mp3_file)
            # If AudioSegment is not working - e.g. due to broken mp3 headers - we
            # can use ffmpeg as a workaround. However, it's a lot slower.
            # ffmpeg.input(mp3_file).output(wav_file).run() # Convert with ffmpeg
            # logger.info(f"Successfully converted {mp3_file} to {wav_file}.")
            # audio = AudioSegment.from_wav(wav_file) # Read the audio

            segments = list(divide_chunks(audio, segment_size_secs=60)) # split the wav into 60s segments     
            transcription_extracts = {}
            for index, chunk in enumerate(segments):
                with BytesIO() as wav_io:
                    chunk.export(wav_io, format='wav')
                    wav_io.seek(0)  # Move to the start of the BytesIO object before reading from it
                        
                    with sr.AudioFile(wav_io) as source:
                        audio_data = recogniser.record(source)

                    try:
                        extracted = recogniser.recognize_google(audio_data)
                        logger.debug(f"Chunk {index} extracted.")
                        transcription_extracts[index] = extracted
                    except sr.UnknownValueError:
                        # Log the unknown value error and continue
                        logger.warning(f"Chunk {index}: Could not understand the audio. Maybe it was empty.")
            
            logger.info("Extract:")
            for idx, extract in transcription_extracts.items():
                logger.info(f"{idx}: {extract}")

        except ffmpeg.Error as e:
            logger.error(f"FFmpeg failed to convert {mp3_file}: {str(e)}")
        except Exception as e:
            logger.error("Unexpected error.", exc_info=True)
            
transcribe_audio()
logger.info("Done")