# 🔊 Extract 🔊

This notebook will focus on extracting audio (or transcript) along with any other useful metadata from the video. I'll mainly be using notebooks for experimentation and quick testing.

## Table of Contents

<ul>
    <li>1. <a href="#setup">Setup</a></li>
    <li>
        2. <a href="#extraction">Extraction</a>
        <ul>
            <li>2.1. <a href="#extract-youtube-audio">Extract YouTube Audio</a></li>
        </ul>
    </li>
</ul>

## 1. Setup

In [1]:
import os
import pytube
from pathlib import Path
from pytube import YouTube

import openai
import langchain
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from langchain.document_loaders.youtube import YoutubeLoader


In [None]:
import dotenv
import os

dotenv.load_dotenv("../.env")

## 2. Extraction

### 2.1. Extract YouTube Audio

In [2]:
# Let's first extract the youtube audio.
def to_snake_case(name):
    return name.lower().replace(" ", "_").replace(":", "_").replace("__", "_")

def download_youtube_audio(url, file_name=None, out_dir="."):
    "Download the audio from a YouTube video"
    yt = YouTube(url)
    if file_name is None:
        file_name = Path(out_dir, to_snake_case(yt.title)).with_suffix(".mp4")
    yt_stream = (yt.streams
            .filter(only_audio=True, file_extension="mp4")
            .order_by("abr")
            .desc())
    return yt_stream.first().download(filename=file_name)

In [3]:
url = "https://youtu.be/8ESJRRrVllI"
root = ".."
out_dir = os.path.join(root, "data", "external", "audio")
os.makedirs(out_dir, exist_ok=True)
audio = download_youtube_audio(url=url, out_dir=out_dir)

In [2]:
# Let's try a different way to do the same thing. This time using LangChain.
url = "https://youtu.be/8ESJRRrVllI"
root = ".."
out_dir = os.path.join(root, "data", "external", "audio")

loader = GenericLoader(YoutubeAudioLoader(urls=[url], save_dir=out_dir), OpenAIWhisperParser())

In [3]:
a = loader.load()

[youtube] Extracting URL: https://youtu.be/8ESJRRrVllI
[youtube] 8ESJRRrVllI: Downloading webpage
[youtube] 8ESJRRrVllI: Downloading ios player API JSON
[youtube] 8ESJRRrVllI: Downloading android player API JSON
[youtube] 8ESJRRrVllI: Downloading m3u8 information
[info] 8ESJRRrVllI: Downloading 1 format(s): 140
[download] ..\data\external\audio\Peggy Hill： I forgot to add the meat.m4a has already been downloaded
[download] 100% of  282.55KiB
[ExtractAudio] Not converting audio ..\data\external\audio\Peggy Hill： I forgot to add the meat.m4a; file is already in target format m4a
Transcribing part 1!


In [6]:
a[0]

Document(page_content='My Sloppy Joe is all sloppy and no joke. I forgot to add the meat. How could I be so freaking stupid?', metadata={'source': '..\\data\\external\\audio\\Peggy Hill： I forgot to add the meat.m4a', 'chunk': 0})

In [20]:
loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=QsYGlZkevEg", add_video_info=True
)
b = loader.load()

In [17]:
b[0].dict()

{'page_content': 'LADIES AND GENTLEMEN, PEDRO PASCAL! [ CHEERS AND APPLAUSE ] >> THANK YOU, THANK YOU. THANK YOU VERY MUCH. I\'M SO EXCITED TO BE HERE. THANK YOU. I SPENT THE LAST YEAR SHOOTING A SHOW CALLED "THE LAST OF US" ON HBO. FOR SOME HBO SHOES, YOU GET TO SHOOT IN A FIVE STAR ITALIAN RESORT SURROUNDED BY BEAUTIFUL PEOPLE, BUT I SAID, NO, THAT\'S TOO EASY. I WANT TO SHOOT IN A FREEZING CANADIAN FOREST WHILE BEING CHASED AROUND BY A GUY WHOSE HEAD LOOKS LIKE A GENITAL WART. IT IS AN HONOR BEING A PART OF THESE HUGE FRANCHISEs LIKE "GAME OF THRONES" AND "STAR WARS," BUT I\'M STILL GETTING USED TO PEOPLE RECOGNIZING ME. THE OTHER DAY, A GUY STOPPED ME ON THE STREET AND SAYS, MY SON LOVES "THE MANDALORIAN" AND THE NEXT THING I KNOW, I\'M FACE TIMING WITH A 6-YEAR-OLD WHO HAS NO IDEA WHO I AM BECAUSE MY CHARACTER WEARS A MASK THE ENTIRE SHOW. THE GUY IS LIKE, DO THE MANDO VOICE, BUT IT\'S LIKE A BEDROOM VOICE. WITHOUT THE MASK, IT JUST SOUNDS PORNY. PEOPLE WALKING BY ON THE STREET SE

### 2.2. Extract Video

In [4]:
from functools import partial
from multiprocessing.pool import Pool

import cv2
import youtube_dl

def process_video_parallel(url, skip_frames, process_number):
    cap = cv2.VideoCapture(url)
    num_processes = os.cpu_count()
    frames_per_process = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) // num_processes
    cap.set(cv2.CAP_PROP_POS_FRAMES, frames_per_process * process_number)
    x = 0
    count = 0
    while x < 10 and count < frames_per_process:
        ret, frame = cap.read()
        if not ret:
            break
        filename =r"PATH\shot"+str(x)+".png"
        x += 1
        cv2.imwrite(filename.format(count), frame)
        count += skip_frames  # Skip 300 frames i.e. 10 seconds for 30 fps
        cap.set(1, count)
    cap.release()



video_url = "https://youtu.be/8ESJRRrVllI"  # The Youtube URL
ydl_opts = {}
ydl = youtube_dl.YoutubeDL(ydl_opts)
info_dict = ydl.extract_info(video_url, download=False)

formats = info_dict.get('formats', None)

print("Obtaining frames")
for f in formats:
    if f.get('format_note', None) == '144p':
        url = f.get('url', None)
        cpu_count = os.cpu_count()
        with Pool(cpu_count) as pool:
            pool.map(partial(process_video_parallel, url, 300), range(cpu_count))

[youtube] 8ESJRRrVllI: Downloading webpage
Obtaining frames


In [2]:
import pafy

# Testing getting stream.
url = "FFXD417ugHM"
video = pafy.new(url)
best = video.getbest(preftype="mp4")

capture = cv2.VideoCapture(best.url)
while True:
    grabbed, frame = capture.read()
    print("a ", end="")

KeyError: 'average_rating'