# A YouTube Video Summarizer Using Whisper and LangChain

• Find the  [Notebook](https://colab.research.google.com/github/towardsai/ragbook-notebooks/blob/main/notebooks/Chapter%2007%20-%20Create_a_YouTube_Video_Summarizer_Using_Whisper_and_LangChain_.ipynb)  for this section at  [towardsai.net/book](http://towardsai.net/book).

The project involves a series of steps, starting with downloading the audio file from YouTube. Once the audio file is obtained, it is transcribed using **Whisper**. After the transcription is complete, the text is summarized using LangChain, employing three different approaches: *stuff*, *refine*, and *map_reduce*. Finally, multiple transcriptions are added to the DeepLake database to enable question-answering for those videos.

The following diagram explains what we are going to do in this project:

![image](./youtube_video_summarizer.jpg)

*Our YouTube video summarizer pipeline.*

As usual, start by installing the packages using the command: `!pip install langchain==0.0.208 deeplake openai==0.27.8 tiktoken,  yt_dlp, and openai-whisper.`

Next, install [ffmpeg](https://ffmpeg.org/); it is a prerequisite for the ***yt_dlp*** package.

Next, add the API key for OpenAI and Deep Lake services to the environment variables.

In [None]:
import os
from langchain_custom_utils.helper import get_openai_api_key, get_deeplake_api_key, print_response
OPENAI_API_KEY = get_openai_api_key()
DEEPLAKE_API_KEY = get_deeplake_api_key()

The tutorial teaches how to programmatically summarize a video featuring Yann LeCun, a notable computer scientist and AI researcher. The video covers LeCun’s thoughts on the challenges associated with large language models. However, the code would work with any other video as long as it can be summarized using only its audio (as the model won’t know what is shown in the video) and that ideally contains only a few speakers. Video podcasts are ideal for this project.

The download_mp4_from_youtube() function downloads the highest quality mp4 video file from a given YouTube link and saves it to a specified path and filename. To use this function, simply copy and paste the URL of the chosen video into it.

In [None]:
import yt_dlp

def download_mp4_from_youtube(url):
    # Set the options for the download
    filename = 'lecuninterview.mp4'
    ydl_opts = {
        'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]',
        'outtmpl': filename,
        'quiet': True,
    }

    # Download the video file
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        result = ydl.extract_info(url, download=True)

url = "https://www.youtube.com/watch?v=mBjPyte2ZZo"
download_mp4_from_youtube(url)

Now that the video MP4 has been downloaded, the next step is to transcribe its audio using a speech-to-text model. One of the currently most popular open-source speech-to-text models is OpenAI’s Whisper.

## Transcribing Audio with Whisper

***Whisper*** is an advanced automatic speech recognition system developed by OpenAI. It’s trained on a dataset of 680,000 hours of multilingual and multitasking supervised data from the web. This extensive and diverse dataset contributes to the system’s ability to efficiently manage accents, background noise, and technical jargon.

The previously installed whisper package includes the `.load_model()` method, which downloads the model and transcribes a video file. Several models are available: tiny, base, small, medium, and large, for balancing accuracy and processing speed. We will use the 'base' model for this example.

In [None]:
import whisper

model = whisper.load_model("base")
result = model.transcribe("lecuninterview.mp4")
print(result['text'])

> /home/cloudsuperadmin/.local/lib/python3.9/site-packages/whisper/transcribe.py:114:
> UserWarning: FP16 is not supported on CPU; using FP32 instead  
> warnings.warn("FP16 is not supported on CPU; using FP32 instead")     
> Hi, I'm Craig Smith, and this is I on A On. This week I talked to Jan
> LeCoon, one of the seminal figures in deep learning development and a
> long-time proponent of self-supervised learning. Jan spoke about
> what's missing in large language models and his new joint embedding
> predictive architecture which may be a step toward filling that gap.
> He also talked about his theory of consciousness and the potential for
> AI systems to someday exhibit the features of consciousness. It's a
> fascinating conversation that I hope you'll enjoy. Okay, so Jan, it's
> great to see you again. I wanted to talk to you about where you've
> gone with so supervised learning since last week's spoke. In
> particular, I'm interested in how it relates to large language models
> because they have really come on stream since we spoke. In fact, in
> your talk about JEPA, which is joint embedding predictive
> architecture. […and so on]

The result is generated as raw text and can be saved to a text file.

In [None]:
with open ('text.txt', 'w') as file:  
    file.write(result['text'])

Once the transcription is ready, the next step is to split it into chunks using a text splitter and then use the chunks to generate a summary.